12.4. Manipulate HTML Section
The settings in this section are used to manipulate the source code retrieved from the target web site. You can add, remove, and/or change things. The HTML manipulations are applied in the following order when crawling a page.
Note
This section exists under Category Tab and Post Tab of Site Settings Page.
- Find and replace in raw HTML
- Find and replace in all post pages (in general settings)
- Find and replace in HTML at first load
- Find and replace in element attributes
- Exchange element attributes
- Remove element attributes
- Find and replace in element HTML
Important
After Find and replace in raw HTML setting, the filters added to the request filters settings (i.e. Category request filters and Post request filters) are applied. See Lifecycle of Events to learn when each setting is applied.
Important
The settings you configure under this section will be applied for all tests made by using buttons of other settings. The reason for this is to provide you a more robust way to create your settings. You can see the effects of the HTML manipulations as soon as you click button of another setting. Otherwise, you would have to use Tester Page every time you make a change, which is very time-consuming.
The following video demonstrates this.
As you can see, the rules defined in Find and replace in element HTML setting are applied when testing the post title selector. Moreover, they are applied in Visual Inspector as well. By this way, you can see the effects of your manipulations immediately.
When button of a setting under this section is used, rules defined in the settings that come before that setting are applied first. For example, when testing a rule of Find and replace in element attributes, the rules of Find and replace in raw HTML, Find and replace in all post pages, and Find and replace in HTML at first load are applied first, respectively.
Among the rules defined in the same setting, only the one whose is clicked will be applied when button is used, not others.
The following video demonstrates these.
As you can see, the rule defined in Find and replace in element attributes setting is applied when testing a rule of Find and replace in element HTML setting. However, the first rule defined in Find and replace in element HTML setting is not applied when testing the second rule of that setting.
12.4.1. Test code for find-and-replaces in HTML
When of a setting under this section is clicked, it will use this code to perform the tests. So, enter any text that you want to test.
Note
The buttons also use the test URL. In case there are both a test URL and test code, then they are used as test subjects both.
12.4.2. Find and replace in raw HTML
This is a type of Find and Replace Setting. The rules you enter in this setting will be applied to the source code of the page as soon as the response is retrieved from the target web site. In other words, this is the first change that is applied to the source code. Hence, you can use this setting to, for example, fix HTML errors existing in the source code that makes the plugin unable the parse the HTML code.
When you click button of this setting, the results show four different values for each type of test subject. So, if there are both a test URL and test code, you will see eight results, four results for each.
Tip
If something that has to be in the source code of the target web page is not found when testing your settings, then there is probably an error in the source code of the target web site. In that case, you can use this setting to see what the plugin sees when crawling the page. If there is something wrong, you can fix it so that the plugin can crawl the target web page.
The four types of results are as follows.
- Crawler HTML with current find-replace applied
- This shows the HTML code after two things are done to it. First, the plugin applies the find-replace rule whose button is clicked. Then, it parses the HTML. Finally, it shows the HTML code with these operations are done to it. If there is something wrong with the HTML, you can understand by looking at the length of the HTML code and the last lines of it. If the length of HTML code is less than the length of the raw response text quite a lot, then there is probably something wrong with the code. You can fix the errors and check if the length of the parsed HTML is close to that of the raw response.
- Raw content
- This is the HTML code retrieved from the target web site. This code is not parsed. It is treated as a text, not HTML. In other words, this is the raw response text that the target site sends. The parsed HTML’s length should be close to the length of the raw content if there is no error in the HTML code retrieved from the target site.
- Crawler HTML with no find-replace applied
- This is the HTML code that is created by parsing the raw content retrieved from the target site. This is created without applying any find-replace rules.
- Crawler HTML with all raw HTML find-replaces applied
- This is created by applying all find-replace rules entered into this setting and then parsing the HTML. This is what the plugin will see when matching CSS selectors or doing any other operation by using your settings. Hence, this is the final result. This result must be error-free. You can compare the length of this code with that of raw content to make sure the HTML code is correctly parsed.
Tip
PHP’s DOMDocument is not able to
differentiate an html
tag that is actually an element from html
tag defined as a text
inside a script
element. Therefore, if there is a text that contains </html>
or
</body>
in the source code of the target web page, it causes DOMDocument to parse it as a
closing tag, instead of a plain text. In this case, the rest of the HTML code is stripped out.
To fix this, you can simply remove these texts from the page by using this setting.
12.4.3. Find and replace in HTML at first load
This is a type of Find and Replace Setting. You can use this to change/remove/add certain things in the source code of the target page. The replacements will be applied after the replacements defined in Find and replace in raw HTML and Find and replace in all post pages (in general settings) are applied, respectively.
12.4.4. Find and replace in element attributes
This is both a type of Find and Replace Setting and Selector and Attribute Setting. You can change/remove/add certain things in the value of an attribute of an element. The replacements will be applied after the replacements defined in Find and replace in HTML at first load are applied.
Important
If you change the values that you use in the CSS selector, the test results will be empty because the CSS selector you use does not match the element anymore.
12.4.5. Exchange element attributes
This is a type of Selector and Attribute Setting. However, it has two attribute values. Using this setting, you can exchange the values of two attributes of the same element. The replacements will be applied after the replacements defined in Find and replace in element attributes are applied.
This is particularly useful when there are lazy-loaded elements that you want to show in your
posts. Here is sample lazy-loading img
element code.
<img src="placeholder.png" data-src="https://site.com/actual-image.jpg">
Note
The actual image URL might not be in data-src
attribute. This is just an attribute name
used for this example. For your case, the real image URL might be in another attribute.
In this code, src
attribute’s value is actually just a placeholder image. The real image’s
URL exists in data-src
attribute. The browsers use only src
attribute’s value to display
an image. Therefore, when the page that contains this code is loaded in the browser, the image is
not loaded immediately. After the page is loaded, the web site uses JavaScript to put the value
of data-src
attribute into src
attribute so that the browser loads the image and displays
it to the user. This is done to decrease the loading time of a web page. The problem for us is
that our site might not be using some JavaScript code to do this replacement. In that case, we
just need the actual image URL in src
attribute so that the browser can display the image.
You can do this replacement quite easily by using this setting. The following configuration
exchanges the values of src
and data-src
attributes.
Selector: | img |
---|---|
Attribute 1: | src |
Attribute 2: | data-src |
After this, the img
element will be turned into this:
<img src="https://site.com/actual-image.jpg" data-src="placeholder.png">
The browser can load and display the image without any problems now. This is just a sample usage of this setting. You can use it in any way you want.
Important
If value of attribute 2
does not exist, the values will not be exchanged.
12.4.6. Remove element attributes
You can use this setting to remove one or many attributes of an element. This is a type of Selector and Attribute Setting. What is different is that you can enter many attribute names into the attribute input by separating them with commas. The removals will be applied after the rules defined in Exchange element attributes setting are applied.
Let’s assume there is the following img
element:
<img src="https://site.com/actual-image.jpg" data-src="placeholder.png" data-test="dummy-value">
Let’s say you do not want data-src
attribute and its value to be shown in your web site. To
remove it, the following values can be defined to this setting:
Selector: | img |
---|---|
Attribute: | data-src |
This turns that code into the following:
<img src="https://site.com/actual-image.jpg" data-test="dummy-value">
If you also want to remove data-test
and its value, you can configure this setting as following:
Selector: | img |
---|---|
Attribute: | data-src, data-test |
This turns that code into the following:
<img src="https://site.com/actual-image.jpg">
Important
If you change the values that you use in the CSS selector, the test results will be empty because the CSS selector you use does not match the element anymore.
12.4.7. Find and replace in element HTML
This is both a type of Find and Replace Setting and Selector and Attribute Setting. Here,
there is no attribute
input because the setting is for changing the entire HTML code of the
matching elements. The replacements will be applied after the rules defined in
Remove element attributes setting are applied.
You can change anything in the HTML of one or more elements by using this setting. You can
add/remove attributes, such as class
es and id
s. You can change the tag name of
elements, such as changing p
tag with div
tag. You can even create your own HTML element.
You are free to do any changes you want as long as the changes result in valid HTML code.
Important
If you change the values that you use in the CSS selector, the test results will be empty because the CSS selector you use does not match the element anymore.