12.4. Manipulate HTML Section

The settings in this section are used to manipulate the source code retrieved from the target web site. You can add, remove, and/or change things. The HTML manipulations are applied in the following order when crawling a page.

Note

This section exists under Category Tab and Post Tab of Site Settings Page.

  1. Find and replace in raw HTML
  2. Find and replace in all post pages (in general settings)
  3. Find and replace in HTML at first load
  4. Find and replace in element attributes
  5. Exchange element attributes
  6. Remove element attributes
  7. Find and replace in element HTML

Important

After Find and replace in raw HTML setting, the filters added to the request filters settings (i.e. Category request filters and Post request filters) are applied. See Lifecycle of Events to learn when each setting is applied.

Important

The settings you configure under this section will be applied for all tests made by using buttons of other settings. The reason for this is to provide you a more robust way to create your settings. You can see the effects of the HTML manipulations as soon as you click button of another setting. Otherwise, you would have to use Tester Page every time you make a change, which is very time-consuming.

The following video demonstrates this.

As you can see, the rules defined in Find and replace in element HTML setting are applied when testing the post title selector. Moreover, they are applied in Visual Inspector as well. By this way, you can see the effects of your manipulations immediately.

When button of a setting under this section is used, rules defined in the settings that come before that setting are applied first. For example, when testing a rule of Find and replace in element attributes, the rules of Find and replace in raw HTML, Find and replace in all post pages, and Find and replace in HTML at first load are applied first, respectively.

Among the rules defined in the same setting, only the one whose is clicked will be applied when button is used, not others.

The following video demonstrates these.

As you can see, the rule defined in Find and replace in element attributes setting is applied when testing a rule of Find and replace in element HTML setting. However, the first rule defined in Find and replace in element HTML setting is not applied when testing the second rule of that setting.

12.4.1. Test code for find-and-replaces in HTML

When of a setting under this section is clicked, it will use this code to perform the tests. So, enter any text that you want to test.

Note

The buttons also use the test URL. In case there are both a test URL and test code, then they are used as test subjects both.

12.4.2. Find and replace in raw HTML

This is a type of Find and Replace Setting. The rules you enter in this setting will be applied to the source code of the page as soon as the response is retrieved from the target web site. In other words, this is the first change that is applied to the source code. Hence, you can use this setting to, for example, fix HTML errors existing in the source code that makes the plugin unable the parse the HTML code.

When you click button of this setting, the results show four different values for each type of test subject. So, if there are both a test URL and test code, you will see eight results, four results for each.

Tip

If something that has to be in the source code of the target web page is not found when testing your settings, then there is probably an error in the source code of the target web site. In that case, you can use this setting to see what the plugin sees when crawling the page. If there is something wrong, you can fix it so that the plugin can crawl the target web page.

The four types of results are as follows.

Crawler HTML with current find-replace applied
This shows the HTML code after two things are done to it. First, the plugin applies the find-replace rule whose button is clicked. Then, it parses the HTML. Finally, it shows the HTML code with these operations are done to it. If there is something wrong with the HTML, you can understand by looking at the length of the HTML code and the last lines of it. If the length of HTML code is less than the length of the raw response text quite a lot, then there is probably something wrong with the code. You can fix the errors and check if the length of the parsed HTML is close to that of the raw response.
Raw content
This is the HTML code retrieved from the target web site. This code is not parsed. It is treated as a text, not HTML. In other words, this is the raw response text that the target site sends. The parsed HTML’s length should be close to the length of the raw content if there is no error in the HTML code retrieved from the target site.
Crawler HTML with no find-replace applied
This is the HTML code that is created by parsing the raw content retrieved from the target site. This is created without applying any find-replace rules.
Crawler HTML with all raw HTML find-replaces applied
This is created by applying all find-replace rules entered into this setting and then parsing the HTML. This is what the plugin will see when matching CSS selectors or doing any other operation by using your settings. Hence, this is the final result. This result must be error-free. You can compare the length of this code with that of raw content to make sure the HTML code is correctly parsed.

Tip

PHP’s DOMDocument is not able to differentiate an html tag that is actually an element from html tag defined as a text inside a script element. Therefore, if there is a text that contains </html> or </body> in the source code of the target web page, it causes DOMDocument to parse it as a closing tag, instead of a plain text. In this case, the rest of the HTML code is stripped out. To fix this, you can simply remove these texts from the page by using this setting.

12.4.3. Find and replace in HTML at first load

This is a type of Find and Replace Setting. You can use this to change/remove/add certain things in the source code of the target page. The replacements will be applied after the replacements defined in Find and replace in raw HTML and Find and replace in all post pages (in general settings) are applied, respectively.

12.4.4. Find and replace in element attributes

This is both a type of Find and Replace Setting and Selector and Attribute Setting. You can change/remove/add certain things in the value of an attribute of an element. The replacements will be applied after the replacements defined in Find and replace in HTML at first load are applied.

Important

If you change the values that you use in the CSS selector, the test results will be empty because the CSS selector you use does not match the element anymore.

12.4.5. Exchange element attributes

This is a type of Selector and Attribute Setting. However, it has two attribute values. Using this setting, you can exchange the values of two attributes of the same element. The replacements will be applied after the replacements defined in Find and replace in element attributes are applied.

This is particularly useful when there are lazy-loaded elements that you want to show in your posts. Here is sample lazy-loading img element code.

<img src="placeholder.png" data-src="https://site.com/actual-image.jpg">

Note

The actual image URL might not be in data-src attribute. This is just an attribute name used for this example. For your case, the real image URL might be in another attribute.

In this code, src attribute’s value is actually just a placeholder image. The real image’s URL exists in data-src attribute. The browsers use only src attribute’s value to display an image. Therefore, when the page that contains this code is loaded in the browser, the image is not loaded immediately. After the page is loaded, the web site uses JavaScript to put the value of data-src attribute into src attribute so that the browser loads the image and displays it to the user. This is done to decrease the loading time of a web page. The problem for us is that our site might not be using some JavaScript code to do this replacement. In that case, we just need the actual image URL in src attribute so that the browser can display the image. You can do this replacement quite easily by using this setting. The following configuration exchanges the values of src and data-src attributes.

Selector:img
Attribute 1:src
Attribute 2:data-src

After this, the img element will be turned into this:

<img src="https://site.com/actual-image.jpg" data-src="placeholder.png">

The browser can load and display the image without any problems now. This is just a sample usage of this setting. You can use it in any way you want.

Important

If value of attribute 2 does not exist, the values will not be exchanged.

12.4.6. Remove element attributes

You can use this setting to remove one or many attributes of an element. This is a type of Selector and Attribute Setting. What is different is that you can enter many attribute names into the attribute input by separating them with commas. The removals will be applied after the rules defined in Exchange element attributes setting are applied.

Let’s assume there is the following img element:

<img src="https://site.com/actual-image.jpg" data-src="placeholder.png" data-test="dummy-value">

Let’s say you do not want data-src attribute and its value to be shown in your web site. To remove it, the following values can be defined to this setting:

Selector:img
Attribute:data-src

This turns that code into the following:

<img src="https://site.com/actual-image.jpg" data-test="dummy-value">

If you also want to remove data-test and its value, you can configure this setting as following:

Selector:img
Attribute:data-src, data-test

This turns that code into the following:

<img src="https://site.com/actual-image.jpg">

Important

If you change the values that you use in the CSS selector, the test results will be empty because the CSS selector you use does not match the element anymore.

12.4.7. Find and replace in element HTML

This is both a type of Find and Replace Setting and Selector and Attribute Setting. Here, there is no attribute input because the setting is for changing the entire HTML code of the matching elements. The replacements will be applied after the rules defined in Remove element attributes setting are applied.

You can change anything in the HTML of one or more elements by using this setting. You can add/remove attributes, such as classes and ids. You can change the tag name of elements, such as changing p tag with div tag. You can even create your own HTML element. You are free to do any changes you want as long as the changes result in valid HTML code.

Important

If you change the values that you use in the CSS selector, the test results will be empty because the CSS selector you use does not match the element anymore.