7.1.2. Recrawling Section

Recrawling means crawling already-crawled posts again. In other words, it means updating the posts by retrieving the source code of the target page again and saving certain details of the posts again.

The following parts of the posts will not be updated:

  • Permalink (the name of the post in its URL)
  • Publish date

All other parts of the posts will be updated when recrawling.

7.1.2.1. Recrawling is active?

Enable this to activate automatic recrawling of the posts that are crawled by the plugin. When you enable this, all posts of the sites whose Active for recrawling? setting is checked will be updated. If a site’s Active for recrawling? setting is not checked, the posts saved by using that site’s settings will not be recrawled.

Note

For the posts to be recrawled, both this setting and Active for recrawling? setting of the site must be enabled.

7.1.2.2. Post Recrawl Interval

The time interval set in this setting will define how much time apart the post recrawling job should run. See Run count for post recrawling event for an example.

Note

This setting is visible only if Recrawling is active? is enabled.

7.1.2.3. Run count for post recrawling event

This setting defines how many times post recrawling event should run every interval of time defined in Post Recrawl Interval.

Let’s assume the following case. You have 3 sites that are active for recrawling. Each site previously saved a few posts which are currently all suitable for recrawling. To keep the example simple, let’s assume that Maximum recrawl count per post is set to 1. The sites and their posts for this example case are given in the following table.

Note

In this example, the posts are assumed to have only 1 page.

Table 7.7 Sites active for recrawling and their recrawling-suitable posts
Site Name Posts (the oldest at the top)
Site 1
Post 1.1
Post 1.2
Post 1.3
Post 1.4
Post 1.5
Site 2
Post 2.1
Post 2.2
Post 2.3
Site 3
Post 3.1
Post 3.2
Post 3.3
Post 3.4
Post 3.5
Post 3.6
Post 3.7
Post 3.8
Post 3.9
Post 3.10

Finally, let’s assume that you have set this setting’s value to 2 and Post Recrawl Interval setting’s value to Every 5 minutes. In this case, the posts will be recrawled as shown in the following table.

Table 7.8 Simulation of recrawling for the example case
Time (minute) Site Recrawled Posts Remaining Post Count
0 Site 1
Post 1.1
Post 1.2
3
5 Site 2
Post 2.1
Post 2.2
1
10 Site 3
Post 3.1
Post 3.2
8
15 Site 1
Post 1.3
Post 1.4
1
20 Site 2
Post 2.3
0
25 Site 3
Post 3.3
Post 3.4
6
30 Site 1
Post 1.5
0
35 Site 3
Post 3.5
Post 3.6
4
40 Site 3
Post 3.7
Post 3.8
2
45 Site 3
Post 3.9
Post 3.10
0

Note

In case of multi-page posts, each page of a post will be handled in a different run. You can assume that the plugin will make only 1 request to the target site at each run of the event. Since each page of a post has a different URL, the plugin will crawl/recrawl them at different runs. The plugin will not crawl/recrawl another post until all pages of the post is crawled/recrawled.

Note

All the runs will be executed as soon as the event is run. For example, if an event’s run count is set to 10 and its interval is set to Every 5 minutes, all the runs will be executed as soon as the previous run is finished. 5-minute time interval will not be divided into 10 equal intervals.

7.1.2.4. Maximum recrawl count per post

The number set to this setting defines how many times a single post can be updated. For example, if you set this setting’s value to 3, a post will be updated 3 times at maximum.

7.1.2.5. Minimum time between two recrawls for a post (in minutes)

The number set to this setting defines at least how many minutes should be between the dates of two recrawls of a post. For example, if you set this number to 120, a post will be updated only if its previous update was at least 2 hours ago.

Let’s assume you set this number to 30 minutes and you have 1 active site that has only 1 crawled post, which has never been recrawled before. Let’s also assume that the first recrawl will take place on 02.09.2019 at 16:00. Finally, let’s assume that you have set Maximum recrawl count per post to 5 and Post Recrawl Interval to Every minute. In this case, the recrawl times of the post is as follows:

Table 7.9 Recrawl times for the post in the assumed case
Recrawl Count Recrawl Date
1 02.09.2019 @ 16:00
2 02.09.2019 @ 16:30
3 02.09.2019 @ 17:00
4 02.09.2019 @ 17:30
5 02.09.2019 @ 18:00

The post is recrawled 5 times because Maximum recrawl count per post has been set to 5. The post is updated in 30-minute intervals because this setting has been set to 30, although Post Recrawl Interval has been set to Every minute. Because there was only 1 active site, the recrawls took place at exactly 30 minutes apart.

Note

If there were other sites that are active for recrawling, then the recrawl intervals will probably greater than 30 minutes. That is because 30 minutes later than the previous recrawl of the post, it might be another site’s turn to be recrawled.

7.1.2.6. Recrawl posts newer than (in minutes)

The number set to this setting defines the freshness of the post to be recrawled. If a post’s creation date is older than this minutes compared to the current time, then it will not be recrawled. It is better to explain this with an example.

Note

Post creation date is the WordPress post’s Publish date. So, if you also save the post creation date by retrieving it from the target web site, then that that will be considered when determining the freshness of the post.

Let’s assume that you have set this setting’s value to 1440 minutes, which is equal to 1 day. Let’s also assume that you have the posts given in the following table which were saved by the sites that are currently active for recrawling. Finally, let’s assume current date is 10.09.2019 12:00 For this case, suitability for recrawling for each post is given in the following table.

Table 7.10 Suitability of the posts for recrawling where current date is 10.09.2019 12:00 and the value of this setting is set to 1440.
Post Title Publish Date Suitable
Post 1 02.09.2019 @ 16:00 Yes
Post 2 02.09.2019 @ 20:00 Yes
Post 3 05.09.2019 @ 12:00 Yes
Post 4 08.09.2019 @ 21:00 Yes
Post 5 09.09.2019 @ 10:00 Yes
Post 6 09.09.2019 @ 12:00 Yes
Post 7 09.09.2019 @ 12:01 No
Post 8 10.09.2019 @ 10:00 No

Post 6 is suitable, because it was saved exactly 1440 minutes ago. Post 7 is not suitable because it was saved 1439 minutes ago, which is less than the defined 1440 minutes. Post 8 is not suitable because it was saved just 120 minutes ago.