7.1.2. Recrawling Section¶
Recrawling means crawling already-crawled posts again. In other words, it means updating the posts by retrieving the source code of the target page again and saving certain details of the posts again.
The following parts of the posts will not be updated:
- Permalink (the name of the post in its URL)
- Publish date
All other parts of the posts will be updated when recrawling.
220.127.116.11. Recrawling is active?¶
Enable this to activate automatic recrawling of the posts that are crawled by the plugin. When you enable this, all posts of the sites whose Active for recrawling? setting is checked will be updated. If a site’s Active for recrawling? setting is not checked, the posts saved by using that site’s settings will not be recrawled.
For the posts to be recrawled, both this setting and Active for recrawling? setting of the site must be enabled.
18.104.22.168. Post Recrawl Interval¶
The time interval set in this setting will define how much time apart the post recrawling job should run. See Run count for post recrawling event for an example.
This setting is visible only if Recrawling is active? is enabled.
22.214.171.124. Run count for post recrawling event¶
This setting defines how many times post recrawling event should run every interval of time defined in Post Recrawl Interval.
Let’s assume the following case. You have
3 sites that are active for recrawling. Each site
previously saved a few posts which are currently all suitable for recrawling. To keep the example
simple, let’s assume that Maximum recrawl count per post is set to
1. The sites and their
posts for this example case are given in the following table.
In this example, the posts are assumed to have only
|Site Name||Posts (the oldest at the top)|
Finally, let’s assume that you have set this setting’s value to
Post Recrawl Interval setting’s value to
Every 5 minutes. In this case, the posts
will be recrawled as shown in the following table.
|Time (minute)||Site||Recrawled Posts||Remaining Post Count|
In case of multi-page posts, each page of a post will be handled
in a different run. You can assume that the plugin will make only
1 request to the target
site at each run of the event. Since each page of a post has a different URL, the plugin will
crawl/recrawl them at different runs. The plugin will not crawl/recrawl another post until all
pages of the post is crawled/recrawled.
All the runs will be executed as soon as the event is run. For
example, if an event’s run count is set to
10 and its interval is set to
minutes, all the runs will be executed as soon as the previous run is finished.
5-minute time interval will not be divided into
10 equal intervals.
126.96.36.199. Maximum recrawl count per post¶
The number set to this setting defines how many times a single post can be updated. For example,
if you set this setting’s value to
3, a post will be updated
3 times at maximum.
188.8.131.52. Minimum time between two recrawls for a post (in minutes)¶
The number set to this setting defines at least how many minutes should be between the dates of two
recrawls of a post. For example, if you set this number to
120, a post will be updated only
if its previous update was at least 2 hours ago.
Let’s assume you set this number to
30 minutes and you have
1 active site that has only
1 crawled post, which has never been recrawled before. Let’s also assume that the first
recrawl will take place on
16:00. Finally, let’s assume that you have set
Maximum recrawl count per post to
5 and Post Recrawl Interval to
In this case, the recrawl times of the post is as follows:
|Recrawl Count||Recrawl Date|
|1||02.09.2019 @ 16:00|
|2||02.09.2019 @ 16:30|
|3||02.09.2019 @ 17:00|
|4||02.09.2019 @ 17:30|
|5||02.09.2019 @ 18:00|
The post is recrawled
5 times because Maximum recrawl count per post has been set to
The post is updated in
30-minute intervals because this setting has been set to
Post Recrawl Interval has been set to
Every minute. Because there was only
active site, the recrawls took place at exactly
30 minutes apart.
If there were other sites that are active for recrawling, then the recrawl intervals will
probably greater than
30 minutes. That is because
30 minutes later than the previous
recrawl of the post, it might be another site’s turn to be recrawled.
184.108.40.206. Recrawl posts newer than (in minutes)¶
The number set to this setting defines the freshness of the post to be recrawled. If a post’s creation date is older than this minutes compared to the current time, then it will not be recrawled. It is better to explain this with an example.
Post creation date is the WordPress post’s
Publish date. So, if you also save the post
creation date by retrieving it from the target web site, then that will be considered when
determining the freshness of the post.
Let’s assume that you have set this setting’s value to
1440 minutes, which is equal to
day. Let’s also assume that you have the posts given in the following table which were saved by
the sites that are currently active for recrawling. Finally, let’s assume that the current date is
10.09.2019 12:00. For this case, suitability for recrawling for each post is given in the
|Post Title||Publish Date||Suitable|
|Post 1||02.09.2019 @ 16:00||No|
|Post 2||02.09.2019 @ 20:00||No|
|Post 3||05.09.2019 @ 12:00||No|
|Post 4||08.09.2019 @ 21:00||No|
|Post 5||09.09.2019 @ 10:00||No|
|Post 6||09.09.2019 @ 12:00||No|
|Post 7||09.09.2019 @ 12:01||Yes|
|Post 8||10.09.2019 @ 10:00||Yes|
Post 6 is not suitable, because it was saved exactly
1440 minutes ago.
Post 7 is
suitable because it was saved
1439 minutes ago, which is less than the defined
Post 8 is suitable because it was saved just
120 minutes ago.