6.1.2. Recrawling Section
Recrawling means crawling already-crawled posts again. In other words, it means updating the posts by retrieving the source code of the target page again and saving certain details of the posts again.
The following parts of the posts will not be updated:
- Permalink (the name of the post in its URL)
- Publish date
All other parts of the posts will be updated when recrawling.
6.1.2.1. Recrawling is active?
Enable this to activate automatic recrawling of the posts that are crawled by the plugin. When you enable this, all posts of the sites whose Active for recrawling? setting is checked will be updated. If a site’s Active for recrawling? setting is not checked, the posts saved by using that site’s settings will not be recrawled.
Note
For the posts to be recrawled, both this setting and Active for recrawling? setting of the site must be enabled.
6.1.2.2. Post Recrawl Interval
The time interval set in this setting will define how much time apart the post recrawling job should run. See Run count for post recrawling event for an example.
Note
This setting is visible only if Recrawling is active? is enabled.
6.1.2.3. Run count for post recrawling event
This setting defines how many times post recrawling event should run every interval of time defined in Post Recrawl Interval.
Let’s assume the following case. You have 3
sites that are active for recrawling. Each site
previously saved a few posts which are currently all suitable for recrawling. To keep the example
simple, let’s assume that Maximum recrawl count per post is set to 1
. The sites and their
posts for this example case are given in the following table.
Note
In this example, the posts are assumed to have only 1
page.
Site Name | Posts (the oldest at the top) |
---|---|
Site 1 | Post 1.1
Post 1.2
Post 1.3
Post 1.4
Post 1.5
|
Site 2 | Post 2.1
Post 2.2
Post 2.3
|
Site 3 | Post 3.1
Post 3.2
Post 3.3
Post 3.4
Post 3.5
Post 3.6
Post 3.7
Post 3.8
Post 3.9
Post 3.10
|
Finally, let’s assume that you have set this setting’s value to 2
and
Post Recrawl Interval setting’s value to Every 5 minutes
. In this case, the posts
will be recrawled as shown in the following table.
Time (minute) | Site | Recrawled Posts | Remaining Post Count |
---|---|---|---|
0 | Site 1 | Post 1.1
Post 1.2
|
3 |
5 | Site 2 | Post 2.1
Post 2.2
|
1 |
10 | Site 3 | Post 3.1
Post 3.2
|
8 |
15 | Site 1 | Post 1.3
Post 1.4
|
1 |
20 | Site 2 | Post 2.3
|
0 |
25 | Site 3 | Post 3.3
Post 3.4
|
6 |
30 | Site 1 | Post 1.5
|
0 |
35 | Site 3 | Post 3.5
Post 3.6
|
4 |
40 | Site 3 | Post 3.7
Post 3.8
|
2 |
45 | Site 3 | Post 3.9
Post 3.10
|
0 |
Note
In case of multi-page posts, each page of a post will be handled
in a different run. You can assume that the plugin will make only 1
request to the target
site at each run of the event. Since each page of a post has a different URL, the plugin will
crawl/recrawl them at different runs. The plugin will not crawl/recrawl another post until all
pages of the post is crawled/recrawled.
Note
All the runs will be executed as soon as the event is run. For
example, if an event’s run count is set to 10
and its interval is set to Every 5
minutes
, all the runs will be executed as soon as the previous run is finished.
5
-minute time interval will not be divided into 10
equal intervals.
6.1.2.4. Maximum recrawl count per post
The number set to this setting defines how many times a single post can be updated. For example,
if you set this setting’s value to 3
, a post will be updated 3
times at maximum.
6.1.2.5. Minimum time between two recrawls for a post (in minutes)
The number set to this setting defines at least how many minutes should be between the dates of two
recrawls of a post. For example, if you set this number to 120
, a post will be updated only
if its previous update was at least 2 hours ago.
Let’s assume you set this number to 30
minutes and you have 1
active site that has only
1
crawled post, which has never been recrawled before. Let’s also assume that the first
recrawl will take place on 02.09.2019
at 16:00
. Finally, let’s assume that you have set
Maximum recrawl count per post to 5
and Post Recrawl Interval to Every minute
.
In this case, the recrawl times of the post is as follows:
Recrawl Count | Recrawl Date |
---|---|
1 | 02.09.2019 @ 16:00 |
2 | 02.09.2019 @ 16:30 |
3 | 02.09.2019 @ 17:00 |
4 | 02.09.2019 @ 17:30 |
5 | 02.09.2019 @ 18:00 |
The post is recrawled 5
times because Maximum recrawl count per post has been set to 5
.
The post is updated in 30
-minute intervals because this setting has been set to 30
, although
Post Recrawl Interval has been set to Every minute
. Because there was only 1
active site, the recrawls took place at exactly 30
minutes apart.
Note
If there were other sites that are active for recrawling, then the recrawl intervals will
probably greater than 30
minutes. That is because 30
minutes later than the previous
recrawl of the post, it might be another site’s turn to be recrawled.
6.1.2.6. Recrawl posts newer than (in minutes)
The number set to this setting defines the freshness of the post to be recrawled. If a post’s creation date is older than this minutes compared to the current time, then it will not be recrawled. It is better to explain this with an example.
Note
Post creation date is the WordPress post’s Publish
date. So, if you also save the post
creation date by retrieving it from the target web site, then that will be considered when
determining the freshness of the post.
Let’s assume that you have set this setting’s value to 1440
minutes, which is equal to 1
day
. Let’s also assume that you have the posts given in the following table which were saved by
the sites that are currently active for recrawling. Finally, let’s assume that the current date is
10.09.2019 12:00
. For this case, suitability for recrawling for each post is given in the
following table.
Post Title | Publish Date | Suitable |
---|---|---|
Post 1 | 02.09.2019 @ 16:00 | No |
Post 2 | 02.09.2019 @ 20:00 | No |
Post 3 | 05.09.2019 @ 12:00 | No |
Post 4 | 08.09.2019 @ 21:00 | No |
Post 5 | 09.09.2019 @ 10:00 | No |
Post 6 | 09.09.2019 @ 12:00 | No |
Post 7 | 09.09.2019 @ 12:01 | Yes |
Post 8 | 10.09.2019 @ 10:00 | Yes |
Post 6
is not suitable, because it was saved exactly 1440
minutes ago. Post 7
is
suitable because it was saved 1439
minutes ago, which is less than the defined 1440
minutes. Post 8
is suitable because it was saved just 120
minutes ago.