7.1.1. Scheduling

Scheduling contains automatically collecting post URLs from target site’s category pages and then saving those posts into your WordPress site.

URL Collection
This means crawling category pages of the sites, which are defined in Category > Category URLs setting, extracting post URLs, and saving those post URLs into the database.
Post Crawling
This means crawling one of the post URLs saved into your database via URL collection event and creating a post in your WordPress site from the source code of the post URL.

7.1.1.1. Scheduling is active?

Enable this to activate automatic crawling. When you enable this, it means the plugin will collect post URLs from the categories entered into Category > Category URLs setting and it will automatically crawl collected post URLs.

Note

For the posts to be crawled, both this setting and Active for scheduling? setting of the site must be enabled.

7.1.1.2. Post URL Collection Interval

The time interval set in this setting will define how much time apart the post collection job should run.

Note

This setting is visible only if Scheduling is active? is enabled.

Note

The collected URLs will be crawled (saved as posts) in the order they are collected.

For example, let’s say you selected Every 5 minutes. In this case, every 5 minutes, the plugin will collect URLs from one of the categories of one of the active sites. Let’s assume you have 3 active sites and they are configured as follows:

Table 7.1 Active sites and their categories for the assumed case
Site Name Categories
Site 1
Category 1.1 (2 pages)
Category 1.2 (1 page)
Site 2
Category 2.1 (1 page)
Category 2.2 (2 pages)
Category 2.3 (2 pages)
Site 3
Category 3.1 (1 page)

Let’s also assume that you do not limit the number of category pages that can be crawled. Finally, let’s assume that you set the value of Maximum number of pages to check to find new post URLs to 1. Here is how the plugin will collect the URLs with this configuration:

Table 7.2 Simulation of URL collection event for the assumed case
Time (minute) Site - Category - Page
0 Site 1 - Category 1.1 - Page 1
5 Site 2 - Category 2.1 - Page 1
10 Site 3 - Category 3.1 - Page 1
15 Site 1 - Category 1.1 - Page 2
20 Site 2 - Category 2.2 - Page 1
25 Site 3 - Category 3.1 - Page 1
30 Site 1 - Category 1.2 - Page 1
35 Site 2 - Category 2.2 - Page 2
40 Site 3 - Category 3.1 - Page 1
45 Site 1 - Category 1.1 - Page 1
50 Site 2 - Category 2.3 - Page 1
55 Site 3 - Category 3.1 - Page 1
60 Site 1 - Category 1.2 - Page 1
65 Site 2 - Category 2.3 - Page 2
70 Site 3 - Category 3.1 - Page 1
75 Site 1 - Category 1.1 - Page 1
80 Site 2 - Category 2.1 - Page 1
85 Site 3 - Category 3.1 - Page 1
90 Site 1 - Category 1.2 - Page 1
95 Site 2 - Category 2.2 - Page 1
100 Site 3 - Category 3.1 - Page 1
105 Site 1 - Category 1.1 - Page 1
110 Site 2 - Category 2.3 - Page 1
115 Site 3 - Category 3.1 - Page 1
120 Site 1 - Category 1.2 - Page 1
125 Site 2 - Category 2.1 - Page 1
130 Site 3 - Category 3.1 - Page 1
135 Site 1 - Category 1.1 - Page 1
140 Site 2 - Category 2.2 - Page 1
145 Site 3 - Category 3.1 - Page 1
150 Site 1 - Category 1.2 - Page 1
155 Site 2 - Category 2.3 - Page 1
160 Site 3 - Category 3.1 - Page 1
165 Site 1 - Category 1.1 - Page 1
170 Site 2 - Category 2.1 - Page 1
175 Site 3 - Category 3.1 - Page 1

7.1.1.3. Post Crawl Interval

The time interval set in this setting will define how much time apart the post crawling job should run.

Note

This setting is visible only if Scheduling is active? is enabled.

Let’s say you set this to Every minute. Then, every minute, the plugin will crawl a URL that is already collected and available in your database. The URLs that are not saved and waiting to be saved are said that they are in queue. So, the plugin will get the first URL in the queue for a site, ask the target server to send the source code, and apply your settings to extract data from the source code and save it as a post in your site.

Let’s also assume that you have 4 active sites that have different number of posts in their queue as shown in the following table.

Note

Every time, the oldest URL in the queue will be saved for a site. You can think this as a bank queue. The one who came the earliest is served first.

Table 7.3 Active sites and URLs in the queue belonging to them for the assumed case
Site Name Post URLs in the Queue (the oldest at the top)
Site 1
URL 1.1
URL 1.2
URL 1.3
URL 1.4
URL 1.5
Site 2
URL 2.1
URL 2.2
URL 2.3
URL 2.4
URL 2.5
URL 2.6
URL 2.7
URL 2.8
URL 2.9
URL 2.10
Site 3
URL 3.1
URL 3.2

Let’s finally assume that you have set Run count for post crawling event to 2. The plugin will crawl the URLs in the queue and save them as posts as shown in the following table.

Table 7.4 Simulation of post crawling event for the assumed case
Time (minute) Site Crawled URLs Remaining URLs in the queue
0 Site 1
URL 1.1
URL 1.2
3
1 Site 2
URL 2.1
URL 2.2
8
2 Site 3
URL 3.1
URL 3.2
0
3 Site 1
URL 1.3
URL 1.4
1
4 Site 2
URL 2.3
URL 2.4
6
5 Site 1
URL 1.5
0
6 Site 2
URL 2.5
URL 2.6
4
7 Site 2
URL 2.7
URL 2.8
2
8 Site 2
URL 2.9
URL 2.10
0

Note

In case of multi-page posts, each page of a post will be handled in a different run. You can assume that the plugin will make only 1 request to the target site at each run of the event. Since each page of a post has a different URL, the plugin will crawl/recrawl them at different runs. The plugin will not crawl/recrawl another post until all pages of the post is crawled/recrawled.

7.1.1.4. Maximum number of pages to crawl per category

This setting defines URLs from how many pages of a category can be collected. When you set this setting’s value to 0, all pages of the category will be crawled to collect post URLs.

For example, let’s assume that you entered a category URL having 100 pages into Post > Category URLs setting. If you set this setting’s value to 50, the plugin will collect URLs from the first 50 pages at first. After all of the 50 pages are crawled, then the plugin will start crawling from the first page of the category to find new post URLs. But, this time the plugin will not go through the first 50 pages. Instead, it will use the value defined in Maximum number of pages to check to find new post URLs setting.

7.1.1.5. Maximum number of pages to check to find new post URLs

This setting defines how many pages of a category, starting from its first page, should be checked for new post URLs. The new post URLs are the URLs that were not added to the database, i.e queue, before.

This setting is applied after the number of pages defined in Maximum number of pages to crawl per category is crawled. For an example, see Maximum number of pages to crawl per category.

7.1.1.6. Run count for URL collection event

This setting defines how many times URL collection event should run.

Let’s assume the following case. You have set Post URL Collection Interval to Every 5 minutes and set the value of this setting to 2. You have 2 active sites as shown in the following table:

Table 7.5 Active sites and their categories for the assumed case
Site Name Categories
Site 1
Category 1.1 (2 pages)
Category 1.2 (1 page)
Site 2
Category 2.1 (1 page)
Category 2.2 (2 pages)
Category 2.3 (2 pages)

Let’s finally assume that Maximum number of pages to check to find new post URLs is set to 1. Here is how the plugin will collect the URLs with this configuration:

Table 7.6 Simulation of URL collection event for the assumed case
Time (minute) Site - Category - Page
0
Site 1 - Category 1.1 - Page 1
Site 1 - Category 1.1 - Page 2
5
Site 2 - Category 2.1 - Page 1
Site 2 - Category 2.2 - Page 1
10
Site 1 - Category 1.2 - Page 1
Site 1 - Category 1.1 - Page 1
15
Site 2 - Category 2.2 - Page 2
Site 2 - Category 2.3 - Page 1
20
Site 1 - Category 1.2 - Page 1
Site 1 - Category 1.1 - Page 1
25
Site 2 - Category 2.3 - Page 2
Site 2 - Category 2.1 - Page 1
30
Site 1 - Category 1.2 - Page 1
Site 1 - Category 1.1 - Page 1
35
Site 2 - Category 2.2 - Page 1
Site 2 - Category 2.3 - Page 1
40
Site 1 - Category 1.2 - Page 1
Site 1 - Category 1.1 - Page 1
45
Site 2 - Category 2.1 - Page 1
Site 2 - Category 2.2 - Page 1
50
Site 1 - Category 1.2 - Page 1
Site 1 - Category 1.1 - Page 1
60
Site 2 - Category 2.3 - Page 1
Site 2 - Category 2.1 - Page 1

Tip

You can increase this number to quickly collect all URLs from all pages of the categories at first. For example, if a category has 30 pages, you can set this number to 3 until URLs of all 30 pages are collected. After that, you can decrease this value because the plugin will collect only new URLs from the category pages from that point on.

Note

All the runs will be executed as soon as the event is run. For example, if an event’s run count is set to 10 and its interval is set to Every 5 minutes, all the runs will be executed as soon as the previous run is finished. 5-minute time interval will not be divided into 10 equal intervals.

7.1.1.7. Run count for post crawling event

This setting defines how many times post crawling event should run every interval of time defined in Post Crawl Interval. For an example case, see Post Crawl Interval.

Note

All the runs will be executed as soon as the event is run. For example, if an event’s run count is set to 10 and its interval is set to Every 5 minutes, all the runs will be executed as soon as the previous run is finished. 5-minute time interval will not be divided into 10 equal intervals.