1.1. About WordPress Content Crawler¶
WordPress Content Crawler is a WordPress plugin that can crawl (scrape, grab, retrieve) content from almost any site existing in the Internet (See: Can I get content from X site?). It uses CSS selectors to locate and retrieve the content in the target web page’s source code. CSS selectors are very easy to learn and very helpful to locate the information in the target site. The plugin also comes with a Visual Inspector that you can use to click to an element to find its CSS selector. The tool is also capable of finding a CSS selector that finds items similar to the one you click. This feature is quite helpful if you want to retrieve, for example, all URLs in a category page. For more information, please refer to its documentation.
The plugin has over 200 settings to enable you not just to retrieve content but also retrieve them in the way you want. For example, you can change HTML of the elements existing in the source code, remove the elements, change attributes of the elements, exchange the values of two attributes of an element, find and replace anything in the source code, assign certain elements to a short code and use them in the templates, and many more. You can refer to Site Settings Page section to learn more about all of the settings of the plugin. You can also read Common Options section.
The plugin works using WP-Cron. This feature of WordPress lets plugins and themes define certain jobs to be triggered at certain times. These jobs are run in the background. The plugin defines jobs to automatically crawl, recrawl (update), and delete posts. By this way, the plugin is able to run in the background and do all of its work automatically. Hence, a working WP-Cron is a requirement of the plugin. To learn more about all requirements of the plugin, please refer to Requirements section.
Here is a video tutorial that explains the plugin’s working logic, along with some CSS knowledge. This video is a good starting point to understand the plugin.
For other video tutorials, you can check out this YouTube playlist.
1.1.1. Main Features¶
You can find the guides about how to the things listed below in Guides page.
- Save every post detail
- Title, excerpt, content, tags, categories, slug, date, custom meta, taxonomies, meta keywords, meta description, featured image, post images, status… Just everything.
- Crawl (scrape, grab, save) posts
- After the settings are configured, the plugin finds URLs of the posts and crawls them automatically in the background.
- Delete posts
- You want to delete old crawled posts? The plugin can delete them automatically.
- Save categories
- The target category does not exist in your site? No problem. The plugin can create the target categories for you. Just define the CSS selectors that find category names. They can even be created as subcategories.
- Save taxonomies
- Save taxonomy values by retrieving them from the target site or entering manually. Saving details of custom post types is easier than ever.
- Custom post meta
- Save anything as custom post meta. You can use a CSS selector or just type the value.
- Alternative selectors
- You can write alternative selectors to get the data even if the target site has post pages designed differently from each other.
- Paginated posts
- Target post has more than one page? No worries. You can save paginated posts as well.
- Remove unnecessary elements
- Sometimes you need to get rid of some elements, such as advertisements, comments, you name it. Just write its CSS selector and it is removed.
- Post types
- Set post type. It can be a post, a page, a product, or any other post type available in your WordPress installation.
- Password protection
- You can set a password for the posts to show them only to the users who have the password.
- Test everything on the fly
- Test post crawling, URL collection, CSS selectors, regular expressions, find and replace options and proxies on the fly. You can also enable caching to perform the tests much faster and reduce the requests sent to the target site.
- Using the tools, you can save posts manually with their URL, recrawl posts with their ID or delete already-saved URLs.
- Post status
- You can directly publish the saved posts or keep them as draft to check them before publishing.
- Save images as gallery
- You can save the images in the target page as gallery and provide a template for each image to make it suitable for the gallery library that you use on frontend. You can also save the images as WooCommerce gallery by just checking one checkbox.
- Use a proxy or proxies to get content from the sites to which your IP does not have access.
- Crawl as many posts as you want
- You can set how many times post crawling or URL collection CRON events should run. By this way, you can, e.g., save 100 posts every minute. Just be careful and consider your server’s capacity.
- Get data from JSON
- When you enable JSON parsing for a CSS selector, you can get the values from the JSON easily.
- Automatic translation
- Use the artificial intelligence of Google Cloud Translation API, Microsoft Translator Text API, Yandex Translate API or Amazon Translate API to automatically translate the posts. Note that these are paid services, except for Yandex Translate API. The paid ones also offer the service for free for a limited amount of time. You can see their pricing pages to learn more.
- Automatic spinning
- Use spinning to automatically rewrite crawled posts’ contents to improve search engine optimization. The plugin currently implements only Spin Rewriter API, which is a paid service. You can visit their website to learn the pricing details.
- Duplicate post check
- Check duplicate posts by URL, post title and/or post content. If you are using WooCommerce, products whose SKU already exists are considered as duplicate and they will not be added to your site.
- Save WooCommerce products
- Save price, inventory, shipping, attributes, and advanced options. You can save the product as a simple or an external product. You can also set downloadable file options and define the product as virtual. The options are available for WooCommerce versions greater than or equal to 3.3.
- Handle files like a pro
- Rename, copy, and move saved files easily. You can also define title, description, caption, and alt texts for the saved media files using templates in which you can use any short code. It is also possible to give random names to the saved files.
- Quick save
- With quick save button, you can save the settings much more quickly. No need to wait for page to reload.
- Save “srcset” attributes
- When alternative sizes of the saved images are available, the plugin assigns them into srcset attribute of img elements so that your pages will load faster in different screen sizes.
- Learn when there is a problem. The plugin will show you the details of the error so that you can fix it right away.
- Navigate between settings easily
- Fix navigation to the top! The plugin stores where you were before switching to a new tab and restores your previous location when you activate that tab again. No more getting lost among the settings.
- Add URLs to the database
- The plugin collects URLs automatically. However, if you want it to crawl only certain URLs, you can add them to the database manually using the manual crawling tool. By this way, the specified URLs will be crawled using your scheduling options, automatically.
- You can import and export site settings easily. Just copy and paste the code created by the plugin.
- Detailed dashboard
- See what’s going on in the background. Active sites, number of posts crawled, number of posts updated, last crawled and updated posts, last added URLs, last and next run of CRON events, currently being saved posts and URLs…
- Use the most secure PHP
- The plugin supports the latest versions of PHP.
- Online documentation
- You can check the online documentation whenever you feel a need.
- Video tutorials
- Watch video tutorials to easily learn how to use the plugin.
To learn more about how to install and activate the plugin, please refer to Installation and Activation section.