In the digital age, almost everyone has an online presence. Most people will look online before stepping foot in a store because everything is available online—even if it’s just information on where to get the best products. We even look up cinema times online!
As such, staying ahead of the competition regarding visibility is no longer merely a matter of having a good marketing strategy. Newspaper and magazine articles, television and radio advertising, and even billboards (for those who can afford them) are no longer enough, even though they’re still arguably necessary.
Now, you also have to ensure that your site is better than your competitors’, from layout to content, and beyond. If you don’t, you’ll slip away into obscurity, like a well-kept secret among the locals—which doesn’t bode well for any business.
This notion is where Search Engine optimization (SEO) comes in. There is a host of SEO tools and tricks available to help put you ahead and increase your search engine page ranking—your online visibility. These range from your use of keywords, backlinks, and imagery, to your layout and categorization (usability and customer experience). One of these tools is the website crawler.
What is a Website Crawler?
A website crawler is a software program used to scan sites, reading the content (and other information) so as to generate entries for the search engine index. All search engines use website crawlers (also known as a spider or bot). They typically work on submissions made by site owners and “crawl” new or recently modified sites and pages, to update the search engine index.
The crawler earned its moniker based on the way it works: by crawling through each page one at a time, following internal links until the entire site has been read, as well as following backlinks to determine the full scope of a site’s content. Crawlers can also be set to read the entire site or only specific pages that are then selectively crawled and indexed. By doing so, the website crawler can update the search engine index on a regular basis.
Website crawlers don’t have free reign, however. The Standard for Robot Exclusion (SRE) dictates the so-called “rules of politeness” for crawlers. Because of these specifications, a crawler will source information from the respective server to discover which files it may and may not read, and which files it must exclude from its submission to the search engine index. Crawlers that abide by the SRE are also unable to bypass firewalls, a further implementation designed to protect site owner’s’ privacy rights.
Lastly, the SRE also requires that website crawlers use a specialized algorithm. This algorithm allows the crawler to create search strings of operators and keywords, in order built onto the database (search engine index) of websites and pages for future search results. The algorithm also stipulates that the crawler waits between successive server requests, to prevent it from negatively impact the site’s response time for real (human) users visiting the site.
What Are the Benefits of Using a Website Crawler?
The search engine index is a list where the search engine’s data is stored, allowing it to produce the search engine results page (SERP). Without this index, search engines would take considerably longer to generate results. Each time one makes a query, the search engine would have to go through every single website and page (or other data) relating to the keyword(s) used in your search. Not only that, but it would also have to follow up on any other information each page has access to—including backlinks, internal site links, and the like—and then make sure the results are structured in a way to present the most relevant information first.
This finding means that without a website crawler, each time you type a query into your search bar tool, the search engine would take minutes (if not hours) to produce any results. While this is an obvious benefit for users, what is the advantage for site owners and managers?
Using the algorithm as mentioned above, the website crawler reviews sites for the above information and develops a database of search strings. These strings include keywords and operators, which are the search commands used (and which are usually archived per IP address). This database is then uploaded to the search engine index to update its information, accommodating new sites and recently updated site pages to ensure fair (but relevant) opportunity.
Crawlers, therefore, allow for businesses to submit their sites for review and be included in the SERP based on the relevancy of their content. Without overriding current search engine ranking based on popularity and keyword strength, the website crawler offers new and updated sites (and pages) the opportunity to be found online. Not only that, but it allows you to see where your site’s SEO ranking can be improved.
How to Choose a Website Crawler?
Site crawlers have been around since the early 90s. Since then, hundreds of options have become available, each varying in usability and functionality. New website crawlers seem to pop up every day, making it an ever-expanding market. But, developing an efficient website crawler isn’t easy—and finding the right option can be overwhelming, not to mention costly if you happen to pick the wrong one.
Here are seven things to look out for in a website crawler:
Now that you know what to look for in a website crawler, it’s time we made things easier for you by narrowing your search down from (literally) thousands to the best 60 options.
1. DYNO Mapper
With a focus on sitemap building (which the website crawler feature uses to determine which pages it’s allowed to read), DYNO Mapper is an impressive and functional software option.
DYNO Mapper’s website crawler lets you enter the URL (Uniform Resource Locator—the website address, such as www.example.com) of any site and instantly discover its site map, and build your own automatically.
There are three packages to choose from, each allowing a different number of projects (sites) and crawl limitations regarding the number of pages scanned. If you’re only interested in your site and a few competitors, the Regular package (at $480 a year paid annually) is a good fit. However, their Freelancer ($696 per year) and Most Popular ($1296 a year) packages are better options for more advanced users, especially those who want to be able to crawl numerous sites and up to 50 000 pages.
With a 14-day free trial (and two months off if you do opt for annual billing), you can’t go wrong.
2. Screaming Frog SEO Spider
Screaming Frog offers a host of search engine optimization tools, and their SEO Spider is one of the best website crawlers available. You’ll instantly find where your site needs improvement, discovering broken links and differentiating between temporary and permanent redirects.
While their free version is somewhat competent, to get the most out of the Screaming Frog SEO Spider tool, you’ll want to opt for the paid version. Priced at about $197 (paid on an annual basis), it allows for unlimited pages (memory dependent) as well as a host of functions missing from the free version. These include crawl configuration, Google Analytics integration, customized data extraction, and free technical support.
Screaming Frog claim that some of the biggest sites use their services, including Apple, Disney, and even Google themselves. The fact that they’re regularly featured in some of the top SEO blogs goes a long way to promote their SEO Spider.
DeepCrawl is something of a specialized website crawler, admitting on their homepage that they’re not a “one size fits all tool.” They offer a host of solutions, however, which you can integrate or leave out as you choose, depending on your needs. These include regular crawls for your site (which can be automated), recovery from Panda and (or) Penguin penalties, and comparison to your competitors.
There are five packages to choose from, ranging from $864 annually (you get one month free by opting for an annual billing cycle) to as high as $10 992 a year. Their corporate package, which offers the most features, is individually priced, and you’ll need to contact their support team to work out a cost.
Overall, the Agency package ($5484 a year) is their most affordable option for anyone wanting telephonic support and three training sessions. However, the Consultant plan ($2184 annually) is quite capable of meeting most site owners’ needs and does include email support.
Designed to extract the site map and data from websites, Apifier processes information in a readable format for you surprisingly quickly (they claim to do so in a matter of seconds, which is impressive, to say the least).
Developers do have the option of signing up for free, but the package does not entail all the basics. To get the best out of Apifier, you’ll want to opt for the Medium Business plan at $1548 annually ($129 a month), but the Extra Small option at $228 annually is also quite competent.
Since Google understands only a portion of your site, OnCrawl offers you the ability to read all of it with semantic data algorithms and analysis with daily monitoring.
The features available include SEO audits, which can help you improve your site’s search engine optimization and identify what works and what doesn’t. You’ll be able to see exactly how your SEO and usability is affecting your traffic (number of visitors). OnCrawl even monitors how well Google can read your site with their crawler and will help you to improve and control what does and doesn’t get read.
With OnCrawl’s Starter package ($136 a year) affords you a 30-day money back guarantee, but it’s so limited you’ll likely be upgrading to one of the bigger packages that don’t offer the same money-back guarantee. Pro will set you back $261 a year—you get two months free with the annual plan—but will also cover almost every requirement.
6. SEO Chat Website Crawler and XML Site Map Builder
We now start moving away from the paid website crawlers to the free options available, starting with the SEO Chat Website Crawler and XML Site Map Builder. Also referred to as SEO Chat’s Ninja Website Crawler Tool, the online software mimics the Google sitemap generator to scan your site. It also offers spell checking and identifies page errors, such as broken links.
It’s incredibly easy to use integrate with any number of SEO Chat’s other free online SEO tools. After entering the site URL—either typing it out or using copy/paste—you can choose whether you want to scan up to 100, 500, or 1000 pages from the site.
Of course, there are some limitations in place. You’ll have to register (albeit for free) if you want the tool to crawl more than 100 pages, and you can only run five scans a day.
7. Webmaster World Website Crawler Tool and Google Sitemap Builder
The Webmaster World Website Crawler Tool and Google Sitemap Builder is another free scanner available online. Designed and developed in a very similar manner to the SEO Chat Ninja Website Crawler Tool above, it also allows you to punch in (or copy/paste) a site URL and opt to crawl up to 100, 500, or 1000 of its pages. Because the two tools have been built using almost the same code, it comes as no surprise that you’ll need to register for a free account if you want it to scan more than 100 pages.
Another similarity is that it can take up to half an hour to complete a website crawl, but allows you to receive the results via email. Unfortunately, you’re still limited to five scans per day.
However, where the Webmaster World tool does outshine the SEO Chat Ninja is in its site builder capabilities. Instead of being limited to XML, you’ll be able to use HTML too. The data provided is also interactive.
8. Rob Hammond’s SEO Crawler
Rob Hammond offers a host of architectural and on-page search engine optimization tools, one of which is a highly efficient free SEO Crawler. The online tool allows you to scan website URLs on the move, being compatible with a limited range of devices that seem to favor Apple products. There are also some advanced features that allow you to include, ignore, or even remove regular expressions (the search strings we mentioned earlier) from your crawl.
Results from the website crawl are in a TSV file, which can be downloaded and used with Excel. The report includes any SEO issues that are automatically discovered, as well as a list of the total external links, meta keywords, and much more besides.
The only catch is that you can only search up to 300 URLs for free. It isn’t made clear on Hammond’s site whether this is tracked according to your IP address, or if you’ll have to pay to make additional crawls—which is a disappointing omission.
WebCrawler.com is easily the most obviously titled tool on our list, and the site itself seems a little overly simplistic, but it’s quite functional. The search function on the site’s homepage is a little deceptive, acting as a search engine would and bringing up results of the highest ranking pages containing the URL you enter. At the same time, you can see the genius of this though—you can immediately see which pages are ranking better than others, which allows you to quickly determine which SEO methods are working the best for your sites.
One of the great features of WebCrawler.com is that you can integrate it into your site, allowing your users to benefit from the tool. By adding a bit of HTML code to your site (which they provide for you free of charge as well), you can have the WebCrawler.com tool appear on your site as a banner, sidebar, or text link.
10. Web Crawler by Diffbot
Another rather simply named online scanner, the Web Crawler by Diffbot is a free version of the API Crawlbot included in their paid packages. It extracts information on a range of features of pages. The data contained are titles, text, HTML coding, comments, date of publication, entity tags, author, images, videos, and a few more.
While the site claims to crawl pages within seconds, it can take a few minutes if there’s a lot of internal links on your site. There’s an ill-structured web results page that can be viewed online, but you can also download the report in one of two formats: CSV or JSON.
You’re also limited in the number of searches, but it isn’t stipulated as to exactly what that limitation is—although you can share the tool on social media to gain 300 more crawls before being prompted to sign up for a 14-day free trial for any of Diffbot’s paid packages.
11. The Internet Archive’s Heritrix
The Internet Archive’s Heritrix is the first open source website crawler we’ll be mentioning. Because it (and, in fact, the rest of the crawlers that follow it on our list) require some knowledge of coding and programming languages. Hence, it’s not for everyone, but still well worth the mention.
Everyone is free to download and use Heritrix, for redistribution and (or) modification (allowing you to build your website crawler using Heritrix as a foundation), within the limitations stipulated in the Apache License.
12. Apache Nutch
Nutch 2.x, on the other hand, stems from Nutch 1.x but is still being processed (it’s still usable, however, and one can use it as a foundation for developing your website crawler). The key difference is that Nutch 2.x uses Apache Gora, allowing for the implementation of a more flexible model/stack storage solution.
Both versions of Apache Nutch are modular and provide interface extensions like parsing, indexation, and a scoring filter. While it’s capable of running off a single workstation, Apache does recommend that users run it on a Hadoop cluster for maximum effect.
Scrapy is a collaborative open source website crawler framework, designed with Python for cross-platform use. Developed to provide the basis for a high-level web crawler tool, Scrapy is capable of performing data mining as well as monitoring, with automated testing. Because the coding allows for requests to be submitted and processed asynchronously, you can run multiple crawl types—for quotes, for keywords, for links, et cetera—at the same time. If one request fails or an error occurs, it also won’t interfere with the other crawls running at the same time.
This flexibility allows for very fast crawls, but Scrapy is also designed to be SRE compliant. Using the actual coding and tutorials, you can quickly set up waiting times, limits on the number of searches an IP range can do in a given period, or even restrict the number of crawls done on each domain.
14. DataparkSearch Engine
Developed using C++ and compatible on several platforms, DataparkSearch Engine is designed to organize search results in a website, group of websites, local systems, and intranets. Some of the key features include HTTP, https, FTP, NNTP, and news URL scheme support, as well as an htdb URL for SQL database indexation. DataparkSearch Engine is also able to index text/plain, text/XML, text/HTML, audio/MPEG, and image/gif types natively, as well as multilingual websites and pages with content negotiation.
Using the vector calculation, results can be sorted by relevancy. Popularity ranking reports are classified as “Goo,” which adds weight to incoming links, as well as “Neo,” based on the neutral network model. You can also view your results according to the last time a site or page has been modified, or by a combination of relevancy and popularity rank to determine its importance. DataparkSearch Engine also allows for a significant reduction in search times by incorporating active caching mechanisms.
15. GNU Wget
Formed as a free software package, GNU Wget leans toward retrieving information on the most common internet protocols, namely HTTP, HTTPS, and FTP. Not only that, but you’ll also be able to mirror a site (if you so wish) using some of GNU Wget’s many features.
If a download of information and files is interrupted or aborted for any reason, using the REST and RANGE commands, allow you to resume the process with ease quickly. GNU Wget uses NSL-based message files, making it suitable for a wide array of languages, and can utilize wildcard file names.
GNU Wget was developed with the C programming languages and is for use on Linux servers (but compatible with other UNIX operating systems, such as Windows).
16. Grub Next Generation
Designed as a website crawling software for clients and servers, Grub Next Generation assists in creating and updating search engine indexes. It makes it a viable option for anyone developing their search engine platform, as well as those looking to discover how well existing search engines can crawl and index their site.
It’s also operating system independent, making it a cross-platform program, and can be implemented in coding schemes using Perl, Python, C, and C# alike. The program also translates into several languages, namely Dutch, Galician, German, French, Spanish, Polish, and Finnish.
The most recent update included two new features, allowing users to alter admin upload server settings as well as adding more control over client usage. Admittedly, this update was as far back as mid-June 2011, and Freecode (the underlying source of Grub Next Generation platform) stopped providing updates three years later. However, it’s still a reliable web crawling tool worth the mention.
17. HTTrack Website Copier
The HTTrack Website Copier is a free, easy-to-use offline website crawler developed with C and C++. Available as WinHTTrack for Windows 2000 and up, as well as WebHTTrack for Linux, UNIX, and BSD, HTTrack is one of the most flexible cross-platform software programs on the market.
Allowing you to download websites to your local directory, HTTrack allows you to rebuild all the directories recursively, as well as sourcing HTML, images, and other files. By arranging the site’s link structure relatively, you’ll have the freedom of opening the mirrored version in your browser and navigate the site offline.
Furthermore, if the original site is updated, HTTrack will pick up on the modifications and update your offline copy. If the download is interrupted at any point for any reason, the program is also able to resume the process automatically.
HTTrack has an impressive help system integrated as well, allowing you to mirror and crawl sites without having to worry if anything goes wrong.
18. Norconex Collectors
Available as an HTTP Collector and a Filesystem Collector, the Norconex Collectors are probably the best open source website crawling solutions available for download.
Although designed for developers, the programs are often extended by integrators and (while still being easily modifiable) can be used comfortably by anyone with limited developing experience too. Using one of their readily available Committers, or building your own, Norconex Collectors allow you to make submissions to any search engine you please. And if there’s a server crash, the Collector will resume its processes where it left off.
The HTTP Collector is designed for crawling website content for building your search engine index (which can also help you to determine how well your site is performing), while the Filesystem Collector is geared toward collecting, parsing, and modifying information on local hard drives and network locations.
While OpenSearchServer also offers cloud-based hosting solutions (starting at $228 annually on a monthly basis and ranging up to $1428 for the Pro package), they also provide enterprise-class open source search engine software, including search functions and indexation.
You can opt for one of six downloadable scripts. The Search code, made for building your search engine, allows for full text, Boolean, and phonetic queries, as well as filtered searches and relevance optimization. The index includes seventeen