Build a business you love to lead. Create the life you want.
How to: Blocking Bad Bot Traffic | Sprk’d Blog
By christine|Last Updated: March 18, 2016|4 min read|
When it comes to bots crawling your website, not all bots are created equal. There are good bots, like Google’s spider, that crawl your website and index the website for search engines. On the other hand, there are bad, bad bots that hurt your website in different ways, including:
Extracting Your Website Images
Skew Your Website Traffic Statistics
Stripping Your Content for Republishing
Downloading Site Archives
As a result, you need a good way to block the bad robots while letting the fun robots do their thing. Fortuitously, you can use the robots.txt file to achieve exactly that.
How to Use Robots.txt
To create and use a robots.txt file, you first need to access the root of your domain. If you don’t know how to access the domain’s root, reach out to your web hosting provider for help.
Once you have access to the root, you can then create, edit, and tweak your robots.txt file. Before making any changes, though, you need to understand the language of a robots.txt file.
Two keywords operate at the heart of the robots.txt file, which are “User-agent” and “Disallow.”
User-agents are essentially search engine robots and/or web crawler software. The majority of user-agents can be found in the Web Robots Database, and many of these are pretty common bots to block.
Disallow is a user-agent command that tells a user-agent never to access a given URL. If you want a user-agent to have access to a child directory URL located in a disallowed parent directory, you can also use an Allow keyword as well.
The syntax for using these keywords is actually quite simple, and it looks as such:
User-agent: [the name of the robot that the rules will apply to]
Disallow: [the path of the URL you wish to block]
Allow: [if applicable, this will be the URL subdirectory path in a blocked parent directory that you want to unblock]
These lines and commands operate as a single entry. So, the Disallow rule will only apply to the named user-agent above the rule.
There is no limit to the amount of entries you can create in your robots.txt file, and you can also create a user-agent command that applies to all web crawlers. To do this, you can create a User-agent with an asterisk, which will look like just like this:
There are also some Disallow commands worth noting. If you wish to block an entire site, just use a forward slash like so:
A specific web page can be blocked by including the webpage’s name after the forward slash. A directory and its subsequent contents can be blocked with the following:
Disallow: /[directory name]/
All images on Google images can be blocked with the following command:
While these are far from making up an entire comprehensive list of commands and strategies you can take with your robots.txt file, they provide a solid launching point.
Your robots.txt code must be saved as file named robots.txt, and it must be placed in the highest level directory of the site (the root of the domain). That way, Googlebot and other creepy bots will be able to access and identify the file.
Thankfully, you can then test whether or not your changes took effect with the robots.txt tester.
Before using the robots.txt strategy, keep in mind that the strategy is also limited in certain facets. For example, robots.txt instructions are only a directive. In other words, instructions in the robots.txt files can only provide directive instructions to crawlers that are accessing your site. While some crawlers will obey the directives, some of the really naughty bad bots will not. In such instances, other blocking methods are the right strategy.
Other Blocking Methods to Consider
The robots.txt file is not the only blocking strategy to take on. Password-protecting server files and inserting meta tags into the HTML are two viable blocking strategies.
When your website has private and confidential data that you want to block from Google search results, you can store those private URLs in a password-protected directory on the website server. This prevents Googlebot and other web crawlers from accessing all content stored in those directories.
You can also prevent a page from appearing in Google searches with a “noindex” meta tag. This tag entirely drops the page from Google no matter what other sites link to it.
While each of these blocking strategies serve different purposes, they all work towards the goal of preventing bad bot access. By using these options in tandem with your robots.txt file, you should now have a robust blocking strategy that prevents bad bots from harming your website.