Let us look at the basics: All search engines use bots to crawl a website. A simple Google search of ‘crawling and indexing’ will deliver a straightforward response. When a search engine bot such as Google bot, Bing bot and other search engine bots come to your site through a link or following a site map link in the dashboard, they follow all links on your site. From here, they crawl and index your site or blog.
Crawlers and bots (a.k.a. spiders), scan your entire site and index everything they can. Whether you like it or not, this will happen. It can also cause generated files that are sensitive- internal search results, for example- to show up on Google.
There is a file that’s considered a must-have in your website’s root folder. It is the first file that the search engines see. It sets the restrictions within which the crawlers and bots operate. From there, the crawlers and bots scan and index your website. It’s called the Robots.txt file.
If you want to protect sensitive information or prevent search engines from scouring all your content and scripts, Robots.txt works for you. Because the crawlers check for this file at the root of the site, it can assure you that the crawlers will follow the instruction inside. If you don’t have a robots.txt file, the crawlers will assume that the whole site is allowed to be indexed.
Robots.txt also improves your search results by making sure the information available from your site is valuable content for the searchers’ purposes and is not muddled by internal data that can render searches inefficient. This improves your traffic by making searches more precise.
You can use cPanel to add robots.txt to your site with these few steps.
After the forward slash, add the name of the folder you want to hide from the crawlers. In this example, we used the WordPress admin files in the ‘wp-admin’ folder.
Preventing duplicate content using robots.txt
To avoid duplicate content in a WordPress site, the basic structure of your robots.txt file should look like this:
This robots.txt will stop robots from crawling your admin folder- feeds, trackbacks, comment feeds, pages and comments. Remember that robots.txt prevents crawling but does little to prevent all indexing.
A helpful tool
Always double check your robots.txt file, and make sure that you haven’t disallowed anything that you want search engines to find. Most of all, keep in mind that improving your content is the ultimate key to improving your site’s traffic statistics. The robots.txt file only puts your site in the best position to get a lot of hits.