Have you ever wondered how to add a robots.txt file to your site? Not many webmasters know this, but you actually have complete control over who crawls and indexes your website, even right down to individual pages. The robots.txt file – also known as the robots exclusion protocol or standard – is a teeny-tiny file that is part of any website on the web. But most people don’t even know of its existence.
The robots.txt file is uniquely designed to work with search engines. But research has revealed that this tiny file is an excellent source of SEO juice that is waiting to be unlocked and tapped.
If you want to enhance your SEO without bending over backward, you need to know how to add a robots.txt file to your site.
But before then, let’s start with the following question: what does robots.txt file mean?
What Does Robots.txt File Mean?
A robots.txt file refers to a plain text o ASCII file that a webmaster can place on their website. This simple text file sits in the root directory of your website. It instructs search engine robots – also known as spiders – where or which pages to crawl on your website and the pages to ignore or not to crawl.
This is highly essential, or else every single file and page on your website will show up straight away in search engine results. This is why you need to learn how to create a robot.txt file. This gives you total control over how Google, Bing, Yahoo, and other search engines see your website.
The debate is still on among SEO gurus who claim that having a robots.txt file on a website can significantly attract search engine spiders. Many of them claim that this will, in turn, lead to an increase in search engine positioning.
But you need to use the robots.txt file in your root directory as this can significantly boost crawling and impact SEO. For this to occur, you need to know how to create a highly effective robots.txt file.
A Brief History of Robots.txt File
During the early days of the internet, engineers and programmers creatively invented ‘spiders’ or ‘robots’ to crawl and then index pages on the web. These robots were referred to as ‘User-agents.’
Occasionally, these robots or spiders find their way onto pages that website owners did not want to get indexed. For instance, private websites or those under construction got indexed. It was a problem that needed an urgent solution.
That was when a Dutch engineer and creator of the world’s first search engine known as Aliweb, Martijn Koster, came into the picture. The engineer quickly proposed a well-defined set of standards that every robot would need to adhere to strictly. It was in February 1994 that these standards were first proposed.
By June 30, 1994, early web pioneers and several robot authors reached a consensus on the proposed standards. The standards were adopted then as the REP (‘Robots Exclusion Protocol’).
The robots.txt is a creative implementation of this protocol. The Robots Exclusion Protocol defines the algorithm or procedures each legitimate spider or crawler must adhere to or follow.
If the robots.txt file instructs bots not to index a particular web page, every legitimate robot – from Googlebot to the MSNbot – must follow the instruction.
Bear in mind that some rogue robots – e.g., spyware, malware, email harvesters, etc. – will not or may not follow these laid-down protocols. This is why you may end up seeing bot traffic on pages that you have already blocked via the robots.txt file.
Also, some robots do not follow the REP standards and are not used for any questionable thing.
To see any website’s robots.txt file, go to the following URL:
How Robots.txt Work
As mentioned earlier, your robots.txt file informs search engines to crawl the webpages on your website swiftly.
Robots.txt files come with 2 major components, and they are:
- User-agent: This component readily defines the web bot or search engine that a rule applies to. You can use an asterisk (*) as a wild card along with User-agent to include all the search engines out there.
- Disallow: This component advises a search engine – which could be any of the numerous ones out there – not to crawl and index a page, file, or directory.
Here’s something to note: in order to block a particular file in your file manager, you need to customize the file so that it is hosted on one of your domains. Then, add the file URL to your robots.txt file.
With that out of the way, here is how robots.txt work:
All search engines have 2 primary jobs:
- Crawl the web to discover highly relevant content.
- Index the content found so that it can readily be served up to online searchers who are looking for specific information.
In order to crawl websites, search engines readily follow links to get from one website to another. In the end, they crawl across several billions of sites and links. This crawling behavior is referred to – at times – as ‘spidering.’
The first thing the search crawler does as soon as it arrives at a website – before spidering it – is to search for a robots.txt file. If the crawler discovers one, it quickly reads that robots.txt file first before it continues through the webpage.
Since a robots.txt file contains specific information about how the search engine should crawl the site, it is the information therein that will instruct further crawler action on this particular website.
If the robots.txt file doesn’t contain any directives that may disallow a user-agent’s vital activity – or if the website does not have a robots.txt file – the search crawler will proceed to crawl other vital information on the website.
Why the Robots.txt File is Vitally Important
Robots.txt is not really a highly vital document for any website. In fact, your website can still rank and grow impeccably well without this file sitting pretty in your root directory.
However, making use of robots.txt comes with several benefits which you should leverage as a website owner:
Controls resource usage
Each time a bot crawls your website, it drains some of your bandwidth as well as server resources. These are resources that should be better spent on real human visitors.
For websites with lots of content, this can easily increase costs and give real users or visitors a poor browsing experience.
But you can utilize the robots.txt file to block off access to unimportant images, scripts, etc., in order to conserve resources.
Prioritize essential pages
Your primary goal is to ensure search engine spiders crawl all the crucial pages on your website, including content pages, etc. And not waste so many resources searching through useless pages such as results from dedicated search queries.
By blocking these useless pages, you can easily prioritize which particular page search engine bots should focus.
Prevent bots from crawling private folders
If you disallow bots from crawling any of the private folders on your website, it will make them a bit harder to index by search engine spiders.
What Can You Hide with Robots.txt?
By now, you already know that robots.txt files are generally used to exclude specific categories, directories, or pages from the search engine result pages (SERPs).
You can exclude using the ‘Disallow’ directive. Some of the few but common pages you can hide using a robots.txt file include:
- Admin pages
- Pagination pages
- Shopping cart
- Pages with duplicate, often printer-friendly content.
- Dynamic service and product pages
- Account pages
- Thank you pages
For instance, let’s say you want to disallow a ‘Thank You’ page; this is how you go about it:
It must be mentioned here that not all search engine crawlers will follow your robots.txt file. Bad bots can easily or entirely ignore your robots.txt files. Therefore, make sure you don’t keep any highly sensitive data on blocked pages.
How to Add a Robots.txt File to Your Site?
In this section, we’ll take a look at how to add a robots.txt file to your site. SEO gurus always recommend adding a robots.txt file to your primary domain as well as all sub-domains on your website.
To add a robots.txt file to your website, you have to, first of all, create it. Follow the step-by-step process outlined below:
Step 1: Open Microsoft Word or Notepad on your computer and ensure you save all files as “robots.” They must be written in lowercase. Choose .txt as the file type extension but choose ‘Plain Text’ if you use Microsoft Word as your text editor.
Step 2: Then, add these 2 lines of text to your file:
‘User-agent’ is another word for search engine spiders/crawlers or robots. That asterisk (*) signifies that this line applies to all the search engine spiders. As you can see, there is no folder or file listed in the ‘Disallow’ line.
This implies that every directory on your website will be accessed. This is the basic robots.txt file.
Step 3: One of the robots.txt file options has to do with blocking the spiders from accessing every inch of your website. You can do these by adding the following lines to the robots.txt file:
Step 4: If you would like to block off the spiders from specific areas of your website, your robots.txt may look like this:
Disallow: / database/
The 3 lines outlined above inform all robots that they cannot have access to anything within the scripts and database directories or even sub-directories.
Bear in mind that only 1 folder or file can be used per ‘Disallow’ line. You can add as many ‘Disallow’ lines as required.
Step 5: Make sure you also add your search engine-friendly XML sitemap file to the robots.txt file. This ensures the robots can easily find your sitemap and quickly index all of your website pages.
Use the following syntax to add your XML sitemap:
Step 6: As soon as everything is complete, save, and then upload your robot.txt file to your website’s root directory.
For instance, if your domain is ‘www.mydomainname.com,’ you will place the robots.txt file at:
And that is how to add a robots.txt file to your site!
Common Robots.txt Directives You Should Know and Use
User-agent: * – This is usually the first line written in your robots.txt file. Its primary purpose is to explain to the search engine spiders the numerous rules of what a webmaster wants them to crawl on their website.
Disallow: / – This tells spiders not to crawl your entire website.
Disallow: – This tells all spiders to crawl your entire website.
Disallow: /ebooks/* .pdf – This tells spiders to ignore all the PDF formats, which may cause duplicate content issues.
Disallow: /staging/ – This tells search engine crawlers to ignore your staging site.
Disallow: /images/ – This tells only the Googlebot spider to ignore all images on your website.
* – This is considered a wildcard that represents whatever sequence of characters.
$ – This character is used to match the end of the URL.
When You Should Not Use Robots.txt
The robots.txt file may be a beneficial tool when used smartly. However, they may not be the best option or solution. Here are some examples of when you should not use robots.txt to control search engine crawling:
- Blocking ULR parameters
- Getting already indexed pages deindexed
- Blocking URLs with backlinks
- Setting specific rules which ignore social network crawlers.
- Blocking access right from dev or staging sites, etc.
The robots.txt file remains a valuable ally as it shapes the way search engine bots or crawlers interact with your website.
When used the right way, robots.txt can positively impact your rankings, thereby making your website incredibly easier to crawl.
Using this guide, you already have the answer to the question: ‘what does robots.txt file mean?’ And you also know how to add a robots.txt file to your site.
Hopefully, you also understand how to add a robots.txt file to your site as well as how to avoid some mistakes using this simple text file.
Therefore, use the robots.txt file wisely, and get as much SEO juice as you can to your website!