What Is a Robots.txt File?
Using a robots.txt file, webmasters can direct search engines on which pages to crawl and which not to. Robots.txt files direct crawler access, however they shouldn’t be used to prevent pages from being indexed by Google.
A robots.txt file looks like this:
Although robots.txt files may appear difficult, the grammar (or computer language) is simple. Later, we’ll discuss those specifics.
In this article we’ll cover:
- Why robots.txt files are important
- How robots.txt files work
- How to create a robots.txt file
- Robots.txt best practices
Why Is Robots.txt Important?
Although robots.txt files may appear difficult, the grammar (or computer language) is simple. Later, we’ll discuss those specifics.
A robots.txt file can be used for the reasons listed below:
1. Optimize Crawl Budget
The number of pages on your website that Google will crawl in a specific amount of time is referred to as the crawl budget.
The amount can change depending on the size, reliability, and backlink profile of your website.
Your website may have pages that are not indexed if the number of pages on it exceeds the crawl budget.
You will lose time generating pages that users won’t view because unindexed pages won’t rank.
Robots.txt blocks unused pages, allowing Googlebot (Google’s web crawler) to concentrate more of its crawl budget on important pages.
Note: Google claims that most website owners don’t need to worry too much about crawl budget. Larger sites with thousands of URLs should be particularly concerned about this.
2. Block Duplicate and Non-Public Pages
There is no need for crawl bots to browse every page on your website. As a result of the fact that not all of them were developed to be served in search engine results pages (SERPs).
such login pages, staging sites, duplicate websites, internal search results pages, etc.
These internal pages may be taken care of by some content management systems.
For instance, WordPress automatically blocks all crawlers from accessing the /wp-admin/ login page.
You can prevent crawlers from accessing these pages using robots.txt.
3. Hide Resources
Resources including PDFs, movies, and photos can occasionally be removed from search results.
either to keep them private or to direct Google’s attention to more crucial stuff.
Robots.txt prevents them from being crawled (and consequently indexed) in either scenario.
How Does a Robots.txt File Work?
Search engine bots are instructed which URLs to crawl and, more critically, which ones to ignore using robots.txt files.
Search engines are used for two things:
- Crawling the web to discover content
- Indexing and delivering content to searchers looking for information
Links are found and followed by search engine bots as they comb through webpages. Through millions of links, pages, and websites, they travel from site A to site B to site C using this procedure.
However, if a bot discovers a robots.txt file, it will first read it before continuing.
The syntax is simple to understand.
Determine the user-agent (the search engine bot), then the directives (the rules), before assigning them.
Assigning directives to every user-agent with the asterisk (*) wildcard will apply the rule to all bots.
For instance, the following instruction permits all crawlers—aside from DuckDuckGo—to access your website:
Note: A robots.txt file can only provide instructions; it cannot enforce them. Consider it to be a code of conduct. The regulations will be followed by good bots (such as search engine bots), but bad bots (such as spam bots) will disregard them.
How to Find a Robots.txt File
Like any other file on your website, the robots.txt file is stored on your server.
By putting “/robots.txt” at the end of the homepage’s entire URL, you may view the robots.txt file for any given website.
Like this: https://seofeatures.com/robots.txt
Note: At the root domain level, there should always be a robots.txt file. The robots.txt file for www.example.com is located at www.example.com/robots.txt. If you put it somewhere else, crawlers might think you don’t have one.
Let’s look at the syntax of a robots.txt file before learning how to make one.
Robots.txt Syntax
A robots.txt file is made up of:
- One or more blocks of “directives” (rules)
- Each with a specified “user-agent” (search engine bot)
- And an “allow” or “disallow” instruction
User-agent: Googlebot Disallow: /not-for-google User-agent: DuckDuckBot Disallow: /not-for-duckduckgo Sitemap: https://www.yourwebsite.com/sitemap.xml
The User-Agent Directive
The user-agent, which identifies the crawler, appears on the first line of each directives block.
For instance, if you wish to instruct Googlebot not to crawl your WordPress admin page, your command would begin with:
User-agent: Googlebot Disallow: /wp-admin/
Most of search engines use several crawlers. For standard indexing as well as for images, videos, etc., they use several crawlers.
The bot may select the most specialized block of directives when there are many directives present.
Assume you have three sets of directives: one each for *, Googlebot, and Googlebot-Image.
The Googlebot-News user agent will adhere to Googlebot instructions when crawling your website.
The Googlebot-Image user agent, on the other hand, will adhere to the more detailed Googlebot-Image directives.
The Disallow Robots.txt Directive
The “Disallow” line appears on line two of a robots.txt directive.
You can set different forbid directives to limit the crawler’s access to different areas of your website.
An empty “Disallow” line indicates that there is nothing you are preventing a crawler from accessing on your site.
For instance, your block would appear like follows if you wanted to permit all search engines to browse your full website:
User-agent: * Allow: /
Your block might appear as follows if you wanted to prevent any search engine from indexing your website:
User-agent: * Disallow: /
Note: The words “Allow” and “Disallow” are not case-sensitive directives. However, each directive’s values are.
For instance, /photo/ and /Photo/ are not the same thing.
However, to make the file easier to understand for humans, “Allow” and “Disallow” directives are frequently capitalized.
The Allow Directive
A subdirectory or particular page in an otherwise forbidden directory can be indexed by search engines thanks to the “Allow” directive.
For instance, your directive can read as follows if you wish to stop Googlebot from accessing all but one post on your blog:
User-agent: Googlebot Disallow: /blog Allow: /blog/example-post
Note: This command is not recognized by all search engines. However, this directive is supported by Google and Bing.
The Sitemap Directive
The Sitemap directive instructs search engines, namely Google, Bing, and Yandex, where to locate your XML sitemap.
The pages you want search engines to crawl and index are typically included in sitemaps.
This command appears as follows and resides at the top or bottom of a robots.txt file:
A fast solution is to include a Sitemap directive in your robots.txt file. However, you can (and should) also use each search engine’s webmaster tools to submit your XML sitemap.
Although search engines will ultimately scan your website, submitting a sitemap will hasten this process.
Crawl-Delay Directive
Crawlers are told to postpone their crawl rates by the crawl-delay command. to avoid slowing down your website by overtaxing a server.
Crawl-delay is no longer supported by Google. You must adjust your crawl rate in Search Console if you want Googlebot to use it.
On the other side, Bing and Yandex do support the crawl-delay directive. Use it as follows.
Say you want the crawler to pause 10 seconds between each act of crawling. Decide on a 10 second delay, as in:
User-agent: * Crawl-delay: 10
Noindex Directive
A search engine cannot be told which URLs not to index and display in search results by a robots.txt file; it only instructs a bot what it can and cannot crawl.
Although the page will still appear in search results, the bot won’t be able to read what’s on it, so it will look like this:
While Google has never officially supported this policy, on September 1, 2019, Google made the announcement that it does not.
Use a meta robots noindex tag instead of this directive if you want to consistently prevent a page or file from showing up in search results.
How to Create a Robots.txt File
Use a robots.txt generator tool or create one yourself.
Here’s how:
1. Create a File and Name It Robots.txt
Open a.txt file with a text editor or web browser to get started.
Note:
Avoid using word processors because they frequently save files in a format that adds random characters.
Then give the file the name robots.txt.
You can now begin typing directives.
2. Add Directives to the Robots.txt File
Each set of instructions in a robots.txt file has many lines of instructions. There may be one or more groups of instructions.
Every group starts with a “user-agent” that has the following details:
- To whom (the user-agent) the group applies
- Accessible folders, pages, or files for the agent
- What files (directories) the agent is unable to access
- Search engines can use a sitemap to learn which files and pages you think are significant.
Lines that don’t match these directives are ignored by crawlers.
Consider the scenario where you don’t want Google to index your /clients/ directory because it is only used internally.
The first group would resemble the following:
User-agent: Googlebot Disallow: /clients/
You can put more instructions in a separate line below, as follows:
User-agent: Googlebot Disallow: /clients/ Disallow: /not-for-google
After finishing Google’s specific instructions, press enter twice to produce a new set of instructions.
Since your /archive/ and /support/ directories are exclusively for internal use, let’s make this one applicable to all search engines.
It would seem as follows:
User-agent: Googlebot Disallow: /clients/ Disallow: /not-for-google User-agent: * Disallow: /archive/ Disallow: /support/
Add your sitemap after you’re done.
The final version of your robots.txt file would resemble this:
User-agent: Googlebot Disallow: /clients/ Disallow: /not-for-google User-agent: * Disallow: /archive/ Disallow: /support/ Sitemap: https://www.yourwebsite.com/sitemap.xml
A robots.txt file should be saved. Keep in mind that it must be called robots.txt.
Note: Crawlers read from top to bottom and match the first set of regulations that are the most explicit. You should therefore begin your robots.txt file with specific user agents before switching to the more inclusive wildcard (*), which matches all crawlers.
3. Upload the Robots.txt File
Upload the robots.txt file to your website after saving it to your computer so that search engines can crawl it.
Sadly, there isn’t a single tool that can be used for this phase.
The file organization and web hosting of your site will determine how to upload the robots.txt file.
If you need assistance uploading your robots.txt file, look online or contact your hosting company.
One search term is “upload robots.txt file to WordPress.”
The articles listed below will explain how to post your robots.txt file to the most well-known platforms:
- Robots.txt file in WordPress
- Robots.txt file in Wix
- Robots.txt file in Joomla
- Robots.txt file in Shopify
- Robots.txt file in BigCommerce
Check to see whether it can be viewed by others and if Google can read it after uploading.
4. Test Your Robots.txt
Check to see if your robots.txt file is publicly accessible first (to make sure it was uploaded properly).
Search for your robots.txt file in your browser’s private window.
For example, https://seofeatures.com/robots.txt
You are prepared to test the markup (HTML code) if you can view your robots.txt file with the material you added.
Google provides two ways to evaluate robots.txt markup:
- The robots.txt Tester in Search Console
- Google’s open-source robots.txt library (advanced)
Let’s test your robots.txt file in Search Console since the second approach is intended for experienced developers.
Note: To test your robots.txt file, you must have an account set up with Search Console.
Go to the robots.txt Tester and click on “Open robots.txt Tester.”
You’ll need to add a property first if you haven’t already connected your website to your Google Search Console account.
Then, confirm that you are the true owner of the website.
Note: Google plans to discontinue this setup wizard. You will therefore need to explicitly authenticate your property in the Search Console going forward.
Choose a property from the drop-down list on the Tester’s home page if you already have validated ones.
Variations in logic or syntax will be detected by the Tester.
and show the overall amount of mistakes and warnings beneath the editor.
As you proceed, retest and correct any issues or warnings that are directly on the page.
The page’s modifications are not saved to your website. The real file on your website is not altered by the program. Only the copy stored in the tool is used for testing.
Copy and paste the modified test copy into your site’s robots.txt file to put any changes into effect.
Robots.txt Best Practices
Use New Lines for Each Directive
On a new line, each instruction should be placed.
Otherwise, they won’t be readable by search engines, and your instructions will be disregarded.
Incorrect:
User-agent: * Disallow: /admin/ Disallow: /directory/
Correct:
User-agent: * Disallow: /admin/ Disallow: /directory/
Use Each User-Agent Once
Multiple submissions of the same user-agent are acceptable to bots.
However, using it only once keeps everything organized and straightforward. This lessens the possibility of human error.
Confusing:
User-agent: Googlebot Disallow: /example-page User-agent: Googlebot Disallow: /example-page-2
The Googlebot user-agent is shown twice, as you can see.
Clear:
User-agent: Googlebot Disallow: /example-page Disallow: /example-page-2
In the first scenario, Google would still adhere to the guidelines and avoid both pages.
But it is cleaner and easier to stay organized if you write all directives under the same user-agent.
Use Wildcards to Clarify Directions
To apply a directive to all user-agents and match URL patterns, use wildcards (*).
You could technically spell them out one by one to stop search engines from reading URLs with parameters, for instance.
But it is ineffective. With a wildcard, your directions can be made simpler.
Inefficient:
User-agent: * Disallow: /shoes/vans? Disallow: /shoes/nike? Disallow: /shoes/adidas?
Efficient:
User-agent: * Disallow: /shoes/*?
The previous example stops all search engine crawlers from accessing any URLs with a question mark under the /shoes/ subdirectory.
Use ‘$’ to Indicate the End of a URL
The “$” indicating the end of a URL is added.
For instance, you can post each individual.jpg file on your website separately to prevent search engines from indexing them all.
But that wouldn’t be productive.
Inefficient:
User-agent: * Disallow: /photo-a.jpg Disallow: /photo-b.jpg Disallow: /photo-c.jpg
Instead, add the “$” feature, like so:
Efficient:
User-agent: * Disallow: /*.jpg$
Note: In this case, crawling /dog.jpg is impossible, however crawling /dog.jpg?p=32414 is possible because it doesn’t finish in “.jpg.”
In specific situations like the one above, the “$” expression is a useful feature. However, it can also be harmful.
Be cautious when using it because stuff you didn’t intend to block can be quickly unblocked.
Use the Hash (#) to Add Comments
Everything that begins with a hash (#) is ignored by crawlers.
Therefore, adding a comment to the robots.txt file generally uses a hash. It keeps the document orderly and readable.
Start each line of text with a hash (#) to add a comment.
User-agent: * #Landing Pages Disallow: /landing/ Disallow: /lp/ #Files Disallow: /files/ Disallow: /private-files/ #Websites Allow: /website/* Disallow: /website/search/*
Because they are aware that people hardly ever read robots.txt files, developers occasionally insert amusing statements in them.
For example, the YouTube’s robots.txt file contains the following text: “Created in the far future (the year 2000) after the robotic uprising of the mid-90s which wiped out all humans.”
Additionally, Nike’s robots.txt has the phrase “just crawl it” as a tribute to their slogan “Just do it” and logo.
Use Separate Robots.txt Files for Different Subdomains
Crawling activity is only controlled by robots.txt files on the subdomain in which they are hosted.
You’ll require a distinct robots.txt file if you want to manage crawling on a different subdomain.
Therefore, you would want two robots.txt files if your main website is located at domain.com and your blog is at blog.domain.com.
One for the root directory of your primary domain and the other for the root directory of your blog.
The most common user agents for search engine spiders
Search engine | Field | User-agent |
---|---|---|
Baidu | General | baiduspider |
Baidu | Images | baiduspider-image |
Baidu | Mobile | baiduspider-mobile |
Baidu | News | baiduspider-news |
Baidu | Video | baiduspider-video |
Bing | General | bingbot |
Bing | General | msnbot |
Bing | Images & Video | msnbot-media |
Bing | Ads | adidxbot |
General | Googlebot | |
Images | Googlebot-Image | |
Mobile | Googlebot-Mobile | |
News | Googlebot-News | |
Video | Googlebot-Video | |
Ecommerce | Storebot-Google | |
AdSense | Mediapartners-Google | |
AdWords | AdsBot-Google | |
Yahoo! | General | slurp |
Yandex | General | yandex |