Meta Robots Tag 101: Blocking Spiders, Cached Pages & More

Mar 5, 2007 at 8:48pm ET by Danny Sullivan

Last week, I covered a new command for the meta robots tag — one to prevent search engines from using Yahoo titles and descriptions. In doing that, a number of questions came up about the meta robots tag syntax itself. Google Webmaster Central has now posted “Using the robots meta tag,” providing some clarity from Google. In addition, both Yahoo and Microsoft have also sent me information on using the tag. I’ll run through what everyone says below, complete with charts for easy at-a-glance comparisons.

The meta robots tag was an open standard created over a decade ago and designed initially to allow page authors to prevent page indexing. Over the years, various search engines have added additional support to the tag.

Let me start off by saying that if you DO want your pages in search engines, then DO NOT use the tag. By default, the major search engines will index any page they find. Yes, there is a form of the meta robots tag you can use to explicitly tell search engines to index your pages. It looks like this:

<meta name=”robots” content=”index”>

There’s also a form you can use that adds the command “follow,” which tells the search engines to index your page and also follow any links they find on that page to other pages, which they can then index. It looks like this

<meta name=”robots” content=”index,follow”>

You do NOT need to use either form if you DO want your pages in the search engines. Without either form, they’ll naturally index your pages and follow your links. That’s what they do.

I always joke that putting these forms of the meta robots tag on your web pages is like putting a Post-It note on your chest that says “breathe.” Hey, if you forget to look at that note, you’ll still breathe. That’s what you do, by default. And that’s what the major search engines do. By default, they inhale web pages without you putting up a meta tag telling them to do so.

Now if you DO NOT want your pages in a search engine, then it’s time to perhaps break out the meta robots tag, if for some reason the robots.txt alternative isn’t suitable. Want to keep a particular page out? Then put this on that page:

<meta name=”robots” content=”noindex”>

See the “noindex” value? That tells the search engines that see this page not to include them in their listings. Remember — as I explained before — this will not prevent the pages from being spidered. That’s because search engines have to keep revisiting the page in order to see if the tag is removed. The tag only keeps the page out. Here’s my earlier chart on that topic.

System Robots. txt Meta Robots Yahoo Delete URL Option
Stops Crawling Yes No No
Stops Index Inclusion Yes Yes Yes
Stops Link Only Listing No No (Yes, for Google) Yes
Why Use? Easy to block many pages at once Can’t access root domain Don’t even want URL to appear or need page out fast

What if you don’t want links followed? Sure, you can do this:

<meta name=”robots” content=”noindex,nofollow”>

That extra command, “nofollow,” tells the search engines not to follow any links on that page. Google recently covered this more as an option. But as Google also explained, links from a page with this tag might still get crawled. That’s because if anyone else links to a particular page WITHOUT a nofollow value, then the search engine will follow that link.

So far, I’ve covered all the commands that were originally created with the tag back in May 1996. Since then, more commands (also called values or attributed) have been added. For example, Google writes today to summarize several options you can use. Quoting Google:

  • NOINDEX – prevents the page from being included in the index.
  • NOFOLLOW – prevents Googlebot from following any links on the page. (Note that this is different from the link-level NOFOLLOW attribute, which prevents Googlebot from following an individual link.)
  • NOARCHIVE – prevents a cached copy of this page from being available in the search results.
  • NOSNIPPET – prevents a description from appearing below the page in the search results, as well as prevents caching of the page.
  • NOODP – blocks the Open Directory Project description of the page from being used in the description that appears below the page in the search results.

At times, you may want to use more than one of these commands. I’ll get back to that. But first, how about another chart? I’ll cover the major commands you may want to use below:

COMMAND Ask Google Microsoft Yahoo
NOINDEX Yes Yes Yes Yes
NOFOLLOW Yes Yes Yes Yes
NOARCHIVE Yes Yes Yes Yes
NOODP No Yes Yes Yes
NOYDIR No No No Yes
NOSNIPPET No Yes No No
Robot Name TEOMA GOOGLEBOT MSNBOT SLURP
Does Robot Specific Tag Override All Robots Tag? ??? No No No

Several of these are already explained above, in what I quoted from Google. They work the same way for the other major search engines. I’ve also linked to help information from each search engine for more specific advice.

The NOYDIR command is fully explained in my previous Yahoo Provides NOYDIR Opt-Out Of Yahoo Directory Titles & Descriptions post. Only Yahoo supports this, but none of the other major search engines used Yahoo titles and descriptions for listings, so it doesn’t really matter for them.

Now on to the topic of a meta robots tag having multiple values. What if you wanted to keep a page from being cached by all the major search engines and also ensure that neither Open Directory or Yahoo Directory descriptions are used. First, you need the values of the commands to say this. From the table above, they are:

  • NOARCHIVE
  • NOODP
  • NOYDIR

Next, you need to decide what robots to target. We’ll keep it simple for now. To target ALL robots, you use this value:

  • ROBOTS

Now to the meta robots format. Without the values, it looks like this:

<meta name=”NAME-OF-ROBOTS-TO-TARGET” content=”COMMANDS”>

We replace that NAME-OF-ROBOTS-TO-TARGET part with the name of the robots we’re, well, targeting. As explained, that’s ROBOTS, in order to target them all. I’ll put it in bold below:

<meta name=”ROBOTS” content=”COMMANDS”>

Now we put in the commands we want to tell the robots, each separated by a command. The order doesn’t matter. Again, I’ll bold the commands:

<meta name=”ROBOTS” content=”NOARCHIVE,NOODP,NOYDIR“>

Voila! Put that tag ANYWHERE inside the header area of a web page like this:

<HEAD> <meta name=”ROBOTS” content=”NOARCHIVE,NOODP,NOYDIR”> </HEAD>

Then you will be telling all major search engines not to cache the page, nor to use Open Directory or Yahoo Directory titles or descriptions for you page listings.

Notice that in the tag above, there are no spaces between the commands. What if I did this?

<meta name=”ROBOTS” content=”NOARCHIVE, NOODP, NOYDIR”>

Google writes today that spaces make no difference. Use them if you want or not, the tag means the same thing. Microsoft tells me the same thing, as does Yahoo.

What if you did this, with no commas:

<meta name=”ROBOTS” content=”NOARCHIVE NOODP NOYDIR”>

Microsoft tells me this is fine. I didn’t ask Yahoo about this, and Google says commas MUST be used. So use commas and don’t be a pain.

Now what if you want to tell search engine different things. Maybe you want Microsoft not to use the ODP descriptions, Google not to cache pages, Yahoo not to follow links on a page and Ask not to index the page at all. Maybe you want to get your head examined for being so strange, too. But aside from your mental health, it is possible to do all this.

You need to have a robots tag for each particular search engine you want to target. See that chart above? At the bottom there’s a “Robot Name” row. That shows you the name of each search engine’s “robot” or “spider” that you’ll issue a command to. With the robot names, we then give each of them their specific commands:

<meta name=”TEOMA” content=”NOINDEX”> <meta name=”GOOGLEBOT” content=”NOARCHIVE”> <meta name=”MSNBOT” content=”NOODP”> <meta name=”SLURP” content=”NOFOLLOW”>

You could also tell all robots to do one thing — say not to follow links — while also issuing a second robots-specific command such as telling only Google not to cache the page:

<meta name=”ROBOTS” content=”NOFOLLOW”> <meta name=”GOOGLEBOT” content=”NOARCHIVE”>

But wouldn’t a search engine only follow the specific tag written for it? In other words, if you target Google with a specific command in the “GOOGLEBOT” tag, then wouldn’t it follow only that tag and ignore the other?

Google, Microsoft and Yahoo say they will honor them both. I don’t know about Ask. That’s why you see “???” in that “Does Robot Specific Tag Override All Robots Tag?” section of the chart above. I’ll try to get that answered.

What if you had more than one “all” robots tag like this:

<meta name=”ROBOTS” content=”NOFOLLOW”> <meta name=”ROBOTS” content=”NOODP”>

As explained, you could easily do this instead:

<meta name=”ROBOTS” content=”NOFOLLOW,NOODP”>

But if for some reason you did do it the other way, Microsoft and Yahoo have told me that’s just fine. They honor the information in BOTH of the robots tags. Google’s post today says the same thing.

Finally, the Google post provides reassurance that capitalization doesn’t make a difference. I’ve shown things in various ways above, sometimes the commands in ALL CAPS, sometimes in lowercase. As Google says, case makes no difference. To quote their post:

Googlebot understands any combination of lowercase and uppercase. So each of these meta tags is interpreted in exactly the same way:

<meta name=”ROBOTS” content=”NOODP”> <meta name=”robots” content=”noodp”> <meta name=”Robots” content=”NoOdp”>

Ah, but what about something like this:

<MeTa nAMe=”RoBots” conTEnt=”NooDP”>

Well, Google didn’t go that far. But my experience over the past decade has been that meta tags are not case sensitive at all with the major search engines. So I think you’re safe in whatever case, for all the major search engines.

Related Topics: Ask: SEO | Features: General | Google: SEO | Google: Webmaster Central | How To: SEO | Microsoft: Bing SEO | SEO: Blocking Spiders | SEO: Titles & Descriptions | Yahoo: SEO


spacer

About The Author: Danny Sullivan is editor-in-chief of Search Engine Land. He’s a widely cited authority on search engines and search marketing issues who has covered the space since 1996. Danny also oversees Search Engine Land’s SMX: Search Marketing Expo conference series. He maintains a personal blog called Daggle (and maintains his disclosures page there). He can be found on Facebook, Google Buzz and microblogs on Twitter as @dannysullivan. See more articles by Danny Sullivan

spacer

SearchCap: Sign Up To Receive Our Free Daily News Recap!

Name: Company: Email:

Share This Article

Tweet
 

Follow Us

RSS: 74,106 Subscribersspacer

On Twitter: 43,431 Followers spacer

On Facebook: 19,181 Fansspacer

 

spacer

9 Comments on Meta Robots Tag 101: Blocking Spiders, Cached Pages & More

kching, March 6th, 2007 at 4:15 am ET:

maybe I haven’t had enough coffee :)
but don’t you mean in you’re second example?
meta name=”robots” content=”index,follow



Danny Sullivan, March 6th, 2007 at 6:38 am ET:

Thanks for the catch! Got it fixed.



pratt, March 6th, 2007 at 12:23 pm ET:

I completely agree that it is silly to put:
meta name=”robots” content=”index,follow” in your code. But if it is already in there, are there any benefits to going in and taking it out? Will it speed up load times or anything like that?

Please excuse my lack of programming knowledge.



Danny Sullivan, March 6th, 2007 at 1:40 pm ET:

You can safely leave it in :)



Sean Carlos, March 6th, 2007 at 3:48 pm ET:

As always, a very informative resource.

Just a quick comment to note that Microsoft does support “noarchive”, so the Microsoft only “nocache” is not needed.

Reference:

search.live.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_REF_RestrictAccessToSite.htm

- Sean Carlos



Danny Sullivan, March 6th, 2007 at 6:05 pm ET:

Thanks, Sean, much appreciated and have it updated.



Robert_Charlton, March 7th, 2007 at 3:20 pm ET:

How does the noindex command on AdWords landing pages affect AdsBot and quality score spidering? Google indicates that it will ignore robots.txt so it can spider these pages, but will it spider them if you’re using the robots meta tag?

There are many reasons, I feel, why using the robots meta to block indexing of a landing page might be preferable.



Robert_Charlton, March 10th, 2007 at 4:40 pm ET:

Danny – Another comment… related to my comment above, and possibly something to mention if you talk to responsible people at MSN.

In the past several days, a client site has been plagued by MSN indexing (and prominently ranking) url-only listings to “blocked” pages. These pages have all used the robots meta tag, with the following syntax…

They include AdWords landing pages, test pages, etc. In the past, I’ve encountered problems with Google indexing such references, but that is because the pages were blocked by robots.txt and the “link references” to these pages were exposed. That’s why I ultimately switched to using the robots meta tag and dropped the robots.txt.

As we’ve discussed at SEWF, all the engines need to get on the same page (so to speak) about this.



Igor, May 7th, 2007 at 4:54 pm ET:

Microsoft’s Livesearch is ignoring the NOINDEX tag on my pages. I have been e-mailing back and forth with them for a few weeks and things just get curiouser and cruriouser. If you do this search for pages at my site -
search.msn.com/results.aspx?q=www.gadgetduck.com&form=QBRE
- you’ll see some pages with a link labeled “cached page”. Those pages carry the NOINDEX metatag. (Also, those pages carry NOARCHIVE. Livesearch actually does not archive the pages – the “cached page” link comes up blank, which is fine with me. But, this seems to show another system glitch for them.)

Has anyone else seen this problem? Is it possible that my tags are malformed? Google and Yahoo seem to have no problem following them correctly.



  • Top News
  • Briefs
  • Features
  • Columns
gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.