Home
About Me
Google/SEO
Disclaimer
Disclosure
Subscribe

Search results in search results

by Matt Cutts on March 10, 2007

in Google/SEO

I was reading an interesting question on Google’s webmaster help group that was posted a few weeks ago. The question was

Is there any official Google statement regarding that search result on
one’s own site ought to be disallowed from indexing (e.g. via
robots.txt)?

and the questioner went on to mention that YouTube’s search results were showing up in Google. Vanessa Fox showed up to tackle the answer:

Typically, web search results don’t add value to users, and since our
core goal is to provide the best search results possible, we generally
exclude search results from our web search index. (Not all URLs that
contains things like “/results” or “/search” are search results, of
course.)

I’ll take a look at the YouTube example. Thanks.

As a result of that question, YouTube added a “Disallow: /results” line in its robots.txt file. That’s good because as Google recrawls web pages, we’ll see that and begin to drop those search results.

Google already does similar things with our web search results, Froogle, etc. to try to prevent our web search results from causing problems for any other engines’ index. In general, we’ve seen that users usually don’t want to see search results (or copies of websites via proxies) in their search results. Proxied copies of websites and search results that don’t add much value already fall under our quality guidelines (e.g. “Don’t create multiple pages, subdomains, or domains with substantially duplicate content.” and “Avoid “doorway” pages created just for search engines, or other “cookie cutter” approaches…”), so Google does take action to reduce the impact of those pages in our index.

But just to close the loop on the original question on that thread and clarify that Google reserves the right to reduce the impact of search results and proxied copies of web sites on users, Vanessa also had someone add a line to the quality guidelines page. The new webmaster guideline that you’ll see on that page says “Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines.”

This hasn’t been a burning issue for many people, and for people that pay attention to search I’m sure it’s a well-known fact (e.g. see here where someone asked me about a particular site copied via a proxy, and my reply later that day), but it’s still good to clarify that Google does reserve the right to take action to reduce search results (and proxied copies of websites) in our own search results.

Philipp, thanks for asking the question originally. It was good that you pointed out we had some of our own web search results showing (so that we could correct that), and it’s also good to make sure that site owners get clear guidance.

{ 58 comments… read them below or add one }

Aaron Pratt March 10, 2007 at 2:20 pm: Google surely reacts to everything “Philipp” says.

Just an observation.
Tony Ruscoe March 10, 2007 at 2:59 pm: One proxy that I’ve seen popping up in Google’s search results is zta.net – just search for [site:zta.net] to see how many sites they’re copying. Since they seem to be hosting (live?) duplicate content on subdomains, I was worried they could rank higher than official sites and perhaps get some sites penalized for hosting duplicate content at multiple domains.

Should those pages be included in Google’s search results?
Sergi March 10, 2007 at 3:59 pm: That’s a good move to ask webmasters to disallow search URLs, but I guess many of them using CMS don’t even know about this issue : www.google.com/search?q=inurl:/index.php/?q

But what about webmasters creating hundreds of thousands of ‘fake’ search pages on which you find only AdSense and links created on the fly to the same search on others sites of them ?

Do Google plan to clean those pages from the index too ? That’s would be great, because spam report don’t seems to be very effective on this issue
Luis Alberto Barandiaran March 10, 2007 at 4:49 pm: Here is an example of how it’s done, so that everyone can implement it right away:

Step 1 : Edit your robots.txt (if you don’t have one, create it)
Step 2: The file must say:

User-agent: *
Disallow: /directory-not-for-robots-or-spiders

Step 3: Save the file and upload it to your server.

Replace “directory-not-for-robots-or-spiders” with whatever directory you need
———-
Say you want to only prohibit Google, but allow the rest, then change the * for the term Googlebot:

User-Agent: Googlebot
Disallow: /directory-not-for-robots-or-spiders
———-
Say you want to prohibit multiple directories:

User-agent: *
Disallow: /directory1
Disallow: /directory2/subdirectory/
Disallow: /directory3
———-
And if you want to test that everything went on smoothly, login to your Google’s Webmaster Tools, and under the diagnostics tab there’s a Robot.txt Tester.

Happy surfing!

Luis Alberto
Luis Alberto Barandiaran March 10, 2007 at 4:59 pm: Hmmm Matt, so my next question would be, if we know of certain sites using this technique to inflate their page numbers, do we report them as spam? How do we go around this?

Let me place a common example, Technorati. They have search results (254,000 counts), and tag results (964,000 counts). They are both indexed in google and often come out in results.

Thanks!

Luis Alberto
SearcH EngineS WeB March 10, 2007 at 9:05 pm: Search Results are an asset to the SERPs, and should be included. They do potentially offer the same impacts as the ‘Search Suggestions’ now at the bottom of Google’s organic SERPs.

Click on RELEVANT RELATED search results can open an entire new window.

Pehaps an algo combination that decides based on TRUSTRANK and Click Popularity which search results should be allowed to rank higher in organic serps should be considered.

But there is potential with this resource and they should NOT be automatically excluded.

All Information has some validity!
meng xiang March 10, 2007 at 9:12 pm: as for Tony’s question:

“One proxy that I’ve seen popping up in Google’s search results is zta.net – just search for [site:zta.net] to see how many sites they’re copying. Since they seem to be hosting (live?) duplicate content on subdomains, I was worried they could rank higher than official sites and perhaps get some sites penalized for hosting duplicate content at multiple domains.

Should those pages be included in Google’s search results? ”

_______

I’ve clicked on some of the copies of these sites (url appended with a zta.net), most of the time I went to a zfs.org, which seems to be associated with zta.net.
Should this zta.net as well as zfs.org be investigated?

thanks
Matt Cutts March 10, 2007 at 10:56 pm: Aaron Pratt, Philipp often asks good questions, but I don’t respond to everything he says.

Tony Ruscoe and meng xiang, I’ll ask someone to look into it.

Sergi, bear in mind Vanessa’s point that “q=” or “/search” doesn’t mean that something is necessarily a search results page. As far as your other point, I’m always happy to get spam reports of any ‘fake’ search engines. That stuff is right down my webspam alley.
meng xiang March 10, 2007 at 11:21 pm: thanks Matt .
I’ve been reading your blog for a while, it’s really a good place to learn about anti-spam and search quality and everything.
Multi-Worded Adam March 11, 2007 at 12:07 am: Stupid side question in all of this: what about approaching the issue of fake search engine/auto-generated page spam from a slightly different angle and trying to auto-detect/ban from Adsense and possibly even Adwords? A lot of these guys are making money off of one or both of these programs, money that could justify their efforts even if they get banned from the organic SERPs (which they quite often don’t).

Maybe by cutting off the money flow, the spam doesn’t flow as freely into organic SERPs (I’m not sure on this because a lot of them will probably turn to Yahoo! or Adbrite or something like that, unless you guys can get together on issues like this.) It wouldn’t become worth it for at least some spammers if there were reduced or no revenues to collect (in theory).

Anyway, just a random thought.
Mike Schinkel March 11, 2007 at 12:13 am: Matt:

There are a lot of little initiatives I can envision that would help Google index content better. I am planning to slowly “propose” them on my Well Designed URLs blog [1]. However, I’d like to present them to people at Google and cultivate an open discussion about them because without your involvement and others in the community they will be nothing more the metaphorical hot air.

Is there any kind of forum to engage people from Google who have the power to recognize new webmaster techniques when indexing in order to help Google and make a better web?

[1] blog.welldesignedurls.org
Mike Schinkel March 11, 2007 at 12:15 am: More on topic, how does Google view web pages like the following? (this is a page I wrote for my _former_ company; they have since updated it just slightly):

www.xtras.net/ComponentsAndToolsForDotNet.asp
Steve March 11, 2007 at 3:47 pm: Is this not a little too vague?
“Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines.”
Would this not be better:
“Use robots.txt to prevent crawling of all search results pages if any pages link directly to a specific results page. Also use robots.txt to prevent crawling of auto-generated pages that don’t add much value for users coming from search engines.”
NextRef March 11, 2007 at 4:41 pm: I think that’s spam and have to be removed from Google index…
matt March 12, 2007 at 8:54 am: Will this update filter out websites that just point to a URL that has their own overture powered feed?
Philipp Lenssen March 12, 2007 at 9:00 am: Cool info thanks Matt & Vanessa.

> Google surely reacts to everything “Philipp” says.

Note I often just “forward” questions that pop in the Google Blogoscoped forums or elsewhere…
Jeremy CEO @ Yelp March 12, 2007 at 10:16 am: While I understand your guideline for general web search, in local that seems a little short sighted (and probably anywhere category exploration is critical – so I’m not sure if killing YouTube search/browse pages was a good idea either).

If I search for “tailor san francisco” I probably want a listing of tailors. Google onebox shows a few results from local, but then shows a single tailor from Yelp (albeit a decent one) below. Since the user is effectively doing a category search/browse isn’t another list of tailors what the user wants? By listing the CitySearch ‘best of” tailors next you’re effectively doing this. However Yelp has essentially the same page, however instead of editorial it’s user-gen with search – but it gets at the same thing, a list of excellent tailors:

www.yelp.com/search?find_desc=tailor&find_loc=San+Francisco%2C+CA&action_search.x=0&action_search.y=0

You really want to ban this?
Mario March 12, 2007 at 11:09 am: Very interesting post.

What happens vertical search sites like Indeed or Oodle? Is Google going to ban their search results pages from the Google search results? (in a way, this is very similar to what Jeremy says )
Mikkel deMib Svendsen March 12, 2007 at 11:32 am: Matt, the guidelines has not been updated on the Danish version – and probably not on most others too. And when we (the Danes) follow your link (or type it in) to the guildeines we end up on the Danish page – not the english one you are seing and linking to (yes, thats what we call cloaking … no sorry, personalization, no, I mean geotargeting … well whatever hehe)
Jeremy CEO @ Yelp March 12, 2007 at 11:46 am: Another good one…

www.google.com/search?sourceid=navclient&ie=UTF-8&rls=GGLG,GGLG:2006-09,GGLG:en&q=burger+San+Francisco

Google Onebox: burger king
CitySearch: editorial list
About.com: editorial list
Yelp: “search” over user-gen
Nico March 12, 2007 at 1:34 pm: Hi Matt,

Just one question (sorry for my English, I’m French): I submit to Google an exclusion request for one of my website (ambv.free.fr) but it still remains in Google Index in spite of my robots.txt?
During a little time it disapeared from Google, but he comed back later.
Does Google Search always respect the robots.txt or does he display results with excluded site when they were the web reference ?
rumblepup March 12, 2007 at 1:57 pm: Matt,

I’ve got to tell you that this opens up, for me, a can of worms. If I’m to understand correctly, dynamic search results might be at danger of being removed from Google SERP’s. My obvious problem with this is that I have sites using dynamic content such as /SearchResult.aspx?CategoryID=4 to produce a page for a particular category of products. Basically, this page (as the “no doi” url implies) is a search result of a query against my database. This was, and is, the promise of dynamic content.

So this particular stance from Google is very concerning. If your not going to penalize sites like mine, and millions of others, the way that Vanessa Fox suggests, then I apologize for my concerns. But does this also endanger social bookmarking sites that use user generated “tags”? These are also “search results” that provide a lot of content that I feel should be in Google’s SERP’s. How about blog sites that “categorize” or “tag” content so that it provides links to a modified search results, thus generating a page of highly relevant content to a particular term. Your own blog, for example, uses this technology. It’s a search result, no matter how you dress it. Is the difference in the url? We can’t all implement human friendly urls, and we can’t make 1000′s of pages by hand.

And in all honesty, I think that some of the mentioned websites, like Yelp, in my estimation, are doing nothing wrong. If a “search results” page is highly relevant to a search term, and ultimately serves the Google user, then what is the harm? Doesn’t the Google algo identify relevancy to a term? The YouTube search results page, I think, served the searcher what they wanted, a page relevant to their term.
Joris Verwater March 13, 2007 at 12:04 am: Hi Matt,

I wonder about the phrase:

“Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines”.

When is a page of no value to a user coming from a search engine?

Let’s take the example on searchengineland (searchengineland.com/070312-104201.php) about the shopping.com dvd players. Those results are very usefull for someone intrested in buying a dvd player or someone just seeking information. Is it really such a good idea to ban those kind of page from the Index?

Grtz
Joris
frank March 13, 2007 at 3:46 am: Luis Alberto says;

User-agent: *
Disallow: /name of directory/

if I want to block a specific directory this way it workd for all SE robots?
the above text is all I need to write in robots.txt?

were can I test my robots.txt file?

regards

Frank
Brandon March 13, 2007 at 5:36 am: Hi Matt,

I just have a quick question that is in somewhat of correlation with what JeremyCEO@Yelp has suggested. I help run a real estate web site, and for obvious reasons this might have a rather dramatic effect on what we are trying to do. We have used mod_rewrite to make a clean URL that automates our searches.

“Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines”.

To us and to our end users this does add value, much in the way of how yelp does.

Any ideas?

Thank you,

Brandon
Deb March 13, 2007 at 6:39 am: I saw [site:zta.net] in google and all the sub domain go to this url > www.zfs.org/home.html

is it not spam, Matt what about you think

Deb
Nacho Hernandez March 13, 2007 at 9:40 am: Quality guideliness are good to have. However, keep in mind that there are many platforms out there (such as Yahoo! Stores) that do not allow you to have control of your robots.txt. In my opinion, Google should filter these results but not penalize a site for them.

Saludos,

Nacho
Reik March 14, 2007 at 6:39 am: Hey Matt, thats a nice topic which I had recently thought about for our page. We did a recent relaunch and added a so called “faceted meta data navigation”, which is simply a navigational search that lets the user narrow down his searchresult by selection different facets (attributes). Looking at my logfiles, I saw that Googlebot now crawling every possible permutation of our naviagtional search. I though of adding a “noindex” to all of our search result pages, but besides the fact that I am lazy , isn’t that something, that I would only do for a search engine and not for the user? There is value being added for the user, which is a convinient way of “exploring” a search results just by navigating. Shouldn’t Google just ignore these similar pages instead of putting a penalty on the webmaster?
Vanessa Fox March 14, 2007 at 8:45 am: Hi Mikkel,

We are working to get the different language versions updated, and you should see the latest version shortly.
NetMidWest March 15, 2007 at 8:26 am: There are Amazon search results being cached under the imdb.com domain:
www.google.com/search?q=site:imdb.com/r/
These are 302 status code producing redirect links from imdb.com to Amazon, and Google is caching Amazon search results pages under these urls. These 302 redirect links are a valid method of click tracking, but if Google would not follow them (and cache them) these “search results in search results” would not exist… nor would other problems associated with Google’s handling of such redirects.

Physician, heal thyself.
Wheel March 15, 2007 at 12:56 pm: Hi Matt,
While you’re cleaning up stuff on other sites , you might want to pass along that the whois information for Google.ca is incorrect. At the very least the phone number is wrong. I called and the lady was pleasant but distressed to be getting 30-40 phone calls for the toronto office of Google every day.

Related searches:
search google content should question

gipoco.com is neither affiliated with the authors of this page or responsible
for its contents. This is a safe-cache copy of the original web site.

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.