Q & A thread: March 27, 2006
by Matt Cutts on March 28, 2006
in Google/SEO
Okay, let’s try tackling a few questions from the Grab bag thread. Just a hint for next time: if your question takes three paragraphs to ask, your odds of getting an answer go down.
Q: “Is Bigdaddy fully deployed?”
A: Yes, I believe every data center now has the Bigdaddy upgrade in software infrastructure, as of this weekend.
Q: “What’s the story on the Mozilla Googlebot? Is that what Bigdaddy sends out?”
A: Yes, I believe so. You will probably see less crawling by the older Googlebot, which has a User-Agent of “Googlebot/2.1 (+www.google.com/bot.html)”. I believe crawling from the Bigdaddy infrastructure has a new User-Agent, which is “Mozilla/5.0 (compatible; Googlebot/2.1; +www.google.com/bot.html)”
Q: “Do you take Emmy with you to San Francisco?”
A: Nope, Emmy is a true indoors cat; she doesn’t like to travel.
Q: “Any new word on sites that were showing more supplemental results?”
A: An additional crawling change to show more sites from those sites was checked in late last week, but it may still take a little bit of time (another few days) for that to show up in the index. I’ll keep an eye on sites that people have given as examples to see how those sites are showing up.
Q: “Is the RK parameter turned off, or should we expect to see it again?”
A: I wouldn’t expect to see the RK parameter have a non-zero value again.
Q: “What’s an RK parameter?”
A: It’s a parameter that you could see in a Google toolbar query. Some people outside of Google had speculated that it was live PageRank, that PageRank differed between Bigdaddy and the older infrastructure, etc.
Q: “Now that Bigdaddy is out, will there be a new export of PageRank anytime soon?” and “Will the deployment of BigDaddy stabilise the rolling PR issues we are experiencing at present?”
A: I’ll ask around about that. If there aren’t any logistical obstacles, I’ll ask if we could make a new set of PageRanks visible within the next couple weeks. I’d expect that as Bigdaddy stabilizes everywhere, the variation in toolbar PR for individual urls is more like to settle down too.
Q: “This datacentre 64.233.185.104/ works differently to all of the others. Noticed just a few hours ago. . . . . Where does that DC fit into the scheme of things? Is it mainly made from newly spidered data?”
A: Sharp eyes, g1smd. That wouldn’t surprise me. As Bigdaddy cools down, that frees us up to do new/other things.
Q: “Not so much a question… GET A PSP!”
A: I got one today, TallTroll. I picked up Me and My Katamari (MAMK) and a PSP that turned out to have firmware v1.52 on it. So I could upgrade to 2.0, then downgrade to 1.5 so I could run homebrew programs. But I think MAMK requires firmware 2.5 or 2.6 to play, which means a one-way upgrade or maybe using RunUMD or a similar program. Suffice it to say I’m having fun just geeking around.
Q: “Can you give us a general way of getting a good idea in front of Google?”
A: If it’s bizdev, there’s a bizdev dept. at Google you could contact. If it’s not a business/patent/proprietary idea, I’d mention it here or blog about it somewhere. Writing a snail mail letter could work well too.
Q: “Did you check out the guys all painted in silver doing the robot on milk crates in San Fran?”
A: Nope, that’s down by Fisherman’s Wharf. We’re hanging near Union Square.
Q: “Why do you focus your attention so much on SEOs and not at webmasters who make actual quality websites?”
A: I think that’s an issue I have personally, because I spend so much of my time looking at spam. Lots of other people focus on helping general webmasters, like the Sitemaps team, for example. I have started to do “SEO Advice” posts instead of just “SEO Mistakes” posts, but you’re right: I personally could use a reminder to keep focusing on the sites that make quality content and how to pull those sites up, not just how to counter sites that cheat. Thanks for bringing that up.
Q: “My sitemap has about 1350 urls in it. . . . . its been around for 2+ years, but I cannot seem to get all the pages indexed. Am I missing something here?”
A: One of the classic crawling strategies that Google has used is the amount of PageRank on your pages. So just because your site has been around for a couple years (or that you submit a sitemap), that doesn’t mean that we’ll automatically crawl every page on your site. In general, getting good quality links would probably help us know to crawl your site more deeply. You might also want to look at the remaining unindexed urls; do they have a ton of parameters (we typically prefer urls with 1-2 parameters)? Is there a robots.txt? Is it possible to reach the unindexed urls easily by following static text links (no Flash, JavaScript, AJAX, cookies, frames, etc. in the way)? That’s what I would recommend looking at.
Q: “When I change a robots.txt to exclude more existing files from being crawled, how long does it take for them to be removed from the index? Perhaps the answer is a function of how often the site is crawled and it’s PR?”
A: It is a function of how often the site is crawled. I believe in the past that every several hundred page fetches or several days, the bot would re-check the robots.txt. Note that for supplemental results, you need recrawling to happen by the supplemental Googlebot in order for the robots.txt file to take affect on those pages. If you’re really sure you never want those pages to be seen, you can use our url removal tool to remove urls for six months at a time. But I’d be very careful with the url removal tool unless you’re an expert. If you make a mistake and (for example) remove your entire site, that’s your responsibility. Google can sometimes clear out self-removals, but we don’t guarantee it.
Q: “I would love to be able to search for html code and see how that ranks.”
A: I would like that too. Indexing non-visible things like punctuation, JavaScript, and HTML would be great, but it would also bulk up the size of the index. Any time you’re considering a new feature (e.g. our numrange search), you have to trade off how much the index would get bigger versus the utility of the feature. My guess is that we wouldn’t offer this any time soon.
Q: “Seriously, How do you plan on picking which of these questions to answer?”
A: I’m tackling the ones that looked interesting, short, and general enough that more than one person would be interested.
Q: “I am seeing a lot of sites with “%09″ (tab) and “%20″ (space) in front of the URL in Googles index.”
A: I’ll ask someone about that.
Q: (paraphrasing) The sitemaps validation fetch seems to happen with a User-Agent of “-”? My auto-reject rules reject that user agent.
A: I’ll ask someone about that. You could whitelist the IP range that Googlebot comes from in the mean time.
Q: “If one were to offer to sell space on their site (or consider purchasing it on another), would it be a good idea to offer to add a NOFOLLOW tag so to generate the traffic from the advertisement, but not have the appearence of artificial PR manipulation through purchasing of links?”
A: Yes, if you sell links, you should mark them with the nofollow tag. Not doing so can affect your reputation in Google.
Q: “On sites directed to international audiences with the same (high quality) content in several languages is it better to do several TLDs like mydomain.com, mydomain.de, mydomain.fr, mydomain.eu and so on or do subdomains like en.mydomain.eu, de.mydomain.eu, fr.mydomain.eu or something else like mydomain.com/en, mydomain.com/de, mydomain.com/fr?”
A: Good question. If you’ve only got a small number of pages, I might start out with subdomains, e.g. de.mydomain.eu or de.mydomain.com. Once you develop a substantial presence or number of pages in each language, that’s where it often makes sense to start developing separate domains.
Q: “Any results on why IDN Domains don’t show pagerank?”
A: I’ve seen a couple that do, but I’ll check into why most don’t. My guess is that there’s a normalization issue somewhere in the toolbar PageRank pathway.
Q: “Would it be possible to add a date range to queries? I might get 91,000,000 results, but the first 200 are 2-3 years old. I would like to limit results to items no more than 6-12 months old.”
A: Check out our advanced search page for this option. Tara Calashain also did some really interesting digging into this too, e.g. this info she uncovered. Google Hacks is a pretty solid book if you’d like to read more fun Google hacks.
Q: “What about the problem of directories and shopping comparison spam overriding real pages?”
A: Fair feedback. I heard that recently from a Googler, too. Sometimes we think of spam as strictly things like hidden text, cloaking, etc. But users think of spam as noise: things that they don’t want. If they’re trying to get information, fix a problem, read reviews, etc., then sites that like aren’t as helpful.
Q: “Are you planning to visit/speak in the UK at all in the near future?”
A: Sadly not. I’m hitting the Boston Pubcon and SES San Jose, but I can only do 4-5 conferences a year.
Q: “The one thing that seems to be getting to people generally, is what are the post Big Daddy intentions? Fixes, spam issues, regeneration of ‘pure’ indices, supp. issues, PR and BL update, etc.”
A: I can’t give a timeline (e.g. “scaling up communication in April, more work on canonicalization in May”) because priorities can change, esp. depending on machine issues, deployments of new binaries, webspam developments, etc. Short-term, I wouldn’t be surprised to see some refreshing in supplemental results relatively soon, and potentially different PageRanks visible in the next couple weeks.
Q: “Even Matt is afraid to use a redirect from www.mattcutts.com/ to www.mattcutts.com/blog/ because Google might penalize his website and put it into supplemental hell.”
A: Heh. No, that’s not it. I’m deliberately leaving them separate as a test case to see how we do now and down the road.
Q: “Just like you told me a couple of months ago, the Supplemental Googlebot (SG) got around to my site and things got sorted out. Thanks. . . . . If you are in San Fran and want to check out the Monterey Aquarium, could you please write a short review? I’ve been thinking of visiting and wondering if it is worth the trip.”
A: I would definitely recommend the Monterey Bay Aquarium, especially if you can find a coupon or other good deal. I highly recommend the otters, the kelp forest, and the jellyfish area.
{ 110 comments… read them below or add one }
Hey Matt,
That is some pretty impressive posting
I have noted that a couple of sites that I believe had canonical probs have come back – but only sites that have been sent to your engineers.
Not sure if this is a conincidence or that a correction is starting to roll out. If it is a correction then cool – will it hit some sites before others – depening on crawl cycle etc ? – If it is a engineers intervention then when would you want reports of these ?
Cheers.
Oops – just to clarify what I would call a correction for these sites.
EG: Site:domain.com – domain.com is first.
domain.com as a phrase – domain.com is first
etc – eg the Homepage returns to its true value – rest of the site seems to follow
Matt, some great answers there, thanks.
This will help put to bed some of the crap that floats around about the Google mystique LOL.
I know that the Supplemental hell and the Lack of deep crawling are especially important to some people
I’m pretty sure that MAMK only requires firmware 2.0 to run, so you should be able to back and forth as required. You need 2.0 for the browser though – depends how much surfing you want to do. AFAIK, the only game that requires 2.5 is EXIT, so you should be able to wait until a downgrade form the 2+ f/wares is available before going there.
I find that Soulseek, a USB cable and a PSP is a memory hungry combo though…. need to get a 2Gb card soon
Matt, that was a fair amount of time spent on writing answers this night. Thanks.
Apart from addressing supps, canonicals, pagerank re-calculation etc, will there be an imminent change in ranks as a result of these corrections?
Hi Matt,
As part of your review of the supplemental problem, are you also monitoring any sites whose pages have simply vanished (rather than gone supplemental)? I think the BD bug is responsible for both types of errant behaviour – sometimes it just refuses to index tens of thousands of pages, despite crawling them over and over again. That’s what we see anyway. None of the supplemental tweaks have yet made any difference to the missing pages problem.
Well well, you can answer questions about Google and SEO very well, but you didn’t answer my “why are there no blue foods in nature” question?! I shalln’t be picking you as my phone-a-friend on Millionaire any time soon Mr Cutts… well, unless they start asking SERP questions in the next few shows! )
P.S. Saw a mobile dog-grooming van drive past our office the other day, called “Mutt Cutts” – I had a little chuckle.
Cheers for all these answers..
I do have one question though, with so many different sources of Pagerank, Live Pagerank, future Pagerank etc. What would you suggest we use to see an accurate measurement?
Please answer this:
From: www.mattcutts.com/blog/miscellaneous-monday-march-27-2006/#comment-19408
«For accessibility purposes, my site has ‘skip navigation’ etc… to allow screen readers to get straight to the content. [..] so I have ‘hidden’ these accessibility links using display:none in the stylesheet. [..] Will Google regard this as hidden text and penalise my site?»
On TLDs and international audiences: When a site is in one language how should it be expressed to Google that it is for a global audience?
For example restaurant reviews and shopping could be seen as local and localised respectively; but product reviews (where the product is available globally), encyclopaedia entries and reference material are more for a global audience.
There are suggestions the site be duplicated at the various TLDs e.g. .com, .co.uk, .ca, .au, etc. But this wastes bandwidth for the site and the google bots, encourages link splitting and can confuse the users.
The geo of the IP doesn’t always work as for example 1and1.co.uk gives out German IP addresses, and many other websites use US hosting for cheaper costs.
Just wondering for a clarification on how this issue should be tackled as the various Google SERPs are becoming more and more local even if the user is not requesting pages only from their contry (google.com vs. google.co.uk or even it seems google.com used from a US ip vs. google.com used from a UK ip).
P.S. Keep up the good work!
Matt thank you for taking the time to answer all these questions. What you are doing here says a lot about your character and commentment to the webmaster community.
I didnt get to ask a question but let me try now. If I agree to buy you Starbucks every morning could you place my website at the top of the results Since my new site isnt ranked yet, thats all I can afford is one cup per morning
Thanks for your time, Matt.
Very generous of you. Much appreciated.
Very disappointed no comments on expired domains.
Looks like we will continue to see domains such as
macalstr.edu/
astronomy-national-public-observatory.org/
rarestonemuseum.com
iasicongress2005.org
papyrusinternational.org/
and many others in the adult serps.
Seems like its all too hard for the webspam team and this reflects badly both on google and the adult internet industry.
So how long does it take for 301′s to take effect across all the DC’s? Even Y*hoo and M*N don’t seem to have a problem with it.
Matt, Firstly many thanks for both your time and efforts. I appreciate that you cannot be specific on certain points, due to the nature of privacy at Google.
Is it within your power to explain exactly what the following GoogleBots do? [You already answered 5.0 above ] – thanks
crawl-66-249-65-225.googlebot.com
Mozilla/4.0 compatible ZyBorg/1.0
Mozilla/4.0 compatible ZyBorg/1.0 Dead Link Checker
Mozilla/5.0
Googlebot/2.1
Thanks for answering these questions! Great information.
The URL Removal Tool has been broken for weeks. For example I’ve tried to remove directory.sysice.com from the index cause I took it down a few months ago, but I just get a Page Not Found when I try to submit it.
The biggest problem that I’ve seen many worry about here and that google is way behind in addressing is 301 redirects with domain moves from domain1 to domain2 and Matt seems to forever be ignoring this question .. Even though it was asked about more than 3 to 4 times in the list of questions here and in many other comment posts by viewers Matt and google continue to ignore it or give vague answers about how or when google plans to address this..
Matt can you please once and for all address the question and webmasters concerns of how and when we can expect to see googles / bigdaddy properly handle domain name moves using 301 redirects?
One comment that you may not publish but I hope will read… WHAT is going on at blogger? It is google’s worst product by a country mile. Regularly unreliable and I can’t recall a single new feature that has been added since you brought it on board. It is dreadful and if I hadn’t been unfortunate to *start* using it I wouldn’t still be using it. I try and warn everyone away and it makes me sad
Thank you Matt for taking the time to answer questions or even to look into the IDN Domain issue with the pagerank. These domains will truly advance the international internet experience.
Hi Matt,
Great effort answering so many questions, thank you.
One thing I’m still curious about (so are many others):
[blockquote]A: Yes, if you sell links, you should mark them with the nofollow tag. Not doing so can affect your reputation in Google.[/blockquote]
Does this include linked images?
Damn…. if there are 2 choices I always make the wrong one – lol – sorry about the
Hi,
If BD is out now then how comes SERP’s are showing pages that haven’t existed for 9 months plus and return 404′s?
Cheers,
K
Good job in answering so many questions, and I know you can’t answer every single one. But, it’s a shame you didn’t answer one of the most popular questions, about the loss of pages. Did you not want to answer it, or did you just miss it?
Thanks
No answer……………
It has been three months since spam has taken over the majority of adult search results in Google.
It’s strange to see “somewhat” relevant results one day Dec 26th then Dec 27th just about the whole adult white hat community was wiped out, filter maybe?.
I believe that the adult serp problem is bigger than the supplementals – I just hope that it’s not being ignored.
What I am saying here applies to the entire adult industry in Google not just my little ole site.
========
Q: “Now that Bigdaddy is out, will there be a new export of PageRank anytime soon?” and “Will the deployment of BigDaddy stabilise the rolling PR issues we are experiencing at present?”
A: I’ll ask around about that. If there aren’t any logistical obstacles, I’ll ask if we could make a new set of PageRanks visible within the next couple weeks. I’d expect that as Bigdaddy stabilizes everywhere, the variation in toolbar PR for individual urls is more like to settle down too.
========
Hi Matt,
I think, it would better, the PageRank is not visible in the toolbar.
Hi can you post some photos of Emmy? We are cat lover.
I have reported several sites that use different spamming techniques. But nothing happens. For exampel look at this site www.kickoff-konferens.se/rw/ and go to the bottom of the site. They mention Mirror1, mirror2, mirror3 and mirror4. Why don´t Google exclude them? It feels like its ok to spam i Sweden and get top positions..
// Not so fun being a white hat SEO in Sweden.
Ahh.. that’s why so many people say bad things about porn… expired domains. Here I was thinking it was some sort of morals issue.
Mike, I agree with that.. Take visible pagerank out of everything. People put way more faith and dependance in it than they should, and it’s still easy to fake.
Give some site a PR higher than 4 and they instantly think they’re worth millions and have hit the big time.
>You could whitelist the IP range that Googlebot comes from in the mean time.
Do you have a listing of all the Googlebot IP addresses?
Thanks.
To balance that feedback: We maintain a niche B2B directory and customer feedback and high listing CTR seems to indicate that a large number of visitors are indeed looking to “buy” products when they type in a product keyword and the directory is indeed relevant.
Google has to make an educated/algorithmic guess about the searchers intent (Information or Purchase). If an action keyword complimenting the product keyword is not specified in a search, the type of product itself can be used to yield a decent intent relevancy.
SERPs should not be flooded with directories, but there is always bound to be more -ve feedback on directories, since there are a lot more individual site webmasters than there are directories!
I have a question about one of your answers.
Yo