Sphinn Home » Sphinn Zone
The SEOmoz Linkscape Ghost
News Source: ekstreme.com
Published: Oct 09, 2008 - 11:14 am
Story Found By: eKstreme 2222 Days ago
Category: Sphinn Zone
There is a way to build SEOmozs Linkscape tool without crawling, and I show you how. If Im wrong, SEOmoz must disclose the Linkscape User Agent, and I explain why.Published: Oct 09, 2008 - 11:14 am
Story Found By: eKstreme 2222 Days ago
Category: Sphinn Zone
26 Comments
Tweet
Comments
Interesting theory. Either way, they do need come forth with full disclosure.
I think you missed my post on WMW where I clearly called out the user agent used to crawl my site which neatly ties back to them and it certainly wasnt Yahoo, it was DotBot, also based in Seattle.There could be more things in play, but I didnt look any further.
IncrediBILL: I didnt miss your post. DotBot could be the seed index. On the DotBot website ( www.dotnetdotcom.org/ ), they state their index has 10,355,148 pages so far, which is a far cry from the claim SEOmoz makes of 30+ billion. The domains count is also much much smaller.So DotBot could be part of the story still.
From Yahoo the concept is "potentially unreliable." From Google, it would be potentially reliable. Even if you were interested in the metrics of a particular website, for their proprietary information, it still is only going to give you a bunch of numbers.In the stock market, this called Technical Analysis. TA dazzles you with very impresively looking graphs and charts. The only problem is that TA doesnt WORK. TA is not anymore effective than charting your horoscope is at helping you live your life better.Just because your got a bunch of charts and numbers documenting what your competition is doing doesnt magically transform your website into a winner.I would still say that this SEOmoz buzz is mostly about being able to snowjob your SEO clients with a bunch of impressive looking numbers, charts, and graphs. But, in the end it dont mean a hoot of beans no matter how you spy on your competition.
@ JohnHGohde - I usually give you a hard time, but the line "TA is not anymore effective than charting your horoscope is at helping you live your life better." is absolute gold!
@DazzlinDonna - They only need to come forward with full disclosure if theyre using a bot to do the crawling -- a bot is taking something from our pages so we have a right to know. If theyre using some other method, such as an API somewhere else, then they dont need to disclose anything. Their secrets of how they build the index arent ours to demand.What if SEOMoz was able to work a deal with Google to get access to the linking data? Doubtful, but its quite possible theyve figured out some other way to go through an aggregator to get at the data that doesnt require that they crawl the sites directly.Heres some back of the envelope calculations: to crawl 30 Billion pages every 60 days, youd need 1500 processors crawling 4 pages/second. The problem with crawling isnt the speed of your machine, its the performance over on the other end. You can increase the speed and decrease the processors, but Im guessing that $1M in venture capital wouldnt be enough to cover that and also do the software development and pay for the bandwidth. They must be doing something else -- perhaps their VC has a portfolio company that was already doing a crawl. Or perhaps the index isnt going to get refreshed very often. Either way, theyve done something pretty cool.
@LtDraper, this quote is from the original blog post announcing their new tool:"Our crawl biases towards having pages and data from as many domains as possible"Now "crawl" to me implies a bot, but if thats not the case, fine - but in that case, we have the right to know that its NOT a bot, so that we dont fret about it, I would think. Certainly, we dont have to be given proprietary info that has no bearing on us, I agree.Added: And here - www.seomoz.org/linkscape/help/crawler - they specifically say they have a bot - That first graphic says "Linkscape bot crawls sites and page".
Theres a bot all right.
DomainTools (also here in Seattle, acquired last year for $16m) maintains a database of every domain registered since the mid 1990s. Inside their user interface to that database, they present a thumbnail of the site (scraped by a bot), and provide an "SEO score" (dont bother). Thats on-demand scraping (with caching obviously) of every domain ever registered. No reason to mention DomainTools here... really.. none at all. Im just noting how "easy" it is to scrape a few billion domains at webmaster expense, and sell the info back to webmasters for a profit. And of course to find domain scraping technology and databases here in Seattle. SEOMOz LinkScrape - you left the door open, so we assumed we were welcome!
Sorry - commented on the other post, but not on this one. We do have bots and we are going to providing ways to say "dont show me in your list of links" to Linkscape. Were a little overwhelmed at the moment (and much of the team is here with me in NYC for SMX East), but well be back in Seattle next week and making this a high priority issue.BTW - Besides DomainTools and SEOmoz, there are several dozen other companies that maintain large scale web crawls similar in size to the search engines for a wide variety of reasons. Most of these have monetization, so as John correctly noted, were certainly not alone in this, but I do understand the issue and want to address ASAP.
Ah, yes the magic word: Monetized.I guest that I just going to have be more aggressive about blocking monetized bots from accessing my websites?Monetized bots need not apply.
@johnHGohde - so I suppose youll be blocking the googlebot, the granddaddy of all monetized bots?These aggregators require serious horsepower to accomplish what they do, and the data is valuable. Are the vendors supposed to provide it for free?
@LtDraper - Google shares the wealth in terms of traffic that you can monetize and AdSense revenue sharing which is far from this parasitic behavior.
@LtDraper - No, I dont expect the vendors to supply it for free. Nor do I expect to supply it to the vendors for free. If they send me traffic I can convert, fine; if not, I have to make a decision whether to bar the door or not.
even if all the savy SEOs block this tool from spidering their sites, the vast majority of sites wont even notice it exists. In my niche, Im betting most of my clueless competitors are not even aware of it, and that makes it a potentially valuable tool for me. By the time they wise up, I may have already taken advantage of their data and blocked my sites from showing up.
I guess the idea that this is some great competitive advantage is lost on me. I suppose thats because I still subscribe to the idea that if you want to know what the most powerful links are for driving traffic for a particular term, you search for that term on the search engines and see what comes up.Of course, maybe Ill change my mind after I play with it more. If it really does show you what are the most important links pointing back to a particular site, especially in terms of driving for particular anchor text, then I suppose thats a nice link building tool to have. When I looked at the beta, I did think it would be nice if it could show me the most important links that my competitors have, so that I could understand what "link gap" might exist.I suppose if youre really, really worried, youd want to block it. But then again, you have to trust that all those mozRank and other figures are really showing what Googles doing -- and they obviously arent, right? I mean, people can only guess at what Googles doing.Personally, Im not too worried. You want to compete with me and get links in places where Im listed? We get listed in places where editorial rules. So just knowing where were at doesnt get you in the door -- you have to be good enought to walk in. And if you are good enough, well, good I guess.But some people dont want to be crawled for various reasons, and I totally respect that. The tool should have announced a way for blocking to be enabled the minute they started spidering. Plenty of other "stealth" projects have done this. You just leave a user agent with a URL to more info that shows up in logs. Good web citizens that run spiders respect robots.txt and respect it a the start. Ive seen Rand say elsewhere that this will now be enabled, so least theyre getting into the right place.
As I commented on www.seomoz.org/blog/announcing-seomozs-index-of-the-web-and-the-launch-of-our-linkscape-tool"Is Linkscape an ad hoc development from SEOmoz or is there a relationship - or even a kind of agreement - between Linkscape and Majestic SEO? On www.majesticseo.com/ Ive got these figures : Index stats: 32,690,802,864 crawled pages (211,051,271,656 unique urls in total) and 81,502,004 unique domains (685,461,105 with subdomains), ~ 1.5 trillion (1.5*10^12) linking relationships Quite the same figures than the ones quoted above." I wrote also an email to SEOmoz asking for explanations, but here is the answer :"Unfortunately, I cannot give you a detailed response. At this time, SEOmoz is not revealing the source of our crawl or the operation of Linkscape. Thus, we are neither confirming or denying relationships with other companies."Jean-Marie
MajesticSEO has since August 4 this year an API www.majesticseo.com/api.php so every SEO company can resell their data, but I dont think SEOmoz use their data.In my test I see 4-8 weeks old data in the LinkScape tool against 4 months old data in the MajesticSEO tool also I get average 10 times more links in the MajesticSEO tool for example if you try phpbb.com you get over 1 billion links at MajesticSEO www.majesticseo.com/research/top-world-domains-by-backlinks.php against 25 million at LinkScape.Looking at www.majesticseo.com/research/anchor-index-quality.php I think Linkscape has maximal 30 billion links in their database and for that you need only to crawl between 1.5 and 3 billion pages as in average a page contain 10-20 unique links when you crawl deeper as MajesticSEO this ration go down.
Im 100% positive that the attempts to access my site came from DotBot because I feed back tracking data.Did all of Linkscapes data come from DotBot?No clue, I can only comment on my tracking data from my own sites.MajesticSEO on their site shows different tracking data, which indicates Majestic12 did the crawling for my data fed back to their site.
@LtDraper Google provides a FREE search engine, as well as other valuable services. Nor, does Google sell my proprietary information for cold hard cash.This is a wake up call to block most bots from accessing your website for security reasons. WordPress has plugins that block evil bots from accessing your blog. It will be interesting to see if SEOmoz is eventually added to their evil bot lists.
DotBot appeared in my log files at July 28, 2008 and was for 2 months active.It looks like they use 10 servers to crawl:208.115.111.242 crawl1.dotnetdotcom.org208.115.111.243 crawl2.dotnetdotcom.org208.115.111.244 crawl3.dotnetdotcom.org208.115.111.245 crawl4.dotnetdotcom.org208.115.111.246 crawl5.dotnetdotcom.org208.115.111.247 crawl6.dotnetdotcom.org208.115.111.248 crawl7.dotnetdotcom.org208.115.111.249 crawl8.dotnetdotcom.org208.115.111.250 crawl9.dotnetdotcom.org208.115.111.251 crawl10.dotnetdotcom.orgAnalyzing their 133GB data it contain 3,533,635 URLs crawled over 10 days.This suggest a speed of 355,363 URLs a day each server and 355,363 x 10 servers x 60 days = 213,217,800 URLs in 2 months.Average they stored 38,737 bytes data for each URL that’s 1Mbit bandwidth each server is 10 x 1Mbit = 10Mbit bandwidth for 10 servers.If their crawlers where 10 times faster (as the MJ12 crawlers) then they could crawl 2 billion URLs in two months and needed 100Mbit bandwidth, when they had 100 servers instead of 10 with that same high speed they could crawl 20 billion URLs in two months but needed 1000Mbit (1Gbit) bandwidth.Maybe DotBot provided their data but it’s more likely they provided some data because I only talked about the crawl efforts but there is also CPU needed to analyze the data.There are only some search engines who reached the billion URL’s indexed page mark: Yahoo, Google, Cuil, Live, Yanga, Ask, Viola, Exalead, Abacho, Gigablast and some Chinese and Eastern Europe countries search engines. Some of them are selling their crawler technology and do custom analyzes at their data set for a lot of dollars.They could have used Nutch but the only search engine I know who manage to scaled Nutch above 1 billion pages was sproose.com.
It seems pretty clear to me that SEOMOzs tools are aimed at a bulk of webmasters not as sophisticated as the ones engaged in these discussions. Thats how I always viewed SEOMoz anyway, and why I have never paid to be a "pro" member. So no big surprise there.However, blocking the bot is not about blocking some competitive info from leaking out. That sort of thing requires cloaking and such. Blocking the bot is to save money and avoid funding future annoyances (like junior webmasters piggy-backing on your linking partners, or selling reports on my websites to their clients as "competitive intelligence"). I dont need to dedicate a portion of MY resources to funding RANDs commercial enterprise. Do the math... a group of us competitive webmasters control a large portion of the webs traffic (and pay for the bandwidth). As Statsfreak so clearly demonstrated above, someone is consuming GBs of bandwidth to do this crawl, and that river of data flows bi-directionally through two accounts: the crawlers ISP and my hosting pipe. For every dollar SEOMoz spends on crawling, we spend a dollar on serving pages to that crawler.That can only mean that when Rand says he needs to charge $799/year to fund this expensive crawler project, we, too are seeing proportionally expensive bandwidth bills as a consequence. If you ALSO pay Rand that $799/year, then you are paying TWICE for that portion of bandwith you contributed to his crawl (while Rand is recovering his costs, and then making a profit).For the SEOMoz pro members on $5 virtual hosting accounts its meaningless math - they are funders of everyones profit margins. But when I get my S3 bill I see a charge for every bit of data served by my sites, and it all goes against revenues to calculate ROI. Serving data to Google is part of the web game. Serving pages to SEOMoz is not.
Theres been a lot of confusion about Linkscape, and Id like to clear up some of the vaguaries. Unless you pay per page view, and you only get 5 page views/month, you wont notice the Seomoz bot anymore than youll notice Googlebot, Slurp, or any other robot. Im assuming Seomozs bot is respecting everyones robot.txt, if not then shame on them.As for the reasoning behind Seomozs crawling, theres no other way to assess the relationship between web pages than to crawl the whole web. You can not reproduce Linkscape using the Yahoo API. I dont know if the Linkscape data will be useful, or even if its an accurate representation of Googles data, but it is not something you could easily reproduce without the same crawl data that Google has. Ive been playing with the Linkscape reports, and while the reports are new and exciting, Im not completely sold yet. I do hope that they are eventually able to produce numbers that are useful for SEO and that it will become a useful tool for developing SEO strategy for our clients.
Any updates to this user agent thing?
I suppose I have to comment, as my silence regarding Linkscape has been mistaken by several people as meaning I dont like it. Quite the contrary. Ill post here what Ive posted elsewhere, almost verbatim. Over the years I have lost count of the 3rd party link building and link analysis tools and software Ive tried out, many of which are long gone. What is most telling to me is that I abandon them when it comes time for heavy lifting deep vertical link target ID and evaluation. I wont go so far as to say “All you need is Google and your brain”, but it’s darn close to true, at least for the type of client content I work with. Linkscape is outstanding and useful for a very specific set of metrics and measures, and for a certain type of link builder will be quite helpful. I commend Rand for it and I will use it to augment my own personal approach to the link building research process. On the other hand, as much as I want and look forward to every new tool, I keep thinking about Rocky IV, where Ivan Drago was using every cutting edge tool and training method available, while Rocky ran around in the snow with a log on his back. The saaviest link builders will use tools and logs.
Yes I would like to know what the user agent is.