So lets be a bit more explicit. What are we trying to achieve here? I've always been curious as to where the hot spots are for Venture Capital (VC) investing so I could discover how Austin stacks up against Boston or Portland in terms of the amount of VC invested. Also, in late 2010 and 2011 there was a lot of talk about the US startup tech scene being in a bubble. Unfortunately, all the information that I tended to come across about the private equity market was simply subjective commentary from various investors based on the deal flow they and their personal networks were seeing. The most quantitative work I had come across was the article in Fortune Magazine by David Kaplan, "Don't call it the next tech bubble-yet" but the data largely focused on the public market rather than the private one. So I thought to myself, with the right data you should be able to ask some high level questions of the private market to see how much was invested this year vs. previous years and the distributions and frequency of those investments to determine if 2011 was showing signs of a bubble?
DISCLAIMER: I am not a Venture Capitalist or in Finance. I am however interested in understanding the technological ecosystem better, which includes its economy. This article is merely an example of how curious people can look to the web to explore data for whatever intrigues them. |
The first step to exploring these questions involves finding the data. Data needs to be in a structured format to be queried, so before you embark on the process of working with data of arbitrary structure on the web, you should see if you can find the data in structured format first. There are some great data marketplaces (follow the icons/links below) which provide searchable repositories of already structured data. I actually think its pretty neat that we already have so many companies providing this service. There are also some institutional repositories, like data.gov, but I don't know how much longer that site will be around.
Lastly, if you've exhausted those alternatives, you can web crawl (or spider). Web crawlers typically start by pulling down one or more web pages, these are known as the seeds of the crawl and crawl depth 1. The crawler then extracts the links from those pages and then builds a fetch list of pages to get and pulls down these pages. This is known as crawl depth 2. The process repeats for however many crawl depths you have specified for the crawl. As you can imagine, you are greatly increasing the amount of pages retrieved for each consecutive crawl depth. One can also specify crawl url filters where each link in the fetch list is matched and discarded if it doesn't meet the filter criteria. This allows one to constrain crawls to just one website, or just a part of a website.
Identifying the Website and Seeding the Crawl
After spending some time searching the web, I found that Crunchbase.com provided a website that was rich with crowd-sourced private equity data specifically related to tech. The site was set up so that each company had a page, and on that page it contained the sector (or category), the Location, the Name and details of each round of funding. This was perfect! However, if you're planning on crawling a site, you need to *carefully* read the website's terms of service and the license for its content. Otherwise, you might get sued.
Crunchbase has an awesome and flexible license for folks that are not gaining financially from the use of their content. One last thing to keep in mind. Websites have a "gentlemens agreement" with web crawlers in that they specify which parts of the site are allowed to be crawled and which parts are not, via a ROBOTS.TXT file. This is accessible at websitedomainname.com/robots.txt and is observed by most web crawlers although it is possible to modify the source code of Open Source crawlers and remove the compliance. You need to check the allowed crawling scheme in this file to make sure you can actually get to the content you want to retrieve. | |
www.crunchbase.com/companies?c=a&q=privately_held www.crunchbase.com/companies?c=b&q=privately_held www.crunchbase.com/companies?c=c&q=privately_held www.crunchbase.com/companies?c=d&q=privately_held www.crunchbase.com/companies?c=e&q=privately_held www.crunchbase.com/companies?c=f&q=privately_held www.crunchbase.com/companies?c=g&q=privately_held www.crunchbase.com/companies?c=h&q=privately_held www.crunchbase.com/companies?c=i&q=privately_held www.crunchbase.com/companies?c=j&q=privately_held www.crunchbase.com/companies?c=k&q=privately_held www.crunchbase.com/companies?c=l&q=privately_held www.crunchbase.com/companies?c=m&q=privately_held www.crunchbase.com/companies?c=n&q=privately_held www.crunchbase.com/companies?c=o&q=privately_held www.crunchbase.com/companies?c=p&q=privately_held www.crunchbase.com/companies?c=q&q=privately_held www.crunchbase.com/companies?c=r&q=privately_held www.crunchbase.com/companies?c=s&q=privately_held www.crunchbase.com/companies?c=t&q=privately_held www.crunchbase.com/companies?c=u&q=privately_held www.crunchbase.com/companies?c=v&q=privately_held www.crunchbase.com/companies?c=w&q=privately_held www.crunchbase.com/companies?c=x&q=privately_held www.crunchbase.com/companies?c=y&q=privately_held www.crunchbase.com/companies?c=z&q=privately_held www.crunchbase.com/companies?c=other&q=privately_held |
An ideal crawl for content retrieval should be at a crawl depth of 2. This means the crawl is specific and highly targeted and requires some thought before hand as to exactly what URLs the crawl will be seeded with. After spending some time with the site, I was able to determine that it could provide alphabetical indexes of just private companies. I was able to change which index was rendered by simply changing a parameter (such as from "c=A" to "c=B") in the querystring. This let me create a seed list such that the next crawl depth would only pull down the individual company pages that had the data I sought. Now, there are a couple of options for a choice of a web crawler. If you don't want to do the hardware/software set up of a crawler, you can use a service like 80legs (also the creators of DataFiniti), otherwise I recommend you use the Apache Hadoop based web crawler, Apache Nutch. Nutch is awesome and we'll be using it as our crawler from the perspective of this post.
The setup information for Apache Nutch is available on the project Wiki. Setting up Nutch is outside the scope of this post, however there are a few things that are non-obvious. If you are going to use Nutch for content retrieval rather than to build a search engine you want to make sure you modify the conf/nutch-site.xml to make sure the file.content.limit, http.content.limit, db.max.outlinks.per.page and db.max.inlinks properties are all set to -1. This makes sure neither the page content or the fetch lists are being truncated. In addition make sure your crawl conf/*-urlfilter.txt files comment out the regular expression that skips URLs that contains queries. i.e. It should look like this: # skip URLs containing certain characters as probable queries, etc. #-.*[?*!@=].*This isn't actually a terribly big crawl and you can actually run this in single process local mode if you wanted. Once Nutch is set up, you create a seeds directory and copy in a seeds.txt file comprised of the seed URLs I provided to the left. |
Then you launch your crawl with the following command: bin/nutch crawl seeds -dir content -depth 2
Nutch will then run the crawl from the seeds specified in the "seeds" directory and store the results of the crawl in a "content" directory. For each crawl depth, the content directory will contain a unique segments directory corresponding to that depth. We can ignore the segments directory for the first crawl depth and just focus on the other one which contains all the data. Nutch by default runs quite a few jobs that build indexes for each segment and then merge them. All of this is unnecessary if you're using Nutch for content retrieval. I've embedded a talk I did on Nutch at the beginning of the year where I step through the jobs that are run for a crawl, how to run Nutch in distributed mode & how to orchestrate Nutch to simply run the jobs you need for content retrieval. I also spend a bit of time talking about processing Nutch Content Objects in Nutch Sequence files. Extracting the data out of the sequence files in the segments directory and doing the analytics will be covered in the next blog post in this series. |
0 comments:
Post a Comment