June 18, 2011

Houston Code Camp: Converting the Internet into a Single Database

Edit
Delete
Tags
Autopost

Houston's first ever Code Camp will be upon us in August. We're pretty excited about it at 80legs, since our CEO has been calling for a stronger hacker culture in Houston. Hopefully this is the first of many quality hacker and developer-oriented events in Houston.

We've submitted a session idea entitled "Converting the Internet into a Single Database: Technologies Used & Lessons Learned" and thought it would be a good idea to provide some more details here on what this session will be about.

The Internet as a Database: What's that Mean?

Consider what's happening when you run a Google search, for say "houston restaurants". What's really happening here? You, as an individual, are trying to find a single data point, most likely advice on where to eat in the next few hours. Google is very good at delivering an answer from the Internet to individuals, but it's not good at deliverinig answers to commercial organizations, or for more complex queries.

Let's say what you really want is "all houston restaurants that may need menu consultation". (E.g., if you are a kitchen consultant). You might want to run a query like "Find all Houston businesses that are restaurants where overall rating is < 3.0 out of 5 stars and reviews contain complaints about menu items". This is a much more complicated query, but the data is available out there. We just need a way of structuring and querying it.

Enter the Platform

Let's break down how we would build a platform that could serve our restaurant query and many more like it. Here's what we'd need:

The ability to collect all relevant data on the web (quickly and at-scale)
A standard format for structuring data from different sources
A storage system for all the data (which will probably be several billion records)
A query language for retrieving data from storage
A processing layer for running the query

If you look at these steps, you can start to conceptualize how a technology stack for "the Internet as a database" might look. During our talk, we'll cover how we addressed and implemented each part of our stack, with a focus on the following questions:

Should we choose to build this component in-house or use an open-source tool?
How did we evaluate open-source tools for our use-case?
How did we keep development of the platform on a rapid iteration cycle?
What did we learn about our technology and business during the devlopment of each component?

We hope folks will come away with a better understanding of how to break down large technology goals into smaller, more manageable components as well as how to evaluate different technologies as they relate to the goal (business or otherwise) at hand.

Hopefully this provides more insight into our proposed talk! If you have any feedback, please let us know :) If you'd like

The latest and greatest in web crawling

Houston Code Camp: Converting the Internet into a Single Database