Hackdiary

Home
About

Natural Language Processing and Machine Learning, some pointers

October 14th, 2012 | Published in data

I’ve been doing a lot of natural-language machine-learning work both for clients and in side-projects recently. Mark Needham asked me on Twitter for some pointers to good introductory material. Here’s what I wrote for him:

Nearly all text processing starts by transforming text into vectors:
en.wikipedia.org/wiki/Vector_space_model

Often it uses transforms such as TFIDF to normalise the data and control for outliers (words that are too frequent or too rare confuse the algorithms):
en.wikipedia.org/wiki/Tf%E2%80%93idf

Collocations is a technique to detect when two or more words occur more commonly together than separately (e.g. “wishy-washy” in English) – I use this to group words into n-gram tokens because many NLP techniques consider each word as if it’s independent of all the others in a document, ignoring order:
matpalm.com/blog/2011/10/22/collocations_1/
matpalm.com/blog/2011/11/05/collocations_2/

When you’ve got a lot of text and you don’t know what the patterns in it are, you can run an “unsupervised” clustering using Latent Dirichlet allocation:
www.cs.princeton.edu/~blei/papers/Blei2012.pdf
www.youtube.com/watch?v=5mkJcxTK1sQ

Or if you know how your data is divided into topics, otherwise known as “labeled data”, then you can run “supervised” techniques such as training a classifier to predict the labels of new similar data. I can’t find a really good page on this – I picked up a lot in IM with my friend Ben who is writing a book coming out next year: blog.someben.com/2012/07/sequential-learning-book/

Here are the tools I’ve mostly been using:

Vowpal Wabbit (classification and LDA, poor documentation, C++ high performance): https://github.com/JohnLangford/vowpal_wabbit/wiki
Gensim (LDA, vector similarity, text processing, python): radimrehurek.com/gensim/index.html
Mallet (classification and LDA, java): mallet.cs.umass.edu/
Lingpipe (text analysis, clustering, classification, linguistics, java, commercial open-source): alias-i.com/lingpipe/demos/tutorial/read-me.html
Mahout (Hadoop, classification, clustering, LDA, collaborative filtering, java): mahout.apache.org/
Langdetect (language detection, java): code.google.com/p/language-detection/

Some blogs I like:

matpalm.com/blog/
blog.echen.me/
thedatachef.blogspot.co.uk/
www.machinedlearnings.com

MetaOptimize Q+A is the Stack Overflow of ML: metaoptimize.com/qa

The Mahout In Action book is quite good and practical: manning.com/owen/

Extracting a social graph from Wikipedia people pages

April 5th, 2012 | Published in data, graphs | 1 Comment

I’ve been in San Francisco this week giving a workshop at the Where Conference called Prototyping Location Apps With Big Data. You can read the full slides for the workshop on Slideshare and get the full code and sample data on Github.

The key message of the workshop is that there are plenty of open datasets available on the web which can be used to prototype new applications by acting as proxies for the kind of data you expect to have later in the product lifecycle. You just have to do a bit of lateral thinking and some data-processing. For example, wouldn’t it be great if you were working on a social site and could test your designs, your algorithms and your scalability using a realistic social graph of 300,000 people with over 2 million connections between them? It’d be much better than entering a test dataset by hand using just a few examples from people you know or your family, and it’d make for a much better demo if you took it to an investor or a product board. No more lorem ipsum!

We can generate such a dataset using Wikipedia. Consider the Wikipedia page for Bill Clinton. In just the first three paragraphs there are mentions of people highly related to the former US President: Hillary Clinton, George H.W. Bush and Franklin D. Roosevelt. If we were to consider these intra-wiki links as connections in the social graph (“Bill Clinton knows Hillary Clinton”) and perform this extraction over all of Wikipedia then we’d have a pretty convincing graph. It would have lots of connections, a good mix of communities (politicians, historical figures, television personalities) and a nice mix of well-connected and less-connected people.

Raw Wikipedia text is openly available for download but parsing it is difficult, and doesn’t give us the kind of structured and typed data that we’re looking for. Luckily the DBpedia project has already tackled this problem. They have extracted page types, images, geocoded coordinates, intra-wiki links and many other things, and made them all downloadable. For this hack we’ll need the “Ontology Infobox Types” and the “Wikipedia Pagelinks” datasets.

The types file has one or more lines for each Wikipedia page. For example, the page for Autism is listed as a Thing and a Disease. We’ll filter this file down to just the Person pages. Then we’ll take the links file and filter it down to just the links that are from a Person to another Person (by using the filtered types file we just made). We can do all of this with 18 lines of Apache Pig code then run it through a Hadoop cluster. You can see sample results in the Github project. If we convert it to GraphML format with a JRuby script (using the JUNG library) and load it into Gephi to detect the communities and create a force-directed layout, we get a pleasant and interesting social graph with all the kinds of clusters we’d expect:

You can also explore a simplified version of this graph in PDF format for your zooming pleasure.

On graphs

September 22nd, 2011 | Published in graphs

I’ve been working on an in-depth post for this blog about graph data and how to analyse it. That post is still unfinished but I’ve been posting pieces of work-in-progress on other sites during the process. Here are some pointers to bring them together:

A Flickr set of Gephi graph visualisations that analyse, cluster and visualise datasets from Delicious, Twitter, Dopplr and Wikipedia
A zoomable exploration of the topics in the 2012 SXSW panel picker and some notes on how it was made.
Visualising the whole world of conference topics with Lanyrd data
Slides from an Ignite talk at Strata Summit NYC that uses Nokia geo data to draw a new map of the world: Place graphs are the new social graphs

Algorithmic recruitment with GitHub

February 10th, 2010 | Published in web | 20 Comments

In my new job in Berlin I’ve been asked to hire some people to help prototype new, secret projects. Berlin has a superb tech scene but as I’m new in town it’s taking me a little time to get to know everyone. While that’s going on, I wrote some code to help me explore Berlin’s developer community.

When I’m hiring, one of the things I always want to see is evidence of personal projects. Over the last two years, GitHub has become an amazing treasure trove of code, with the best social infrastructure I’ve ever seen on a developer site. GitHub profiles let the user set their location, so I started with a few web searches for Berlin developers. This finds hundreds of interesting people, but how do I prioritise them?

Another thing that I look for when building a good team is someone’s personal network. I’ve always believed strongly in spending lots of time at conferences meeting passionate people who are smarter than me. A good developer can make themselves even more productive by knowing who to email, IM or DM to answer a question when they’re stuck.

A recent article by Stowe Boyd on centrality and influence in social networks reminded me of some of the network analysis we use behind the scenes calculating recommendations for the Dopplr Social Atlas. So I wrote some code to query the GitHub API and analyse the social graph of the Berlin subset of their users.

The JRuby code uses Yahoo BOSS to do the web search. After querying the GitHub API for each user’s followers it builds an in-memory graph using the Java Universal Network/Graph Framework. Then it ranks each user node in the graph using the Betweenness Centrality algorithm. You can see the simple source code on my github.

To sanity-check the results I ran it for a couple of cities I already know well: London and San Francisco. Here are the top 5 for each, which seem quite plausible to me:

San Francisco

Chris Wanstrath, GitHub
Tatsuhiko Miyagawa, Six Apart
Leah Culver, Six Apart
Square Inc
Aman Gupta, ruby eventmachine maintainer

London

James Darling
London Ruby User Group
Mark Norman Francis
Dan Webb (recently moved to Twitter in SF)
Carlos Villela, Thoughtworks

My choice of metric biases these lists towards connectedness and influence — it can’t measure ability. It’s only measuring GitHub users, and they are biased towards Ruby, Perl and Javascript. But seeing names there that I trust gives me confidence that it’ll help me find interesting people in Berlin.

Hopefully some of those people are reading this blog post right now. Others outside Berlin might be interested to know that Nokia does a superb job of relocating people, with everything taken care of by shipping companies and local agents. If you love the web, Javascript, mobile, user experience, social networks, location, enormous datasets and currywurst, you should get in touch.

Scripting “Find My iPhone” from Ruby

July 23rd, 2009 | Published in Uncategorized

When the iPhone OS 3.0 came out with new Mobile Me features allowing you to remotely discover the location of your iPhone and send it a message and an alarm, I hoped that there’d be an API. While there’s no official way to access it, the enterprising Tyler Hall and Sam Pullara dug out their HTTP sniffers and figured out how the javascript on me.com talks to its backend service.

Their code is written in PHP and Java respectively, two languages I’m not particularly comfortable in. Translating from their source code, I’ve produced a ruby version and packaged it as a very simple gem. It lacks real documentation or elegant error handling, but it’s easy to figure out.

Use it like this to locate your phone:

$ sudo gem install mattb-findmyiphone --source gems.github.com

>> require 'rubygems' ; require 'findmyiphone'
>> i = FindMyIphone.new(username,password)
>> i.locateMe
=> {"status"=>1, "latitude"=>51.546544, "time"=>"8:06 AM", "date"=>"July 23, 2009", "accuracy"=>162.957953, "isLocationAvailable"=>true, "isRecent"=>true, "isLocateFinished"=>true, "statusString"=>"locate status available", "isAccurate"=>false, "isOldLocationResult"=>true, "longitude"=>-0.05744}

And to send a message:

>> i.sendMessage("Unimportant message")
=> {"status"=>1, "time"=>"8:17 AM", "date"=>"July 23, 2009", "unacknowledgedMessagePending"=>true, "statusString"=>"message sent"}

Finally, if you look in the examples directory you’ll find a short script that uses the location data to update Fire Eagle via its API. Fill in the example YAML files with the appropriate credentials and it’ll do the rest.

Of course the code’s all open source and contributions via Github are very welcome.

iPhone coding for web developers

March 28th, 2009 | Published in iphone, talks, Uncategorized | 1 Comment

This week the London Flash Platform User Group ran an evening of iPhone developer talks. My talk, “iPhone Coding For Web Developers” seemed to go down well. As a web developer, I’ve found the iPhone development environment exciting in its power and possibilities, but also perplexing in its lack of basic facilities that I’d take for granted in a modern dynamic language.

This talk (based on a previous blog post here) goes into some detail about how I use HTTP, JSON and other web-oriented tech in my iPhone work.

iPhone Coding For Web Developers

View more presentations from mattb.

Switching from scripting languages to Objective C and iPhone: useful libraries

January 26th, 2009 | Published in iphone | 8 Comments

For the last few months I’ve been spending much of my spare hacking time learning to code iPhone applications. I’ve found Objective C to be a surprisingly pleasant language, and Cocoa is one of the best frameworks I’ve ever worked with. I’ve reached a point where I feel I can go fairly quickly from simple app ideas to sketching in real code.

I’m a web developer at heart, and a scripting language user by preference. Coding for the iPhone doesn’t feel as fluid in text handling or HTTP access as the environments I’m used to. Fortunately I’ve been able to find some fantastic open-source libraries and wrappers that make up the difference. Here are my favourites so far:

GTMHTTPFetcher from Google Toolbox for Mac

The iPhone’s native HTTP handling is capable, but low-level and verbose. Rather than handling the many callbacks, NSData objects and options I prefer this wrapper. It has a ton of convenience methods allowing you to specify POST data and basic auth, follow redirects automatically, keep cookies over a session, set headers, and have two simple callbacks for success and error handling. In many ways it’s comparable to jQuery’s $.ajax() one-hit function.

JSON framework

Having got some data over HTTP from a web API, chances are that it’s available in JSON format. This simple framework extends NSString with a JSONValue method to convert any legal JSON string to nested NSDictionaries and NSArrays. To go the other way, dictionaries and arrays gain a JSONRepresentation method.

libxml2 wrappers for XPath over XML and HTML

Perhaps your web API returns XML, or perhaps you’re getting your data by screenscraping HTML. Did you know that the iPhone ships with libxml2, which has high-performance XML and HTML parsing and a high-quality XPath implementation? Don’t struggle with Cocoa’s NSXMLParser or get bogged down in the complex libxml2 docs; use these two simple wrapper functions, PerformXMLXPathQuery and PerformHTMLXPathQuery, to pull out the structured data you need in a Cocoa-friendly representation.

RegexKitLite for regular expressions

Where would scripting be without regular expressions? Luckily they’re available on the iPhone, but buried deep within the ICU libraries. RegexKitLite extends NSString with core regex string handling, including ‘split’ (known as componentsSeparatedByRegex) and a search-and-replace operator (stringByReplacingOccurrencesOfRegex and replaceOccurrencesOfRegex).

FMDB, an Objective C wrapper for sqlite

Every scripting language has convenient database driver wrappers. I was very happy to find that sqlite is available on the iPhone, but unfortunately its interface is all bare-metal C. The simplest wrapper I’ve found so far is FMDB. Apparently somewhat inspired by JDBC, it gives you connection and resultset objects, along with one-liner convenience functions allowing code like [db intForQuery:@"SELECT COUNT(*) FROM things"].

And there’s more…

I’ve used all of the above in a real project, but I’ve got yet more things to explore on my todo list. These include Matt Gemmell’s web-style templating framework MGTemplateEngine, ActorKit for Erlang-style messaging and thread management and the LLVM/Clang Static Analyzer for automatic bug detection. What else do you use?

Google map of London with Flickr shape data overlaid

November 16th, 2008 | Published in javascript

Flickr place info now includes shape data for many places. See the Flickr code blog for more.

We’ve correlated most of Dopplr’s places with Yahoo WOE IDs using Flickr’s reverse geocoder, so we can use this data too. As an experiment, I wrote some clientside code to overlay this shape data onto the maps we use on Dopplr. Help yourself to the code if you want it: gist.github.com/25502

Conference season 2008

February 7th, 2008 | Published in Uncategorized | 1 Comment

The March 2008 US conference season is nearly upon us. I’m just on my way back from representing Dopplr at Social Graph Foo Camp (find out more by listening to the Citizen Garden Podcast I participated in after the camp), but I’ll be back here again in three weeks.

Read the rest of this entry »

Last call for XTech

January 25th, 2008 | Published in Uncategorized

It’s that time of year again – today is your last chance to put in a proposal for XTech 2008 in Dublin. You can read all about it in the Call for Participation. This year, along with the traditional core Web and XML technologies of XTech, we’re focusing on “The Web on the Move” – the emerging portability of data, applications and identity on the internet.

I’m writing my proposal today – I’m planning on pulling the very loose ramble I presented at Barcamp London on messaging architectures into a proper talk. For 2008 I’m very excited about Erlang, XMPP, message brokers such as ActiveMQ and clientside messaging with Comet. The future’s asynchronous and highly concurrent.

Read the rest of this entry »

« Previous Entries

Hackdiary

Hackdiary

Natural Language Processing and Machine Learning, some pointers

Extracting a social graph from Wikipedia people pages

On graphs

Algorithmic recruitment with GitHub

San Francisco

London

Scripting “Find My iPhone” from Ruby

iPhone coding for web developers

Switching from scripting languages to Objective C and iPhone: useful libraries

GTMHTTPFetcher from Google Toolbox for Mac

JSON framework

libxml2 wrappers for XPath over XML and HTML

RegexKitLite for regular expressions

FMDB, an Objective C wrapper for sqlite

And there’s more…

Google map of London with Flickr shape data overlaid

Conference season 2008

Last call for XTech

Archives