qwghlm.co.uk : it's pronounced "Chris Applegate"

Why it took me five months to write @whensmytube

6 March 2012

(or, open data is not always 100% open)

Five months ago I wrote a Twitter bot called @whensmybus. It took me a fortnight to code up and test the first version, which was pretty simple to begin with – it took a single bus number, and a geotag from a Tweet, and worked out when the next bus would be for you. And then people started using it, and really liking it. And because they liked it, they found ways of making it better (curse those users!). So I had to add in features like understanding English placenames, being able to specify a destination, handling direct messages, multiple bus routes, and tolerating the many many ways it’s been possible to break or confuse it, and this took up a lot of my time. And it was fun, to be honest.

At the same time, those bloody users also asked me when was I going to do a version for the Tube. But I was too busy adding the features for @whensmybus, and that’s one reason why it took me five months to write its counterpart, @whensmytube, which I launched last week. But there’s a stack of other reasons why it took so long. It didn’t seem too difficult to begin with. Just like with buses, Transport for London have made their Tube departure data open-source (via a system called TrackerNet), as well as the locations of all their stations. It would be pretty simple to do the same for tube data as it would for bus data, right?

Wrong.

So, for anyone interested in open data, software development, or just with a lay interest in why software doesn’t get new features quickly, here’s a run-down of why:

1. The Tube data isn’t complete

TfL helpfully provide details of all their Tube stations in a format called KML, from which it’s reasonably easy to extract the names and locations of every station. Well, they say “all”. That’s a bit of a lie. The file hasn’t been updated in a while; according to it, the East London Line is still part of the Tube network, and Heathrow Terminal 5 and Wood Lane stations don’t exist; neither do the stations on the new Woolwich Arsenal and Stratford International branches of the DLR. This has been griped about by other developers, but no update has been forthcoming. So it took time to do the ballache task of manually writing the data that hadn’t been included in the first place.

To make things more annoying, certain stations are left out of the TrackerNet system. If you want live updates from Chesham, Preston Road, or anywhere between Latimer Road and Goldhawk Road on the Hammersmith & City, you’re plain out of luck. Sorry, this is TfL’s fault and not mine. This also wasn’t documented anywhere, just omitted from the system documentation.

2. The Tube data isn’t built for passengers

To be fair to TfL, they do say what the TrackerNet service was meant for – it is built on their internal systems and was for non-critical monitoring of service by their staff, and there is a disclaimer saying this. The public version is useful, but unlike its bus counterpart there is a lot of data there which is not for public consumption. If anything, it’s too useful, as it contains irrelevant information such as:

Trains that are out of service or “Specials”
Trains that are parked in sidings
Trains on National Rail systems, like Chiltern Railways, that run over Tube lines
Data on whether a train is scheduled to go to a depot after its journey
Some trains just don’t know what their final destination is yet, and are just labelled “Unknown”

And none of these special cases are documented in the system. So I had to spend a lot of time working out these odd edge cases and filtering out the chaff. And the code is by no means complete – I have to wait until irrelevant information is shown up to be able to filter it, because TfL don’t provide anywhere a list of possible values. This is annoying – so much so that I have even taken the step of submitting a Freedom of Information request to find out all the possible destinations a train can be given on the system to make sure, but I’m still waiting on it.

The documentation also falls down on being useful to reuse. For example, each station has a name (e.g. “King’s Cross St. Pancras”) and a code (e.g. “KXX”). Because spellings can vary, it’s easier to use the three-letter code when doing lookups for consistency. But the list of codes, and the station names they correspond to, were locked in a bunch of tables in a write-protected PDF, so it was impossible for me to create a table of code-to-station-name lookup table. In the end, I’m glad that someone had done the hard work for me, rather than I having to manually type them out.

On top of that, the system uses terminology more suited to insiders. For example, most stations have platforms labelled Eastbound/Westbound or Northbound/Southbound, which is fine. But the Circle Line and the Central Line’s Hainault Loop have designations “Inner Rail” and “Outer Rail”. And then to make my life even worse, some edge cases like White City and Edgware Road have platforms that take trains in both directions. This is confusing as hell, and so I had to spend a bit of time dealing with these cases and converting them to more familiar terms, or degrading gracefully.

This is a pain, but worth it. As far as I’m aware, no other Tube live data app (including TfL’s website, or the otherwise excellent iPhone app by Malcolm Barclay, which I regard as the gold standard of useful transport apps) takes this amount of care in cleaning up the output presented to the user.

3. Humans are marvellous, ambiguous, inconsistent creatures

And then on top of that there’s the usual complications of ambiguity. There are 40,000 bus stops in London, and typically you search for one by the surrounding area or the road it’s on, because you don’t know its exact name, and the app can look up roughly where you are, and give an approximate answer. But, there are fewer than 300 Tube stations, and so you’re more likely to know the name of the exact one you want. But, there are variations in spelling and usage. Typically, a user is more likely to ask for “Kings Cross” than the full name “King’s Cross St. Pancras” – punctuation and all. This all needs dealing with gracefully and without fuss.

4. Despite all my work, it’s still in beta

There’s plenty @whensmytube doesn’t yet do. It only accepts requests “from” a station and doesn’t yet accept filtering by destination “to”. This is because, unlike bus routes, most tube lines are not linear (and some even have loops). Calculating this is tricky, and TfL don’t provide an open-source network graph of the Tube network (i.e. telling us which station connects to which), and I haven’t yet had the time to manually write one.

5. But I’m still glad I did it

Despite all my problems with wrangling TfL’s data, I’m still pleased with the resulting app. Not least because, hey, it shipped, and that’s to be proud of in its own right. But more because everything I learned from it has kept me keen, and it’s had some pleasant side effects. The refactoring of the code I had to do has made @whensmybus a better product, and all the learnings of how to deal with the Tube network meant I was able to code and release a sister product, @whensmyDLR, with only a few days’ extra coding. Not bad.

But, here’s some quick conclusions from wrangling with this beast for the past five months:

Open data is not the same as useful data If it’s badly-annotated, or incomplete, then an open data project is not as useful. Releasing an API to the public is a great thing, but please don’t just leave it at that; make sure the data is as clean as possible, and updates are made to it when needed.
Open documentation is as important as open data It’s great having that data, but unless there’s documentation in an open format on how that data should be interpreted & parsed, it’s a struggle. All the features should be documented and all possible data values provided.
Make your code as modular as possible If you’re having to deal with dirty or incomplete datasets, or undocumented features, break your code up into as a modular a form you can get away with. The string cleaning function or convenience data structure you wrote once will almost certainly need to be used again for something else down the line, and in any case they shouldn’t clutter your core code.
In the end, it’s worth it Or, ignore all my moaning. Yes, it can be a pain, and annoying to deal with cleaning up or even writing your own data to go along with it; but in the end, a cleanly-coded, working product you can look on with pride is its own reward.
Thank you TfL Despite all my bitching above, I’m still really grateful that TfL have opened their datasets, even if there are flaws in how it’s distributed and documented. Better something than nothing at all – so thank you guys, and please keep doing more. Thank you.

3 Comments

When’s My… Anything

27 February 2012

Last year I introduced a service called @whensmybus, a Twitter bot that you could ask for real-time bus times from anywhere in London. It proved to be a little bot cult hit, and in time I’ve expanded it from a simple “one bus please” service to handle natural language parsing, multiple routes, direct messages and the like.

But people don’t just take buses in London. They also take the Tube. And so it only seems fair to build a sister service for @whensmybus for the subterranean-inclined. So, introducing… @whensmytube. It does the exact same thing – taking advantage of Twitter’s realtime and geolocation capabilities and mashing them up with TfL’s open APIs to give you live Tube departure times for nearly any station on the Underground. Just Tweet:

@whensmytube Central Line

with a GPS-enabled Tweet, or:

@whensmytube Central Line from Liverpool Street

with an ordinary Tweet. More information and a full description of its abilities and how to use it are available here. Please use it! And break it! It’s still in beta, and any feedback would be much appreciated, thank you.

But, hang on. That’s not all! There’s not just the Tube in London. There’s also my beloved Docklands Light Railway. And it would be cruel to leave it out. So have two for the price of one – if you’re a DLR lover, please try @whensmyDLR for size as well:

@whensmytube DLR

with a GPS-enabled Tweet, or:

@whensmytube DLR from Poplar

with an ordinary Tweet. Like its Tube and bus counterparts, it’s reasonably flexible, so please check out out the help page. And please give any feedback you can, thank you!

Add a comment

Why it’s not just about teaching kids to code

10 January 2012

The Guardian have launched a Digital Literacy Campaign, led by an article entitled “Britain’s computer science courses failing to give workers digital skills“:

In higher education, although universities such as Bournemouth are praised by employers for working closely with industry, other universities and colleges have been criticised by businesses for running a significant number of “dead-end” courses in computer science, with poor prospects of employment for those enrolled.

And from my own anecdotal experience, that’s correct. For one reason or another, I’ve been reviewing CVs and interviewing people at work for developer roles last couple of months, and some of them were awful. They tended to have degrees or other qualifications from mid- and lower-tier universities and colleges, but had trouble telling the difference between PHP and JavaScript code, or were unable to provide even stock answers to well-versed problems such as sorting.

(Feel free to call me out as a snob on this one; I read my Bachelor’s in Computer Science at Cambridge, one of the few universities in this country where the majority of the course is spent not coding)

Anecdotal though my own experience, and many of the quotes in the article are, the Guardian’s campaign is laudable and I back teaching kids code in schools. But there are two issues I have with the campaign – it’s not just teaching, and not just code that needs to be taught (or learned).

Firstly, “digital literacy” is as broad a term as “literacy” or “numeracy”, and there are a range of different issues at stake. Take this complaint in the above article:

Ian Wright, the chief engineer for vehicle dynamics with the Mercedes AMG Petronas Formula One team, said: “There’s definitely a shortage of the right people. What we’ve found is that somebody spot on in terms of the maths can’t do the software; if they’re spot on in terms of the software, they can’t do the maths.

versus:

Kim Blake, the events and education co-ordinator for Blitz Games Studios, said: “We do really struggle to recruit in some areas; the problem is often not the number of people applying, which can be quite high, but the quality of their work. We accept that it might take a while to find a really good Android programmer or motion graphics artist, as these are specialist roles which have emerged relatively recently – but this year it took us several months to recruit a front-end web developer. Surely those sorts of skills have been around for nearly a decade now?

versus:

In a highly critical report last month, school inspectors warned that too many information and communication technology (ICT) teachers had limited knowledge of key skills such as computer programming. In half of all secondary schools, the level many school leavers reach in ICT is so low they would not be able to go on to advanced study, Ofsted said.

Computer Science is not Programming, and Programming is not Web Development, and Web Development is not ICT. What we have is a whole spectrum of different demands and of different roles, all of which have technology in common but often little else; producing computer models for a Formula One team or CGI Studio is going to demand a PhD-level or near grasp of maths or physics, combined with knowledge of highly specialised programming. Developing a front-end for a website still demands a reasonable degree of intelligence, but also a wider knowledge of languages and coding, and a better appreciation of more subjective issues such as usability, browser standards (or the lack of them) and aesthetics. Meanwhile, being adept with ICT doesn’t mean you have to be a genius or be an expert in code, but it needs to be more than how to make a PowerPoint presentation; how to use a computer properly and not just by rote, how to be confident in manipulating and understanding data, how to automate tedious tasks, how to creatively solve a problem.

Today technology is integrated to our lives to a quite frankly frightening degree. Should that mean everyone has to learn how to code? No. Should it mean everyone have an understanding of the basics, an appreciation of what computers can and can’t do, and the ability to use that knowledge to solve problems by themselves? Yes. But making everyone code is not the answer, and to me the Guardian is taking a bit of a “if it looks like a nail” approach to the problem of digital illiteracy.

That said, from my experience of the graduate CVs I read, the teaching of coding, as a practice, does need to improve. University courses should be better assessed and monitored and the “sausage factories” closed. Teaching how to code should be integrated into related subjects such as maths and physics wherever possible (and it’s worth noting many places do this well already). It shouldn’t just be coding that is taught, but how to define a problem, to break it down, and solve it. If anything, that’s more important – programming languages and technologies change all the time (e.g. how many Flash developers do you think will be about in five years’ time?) but the problems usually remain the same.

Secondly, there’s a spectrum of challenges, but there’s also a spectrum of solutions. It’s not just schools and universities that need to bear the burden. As I said, coding is a practice. There’s only so much that can be taught; an incredible amount of my knowledge comes from experience. Practical projects and exercises in school or university are essential, but from my experience, none of that can beat having to do it for real. Whether it’s for a living, or in your spare time (coding your own site, or taking part in an Open Source project), the moment your code is being used in the real world and real people are bitching about it or praising it, you get a better appreciation of what the task involves.

So it’s not just universities and schools that need to improve their schooling if we want to produce better coders. Employers should take a more open-minded approach to training staff to code – those that are keen and capable – even if it’s not part of their core competence. Technology providers should make it easier to code on their computers and operating systems out-of-the-box. Geeks need to be more open-minded and accommodating to interested beginners, and to build more approachable tools like Codecademy. Culturally, we need to be treat coding less like some dark art or the preserve of a select few.

On that last point, the Guardian is to be applauded for barrier-breaking, for making the topic a little less mysterious and for engaging with it in a way I’ve seen precious little of from any other media outlet. And the page on how to teach code is a great start – it should really be called how to learn code, because it’s a collection of really useful resources. For what it’s worth, I wrote a blog post nearly three years ago on things on things to get started on – though if I wrote it today I would probably drop the tip on regular expressions (what was I thinking?).

If I had one last thing to add, is that all of the Guardian’s campaign, and the support from Government, is framed around coding for work. Which is important – we are in the economic doldrums and the UK cannot afford to fall behind other nations. But, at the same time, the first code a beginner writes is going to be crap, and not very useful. Even when they get to a moderately competent level, it won’t be very useful beyond the unique task it was built for. Making really good code that is reusable and resilient is bloody hard work, and it would be off-putting to make the beginner judge themselves against that standard.

We need to talk a lot more about why we code as well as how we code. I don’t code for coding’s sake, or just because I can make a living out of it. I code because it’s fun solving problems, it’s fun making broken things work, it’s fun creating new things. Take the fun out of it, making it merely a “transferrable skill” for economic advantage, will suck the joy out of it just like management-speak sucks the joy out of writing. It doesn’t have to be like that. Emphasise on the fun, emphasise the joy of making the infernal machine do something you didn’t think it was possible to do, encourage the “Isn’t that cool?” or “Doesn’t that make life easier?”. Get the fun bit right first, and the useful bit will follow right after.

5 Comments

@whensmybus gets a whole lot better

13 October 2011

Wow. It’s been nine days since @whensmybus was released and the feedback has by and large been positive. It’s not all been plain sailing – the odd bug or two made it past my initial testing, and a database update I tried inadvertently corrupted it all. My thanks go to @LicenceToGil, @randallmurrow and @christiane who were all unlucky enough to manage to break it. As a result, I’ve ironed out some of the bugs, and even put in some unit testing to make sure new deployments don’t explode. I now feel this is A Proper Software Project and not a plaything.

Bugfixes are all very well, but… by and far away the most requested feature was to allow people to get bus times without needing a GPS fix, to allow use on Twitter via the web, desktop app or not-so-smartphone. And although using GPS is easier, and cool and proof-of-concepty, it’s plain to see that making access to the app as wide as possible is what makes it really useful. So, from now on you can check the time of a London bus by specifying the location name in the Tweet, such as:

@whensmybus 55 from Clerkenwell

This will try and find the nearest bus stop in Clerkenwell for your bus – in this case, the stops on Clerkenwell Road, which are probably what you’d want). The more precise the location given, the better; place names are OK, street names are better. It works great on postcodes and TfL’s SMS bus stop codes as well.

The geocoding that makes this possible is thanks to the Yahoo! PlaceFinder API, so my thanks goes to them for making a service free for low-volume use. (Aside: you may ask why not use Google Maps? Because Google Maps’s API terms only allow it to be used to generate a map, not for other geo applications like this).

So, play away, and let me know what you think. Of course, it may not always work – geocoding is tricky and not foolproof; if it doesn’t, please let me know in the comments here, or just ping me at @qwghlm on Twitter.

More information and FAQs can be found on the about page, and the technically-minded of you might want to check out the code on github.

2 Comments

Introducing @whensmybus

3 October 2011

A few weeks ago TfL put all their information from Countdown, the service they use to provide bus arrival times, online. There’s a TfL Countdown website and you can enter a bus stop name, or ID number, and find out the latest buses from the stop.

But, it’s a bit fiddly. The main website doesn’t automatically redirect you to the mobile version if you are on a phone. If you type in a location, (e.g. my local Tube station, “Limehouse Station”), you have to pick a match for the location first (from two identically-named options), then a second screen asking you to find a bus stop, and then you get the relevant times. On a phone, it’s just feels fiddly and frustrating ~~especially when I know my phone has GPS in it and knows my location anyway.~~

Update/correction There is, as it turns out, the ability to find by geolocation on the mobile site, it’s just on a mobile browser I just get the main website and don’t get redirected to the special mobile site, which means I never knew about it (thanks to Ade in the comments for pointing this out).

If only there was a mobile-friendly, geolocation aware, real-time way of fetching information. Oh wait. There is. It’s called Twitter. Twitter has geolocation allowed on Tweets (if you opt in) and an API to fetch and send messages, so we have a system set up already in place for our needs.

I owe a big debt of gratitude to Adrian Short, who wrote a Ruby script to pull bus times from TfL. TfL have not officially released an API for Countdown just yet, but Adrian found it, and it’s there and accessible – providing the data in JSON format for each stop. That use got me thinking – if that data is available and can be parsed quickly and easily, why not make a Twitter bot for it?

With that, @whensmybus was born, and is now in beta. Try it out now if you like. Make sure your Tweet has geolocation turned on (for which you’ll need a GPS-capable smartphone), and send a message like:

@whensmybus 135

Or whatever bus you are looking for. Within 60 seconds, you’ll get a Tweet back with the times of the next buses for that route, in each direction, from the stops closest to your location.

Why each direction? Specifying a direction is fiddly and ambiguous; bus routes wind and twist, and some of them are even circular, so “northbound” and “southbound” are not easy things to parse. The name of your destination can have ambiguous spellings, and I haven’t yet got round to tying it in with a geocoding service like Google Maps. So, at the moment the bot simply tells you buses in both directions from the stops nearest to you. I might change this in future, once I’ve got my head around geolocation services and fuzzy string matching and all that.

It’s still beta (thanks to an early unveiling by Sian ;) ) and I plan in future to add enhancements such as the ability to use without GPS. ~~I also need to write some proper documentation for it, and stick the source code on Github later tonight once I am home~~. The source code is now available on github, but do bear in mind the codebase is a bit unstable right now. So, if you are a Londoner, please do use it and tell me what you think, either on the comments below or on Twitter. @ me, don’t @ the bot – it will think it’s a request for a bus service and get confused. :) All suggestions are welcome.

(And now, some tech stuff for the more interested)

The bot is a Python script, run every minute via a cronjob. It’s quite short – 350 lines including comments for the main bit. As well as the live data API, the service also uses two databases officially provided by TfL’s syndication service for free; one is of all the routes, and one for all the bus stop locations. I converted these from CSV format to sqlite so the bot can make SQL queries on the data. TfL use OS Easting and Northing locations for the bus stops, so I have to convert the GPS longitude and latitude; I am indebted to Chris Veness and his lat/lng to OS conversion script, which I translated from JavaScript to Python; I am also now much more educated on subtleties like the difference between OSGB36 and WGS84. Finally, I use the Tweepy library to receive and send the Tweets, which is really rather excellent and saves a lot of faff. Finally, the whole project would not be possible without the ideals of open data and open source software behind it, so if you’ve written even a single line of free software, then thank you as well.

15 Comments

Some thoughts on quitting Facebook

23 September 2011

I did an odd thing last night, for a social media webponce. I disabled my Facebook account, perhaps for good (at least that’s the intention).

Although this was not solely due to what came out of the latest Facebook f8* conference, it probably was some sort of straw that broke a proverbial camel’s back. At f8, Mark Zuckerberg announced the Facebook Timeline, a way of not just showing what you are up to right now, but your whole life as Facebook saw it, digitised and shown to all. And my reaction was along the lines of:

Fucking hell, I’m going to be spending the rest of my life tagging photographs of myself

I joined Facebook early in 2007 when they let ordinary civilians in, and at first I quite liked it. It was a cute way of tying in and aggregating one’s content, thoughts and photos, and keeping up with people I knew, or used to know. What a nice service. And for free! But over time, the fun faded. Facebook kept on quietly changing privacy settings and made a landgrab for copyright of uploaded photos (later rescinded).

So, I harrumphed, tightened my privacy (a tedious task), removed a lot of personal info and content (photos, imported blog posts) and despite my misgivings, carried on with a stripped-down profile to keep in touch with friends. But as Facebook matured, and my profile accrued information over time, another unwelcome feature came about.

The practice of “Friending” someone just because you met them at a party, or went to school ten years ago with them, or you work with them, seemed a good idea at the time; it’s nice, who doesn’t want more friends? Even if they are just Facebook friends. But these are people I do not see every day, for whatever reason; as sad as that may be, over time those social ties would normally fade. C’est la vie.

But Facebook ossifies these previously ephemeral social ties; they are there forever, reminding us of the past. Whereas before we would be able to let these ties fade passively, with them laid now we have to actively “unfriend” people we no longer associate with. That’s not very nice, is it – after all, isn’t the opposite of a friend an enemy? So out of politeness, we accumulate these ossified ties, even after we change jobs, cities, relationships, as a form of digital clutter.

This was as bad as it got, until now. While social ties lingered, other content on Facebook would gradually drop off your timeline and fade away. Indeed, as online archiving extraordinaire Jason Scott observed in an excoriating critique of Facebook:

So asking me about the archiving-ness or containering or long-term prospect of Facebook for anything, the answer is: none. None. Not a whit or a jot or a tiddle. It is like an ever-burning fire of our memories, gleefully growing as we toss endless amounts of information and self and knowledge into it, only to have it added to columns of advertiser-related facts we do not see and do not control and do not understand.

Be careful what you wish for. Now our Facebook profiles will have everything we ever have, dished up by default (and while Facebook’s UI has got easier to customise recently, I bet the default will still be everything). Now it’s impossible to escape your past. Everything you have ever done that has been digitally logged by you, or your friends, can now be potentially dished up as your very own digital This Is Your Life. There is, on Facebook, a photograph of me in my early twenties, passed out after drinking too much tequila on Mexican Independence Day (any excuse, my younger self would say). That’d be on my Timeline by default, no doubt.

But it’s not because of embarrassing photos that I’m off Facebook (far more cringeworthy ones exist, thankfully on analogue prints). It’s the sense that Facebook is very much about the past. The people you have known. The relationships you were in. The things you have done. And these hang around your neck and tie you down.