Blog - bitquabit

Having Fun: Python and Elasticsearch, Part 2

November 7th, 2014
11:10 a.m.

Programming

In my earlier post on Elasticsearch and Python, we did a huge pile of work: we learned a bit about how to use Elasticsearch, we learned how to use Gmvault to back up all of our Gmail messages with full metadata, we learned how to index the metadata, and we learned how to query the data naïvely. While that’s all well and good, what we really want to do is to index the whole text of each email. That’s what we’re going to do today.

It turns out that nearly all of the steps involved in doing this don’t involve Elasticsearch; they involve parsing emails. So let’s take a quick time-out to talk a little bit about emails.

A Little Bit About Emails

It’s easy to think of emails as simple text documents. And they kind of are, to a point. But there’s a lot of nuance to the exact format, and while Python has libraries that will help us deal with them, we’re going to need to be aware of what’s going on to get useful data out.

To start, let’s take a look again at the raw email source we looked at yesterday a bit more completely:

$ gzcat /Users/bp/src/vaults/personal/db/2005-09/11814848380322622.eml.gz
X-Gmail-Received: 887d27e7d009160966b15a5d86b579679
Delivered-To: benjamin.pollack@gmail.com
Received: by 10.36.96.7 with SMTP id t7cs86717nzb;
        Wed, 14 Sep 2005 19:35:45 -0700 (PDT)
Received: by 10.70.13.4 with SMTP id 4mr150611wxm;
        Wed, 14 Sep 2005 19:35:45 -0700 (PDT)
Return-Path: <probablyadefunctaddressbutjustincase@duke.edu>
[...more of the same...]
Message-ID: <4328DDFA.4050903@duke.edu>
Date: Wed, 14 Sep 2005 22:35:38 -0400
From: No Longer a Student <probablyadefunctaddressbutjustincase@duke.edu>
Reply-To: probablyadefunctaddressbutjustincase@duke.edu
User-Agent: Mozilla Thunderbird 1.0.6 (Macintosh/20050716)
MIME-Version: 1.0
To: benjamin.pollack@gmail.com
Subject: Celebrating 21 years
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

    It's my birthday, Blue Devils!  At least it will be in a few days, so I
am opening my apartment, porch, and environs this Friday, Sept. 16th
to all of you for some celebration.  Come dressed up, come drunk, or
whatever...just come.  There will be plenty to drink, and for those of
you that are a wine connessiouers, Cheerwine is the closest you'll get.
Kickoff is at 10:30pm.  Pass-out is at <early Saturday morning>.  If you
have some drink preferences, let me know and we'll see what we can
snag.  In addition to that, let me know if you think you can make it,
even if only for a while, so we can judge the amount of booze that we'll
be stocking.

All you really need to know:
Friday, Sept. 16th
10:30pm-late
alcohol

This is about the simplest form of an email you can have. At the top, we have a bunch of metadata about the email itself. Notably, while these look kind of like key/value pairs, we can see that at least some values are allowed to repeat. That said, we’d like to try to merge this with the existing metadata we’ve got if we can.

There’s also the, you know, actual content of the email. In this particular case, that’s clearly just a blob of plain text, but let’s be honest: we know from experience that some emails have a lot of other things—attachments, HTML, and so on.¹ Emails that have formatting or attachments are called multipart messages: each chunk corresponds to a different piece of the email, like an attachment, or a formatted version, or an encryption signature. For a toy tool, we don’t really need to do something special with all of the attachments and whatnot; we just want to grab as much as we can from the email itself. Since, in real life, even multipart emails have a plain text part, it’ll be good enough if we can just grab that.

Let’s make that the goal: we do care about the header values, and we’ll extract any plain text in the email, but the rest can wait for another day.

Parsing Emails in Python

So we know what we want to do. How do we do it in Python?

Well, we’ll need two things: we’ll need to decompress the .eml.gz files, and we’ll need to parse the emails. Thankfully, both pieces are pretty easy.

Python has a gzip module that trivially handles reading compressed data. Basically, wherever you’d otherwise write open(path_name, mode), you instead write gzip.open(path_name, mode). That’s really all there is to that part.

For parsing the emails, Python provides a built-in library, email, which does this tolerably well. For one thing, it allows us to easily grab out all of those header values without figuring out how to parse them. (We’ll see shortly that it also provides us a good way to get the raw text part of an email, but let’s hold that thought for a moment.)

There’s unfortunately one extra little thing: emails are not in a reliable encoding. Sure, they might claim they’re in something consistent, like UTF-7, but you know they aren’t. This is a bit of a problem, because Elasticsearch is going to want to be handed nothing but pure, clean Unicode text.

For the purposes of a toy, it’ll be enough if we just make a good-faith effort to grab useful information out, even if it’s lossy. Since most emails are sent in Latin-1 or ASCII encoding, we can be really, really lazy about this by introducing a utility function that tries to decode strings as Latin-1, and just replaces anything it doesn’t recognize with the Unicode unknown character symbol, �.

def unicodish(s):
    return s.decode('latin-1', errors='replace')

With that in mind, we can start playing with these modules immediately. In your Python REPL, try something like this:

import email
import gzip

with gzip.open('/path/to/an/email.eml.gz', 'r') as fp:
    message = email.message_from_file(fp)
print '%r' % (message.items(),)

This looks awesome. The call to email.message_from_file() gives us back a Message object, and all we have to do to get all the header values is to call message.items().

All that’s left for this part is to merge the email headers with the Gmail metadata, so let’s do that first. While values can repeat, we don’t actually care; key fields, like From and To, don’t, and if we accidentally only end up with one Received field when we should have fifteen, we don’t care. This is, after all, something we’re hacking together for fun, and I’ve never in my life cared to query the Received field. This gives us an idea for a way to quickly handle things: we can just combine the existing headers with our current metadata.

So, ultimately, we’re really just changing our original metadata loading code from

with open(path.join(base, name), 'r') as fp:
    meta = json.load(fp)

with gzip.open(path.join(base, name.rsplit('.', 1)[0] + '.eml.gz', 'r') as fp:
    message = email.message_from_file(fp)
meta = {unicodish(k).lower(): unicodish(v) for k, v in message.items()}
with open(path.join(base, name), 'r') as fp:
    meta.update(json.load(fp))

Not bad for a huge feature upgrade.

Note that this prioritizes Gmail metadata over email headers, which we want: if some email has an extra, non-standard Label header, we don’t want it to trample our Gmail labels. We’re also normalizing the header keys, making them all lowercase, so we don’t have to deal with email clients that secretly write from and to instead of From and To.

That’s it for headers. Give it a shot: try running your modified loader script, and then querying it using the query tool we wrote last time with the --raw-result flag we added to our query tool last time. We’re not printing something useful and user-friendly with the new data, but it’s already searchable and useful.

In fact, you know what? Sure, this is a toy, but it’s just not honestly hard to make this print out at least a little more useful data. Just having From and To would be helpful, so let’s quickly tweak the tool to do that by altering the final click.echo() call:

#!/usr/bin/env python

import json

import click
import elasticsearch


@click.command()
@click.argument('query', required=True)
@click.option('--raw-result/--no-raw-result', default=False)
def search(query, raw_result):
    es = elasticsearch.Elasticsearch()
    matches = es.search('mail', q=query)
    hits = matches['hits']['hits']
    if not hits:
        click.echo('No matches found')
    else:
        if raw_result:
            click.echo(json.dumps(matches, indent=4))
        for hit in hits:
            # This next line and the two after it are the only changes
            click.echo('To:{}\nFrom:{}\nSubject:{}\nPath: {}\n\n'.format(
                hit['_source']['to'],
                hit['_source']['from'],
                hit['_source']['subject'],
                hit['_source']['path']
            ))

if __name__ == '__main__':
    search()

Bingo, done. Not bad for a three-line edit.

For the body itself, we need to do something a little bit more complicated. As we discussed earlier, emails can be simple or multipart, and Python’s email module unfortunately exposes that difference to the user. For simple emails, we’ll just grab the body, which will likely be plain text. For multipart, we’ll grab any parts that are plain text, smash them all together, and use that for the body of the email.

So let’s give it a shot. I’m going to pull out the io module so we can access StringIO for efficient string building, but you could also just do straight-up string concatenation here and get something that would perform just fine. Our body reader then is going to look something like this:

content = io.StringIO()
if message.is_multipart():
    for part in message.get_payload():
        if part.get_content_type() == 'text/plain':
            content.write(unicodish(part.get_payload()))
else:
    content.write(unicodish(message.get_payload()))

This code simply looks for anything labeled plain text and builds a giant blob of it, handling the plain case and the multipart case differently.²

Well, if you think about it, we’ve done all the actual parsing we need to do. That just leaves Elasticsearch integration. We want to combine this with the metadata parsing we already had, so our final code for indexing will look like:

def parse_and_store(es, root, email_path):
    gm_id = path.split(email_path)[-1]
    with gzip.open(email_path + '.eml.gz', 'r') as fp:
        message = email.message_from_file(fp)
    meta = {unicodish(k).lower(): unicodish(v) for k, v in message.items()}
    with open(email_path + '.meta', 'r') as fp:
        meta.update(json.load(fp))

    content = io.StringIO()
    if message.is_multipart():
        for part in message.get_payload():
            if part.get_content_type() == 'text/plain':
                content.write(unicodish(part.get_payload()))
    else:
        content.write(unicodish(message.get_payload()))

    meta['account'] = path.split(root)[-1]
    meta['path'] = email_path

    body = meta.copy()
    body['contents'] = content.getvalue()
    es.index(index='mail', doc_type='message', id=gm_id, body=body)

That’s it. On my system, this can index every last one of the tens of thousands of emails I’ve got in only a minute or so, and the old query tool we wrote can easily search through all of them in tens of milliseconds.

Making a Real Script

Last time, we used click to make our little one-off query tool have a nice UI. Let’s do that for the data loader, too. All we really need to do is make that ad-hoc parse_and_store function be a real main function. The result will look like this:

#!/usr/bin/env python

import email
import json
import gzip
import io
import os
from os import path

import click
import elasticsearch


def unicodish(s):
    return s.decode('latin-1', errors='replace')


def parse_and_store(es, root, email_path):
    gm_id = path.split(email_path)[-1]

    with gzip.open(email_path + '.eml.gz', 'r') as fp:
        message = email.message_from_file(fp)
    meta = {unicodish(k).lower(): unicodish(v) for k, v in message.items()}
    with open(email_path + '.meta', 'r') as fp:
        meta.update(json.load(fp))

    content = io.StringIO()
    if message.is_multipart():
        for part in message.get_payload():
            if part.get_content_type() == 'text/plain':
                content.write(unicodish(part.get_payload()))
    else:
        content.write(unicodish(message.get_payload()))

    meta['account'] = path.split(root)[-1]
    meta['path'] = email_path

    body = meta.copy()
    body['contents'] = content.getvalue()

    es.index(index='mail', doc_type='meta', id=gm_id, body=meta)
    es.index(index='mail', doc_type='message', id=gm_id, body=body)


@click.command()
@click.argument('root', required=True, type=click.Path(exists=True))
def index(root):
    """imports all gmvault emails at ROOT into INDEX"""
    es = elasticsearch.Elasticsearch()
    root = path.abspath(root)
    for base, subdirs, files in os.walk(root):
        for name in files:
            if name.endswith('.meta'):
                parse_and_store(es, root, path.join(base, name.split('.')[0]))

if __name__ == '__main__':
    index()

Until Next Time

For now, you can see that what we’ve got works by using the old query tool with the --raw-result flag, and you can use it to do queries across all of your stored email. But the query tool is lacking, and in multiple ways: it doesn’t output everything we care about (specifically, a useful bit of the message bodies), and it doesn’t really work the way we want by treating some fields (like labels) as exact matches. We’ll fix these next time, but for now, we can rest knowing that we’re successfully storing everything we care about. Everything else is going to be UI.

After all, if you can’t attach a Word document containing your cover letter to a blank email saying “Job Application”, what’s the point of email? ↩
I actually think the Python library messes this up: simple emails and multipart emails really ought to look the same to the developer, but unfortunately, that’s the way the cookie crumbled. ↩

Having Fun: Python and Elasticsearch, Part 1

November 4th, 2014
9:03 a.m.

Programming

I find it all too easy to forget how fun programming used to be when I was first starting out. It’s not that a lot of my day-to-day isn’t fun and rewarding; if it weren’t, I’d do something else. But it’s a different kind of rewarding: the rewarding feeling you get when you patch a leaky roof or silence a squeaky axle. It’s all too easy to get into a groove where you’re dealing with yet another encoding bug that you can fix with that same library you used the last ten times. Yet another pile of multithreading issues that you can fix by rewriting the code into shared-nothing CSP-style. Yet another performance issue you can fix by ripping out code that was too clever by half with Guava.

As an experienced developer, it’s great to have such a rich toolbox available to deal with issues, and I certainly feel like I’ve had a great day when I’ve fixed a pile of issues and shipped them to production. But it just doesn’t feel the same as when I was just starting out. I don’t get the same kind of daily brain-hurt as I did when everything was new,¹ and, sometimes, when I just want to do something “for fun”, all those best practices designed to keep you from shooting yourself (or anyone else) in the foot just get in the way.

Over the past several months, Julia Evans has been publishing a series of blog posts about just having fun with programming. Sometimes these are “easy” topics, but sometimes they’re quite deep (e.g., You Can Be a Kernel Hacker). Using her work as inspiration, I’m going to do a series of blog posts over the next couple of months that just have fun with programming. They won’t demonstrate best practices, except incidentally. They won’t always use the best tools for the job. They won’t always be pretty. But they’ll be fun, and show how much you can get done with quick hacks when you really want to.

So, what’ll we do as our first project? Well, for awhile, I’ve wanted super-fast offline search through my Gmail messages for when I’m traveling. The quickest solution I know for getting incredibly fast full-text search is to whip out Elasticsearch, a really excellent full-text search engine that I used to great effect on Kiln at Fog Creek.²

We’ll also need a programming language. For this part of the series, I’ll choose Python, because it strikes a decent balance between being flexible and being sane.³

I figure we can probably put together most of this in a couple of hours spread over the course of a week. So for our first day, let’s gently put best practices on the curb, and see if we can’t at least get storage and maybe some querying done.

Enough Elasticsearch to Make Bad Decisions

I don’t want to spend this post on Elasticsearch; that’s really well handled elsewhere. What you should do is read the first chapter or two of Elasticsearch: the Definitive Guide. And if you actually do that, skip ahead to the Python bit. But if you’re not going to do that, here’s all you need to know about Elasticsearch to follow along.

Elasticsearch is a full-text search database, powered by Lucene. You feed it JSON documents, and then you can ask Elasticsearch to find those documents based on the full-text data within them. A given Elasticsearch instance can have lots of indexes, which is what every other database on earth calls a database, and each index can have different document types, which every other database on earth calls a table. And that’s about it.

“Indexing” (storing) a document is really simple. In fact, it’s so simple, let’s just do it.

First, if you haven’t already, install the Python library for Elasticsearch using pip via a simple pip install elasticsearch, and then launch an instance of a Python. I like bpython for this purpose, since it’s very lightweight and provides great tab completion and as-you-type help, but you could also use IPython or something else. Next, if you haven’t already, grab a copy of Elasticsearch and fire it up. This involves the very complicated steps of

Downloading Elasticsearch;
Extracting it; and
Launching it by firing up the bin/elasticsearch script in a terminal.

That’s it. You can make sure it’s running by hitting localhost:9200/ in a web browser. If things are looking good, you should get back something like

{
  "status" : 200,
  "name" : "Gigantus",
  "version" : {
    "number" : "1.3.4",
    "build_hash" : "a70f3ccb52200f8f2c87e9c370c6597448eb3e45",
    "build_timestamp" : "2014-09-30T09:07:17Z",
    "build_snapshot" : false,
    "lucene_version" : "4.9"
  },
  "tagline" : "You Know, for Search"
}

Then, assuming you are just running a vanilla Elasticsearch instance, give this a try in your Python shell:

import elasticsearch
es = elasticsearch.Elasticsearch()  # use default of localhost, port 9200
es.index(index='posts', doc_type='blog', id=1, body={
    'author': 'Santa Clause',
    'blog': 'Slave Based Shippers of the North',
    'title': 'Using Celery for distributing gift dispatch',
    'topics': ['slave labor', 'elves', 'python',
               'celery', 'antigravity reindeer'],
    'awesomeness': 0.2
})

That’s it. You didn’t have to create a posts index; Elasticsearch made it when you tried storing the first document there. You likewise didn’t hav