Fun with ones and zeros



PostgreSQL full text search with Django

PostgreSQL 8.3 is coming out soon with full text search integrated into the core database system. It's pretty well documented in chapter 12 of the PostgreSQL docs. The docs are a bit intimidating, but it turns out to be pretty easy to use with Django.

Let's say you're doing a stereotypical blog application, named 'blog', and have a model for entries such as:

from django.db import models

class Entry(models.Model):
    title = models.CharField(max_length=128)
    body = models.TextField()

after adding your app, and doing a syncdb, you should have a PostgreSQL table named blog_entry. You'll need to connect to the database with psql, or probably more conveniently with manage.py dbshell to execute a few SQL commands to setup full text searching.

PostgreSQL Setup

First, you'll want to add a column to hold a tsvector for the blog entry, which is a preprocessed version of text optimized for searching.

ALTER TABLE blog_entry ADD COLUMN body_tsv tsvector;

Next, you'll want to add a trigger to update the body_tsv column whenever a record is inserted or updated:

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON blog_entry 
    FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(body_tsv, 'pg_catalog.english', body);

You'll probably also want an index on that field, to make searches more efficient:

CREATE INDEX blog_entry_tsv ON blog_entry USING gin(body_tsv);

Lastly, if you already have records in the blog_entry table, you'll want to update the body_tsv column for those records (any new records will be automatically taken care of by the trigger).

UPDATE blog_entry SET body_tsv=to_tsvector(body);

As a test, to search for the words 'hello world' for example, you could try:

SELECT title FROM blog_entry WHERE body_tsv @@ plainto_tsquery('hello world');

and get back a list of entries with those words in the body.

Full text searching in Django

Full text searching through the Django ORM is possible using the .extra() queryset modifier. To fetch the same entries with 'hello world' we searched for in raw SQL, you'd do something like:

q = 'hello world'
queryset = Entry.objects.extra(
    where=['body_tsv @@ plainto_tsquery(%s)'], 
    params=[q])
for entry in queryset:
    print entry.title

The full text features in PostgreSQL also allow for ranking the search results using the PostgreSQL ts_rank_cd() function, and generating fragments of documents with search terms highlighted using the ts_headline() function. An example of raw SQL for doing this might be:

SELECT title, ts_headline(body, query) AS snippet, ts_rank_cd(body_tsv, query, 32) AS rank
  FROM blog_entry, plainto_tsquery('hello world') AS query
  WHERE body_tsv @@ query
  ORDER BY rank DESC;

The Django ORM can generate an equivalent query with

q = 'hello world'
queryset = Entry.objects.extra(
    select={
        'snippet': "ts_headline(body, query)",
        'rank': "ts_rank_cd(body_tsv, query, 32)",
        },
    tables=["plainto_tsquery(%s) as query"],
    where=["body_tsv @@ query"],
    params=[q]
    ).order_by('-rank')

for entry in queryset:
    print entry.title, entry.snippet, entry.rank

Addendum, 2008-04-30

With the new Queryset Refactor (QS-RF) branch of Django, one of the few backwards-incompatible changes is that you can't use an order_by method with a field generated by an extra method. Instead, ordering by that field should be specified using an order_by parameter in the extra method, as in (the change is only in the last two lines)

Also, having a function call in the 'tables' parameter no longer seems to work, so as a workaround for now I've been repeating plainto_tsquery() 3 times:

q = 'hello world'
queryset = Entry.objects.extra(
    select={
        'snippet': "ts_headline(body, plainto_tsquery(%s))",
        'rank': "ts_rank_cd(body_tsv, plainto_tsquery(%s), 32)",
        },
    where=["body_tsv @@ plainto_tsquery(%s)"],
    params=[q],
    select_params=[q, q],
    order_by=('-rank',)
    )

There is a lot more that can be done with the PostgreSQL full text search feature, check the docs for more info. But hopefully this is enough to get started with.

Addendum, 2009-01-28

Ross Poulton has a writup on Full-text searching in Django with PostgreSQL and tsearch2 - using PostgreSQL 8.1 (on a Debian system) where tsearch2 was a contibuted addon module.

posted: Wednesday, 02 January 2008 07:00 | tags: django postgresql


Comments

#1 Adrian commented, on January 29, 2008 at 1:24 p.m.:

Excelent post!
Thank you very much for the information. I'm really looking forward to the release of the 8.3 version of postgresql.

#2 Charlie commented, on February 21, 2008 at 11:46 a.m.:

Hi,

I'm just starting to look into PostgreSQL (I'm already using Django), so thanks for teaching me some stuff.

I noticed you use ts_headline() in your code snippets, but in your text you refer to the highlighting function as ts_highlight(). I believe the snippets are correct and the text is wrong, right?

#3 Barry commented, on February 21, 2008 at 11:56 a.m.:

Charlie: You're right, thanks for pointing that out - I've corrected the text.

#4 Sajal Kayan commented, on December 22, 2008 at 10:34 a.m.:

Excellent post, been digging into postgres and full text search for last couple of days, however i havent been able to figure something out yet.

This is not specific to full text search in django, but more of a postgres question..

from your example:

SELECT title FROM blog_entry WHERE body_tsv @@ plainto_tsquery('hello world');

This would match documents only if 'hello world' appears exactly as it is. how do i make it also match 'hello stupid world' ?

My intention not to make a search engine but rather to send in an entire article as the search term and get a list of 10 most similar articles with some sort of scoring capability.

kinda do what this plugin does for wordpress rmarsh.com/plugins/similar-posts/

Would be very grateful for some insights into this.

#5 Rachel commented, on June 30, 2009 at 3:39 a.m.:

A really useful post - thanks for sharing!

blog comments powered by Disqus

spacer


gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.