Sharding your database

Nov 30th, 2012 | Comments

I’m increasingly encountering users on Heroku that are encountering the need to shard their data. For most users this is something you delay as long as possible as you can generally go for sometime before you have to worry about it. Additionally scaling up your database is often a reasonable approach early on and something I encourage as a starting point as scaling up is easy to do with regards to databases. However, for the 1% of users that do need to shard when the time comes many are left wondering where to start, hence the following guide.

How I Write SQL

Nov 17th, 2012 | Comments

I recently got asked by a friend and former co-worker how I write SQL. At first this caught me by surprise and I assumed there was nothing different, but after a few additional comments on it, it became clear most people have no concept for creating clean readable SQL. So without further adieu here’s how I write SQL, with a built up example query.

Using Postgres Arrays in Django

Nov 6th, 2012 | Comments

A few weeks back I did a brief feature highlight on Postgres arrays. Since that time I’ve found myself using them with increasing regularity on small side projects. Much of this time I’m using Django and of course not opting to write raw SQL to be able to use arrays. Django actually makes it quite simple to work with Arrays in Postgres with a package by Andrey Antukh. Lets get started by installing two libraries:

pip install djorm-ext-pgarray
pip install djorm-ext-expressions

The first library is for support for the array field type, the second allows us to more easily mix bits of SQL within the Django ORM.

Redis in my Postgres

Oct 18th, 2012 | Comments

Yesterday there was a post which hit Hacker News that talked about using SQL to access Mongo. While this is powerful I think much of the true value was entirely missed within the post.

SQL is an expressive language, though people are often okay with accessing Mongo data through its own ORM. The real value is that you could actually query the data from within Postgres then join across your data stores, without having to do some ETL process to move data around. Think… joining sales data from Postgres with user reviews stored in Mongo or searching for visits to a website (retained in redis) against purchases by user in Postgres.

The mechanism pointed out was a MongoDB Foreign Data Wrapper. A Foreign Data Wrapper or FDW essentially lets you connect to an external datastore from within a Postgres database. In addition to the Mongo FDW released the other day there’s many others. For example Postgres 9.0 and up ships with one called db_link, which lets you query and join across two different Postgres databases. Beyond that there’s support for a variety of other data stores including some you may have never expected:

Redis
Textfile
MySQL
Oracle
ODBC
LDAP
Twitter
More

Lets look at actually getting the Redis one running then see what some of the power of it really looks like. First we have to get the code then build it:

Understanding Postgres Performance

Oct 1st, 2012 | Comments

For many application developers their database is a black box. Data goes in, comes back out and in between there developers hope its a pretty short time span. Without becoming a DBA there’s a few pieces of data that most application developers can easily grok which will help them understand if their database is performing adequately. This post will provide some quick tips that allow you to determine whether your database performance is slowing down your app, and if so what you can do about it.

Understanding your Cache and its Hit Rate

The typical rule for most applications is that only a fraction of its data is regularly accessed. As with many other things data can tend to follow the 80/20 rule with 20% of your data accounting for 80% of the reads and often times its higher than this. Postgres itself actually tracks access patterns of your data and will on its own keep frequently accessed data in cache. Generally you want your database to have a cache hit rate of about 99%. You can find your cache hit rate with:

SELECT 
  sum(heap_blks_read) as heap_read,
  sum(heap_blks_hit)  as heap_hit,
  (sum(heap_blks_hit) - sum(heap_blks_read)) / sum(heap_blks_hit) as ratio
FROM 
  pg_statio_user_tables;

Arrays in Postgres

Aug 20th, 2012 | Comments

Postgres out of the box has an abundance of datatypes, from standard numeric datatypes to geometric or even network datatypes. With extensions you can get even more out of it as earlier discussed with hStore. Though with all of the datatypes its easy to miss out on some of them that are there, in fact one of my favorites is often missed entirely. The Array datatype lets you do just as you’d expect, store an array inside Postgres. With this you can often get some of the functionality you’d want in a single table when you might traditionally have expanded to multiple tables.

The broader question may be why you’d actually want to use an array. One good reason may be if you’re an application developer its how you think of your data, so why not model it the same way. As you’ll see below it can be easier than joining and aggregating across a set of rows. Also depending on your case you performance could be improved, though mileage may vary here as it does depend on the data you’re storing.

First a bit of a hacky example… Lets say you have a basic website that sells stuff, and instead of having a purchase ID and a total you want to include the quantity, id, and price of each item in a single row. With a bit of a messy foreign key (using a decimal) you could store all of this within a single row:

CREATE TABLE purchases (
    id integer NOT NULL,
    user_id integer,
    items decimal(10,2) [100][1],
    occurred_at timestamp
);

With this table I could have an array that holds multiple records of:

The item purchased
The quantity
The price

Rapid API Prototyping with Heroku Postgres Dataclips

Jul 19th, 2012 | Comments

For small and large applications there often comes a time where you’re busy creating an API. The API creation process usually takes the form of something like: Design your API, Implement your API, Test and Evaluate, Rinse and Repeat. Historically with implementing the API fully you can’t see how you truly feel about the result, causing this cycle to take longer than it should. Heroku Postgres has Dataclips, which (among other things) can be used for quickly prototyping APIs. Dataclips allows you to easily share data, but more importantly consume it in a form much like you would a restful API. Lets take a look at how this would work:

Given a schema

We can see from the screen shot of the schema above we can see we have a few tables. These tables are the complete works of Shakespeare thanks to opensourceshakespeare. Lets take a couple of hypothetical endpoints we’ve decided on that we’d like to expose for users and test as an API.

The number of works per year
Drone factory (this is a fun one courtesy of Richard Morrison - @mozz100 essentially who has the longest paragraphs on average in his works.

Protips for Conference Talks

Jun 19th, 2012 | Comments

A few weeks ago I was sitting at the hotel in Zurich with Jacob Kaplan Moss prior to DjangoCon EU enjoying a beer, talking about Django, and discussing a bit about our upcoming talks for the conference. He talked briefly about his upcoming keynote and how he was doing something different, including essentially 5 mini-talks. This seemed interesting enough, but the part that surprised me was when Jacob said, “I’m among friends here so it’ll be a good place to test this format.” Many if not all in the community know who Jacob is as one of the creators of Django, though still to be “among friends” at a roughly 300 person conference surprised me. However, as someone thats keynoted several times, spoken at conferences for many years, and familiar with many people in the community; for the 150-200 people there he had not met before, he was still truly among friends. While giving a keynote is never an easy feat, it seems to ease the worry ahead of time of doing such.

Saturday night there was a bit of conversation on twitter that had some related discussion. In the last minute rush for DjangoCon US talk submissions a few that have been involved in the community for some time discussed submitting their first talk proposals. In parallel to that was some discussion around diversity, I volunteered the idea of not including presenters name’s in the list when reviewing and voting on talks. While both of the above are controversial topics alone, I hope that can be left to another later time. The key idea that emerged that can be helpful to anyone looking to submit a talk to a conference is how the “pro’s do it”, as Jeremy Dunck put it.

So without further adieu, hopefully without speaking too much for him here’s likely why Jacob viewed his 300 person keynote as being among friends:

1. Start small

Whether its practicing the talk itself or writing the abstract for a proposal practicing each step lets you refine this well ahead of time. In my experience, providing a talk description for a meetup can often be far harder than for a conference. For a meetup I feel confined to 2-3 sentences, versus an abstract a solid paragraph or two. Yet, I still have to make it as exciting, because of course I don’t want 4 people to show up to the meetup because it sounds uninteresting. In the case I’m most familiar with DjangoCon and DjangoCon.eu both happen once a year, though many smaller regional conferences related to Python exist and especially meetup groups:

Schemaless Postgres in Django

Jun 11th, 2012 | Comments

Earlier this week while I was at DjangoCon EU there seemed to be a surprising amount of talk about MongoDB. My problem with this isn’t with MongoDB, but in the assumption that only Mongo can solve what you’re looking for. By and far the most common feature is people want schemaless. It gives them flexibility in their data model and lets them iterate quickly. While I still opt for relational models that map cleanly to a relational database, there are cases where developers may want schemaless. I gave a quick lightning talk on this with slides here, but it is worth recapping.

The example given by pydanny was a product catalog. You may have different items you want to store for a catalog. Lets take an example below:

django_pony = {'name': 'Django Pony', 'rating': '5'}
pink_pony = {'name': 'Pink Pony', 'rating': '4', 'color': 'pink'}

In the case of a product catalog it could be understandable you don’t want to normalize every possible spec for the product. The argument for Mongo is so commonly that you can easily work with this data model. Admittedly it is quite simple:

from pymongo import Connection
connection = Connection()
django_pony = {'name': 'Django Pony', 'rating': '5'}
connection.product.insert(django_pony)

The problem is that this assumes other schemaless options don’t exist or are inferior.

Enter hStore

hStore is a column type within Postgres. It is a key value store that allows you to store a dictionary, with text values. It alone is not a full document store replacement, but allows for flexibility in your data model where you need it while letting you use relational models elsewhere. Its not exactly new within Postgres either, as its been available since 8.4, however its recently become easier to work with and is supported in some form or another by more frameworks.

To do the same as above we only need to do a few steps:

← Older Blog Archives

Craig Kerstiens

More on Postgres Performance

It all starts with stats