Batch Processing Millions and Millions of Images

spacer Posted by Mike Brittain on July 9, 2010

I joined Etsy back in February and knew immediately that there would be no shortage of technical challenges.  Many of our job postings for Engineering positions describe the company as a place “where the word ‘millions’ is used frequently and in many contexts”.  I got a taste of that within my first weeks on the job.

We are in the process of redesigning a few of the major sections around etsy.com.  Every item being sold on the site can have up to five photos posted with it.  When a seller uploads a new photo, it’s resized automatically into six different sizes that are displayed throughout the site.  As we redesigned some pages we realized we would need to replace a few of the existing image sizes.

When I started this project, there were 135 million images for items being sold on Etsy, and that number increases every minute as sellers list new items for sale.  To provide a sense of scale, let’s consider how long it would take me to resize these images by hand in Photoshop.  It takes about 40 seconds for me to find and open a file, scale it to a smaller size, and save it to a new location.  With a bit of practice, I could probably shave a couple of seconds off of that.  But at this rate it would take 170 years to resize all of those images.

But here’s the spoiler… We did it in nine days.  Every single image.

spacer

Images resized over a seven day period using four machines.

We spent a number of days laying out this project.  We planned for how to move these images to the cloud and resize the whole batch using EC2.  We investigated resizing each photo on-demand as it was displayed on the site.  Both of these options had drawbacks for us.  Instead, we proceeded with what seemed like the simplest approach: batch process the images on our own servers.

Our Weapons of Choice

There are three tools that made up the majority of this project:

GraphicsMagick

If you’re not familiar with it, GraphicsMagick is a fork of ImageMagick that has somewhat better performance especially due to its multiprocessor support.  Its flexible command-line parameters (almost the same as ImageMagick’s) provided good opportunities for performance tuning which I’ll talk about shortly.

Perl

It is the “Swiss army knife”, right?  I didn’t use the GraphicsMagick-Perl library, though.  All of the resizing tasks were executed as shell commands from the Perl script I wrote.  So this really could have been written in any scripting language.

Ganglia

Nearly all of our projects at Etsy include metrics of one sort or another.  When testing the resizing process, I saw a decent amount of variability in speeds.  By measuring each part of the image conversion process (inspecting original images, copying images to a local ram disk, resizing, comparing to the original, and copying back to the filer), we were able to determine what processes were impacting the overall processing speed.  Another benefit to graphing all of this stuff is that with a simple dashboard you can keep everyone in the project up to date with your day-to-day progress.

Tuning GraphicsMagick

GraphicsMagick has a lot of command line options.  There is no shortage of dials to turn, and plenty of opportunities for balancing between file size, speed, and quality.  I started testing locally on a MacBook Pro, which was ample for working out the quality and file size settings we would use.

Two hundred images from our data set were copied to local disk and used for visual comparisons as we tweaked the settings in GraphicsMagick.  Image quality is a very high priority for Etsy and our sellers, so it was difficult to trade off even a small amount of quality for faster rendering or smaller file sizes.  I provided test pages of results throughout our benchmarking for internal review — imagine an HTML page with 200 sample images displayed in rows, with seven columns of varying quality of each image.  (Here’s a tip for when you do this yourself: don’t label the quality settings, the file size, the processing time, or any other data about the images being compared.  In fact, don’t order the images according to these values.  Don’t even name the files according to any of these values.  Force your judges to make non-biased decisions.)

One option I didn’t expect to be fiddling with until we got deep into testing was resampling filters.  There are a number of these in GraphicsMagick and you should test and choose one that best suits your needs – speed, file size, and quality.  We found seven filters that provided acceptable quality for our images: Blackman, Catrom, Hamming, Hanning, Hermite, Mitchell, and triangle.  I tested each of these against our sample set to determine the optimal speed and file size resulting from each filter.  Even a few seconds difference in speed when testing 200 images can equate to days, or weeks, when you’re processing millions of images.

Filter File size (KB) Time (Sec)
blackman 969 24
catrom 978 29
hamming 915 24
hanning 939 24
hermite 937 23
mitchell 922 29
triangle 909 23

Start with the right image file when you’re down-sizing images.  We keep full-size original images that are uploaded to the site by our sellers.  We cut these into the various image sizes we display on the site.  Let’s say these sizes are “large,” “small,” and “extra-small.”  In this project, we needed to create a “medium” size image and it seemed to make sense that we would want to cut this image directly from the original (full-size) image.  We found out, almost by accident, that using the previously down-sized “large” images resulted in better quality and faster processing than starting with the original full-size images.

Compare what you find when tuning performance with a professional.  In my case, I went to the source, Bob Freisenhahn, who writes and maintains GraphicsMagick.  Bob was kind enough to provide some additional advice that improved performance for this project even more.

Tuning Everything Else

Armed with preliminary testing results, I moved to our production network to test some more under “real world” conditions.  There were fairly dramatic differences in the environment, specifically the machine specs and the NFS mounted storage we use for images.

I was expecting CPU to be the bottleneck, but at this point my problem was NFS.  With a primed NFS cache, performance can be snappy.  But touching un-cached inodes is downright sluggish.  I checked out top while running a batch of 10,000 resize operations and saw that the CPU was barely working.  I wanted it to be pegged around 95%, but it was chilling out around one percent.  When I looked through some Ganglia metrics, it was clear we were bound by NFS seek time.  The fastest I was able to process images was five images per second.

Fork to the rescue!  I rewrote the portion of the script that handled the read/resize/write operations so that it would be forked from the parent script, which spent its time looping through file names, spawning children, and reaping them when they exited.  (When you do this, make it a command-line option so you can tune it easily, e.g. “–max-children=20.”)  This made a big difference.  Lots of NFS seeks could be made in parallel.  There were enough processes running that a steady queue built up waiting for NFS to return files from disk, and another queue built up waiting for processor time.  Neither spent any time twiddling their thumbs.  The resizing speed improved to about 15 images per second.  At this rate the total working set would take 2500 hours, or 104 days, to resize.  Still not good enough.

Now that we could feed enough images from NFS, we reached for more CPU — a 16 core (hyperthreaded) Nehalem server.  Problem solved, right?  Wrong.  The resizing speed on this box was actually worse, around 10 images per second.  Here’s why…

Given the opportunity to use additional processors, GraphicsMagick used all of them.  To resize an original image (approx. 1.5 MB) to a thumbnail (approx. 3 KB), GraphicsMagick split up the work across 16 processors, executed each job, and reassembled the results.  This was simply too much overhead for the relatively small amount of work actually being done.  This can be fixed by tuning the OpenMP threads environment variable when running GraphicsMagick, for example:

env OMP_NUM_THREADS=2 /usr/local/bin/gm ...

This showed an immediate improvement, but I needed to find the sweet spot.  I had knobs for both maximum number of children (Perl script) and number of threads (GraphicsMagick) used for resize operations.  I ran a number of tests tuning each of these parameters.

spacer

Using two processors per resize operation and running a maximum of 15 children yielded the best results. Note that while tuning these parameters, I tested with local files to exclude variability introduced by NFS.  We’re now closer to 262 hours (11 days) for the entire working set.  This starts to look sufficiently optimized and we can start to simply add some more iron.  Four 16-core Nehalems were used for resizing the production working set.  This may be the point where you are asking, “who has four of these boxes just lying around?”  But if you actually have 135 million images to resize, you probably have some spare hardware around, too.

In production, we didn’t see the amazing rates of 140 images per second for each machine. We still had to contend with cold seeks across NFS. By applying all of this learning, we managed to get a fairly consistent resize rate for each running machine.

spacer

Resizing images at 180 per second across four machines

Summary

We needed to resize about 135 million images right on our production servers.  We accomplished this using GraphicsMagick, Ganglia, Perl, and a very healthy dose of research.  In fact, the research phase of this project took longer than the batch processing itself.  That was clearly time well spent.  This well-tuned resizing process (if you missed the spoiler at the beginning of the article) took only nine days to complete.  And since first running the process, we have re-used it two more times on the same data set.

By the way, I can’t end this post without also acknowledging our operations team who worked with me and help out on this project. They are amazing. But don’t just take my word for it, come find out for yourself.

spacer
Posted by Mike Brittain on July 9, 2010
Category: engineering

30 Comments

Jochen Wersdörfer •
5 years ago

Do you really have to batch-process all those images?

I mean, usually, only a small fraction of your stored images will ever be shown to users. We also have millions of images and did batch-processing for some time, but than switched to online resizing and caching.

We now use a combination of squid, online resizing and varnish, which seems to be simpler and faster (changed images are online after the varnish cache times out, no state to hold, no wasted cpu-cycles for images no one will ever see etc. …).

Reply
skhan •
5 years ago

This is excellent information! I am currently working on an online catalog of magazines for a magazine distribution firm. Tons of magazines, a cover image for every issue.

Reply
Gaurav Kalra •
5 years ago

You have mentioned that “We planned for how to move these images to the cloud and resize the whole batch using EC2.” had it’s drawbacks.

Was the drawback limited to the time taken to move the images to the cloud ? Or something else was also there ?

Reply
Dave •
5 years ago

Did you look at libgd?

Reply
Alex •
5 years ago

Great article, thanks. Did you try running GraphicsMagick across only a single thread as there is a clear trend in your benchmarks showing that fewer threads and more processes gives better results.

Reply
links for 2010-07-10 « Uncle Joe's House of Crazy •
5 years ago

[…] Code as Craft » Batch Processing Millions and Millions of Images (tags: scalability programming optimization development images engineering) […]

Reply
links for 2010-07-10 « Uncle Joe's House of Crazy •
5 years ago

[…] Code as Craft » Batch Processing Millions and Millions of Images (tags: scalability programming optimization development images engineering) […]

Reply
Bob Friesenhahn •
5 years ago

This is a nice article. Since NFS overhead is a factor, make sure to look at the MAGICK_IOBUF_SIZE environment variable. This may be used to adjust the I/O buffer size used when accessing files. The GraphicsMagick default is 16K but a somewhat different value may be more appropriate for NFS, depending on how the NFS buffering is tuned. Some NFS implementations can be tuned to buffer 32K at a time. The NFS client may then perform more read-ahead on the file, or buffer more data prior to writing.

A large buffer size is not always ideal since the larger size may hinder performance for formats which need a lot of seeking (e.g. TIFF) and might therefore do more I/O than is actually needed.

Bob

Reply
Mike Brittain •
5 years ago

The biggest concern about using the cloud was moving all of the files there and then getting them back into the right place on our filers afterward. My findings on processing speed from local disk vs. NFS mounted filers show that the filers were more of a hindrance to us than availability of CPU. I.e. resizing from local disk was about 3x faster.

There are clearly many ways to skin this cat. I won’t argue that this is the most elegant solution, but a benefit to batch processing (in this case) is that it is completely de-coupled from the rest of our application stack. Right now, that is A Good Thing for us.

There are some pieces of our hardware and software architecture that were (when I started this project) temperamental. Our operations team at Etsy has done some amazing work over the last five months to tighten those up and we have much more flexibility now. It’s quite possible that if I had started this project today, it would have taken a completely different path. spacer

Reply
Paul •
5 years ago

mod_rewrite + dynamic image finder/resizer script = on-demand resize + caching. One of the commenters asked if it really needed to be done all at once. I agree with his sentiment. This type of thing can be done on-the-fly as needed. I do this with scientific data all-the-time and the level of processing is significantly more complicated. You mention 5 images/second… I hardly doubt that a visitor will complain if an image takes 0.2 seconds to load. Then, once it’s done, you’ll never have to generate it again.
// mod_rewrite command:
RewriteCond %REQUEST_FILENAME !-f
RewriteRule . resizer.php [PT,L]

// resizer.php
if( !file_exists($cached_resized_image) ) {
system(“convert $source_image -resize {whatever} -crop {whatever} $cached_resized_image”);
}
// Header stuff here (optionally).
readfile($cached_resize_image);

Reply
Batch Processing Millions and Millions of Images « Joebackward's Blog •
5 years ago

[…] Batch Processing Millions and Millions of Images Filed under: Uncategorized — joebackward @ 2:05 pm Code as Craft » Batch Processing Millions and Millions of Images. […]

Reply
Batch Processing Millions and Millions of Images « Joebackward's Blog •
5 years ago

[…] Batch Processing Millions and Millions of Images Filed under: Uncategorized — joebackward @ 2:05 pm Code as Craft » Batch Processing Millions and Millions of Images. […]

Reply
Julian •
5 years ago

Thanks for the interesting info, Mike. We’ve noticed some quality issues on the 170×135 thumbnails produced for old listings. According to my business partner Karena, who is a graphics expert, it appears the 170x135s may have been upscaled from 155×125? 170x135s have been made for all old er listings, but if you relist (not renew, not unexpire – relist) the item, Etsy appears to make new thumbnails for the listing. The new 170×135 is much sharper than the one produced for the original old listing. Please see our comparison at craftcult.com/images/upscaling-example-2.jpg to see what I mean. The left image is the 155×125 from a listing that was active in 2008. The middle image is the 170×135 with the same image ID. The right image is the new 170×135 made when we relisted that item. The new one is of considerably higher quality than the middle image. This is important because while most old images will disappear in the next 4 months, many people renew/unexpire older items, and they will be shown with the poorer quality 170x135s.

Reply
Julian •
5 years ago

For another example, here’s a 170×135 that belongs to a sold original listing from 2008:
ny-image0.etsy.com/il_170x135.7370724.jpg

Here is the same image, created when that listing was relisted:
ny-image1.etsy.com/il_170x135.157388933.jpg

The quality difference is pretty significant. The first is clearly lacking in sharpness, while the second looks great.

Reply
Oren •
5 years ago

Did you end up using an optimized version of libjpeg (as asked in the forum post)?

Reply
Ole Tange •
5 years ago

How fast would it have gone if you instead had used GNU Parallel www.gnu.org/software/parallel/? With -j+0 it will run one job per core.

find images | parallel -j+0 -S server1,server2,: do_stuff

Or if you do not have shared storage (NFS):

find images | parallel -j+0 –trc {.}.out -S server1,server2,: “do_stuff {} > {.}.out”

Watch the intro video for GNU Parallel: www.youtube.com/watch?v=OpaiGYxkSuQ

Reply
Laurent Raufaste •
5 years ago

We faced the same problems with millions of image served we also chose the dynamic_resize+caching way.

What were the drawbacks you encountered ?

Reply
Robert Eisele •
5 years ago

I recently published a solution to scale images inside of lighttpd using mod_magnet and imagemagic. I would also prefer an online generation of images over a batch processed solution, because you bypass many problems:
– you do not have to find a good time to generate the images (at night, morning, …)
– page loads quickly since no operations are performed on the server
– you save money
– and as Jochen already said: only a small fraction of the generated images will be delivered to the user – so why wasting the disk space?

www.xarg.org/2010/04/dynamic-thumbnail-generation-on-the-static-server/

Robert Eisele

Reply
Vaibhav S Puranik •
5 years ago

why u mention cloud ? what is that ? you are talking about cloud computing technology ?

Reply
Mike Brittain •
5 years ago

@Oren We did not end up using an optimized version of libjpeg for this project.

Reply
EastZoneSoupCube - links for 2010-07-26 •
5 years ago

[…] Code as Craft » Batch Processing Millions and Millions of Images How to process 135 million images in 9 days. (tags: performance processing image optimization batch) […]

Reply
Batch processing millions of images •
5 years ago

[…] just want to share one link about processing millions of images. Here it is LINK Posted by admin Tips & Tricks Subscribe to RSS […]

Reply
10 IT management tips from Etsy CTO Chad Dickerson | 10 Things | TechRepublic.com •
5 years ago

[…] Etsy maintains an engineering blog called Code as Craft. At the blog, the public can read about Etsy’s deployments and the technologies it uses. The benefits? Outsiders have provided tips to Etsy on managing the MongoDB as well as helping with other chores, such as resizing 135 million images in one pop. […]

Reply
Boris Kuzmanovic •
3 years ago

I’m kinda late to this post, but anyhow…

I faced the same problem in one of my projects. Instead of going either or route of choosing preprocessing all thumbnails or generating them on the fly I chose to use the mix of the two.

I wrote a shell script which identified all of the images accessed in the last 4 days and ran a thumbnail regenera