BAMFCSV - BAMF, your data's here!

Posted by Venerable High Pope Swanage I, Cogent Animal of Our Lady of Discord 06 May 2011 at 10:02AM

Rivaling the amazing transitive powers of Nightcrawler, Jon Distad and I decided to tackle the problem of parsing CSV rapidly under Ruby 1.9. "Aha!" you might be saying, "FasterCSV was already rolled into the stdlib in Ruby 1.9! Why would I need a gem to handle it?" Well you have a very good point there, FasterCSV was a good response to the performance of the old 1.8 stdlib CSV parser. However there are still cases where it doesn't quite go fast enough.

In our specific use case for a client project, we needed to parse large result sets coming in a csv format. How large? About 25 megs large, 200,000 records large. Decently, but not outrageously large. When we used the FasterCSV route, we could consume the data in about 8 seconds or so on our staging environment. I groused about how it could be better with a C extension. Jon rose to the challenge and started working on one, and I dutifully obliged by contributing patches and we spent a couple of Fridays pairing on it. I liked our results.

It's fast. How fast? Well we took our problem CSV input and started benchmarking with it. We then tuned it the only way that makes any sense: profiling, identifying hotspots, and circumventing them. Here's a quick benchmarking session I just ran on my MBP:

alexs-MacBook-Pro:bamfcsv alex$ irb
ruby-1.9.2-p136 :001 > require 'benchmark'
 => true 
ruby-1.9.2-p136 :002 > require 'bamfcsv'
 => true 
ruby-1.9.2-p136 :003 > require 'csv'
 => true 
ruby-1.9.2-p136 :004 > Benchmark.measure { CSV.read "observations.csv" }
 =>   2.050000   0.040000   2.090000 (  2.085173)
 
ruby-1.9.2-p136 :005 > Benchmark.measure { CSV.read "observations.csv" }
 =>   2.190000   0.050000   2.240000 (  2.230679)
 
ruby-1.9.2-p136 :006 > Benchmark.measure { CSV.read "observations.csv" }
 =>   2.190000   0.020000   2.210000 (  2.215040)
 
ruby-1.9.2-p136 :007 > Benchmark.measure { CSV.read "observations.csv" }
 =>   2.140000   0.050000   2.190000 (  2.180277)
 
ruby-1.9.2-p136 :008 > Benchmark.measure { CSV.read "observations.csv" }
 =>   2.170000   0.040000   2.210000 (  2.208252)
 
ruby-1.9.2-p136 :009 > Benchmark.measure { BAMFCSV.read "observations.csv" }
 =>   0.270000   0.040000   0.310000 (  0.301174)
 
ruby-1.9.2-p136 :010 > Benchmark.measure { BAMFCSV.read "observations.csv" }
 =>   0.210000   0.020000   0.230000 (  0.233012)
 
ruby-1.9.2-p136 :011 > Benchmark.measure { BAMFCSV.read "observations.csv" }
 =>   0.220000   0.020000   0.240000 (  0.239818)
 
ruby-1.9.2-p136 :012 > Benchmark.measure { BAMFCSV.read "observations.csv" }
 =>   0.210000   0.020000   0.230000 (  0.224568)
 
ruby-1.9.2-p136 :013 > Benchmark.measure { BAMFCSV.read "observations.csv" }
 =>   0.220000   0.020000   0.240000 (  0.240832)
That's about a 10x increase over the FasterCSV you'll get out of the box with 1.9.

We haven't tried to match FasterCSV feature for feature. We haven't tried to implement good Windows support. But it's really fast, and if you're hurting on CSV parsing performance, it can help you out in 1.9.

There's many 1.8 CSV libraries built as native extensions out there. Some report ridiculously fast performance, orders of magnitude faster than BAMFCSV. And in our attempts to use them under 1.9 they all blew up. So here's an effort to solve the same problem for Ruby 1.9.

  • Tags c, csv, extension, ruby
  • Meta no comments, permalink, rss, atom
  1. No comments

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.