BAMFCSV - BAMF, your data's here!
Rivaling the amazing transitive powers of Nightcrawler, Jon Distad and I decided to tackle the problem of parsing CSV rapidly under Ruby 1.9. "Aha!" you might be saying, "FasterCSV was already rolled into the stdlib in Ruby 1.9! Why would I need a gem to handle it?" Well you have a very good point there, FasterCSV was a good response to the performance of the old 1.8 stdlib CSV parser. However there are still cases where it doesn't quite go fast enough.
In our specific use case for a client project, we needed to parse large result sets coming in a csv format. How large? About 25 megs large, 200,000 records large. Decently, but not outrageously large. When we used the FasterCSV route, we could consume the data in about 8 seconds or so on our staging environment. I groused about how it could be better with a C extension. Jon rose to the challenge and started working on one, and I dutifully obliged by contributing patches and we spent a couple of Fridays pairing on it. I liked our results.
It's fast. How fast? Well we took our problem CSV input and started benchmarking with it. We then tuned it the only way that makes any sense: profiling, identifying hotspots, and circumventing them. Here's a quick benchmarking session I just ran on my MBP:
alexs-MacBook-Pro:bamfcsv alex$ irb ruby-1.9.2-p136 :001 > require 'benchmark' => true ruby-1.9.2-p136 :002 > require 'bamfcsv' => true ruby-1.9.2-p136 :003 > require 'csv' => true ruby-1.9.2-p136 :004 > Benchmark.measure { CSV.read "observations.csv" } => 2.050000 0.040000 2.090000 ( 2.085173) ruby-1.9.2-p136 :005 > Benchmark.measure { CSV.read "observations.csv" } => 2.190000 0.050000 2.240000 ( 2.230679) ruby-1.9.2-p136 :006 > Benchmark.measure { CSV.read "observations.csv" } => 2.190000 0.020000 2.210000 ( 2.215040) ruby-1.9.2-p136 :007 > Benchmark.measure { CSV.read "observations.csv" } => 2.140000 0.050000 2.190000 ( 2.180277) ruby-1.9.2-p136 :008 > Benchmark.measure { CSV.read "observations.csv" } => 2.170000 0.040000 2.210000 ( 2.208252) ruby-1.9.2-p136 :009 > Benchmark.measure { BAMFCSV.read "observations.csv" } => 0.270000 0.040000 0.310000 ( 0.301174) ruby-1.9.2-p136 :010 > Benchmark.measure { BAMFCSV.read "observations.csv" } => 0.210000 0.020000 0.230000 ( 0.233012) ruby-1.9.2-p136 :011 > Benchmark.measure { BAMFCSV.read "observations.csv" } => 0.220000 0.020000 0.240000 ( 0.239818) ruby-1.9.2-p136 :012 > Benchmark.measure { BAMFCSV.read "observations.csv" } => 0.210000 0.020000 0.230000 ( 0.224568) ruby-1.9.2-p136 :013 > Benchmark.measure { BAMFCSV.read "observations.csv" } => 0.220000 0.020000 0.240000 ( 0.240832)
We haven't tried to match FasterCSV feature for feature. We haven't tried to implement good Windows support. But it's really fast, and if you're hurting on CSV parsing performance, it can help you out in 1.9.
There's many 1.8 CSV libraries built as native extensions out there. Some report ridiculously fast performance, orders of magnitude faster than BAMFCSV. And in our attempts to use them under 1.9 they all blew up. So here's an effort to solve the same problem for Ruby 1.9.