The adventures of scaling, Stage 1 March 13th

What is this series about?

While a couple of high-traffic sites are being powered by Rails and while the Rails book has a handful of instructions to scale your application, it was apparent for us that you’re on your on at a certain point. This series of articles is meant to serve more as a case study as opposed to a generic “How To Scale Your Rails Application” piece of writing, which may or may not be possible to write. I’m outlining what we did to improve our applications’ performance, your mileage may obviously vary.

Our journey is broken up into 4 separate articles, each containing what a certain milestone in scaling the eins.de codebase was about. The articles are scheduled for posting a week apart from the previous.

Facts

Our mission was to rewrite the codebase behind the online community network eins.de since the former PHP-based codebase was both bloated and poorly architected. Being an online community site, eins.de has everything you’d expect from such a term: user forums, galleries with comments, user profiles, personal messaging, editorial content, and more. Additionally, eins.de has local partners that are the driving forces behind all of the available sub-communities, mostly forming around the bigger German cities. User interaction is possible globally, as such there’s only a single dataset behind everything.

The old codebase roughly consisted of around 50.000 lines of PHP code (plus a closed-source CMS that’s not included in this calculation). We’ve rewritten most of it (some features were left out on purpose) in about 5.000 lines of Rails code.

eins.de serves about 1.2 million dynamic page impressions on a good day. The new incarnation is serving up the 25 sub-communities on different domains in a single Rails application. It was, however, not before Febuary of this year when our iterative optimizations of both system configuration and application code lead to a point where we were able to deal with this amount of traffic.

The site largely lives through dynamic pages and information rendered based upon user preferences or things like online status or relationship status. This kept us from taking the easy way out by just using page or fragment caching provided by Rails itself.

The application servers are dual Xeon 3.06GHz, 2GB RAM, SCSI U320 HDDs RAID-1. The database servers are dual Xeon 3.06GHz, 4GB RAM, SCSI U320 HDDs RAID-1. The proxy server is a single P4 3.0GHz, 2GB RAM, SCSI U320 HDDs RAID-1.

Without changing the hardware we were able to improve the performance of our setup while still adding features to the application by configuration optimization and changes to the application code.

In numbers: We were maxed out at about 750.000 page impressions per day in November (about 60GB of traffic) and now easily handle 1.200.000 page impressions per day (about 100GB of traffic) in March. That is a 1.6x improvement!

At peak times about 20Mbit/s leave the proxy server’s ethernet interface.

(Click the link for the steps taken in stage 1.)

Update March 18: A follow-up article addressing reader comments is now available here.

Update March 20: Stage 2 is online.

Update March 27: Stage 3 is online.

Update April 03: Stage 4 is online

So, what did you start out with?

Well, you cannot change history. That’s what our configuration was back then. A bit more versioning detail about the diagram above:

Debian 3.1
Kernel 2.4.27
lighttpd 1.4.6
Ruby 1.8.3 from Debian packages
MySQL 5.0.16 from Debian packages
Rails 0.14.3 from RubyGems
Ruby-MySQL 2.7 from RubyGems
Ruby-MemCache 0.0.4

We were using the ActiveRecordStore for session management, a token based single-signon mechanism and memcached on both database servers to store the results of database-heavy calculations in memory.

The two database servers were replicated in a master-master setup, spacing the auto increment generation apart through auto_increment_increment and auto_increment_offset (see the MySQL manual for more information).

haproxy was used to balance both the external FastCGI listeners sitting on the application servers as well as the database connection from the dispatchers to the MySQL servers.

Basically, as outlined in the introductory paragraph above, the relaunch performance was a desaster. The old and crufty PHP-based site was able to handle about 900.000 page impressions before it collapsed (that said, it only had half the number of application servers as well) and the newly architected one fell over at a whopping 150.000 page impressions less. Not the turnaround you’d have hoped for. Even less so after spending days and nights programming. Good thing the “I cannot deal with change”-mob had different things to worry about.

The emergency plan

Yes, we’ve been pondering cashing our checks and fly to the Bahamas. We stayed.

As a first measure the number of FastCGI listeners was decreased from 20 to 10. To be honest, with the old setting the site was truly unusable. Pages would start to load but stall every once in a while having boatloads of disappointed and grumpy users hitting reload on us making things even worse. With the new setting, things came down a bit, pages loaded albeit everything but quickly.

Over the next few days after the relaunch we’ve taken additional measures to improve performance and fix little issues that haven’t crept up in private testing. Sleep was a rare good.

A couple of things we did to put out the fire, with varying degrees of success:

Rip out haproxy as it introduced yet another variable that could be tweaked and the immediate benefit of using it wasn’t really obvious. MySQL connections of all application servers were statically configured to connect to a single MySQL host. The distribution of the FastCGI connections was handed back to lighttpd. Tip: We found that in order to really have equally loaded application servers you should order your fastcgi.server directives by port and not by host, like so:


"http-1-01" => ( "host" => "10.10.1.10", "port" => 7000 ),
"http-2-01" => ( "host" => "10.10.1.11", "port" => 7000 ),
"http-3-01" => ( "host" => "10.10.1.12", "port" => 7000 ),
"http-4-01" => ( "host" => "10.10.1.13", "port" => 7000 ),
"http-1-02" => ( "host" => "10.10.1.10", "port" => 7001 ),
"http-2-02" => ( "host" => "10.10.1.11", "port" => 7001 ),
"http-3-02" => ( "host" => "10.10.1.12", "port" => 7001 ),
"http-4-02" => ( "host" => "10.10.1.13", "port" => 7001 ),

Play with fragment caching although it introduced inconveniences for the users (stalled data, no longer personalized) — no improvement, changes were reverted at a later time.

Back out of the idea of using two memcached hosts simultaneously, as the Ruby-MemCache library apparently doesn’t handle that too well. Things got distributed not on a per-key basis but randomly, giving us headaches about distributed expiration of dirty keys.

Refactoring of sidebar code which was originally written as a component — talking to bitsweat revealed that they’re a performance killer. You basically setup yet another full controller environment for each sidebar you render. Yes, that one was obvious. (See RailsExpress if you need more convincing.)

Add gzip compression as an after_filter (based on the examples in the Rails book)

Identified various slow queries in the MySQL slow query log and refactored the culprits by eliminating joins, optimizing index columns, etc. (This is obviously not Rails specific.)

This got us into December at least, to the point where we were able to handle 850.000 page impressions a day, still hardly something you’d put a sticker labeled “easily” on though.

Our new, simplified setup was as follows:

Stay tuned for the second part of the scaling series due for posting on Monday, March 20th containing MySQL tuning tips, tuning of FastCGI dispatchers, and further system optimization techniques.

Filed Under: Rails

ph 03.13.06 / 23PM

Woa, I need to check that out, thank you!
Jon Tirsen 03.14.06 / 04AM

Fantastic article! Truly great of you to share your experiences!
times 03.14.06 / 18PM

Much appreciated. Thanks for sharing.
Richard 03.15.06 / 10AM

Fantastic flowcharts!
David Jones 03.15.06 / 10AM

Nothing beats a shiny chart.
Nico 03.15.06 / 15PM

Yey! Thank you very much for sharing your knowledge. I’m getting into rails at the moment and want to develop and switch some high-traffic sites too.
Eleanor McHugh 03.15.06 / 16PM

A very timely and interesting article. I’m currently looking into scaling issues with a (non-Rails) Ruby web application and the lack of anecdotal data online is very frustrating. I look forward to next week’s installment.
goyaves 03.15.06 / 19PM

Great article

What app did you use for the grpahics on this page ?

They are precise consise and awesome !!! Gret job
scoop 03.15.06 / 22PM

The charts are done in OmniGraffle.
Hone Watson 03.16.06 / 00AM

Thanks for the heads up man. I must admit I had never heard about Rails until I read this post.
Henry 03.16.06 / 06AM

|

Please of please go into more detail as to HOW you clustered your web server.

- What software did you use. - How do you set it up. - How did you setup MySQL to have a failover. - How do you get your proxy lighttpd to send request to your application servers. - What are your application servers? Are they just web servers running lighttpd?

I would greatly appreciate a more detailed explaination.

Thanks in advance

Henry

============================================= |
john 03.16.06 / 08AM

so it looks like the majority of gains were in a re-architecting of the back-end, and not so much from using Rails-specific features ?
Richard N 03.16.06 / 10AM

Very interesting article Patrick – thank you for posting. I look forward to the next installment!

Regards,

Richard
Dick Davies 03.16.06 / 11AM
Just checking:
- is that really only 1 lighttpd in the diagram?
- it’s not (http) proxying, it’s just using remote fcgis, right?
Nice article, thanks for it.
Mike Judge 03.16.06 / 21PM

Clever as hell. I’m looking forward to future articles.
Sean 03.16.06 / 23PM

Excellent info! Thanks for taking the time to write this up. Could you go into more detail about the refactoring you did on your sidebar code so that it is no longer component based? I’d like to do the same thing for Typo (as it’s really, really slow) but I don’t know where to start … i.e. did you just make the sidebars partials? How do you avoid the multiple render issue?

Thanks again!
gumi 03.17.06 / 02AM

What did all this cost?
eliott 03.17.06 / 05AM

You might also consider using scgi. I have read that it seems to scale a bit better than plain fcgi. It might be worth testing, at the very least.

Nice article, and it is great to see people working on performance and scaling. :)
craig 03.17.06 / 08AM

Excellent article. Answers many of the scaling theories that are bandied around with real facts and figures.
malcontent 03.17.06 / 09AM

How much of this was the fault of mysql. Could your application better handle the load with postgres or even oracle.
Greg 03.17.06 / 18PM

Please provide more information. Please oh please oh please!

Such things as HOW things interconnect. What software is used. HOW YOU LOAD BALANCE. How you manage syncronizing data.

I could write many more questions, but I’m sure you get the idea.

I hope to see this talked about in the second article

THANKS Greg
Chris 03.17.06 / 18PM

This a great first article but I want to say that I completely agree with Wayne, Greg and Henry’s comments that more detailed information desparetly wanted.

You have now made all of us wanting to know how exactly you load balance, etc. your web farm.

I’m looking forward to the second article and hope it is written soon.
Billy 03.17.06 / 19PM

I looks like a lot of people over at rubyonrails.com would also like article 2 to be more detailed.

I think I can speak for everyone here to say this is a GREAT article and our expectations for article 2 are very high now.

Thanks for the article – I look forward to the second edition
bathow 03.17.06 / 20PM

Wow. This is awesome!! Thanks for this excellent article. We are starting to experience scalability issues on www.crispynews.com and we were just getting nervous about this stuff.

Right now this is pure gold to us. (esp the component stuff. No wonder some pages were so slow.) Thanks a ton!
Justin Dossey 04.06.06 / 02AM

Hey, what did you use to create that diagram?
easyMobile 05.03.06 / 00AM

“I could write many more questions, but I’m sure you get the idea.” A bulletin board on this website would be a nice idea :)
prepaid 05.12.06 / 00AM

Nice diagramm.

I start to try it out now.

poocs.net

The adventures of scaling, Stage 1 March 13th

What is this series about?

Facts

So, what did you start out with?

The emergency plan

27 comments

|

Archives