The adventures of scaling, Stage 2 March 20th

What is this series about?

While a couple of high-traffic sites are being powered by Rails and while the Rails book has a handful of instructions to scale your application, it was apparent for us that you’re on your own at a certain point. This series of articles is meant to serve more as a case study as opposed to a generic “How To Scale Your Rails Application” piece of writing, which may or may not be possible to write. I’m outlining what we did to improve our applications’ performance, your mileage may obviously vary.

Our journey is broken up into 4 separate articles, each containing what a certain milestone in scaling the eins.de codebase was about. The articles are scheduled for posting a week apart from the previous.

Stage 2 contains MySQL tuning tips, tuning of FastCGI dispatchers, and further system optimization techniques.

See also:
  • The adventures of scaling, Stage 1
  • Questions and answers for Stage 1
  • The adventures of scaling, Stage 3
  • The adventures of scaling, Stage 4

(Click the link for the steps taken in stage 2.)

Christmas Season ahead

In December, still behind the figures of the old application, several things were tried somewhat in despair as creativity was slowly running out.

As it was still impossible to spot a single of failure (or slowness for that matter) we tried to exclude a couple of things in the equation.

First of all, we compiled Ruby from source instead of using the Debian supplied binaries. Debian sometimes tends to factor in non-standard patches which is, by all means, a good thing. However, since we aimed to get a common denominator with most other Rails installations we jumped through the hoops of compiling Ruby from source, reinstalling all gems and even installed the i686 optimized libc6 packages on the way.

In the same breathe we switched from using the Debian supplied binaries of MySQL to the MySQL.com supplied binaries, also known as the “official” binaries. MySQL AB does some secret voodoo to their binaries that had them being well known for being a little speedier and sometimes a little more stable in people’s heads. This also upgraded our installs from MySQL 5.0.16 to MySQL 5.0.17 along the way.

In general, MySQL 5 has proven really stable for a product that has only recently been tagged with the “it’s stable!”-sticker. We’ve had a few replication issues here and there (mostly resulting from duplicate auto increment keys, something which has been fixed in MySQL 5.0.19 in the meantime) but the daemon process only crashed once or twice over the course of several months with a constant load between 2.000 and 3.000 queries per second.

While we are talking about MySQL

A database may obviously cause a whole application to slow down if configured incorrectly or optimized the wrong way. There are whole books and webinars devoted to this topic so I’ll keep this brief.

eins.de makes use of the MySQL FULLTEXT search functionality which is, unfortunately, only available in the MyISAM storage engine which, in turn, don’t support transactions. Plus, you have to optimize your MySQL server to handle both InnoDB and MyISAM tables equally well.

Your only options around this are:

  1. Buy another database machine which transparently transforms the table type to MyISAM while replicating from an InnoDB master (which is what Flickr does I’ve heard). This, however, requires careful architecting around Rails’ limitation of not being able to differentiate between a reader and a writer database, if it’s possible at all. We didn’t go that route as we didn’t have the budget for a third database machine.
  2. Don’t use FULLTEXT indices, use something as a screw-on to your database like building your own search engine on top of Ruby-Odeum for example. It gave me headaches thinking about keeping indices current with database changes as well as the time needed to perform a full index rebuild so we didn’t go that route either.

So we were left with our mixed setup.

Currently, the available memory (4GB) is split between InnoDB and MyISAM in about two thirds for InnoDB (since it’s the majority of tables and also uses memory to cache actual row data as opposed to MyISAM’s caching of index data only) and the remaining third for MyISAM. Additionally, we use a relatively large query cache (which has proven useful as per MySQL’s SHOW STATUS command).

In configuration variables:


key_buffer               = 700M
myisam_sort_buffer_size  = 128M
query_cache_size         = 64M
innodb_buffer_pool_size  = 1600M

Tuning connection related settings in MySQL should not be necessary as your dispatchers use a single, persistent connection to your database.

Resuming application optimization

Finding out how many dispatchers make sense for an application is kind of a trial and error approach. There’s no one-size-fits-all solution. Basically you have to cover your tracks in the peak hours while not clogging a single machine with so many processes in parallel that it crawls to a halt because of processes blocking each other at the CPU level. This is what your system load tells you.

With our original setting of 20 dispatchers per application server this was clearly the case. System load of 30 or more was very common. Things went more smoothly as we lowered the number of dispatchers to 10. But even that was apparently too much.

A simple calculation of the number of page impressions you get for a given day should give you a nice indication of what total number of listeners makes sense for your application. eins.de got about a million page impressions a day at that point. Since its visitors all sit in the same timezone, this million isn’t evenly distributed across the 24 hours of a given day, but spread over about 14 hours (starting at 9 in the morning to 11 at night). With a little abstraction this boils down to:


1M page requests / 14 hours = 20 requests per second

Presuming that processing an average page request should take no longer than a single second we need 20 listeners to support our claims. To be able to handle peaks we therefore reduced the listeners per application further from 10 to 7, giving a total of 28 listeners to serve our application.

Additionally, the Rails installation on each systems was successively upgraded to Rails 0.14.4 and later the final version 1.0. This was, however, a purely evolutional step, nothing broke, nothing improved performance wise.

Driven by reports of better performance in general and improved memory and process management in the Linux Kernel series 2.6 we upgraded all machines from 2.4.27 to 2.6.14 somewhere around mid December. Believe it or not, according to the Cacti graphs it did indeed make a difference. Not as much for the database servers, but the load on the application servers dropped significantly. Records show the system load went down from 8 to 5 in busy hours. Just by upgrading the kernel that is.

Getting a little enthusiastic about the achievements with the kernel upgrade we went ahead and opened the floodgates of the application servers to both database machines again. Up to this point all requests were served by a single machine, the other just sat there silently replicating waiting for the primary to fail.

As we were not keen enough to get haproxy back onboard we simply distributed the requests of two application servers each to a single database machine in a 2:1 fashion.

Occasionally though, replication was not able to catch up with the amount of writes that were done on one side of the master-master setup. If, in such an unlucky situation of being a couple of seconds apart, a user bounced from one application server to another which was, by coincidence, connected to a different database server, things would get awkward. The worst examples being AJAX requests getting fired off on one dispatcher, the result of the operation being queried by another on the other database which then got out-of-sync results. Simple toggles that usually return instantly still returned, but looked as if the request didn’t go through, as if the information didn’t get changed. Bad!

To lighten the number of write requests we ripped out our custom made token authentication system as each page requests didn’t only have to update session data in the ActiveRecordStore but also update a user online table and update the token data. That’s at least 3 write request without the user touching any real data.

Thing is, while multiple MySQL threads can update multiple tables (or even multiple rows in a single table through InnoDB’s row level locking) at once, you only get a single process handling the writes of the other master. Something that’s a clear win by numbers for the local database threads over the remote replication process. And that got us in trouble. More than once.

As an aside, we moved memcached between database boxes to better distribute the load as the other machine was handling database connectivity for the phpAdsNew instance.

By the end of December we were finally up in numbers. We even reached a million page impressions a couple of times. Traffic was up to about 85GB a day.

The setup now looked as follows:

spacer

Stay tuned for the third part of the scaling series due for posting on Monday, March 27th containing more memcached best practices, session optimization, and further system optimization techniques.

Filed Under: Rails

7 comments

Jump to comment form
  1. atmos 03.20.06 / 16PM

    I’m really digging these articles man. :)

  2. Bradley Taylor 03.20.06 / 17PM

    Great post!

    A couple of questions…

    If your application servers are only running 7 dispatchers each, how much of the 2gb of RAM per sever is actually being used? 7 dispatchers on debian shouldn’t only need a small portion of that.

    Did you test with less dispatchers and find that the application server CPU was under utilized? Did you retest the number of required dispatchers after the upgrade 2.6?

    Thanks for sharing!

  3. Ingo 03.20.06 / 19PM

    I didn’t get how you solved de issue of the database slow replication: “looked as if the request didn’t go through”

    please explain more about how you solved it. The memcache is caching database querys or partials views? are you accesing directly with the memcache api or with rails cache sistem?

    i’ve bumped into www.continuent.org but haven’t tried it yet. don’t know the write performance of that.

  4. Guido Sohne 03.22.06 / 16PM

    I can’t thank you enough for writing about this topic!

    I’m not just looking at scaling Rails but also at high availability and redundancy. Your articles are a great help to me now since we’re not yet at the implementation stage and we can draw on your experiences.

    You ditched haproxy in order to streamline the situation and remove an additional variable. I’m curious as to what impact this had on availability in face of a server failure and how you designed around not using haproxy …

    In other words, some information about not just scaling but availability would be very much appreciated here.

  5. Trey 03.24.06 / 08AM

    I also can’t thank you enough for these articles and an antiscipating the next article.

    One quick question, does having 7 dispatchers on an application server indicate that the server can only handle 7 requests per second per application server.

  6. Wrighty 03.24.06 / 12PM

    Trey, that would only be true if all of your controllers / actions took exactly 1 second to respond to requests. Which is unlikely with beefy servers in this example and the use of memcached.

    However I think you’re right that this setup would be able to simultaneously handle 7 requests at once, though I guess there’d be some queuing to buffer requests.

  7. Wayne 03.24.06 / 19PM

    I’m another person who thinks these articles are great.

    Following up on the previous comment by Wrighty, what kind of requests/sec. can someone expect Ruby on Rails to handle/perform (let’s assume a 1 server config).


Archives

  • 133Home
  • 1About
  • 1Git
  • 15Apple
  • 5Conferences
  • 14Development
  • 1limited overload
  • 4Misc
  • 10Personal
  • 3Photography
  • 7poocs.net
  • 40Rails
  • 22Software
  • 16Web

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.