Stack Overflow running short on space

by Nick Craver on Feb 07, 2012 under Growing Pains

Stack Overflow handles a lot of traffic. Quantcast ranks us (at the time of this writing) as the 274th largest website in the US, and that’s rising. That means everything that traffic relates to grows as well. With growth, there are 2 areas of concern I like to split problems into: technical and non-technical.

Non-technical would be things like community management, flag queues, moderator counts, spam protection, etc. These might (and often do) end up with technical solutions (that at least help, if not solve the problem), but I tend to think of them as “people problems.” I won’t cover those here for the most part, unless there’s some technical solution that we can expose that may help others.

So what about the technical? Now those are more interesting to programmers. We have lots of things that grow along with traffic, to name a few:

Bandwidth (and by-proxy, CDN dependency)
Traffic logs (HAProxy logs)
CPU/Memory utilization (more processing/cache involved for more users/content)
Performance (inefficient things take more time/space, so we have to constantly look for wins in order to stay on the same hardware)
Database Size

I’ll probably get to all of the above in the coming months, but today let’s focus on what our current issue is. Stack Overflow is running out of space. This isn’t news, it isn’t shocking; anything that grows is going to run out of room eventually.

This story starts a little over a year ago, right after I was hired. One of the first things I was asked to do was growth analysis of the Stack Overflow DB. You can grab a copy of it here (excel format). Brent Ozar, our go-to DBA expert/database magician, told us you can’t predict growth that accurately…and he was right. There are so many unknowns: traffic from new sources, new features, changes in usage patterns (e.g. more editing caused a great deal more space usage than ever before). I projected that right now we’d be at about 90GB in Stack Overflow data; we are at 113GB. That’s not a lot, right? Unfortunately, it’s all relative…and it is a lot. Also, we currently have a transaction log of 41GB. Yes, we could shrink it…but on the next re-org of the PK_Posts (and as a result, all other indexes on the Posts table) it’ll just grow to the same size.

These projections were used to influence our direction on where we were going with scale, up vs. out. We did a bit of both. First let’s look at the original problems we faced when the entire network ran off one SQL Server box:

Memory usage (even at 96GB, we were over-crammed, we couldn’t fit everything in memory)
CPU (to be fair, this had other factors like Full Text search eating most of the CPU)
Disk IO (this is the big one)

What happens when you have lots of databases is all your sequential performance goes to crap because it’s not sequential anymore. For disk hardware, we had one array for the DB data files: a RAID 10, 6 drive array of magnetic disks. When dozens of DBs are competing in the disk queue, all performance is effectively random performance. That means our read/write stalls were way higher than we liked. We tuned our indexing and trimmed as much as we could (you should always do this before looking at hardware), but it wasn’t enough. Even if it was enough there were the CPU/Memory issues of the shared box.

Ok, so we’ve outgrown a single box, now what? We got a new one specifically for the purpose of giving Stack Overflow its own hardware. At the time this decision was made, Stack Overflow was a few orders of magnitude larger than any other site we have. Performance-wise, it’s still the 800 lb. gorilla. A very tangible problem here was that Stack Overflow was so large and “hot,” it was a bully in terms of memory, forcing lesser sites out of memory and causing slow disk loads for queries after idle periods. Seconds to load a home page? Ouch. Unacceptable. It wasn’t just a hardware decision though, it had a psychological component. Many people on our team just felt that Stack Overflow, being the huge central site in the network that is is, deserved its own hardware…that’s the best I can describe it.

Now, how does that new box solve our problems? Let’s go down the list:

Memory (we have another 96GB of memory just for SO, and it’s not using massive amounts on the original box, win)
CPU (fairly straightforward: it’s now split and we have 12 new cores to share the load, win)
Disk IO (what’s this? SSDs have come out, game. on.)

We looked at a lot of storage options to solve that IO problem. In the end, we went with the fastest SSDs money could buy. The configuration on that new server is a RAID 1 for the OS (magnetic) and a RAID 10 6x Intel X-25E 64GB, giving us 177 GB of usable space. Now let’s do the math of what’s on that new box as of today:

114 GB – StackOverflow.mdf
41 GB – StackOverflow.ldf

With a few other miscellaneous files on there, we’re up to 156 GB. 155/177 = 12% free space. Time to panic? Not yet. Time to plan? Absolutely. So what is the plan?

We’re going to be replacing these 64GB X-25E drives with 200GB Intel 710 drives. We’re going with the 710 series mainly for the endurance they offer. And we’re going with 200GB and not 300GB because the price difference just isn’t worth it, not with the high likelihood of rebuilding the entire server when we move to SQL Server 2012 (and possibly into a cage at that data center). We simply don’t think we’ll need that space before we stop using these drives 12-18 months from now.

Since we’re eating an outage to do this upgrade (unknown date, those 710 drives are on back-order at the moment) why don’t we do some other upgrades? Memory of the large capacity DIMM variety is getting cheap, crazy cheap. As the database grows, less and less of it fits into memory, percentage-wise. Also, the server goes to 288GB (16GB x 18 DIMMs)…so why not? For less than $3,000 we can take this server from 6x16GB to 18x16GB and just not worry about memory for the life of the server. This also has the advantage of balancing all 3 memory channels on both processors, but that’s secondary. Do we feel silly putting that much memory in a single server? Yes, we do…but it’s so cheap compared to say a single SQL Server license that it seems silly not to do it.

I’ll do a follow-up on this after the upgrade (on the Server Fault main blog, with a stub here).

growth, stack overflow

8ctopus

Great read.
Adam Berkan

I work on an OSS project, bcache, that aims to use SSDs as a cache in front of your disks. It’s
Typecastexception.com John Atten

Fascinating hearing about the “under-the-hood” issues we don’t see when we use the SO site. Keep up the great posts. Very informative!
Anonymous

Tried going to your blog’s homepage and got hit with:
i.imgur.com/rYMB3.png

Ironic? Maybe. Comedic? for sure.

Anonymous

Oops, it looks like this blog got *way* more traffic than I expected. Caching is now cranked up, shouldn’t happen again – thanks for alerting me!

Andrew Armstrong

Is it worth considering whether all data is equal?

Can you archive closed questions that won’t ever get a response (to conserve space etc) to smaller rows or alternate storage?

Or move entire questions and their comments to separate db servers, but still keep user information etc on a single main server, so its only some things that need to hit a secondary server.

Just off the top of my head, the main server could just list subjects and a pointer column to tell it to fetch the remaining data from server #2?

Anonymous

Worth considering? Absolutely. Worth doing? Not quite yet…*hopefully* never. So far hardware advances have completely outpaced our hunger for it by continual performance and efficiency improvements. Who knows what kind of growth the next year will bring, if we continue to grow like crazy this may come up…and this kind of sharding will certainly be on the table for discussion.

Before long, much faster processors for these database level boxes are coming out: 10 core Xeons based on the Sandy Bridge E architecture IIRC. Those are a tremendous increase in performance over what we have, and SSDs continue to grow as well (and there’s always the SAN option). That means in short order (and relatively inexpensive in the grand scheme of things) we could get a box with low, single-digit utilization on the CPU with a crazy amount of space. Because of this ability to still greatly scale the current single-DB-per-site (but not necessarily the same server, as SO has its own), we’ll stay on that track, at least for the foreseeable future.

Me

Why would you put a RAID 10 on SSD?

Anonymous

The RAID 10 combo gives us a single volume for the data (only 177GB, so you wouldn’t want to split that up) and SSDs do fail…so you want redundancy there, especially on something as important as Stack Overflow.

Ryan

but SSD deaths occur due to limited writes; RAID *mirroring* on SSD seems absolutely bizarre logic, surely?

Anonymous

It may seem so at a high level, but in practice each SSD internally makes its own provisioning decisions and failure times are mean numbers (since we’re ultimately talking about physical hardware failure, albeit at a tiny level). That means the chance of two in the raid that are a mirror dying at the same time is very low, even in a narrow window it’s very, very low. We also have George and Pete just 5 at the office just 5 minutes away ready to go slap a replacement in ASAP were anything to happen.

Me

The only reason you can rationally give for RAID mirroring SSD is “I want to have less capacity and spend more money”. All else is handwaving and bull-something-smelly.
Mike

What? You
Ian

The main reason, aside from the extra failsafe that Nick mentions, is performance. Multiple drives, properly raided, can give even the Fusion IO a run for it’s money.

twitter.com/fs111 fs111

I know you are a windows centric shop, but have you considered using hbase instead of sql server? It is made for exactly these kind of use cases: hbase.apache.org

Anonymous

We’ve looked at just about everything at one time or another…but there’s simply no compelling reason to switch. Look at the investment we’re talking about above: less than $3k in memory and about $7,800 in drives (those numbers may be slightly off, I don’t have them in front of me, but they’re very close).

In practical terms, even if one developer takes 4 weeks totally dedicated to this…that’s already more expensive. In reality it’d take much, much longer and many more resources to move. The sites themselves aren’t the only thing running against those database, the entire network would have to make a shift…or we’d have to context switch every day, which also isn’t a null cost.

Yuhong Bao

BTW, can you add the names for stackoverflow.com and serverfault.com etc to the SSL certificate?

Anonymous

We’ll be rolling out SSL across the network in the future…but it’s not a simple ordeal (there’s a huge amount of load SSL adds on our side) Here’s a post by George (one of our SysAdmins) with a few examples: meta.stackoverflow.com/a/69177/135201

In the near-term we’ll likely start redirecting stackoverflow.com and other sites from https:// to , but we need to chart everything out. Since we don’t know at the time of the request which site you’re asking for (this is part of how SSL works), we either have to a) have a cert with all sites hitting that IP (likely) or point DNS for each site at a separate IP (clean…but messy too, and harder to maintain/scale).

There are lots of issues with “just turning on” SSL, especially with as many domains as we have and with as much traffic as they handle. I think this would warrant a blog.serverfault.com post…I’ll ping George/Kyle/Pete and see if we can put something together to explain all the details/decisions we’re currently making.

Malcolm Haak

It’s called pound and it’s awesome. One IP one wildcard cert for your domain. All sub-domains covered with SSL all on different back-ends if you want.. But all via one (or more) front facing IP addresses.. So Sub-domain Vhosts for SSL if you will.

Anonymous

Yes indeed, we’re already looking at a shared multi-domain cert for all our specific domain names routed to a single IP and having *.stackexchange.com (current primary cert) to another. It does seem like clearly the best direction for us to go from a management standpoint.

GMc

Why don’t you just segment and horizontally scale the db?

twitter.com/tim594 Tim Hon

i think its cuz sql server licensing is too expensive to go that route. one of the problems of picking microsoft i guess?
Anonymous

In short: we don’t need to, cost-wise, it just doesn’t make sense at this point. The SO database server *peaks* at 25% CPU, average is well below that, and memory is dirt cheap. Also, those numbers may drop coming up due to some other investments/changes. The change in architecture to another DB solution is a huge one vs. a relatively very minor hardware investment. Since we’re absolutely anal about performance, hardware continues to outpace our needs…so much so that a fundamental scaling change isn’t even on the radar at this point.

Invitadoi

OMG full text search into db! get rid of this, use Sphinx! sphinxsearch.com/

twitter.com/lhaussknecht Louis Haußknecht

This _had_ other factors. AFAIK they switched to a dedicated Lucene machine.

Anonymous

Almost right, Lucene.Net is currently running across our web tier redundant across many servers. This will change coming up, to a dedicated service architecture (still on redundant machines of course). I’ll be sure to blog this if Marc doesn’t (I’ll link to it here either way).

blog.dp.cx dp

Any reason to go Lucene (.Net or otherwise) instead of Sphinx? We’ve been using Sphinx happily for a couple of years, just curious about the deciding factors.

Anonymous

It’s currently running in the same .Net app domain as the Q&A sites themselves (they’re all the same application underneath)…that and “stick with what you know/are used to” has both tangible/intangible benefits.

twitter.com/tim594 Tim Hon

so wait, S.O. has no redundant/hot-backup database server?!

Anonymous

I didn’t mention it here, but yes we do indeed have an identical hot backup running in a mirror configuration that’s just a few minutes behind at all times. It’ll get the same upgrades.

Юрий Соколов

PostgreSQL? Linux, full text search included, partitioning (by what?), several master-slave replication solutions.
Neerdis

huh…! this is the challenge of science…
we will….
Bob

Did you look at Fusion IO? I’m a SysAdmin not DBA, but the performance on these PCIe cards is awesome, even compared to SSDs. They destroy SAN on cost and performance.

Anonymous

Indeed we did, check out the linked Server Fault blog post for details on our research: blog.serverfault.com/2011/02/09/our-storage-decision/

The single point of failure with Fusion IO was a deal-breaker there. If/when they fail they’re supposed to go into a read-only state…unfortunately we’ve heard quite a few occurrences of hard, complete, unrecoverable death instead. Also, our issue is space, not performance (since we can fit it all in memory thus far). Cost-wise, Fusion IO doesn’t get us there. SSDs on the other hand get us much closer, for a lower price and more redundancy.

Stas Antropov

You can use Windows software RAID 1 with FusionIO cards. From what I hear, the performance impact is negligible.

Also, RamSan 70 is a direct competitor to FusionIO and at least comparing the sales pitches, it’s not worse, may be better. The cost is around $20K. I hear also that RamSan can borrow you one of those cards for 1-2 weeks to do your testing at no cost/no obligations.

Disclaimer: I’m not associated with RamSan in any way. Simply doing similar research to solve a problem of our customer.

nickcraver.com/blog Nick Craver

If we went with Fusion IO we’d be doing the software raid for sure…but that also means double the cost (let’s face it, Fusion IO is not exactly cheap), plus the backup server needs a similar configuration of sorts. That’s quite a chunk of money considering we’re not waiting on disks *at all* right now. Once we outgrow memory it may be an entirely different ballgame, but that’s not in the foreseeable future.

Anonymous

Did you hear about Microsoft’s Hadoop’s efforts already showing results like a Hadoop to SQL Server connector for importing and exporting and an ODBC driver for Hive for real time querying from business intelligence tools into Hadoop.
twitter.com/aCraigPfeifer craigpfeifer

Seems like a good short term solution to put in place while you mature your long term solution. Don’t spend too much time and energy rearranging the deck chairs on the Titanic.
Tim Gilbert

You are vertically scaling which is a game you can’t possibly win. Eventually, you will lose. Partitioning/Sharding SO from your sites duplicates your schema and only helps if your sites grow at a similar rate, which you clearly said they do not.

I don’t fully understand why you aren’t using different technologies… For example, Cassandra would be an interesting model for you due to its write optimization, timestamping for edits, excellent horizontal scalability, and partition tolerance (ie. no single point of failure), near unlimited scaling, multiple datacenter support, etc. Redis is another great option for web server based caching to relieve some load from database. Solr/Lucene might work for your full-text search, particularly with NRT plugin.I’d really like to see another blog post explaining your decision to use a single database server for something like SO. Until I read and understand your architectural decisions I will say this simply, “You’re doing it wrong.”. :)

Anonymous

Who said we’re not? :) Redis *is* our shared caching layer for the web tier (meta.stackoverflow.com/a/69172/135201) and we do use Lucene.Net for searching (
blog.stackoverflow.com/2011/01/stack-overflow-search-now-81-less-crappy/ it runs on the web tier, not on these DB servers). It’s just that those technologies/services/layers weren’t the topic of this particular post.

Tim Gilbert

Thanks for the link to the other articles. I’ll check them out.

Codey Whitt

Interesting read! We’ve done a similar setup here, and I’m just curious as to why you went with the Intel 710s. We decided to go with 4 Crucial C300s in RAID 10 on a SATA III setup and we’re getting amazing speeds (770MB/sec reads, couldn’t test the writes since the array was mounted). I will say we kind of took a little bit of a leap of faith on going with SATA III, since it was still a fairly new standard.

nickcraver.com/blog Nick Craver

The Intels have proven to be very reliable for us is part of it, and the 710s have incredible write endurance (1.0PB-1.9PB for 0-20% over-provisioning -
www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-710-series.html). After a lot of searching just now, I still cannot find write endurance numbers for the C300 series, that’s a bit of a deal-breaker for us since we’re 40% writes/60% reads at the DB level.

Ryan Guill

> Memory usage (even at 96GB, we were over-crammed, we couldn’t fit everything in memory)

Can you talk more about this? It sounds like you are storing the DB in memory and writing to disk later? Is that right and if so can you point to some resources about how you do this in MSSQL?

Anonymous

The DB is in memory, but this is handled by SQL server itself (if you give it RAM, it’ll use it)…we’re doing nothing extraordinary there other than regular index tuning to keep things tidy. Several layers of caching for various areas of the site on top also make restarts actually very painless, at least for now.

Ryan Guill

Ah I see. Thanks for the information.

Ryan Guill

> Memory usage (even at 96GB, we were over-crammed, we couldn’t fit everything in memory)

Can you talk more about this? It sounds like you are storing the DB in memory and writing to disk later? Is that right and if so can you point to some resources about how you do this in MSSQL?
www.facebook.com/people/Nicolai-Søndergaard/683674066 Nicolai Søndergaard

Any specific reasons for going with SQL server, and not a competitor?
Also, what about Ramsan systems? Are they not worth it?

John Hoover

Have you priced Ramsan systems at any point? They’re significantly cheaper than they used to be, but still silly expensive. Also, reading through, it seems that performance isn’t the limiting factor the way actual disk space is. Getting a TB of Ramsan is going to be a $x00K+ proposition. Not worth it when they can put almost 300GB of RAM in the box and have a similar situation, running most if not all of the DB in memory, then writing out to an array of fast SSDs for non-volatile storage.

Anonymous

delete junk responses. 99% of space saved overnight!

Mohsen

and how would you categorize junk?
not to mention all the stats that go with those ‘junk’ responses.
Guest

better yet, delete junk questions

Ian

Why not consider Intel 320s overprovisioned? You could purchase 600gb versions and provision them to 300-400gb for more space and less money.

nickcraver.com/blog Nick Craver

There’s quite an endurance difference between the 320s (60TB -
www.intel.com/content/www/us/en/solid-state-drives/ssd-320-enterprise-server-storage-application-specification-addendum.html) and the 710s (1.9PB -
www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-710-series.html) so they’re quite a bit safer…and give us enough space (we hope, I could be proven wrong here) for the duration we want to use them.

Ian

I can understand the desire for safety at the level of ops that SO does. 320s in RAID 10 have served us well in essentially the same setup as you have for SO (same servers, same RAID, just 320s instead of the X-25Es). At their cost we can afford to have a hot spare in our server as well. Just a more pragmatic way to go for those who don’t necessarily need top-cost endurance guarantees.

nickcraver.com/blog Nick Craver

I absolutely agree, in fact the backup database server in Oregon is on 320s where we don’t need the same level of endurance.

Ian

On this topic, Anandtech just came out with an analysis of Intel SSDs in Enterprise use. Excellent read, as usual:
www.anandtech.com/show/5518/a-look-at-enterprise-performance-of-intel-ssds/

Badger

110gb is NOT a lot of data, you can buy a box for a few thousand dollars with triple that much RAM and your problems go away for a year or so…

nickcraver.com/blog Nick Craver

…did you read the article? Make sure you read to the end before making assumptions :)

Jeff Johnson

gipoco.com is neither affiliated with the authors of this page or responsible
for its contents. This is a safe-cache copy of the original web site.

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.

Nick Craver

Software Imagineering

Stack Overflow running short on space