Top
  • Home
  • Real Life Architectures
  • Strategies
  • All Posts
  • Advertising
  • Book Store
  • Start Here
  • contact
  • All Time Favorites
  • RSS
  • Twitter
  • Facebook
  • G+
Recent Posts
  • Gone Fishin': Justin.Tv's Live Video Broadcasting Architecture
  • Sponsored Post: Zoosk, Aerospike, Server Stack, Wiredrive, NY Times, CouchConf, FiftyThree, Percona, ElasticHosts, ScaleOut, New Relic, NetDNA, GigaSpaces, AiCache, Logic Monitor, AppDynamics, CloudSigma
  • Gone Fishin': Hilarious Video: Relational Database Vs NoSQL Fanbois
  • Gone Fishin': 10 Ways to Take your Site from One to One Million Users by Kevin Rose
  • Gone Fishin': Building Super Scalable Systems: Blade Runner Meets Autonomic Computing In The Ambient Cloud
  • Are we seeing the renaissance of enterprises in the cloud?
  • Cost Analysis: TripAdvisor and Pinterest costs on the AWS cloud
  • Gone Fishin': LiveJournal Architecture
  • Sponsored Post: Zoosk, Aerospike, Server Stack, Wiredrive, NY Times, CouchConf, FiftyThree, Percona, ElasticHosts, ScaleOut, New Relic, NetDNA, GigaSpaces, AiCache, Logic Monitor, AppDynamics, CloudSigma
  • Gone Fishin': Welcome to High Scalability
All Time Favorites
  • More Favorites...
  • YouTube Architecture
  • Plenty Of Fish Architecture
  • Google Architecture
  • How Twitter Stores 250 Million Tweets A Day Using MySQL
  • Scaling Twitter: Making Twitter 10000 Percent Faster
  • Flickr Architecture
  • Amazon Architecture
  • How I Learned to Stop Worrying and Love Using a Lot of Disk Space to Scale
  • Stack Overflow Architecture
  • Facebook
  • An Unorthodox Approach to Database Design : The Coming of the Shard
  • Building Super Scalable Systems: Blade Runner Meets Autonomic Computing in the Ambient Cloud
  • Are Cloud Based Memory Architectures the Next Big Thing?
  • Latency is Everywhere and it Costs You Sales - How to Crush it
  • How will memristors change everything? 
  • DataSift Architecture: Realtime Datamining At 120,000 Tweets Per Second
  • Useful Scalability Blogs
  • Scaling Traffic: People Pod Pool of On Demand Self Driving Robotic Cars who Automatically Refuel from Cheap Solar
  • VoltDB Decapitates Six SQL Urban Myths And Delivers Internet Scale OLTP In The Process
  • The Canonical Cloud Architecture 
  • Justin.Tv's Live Video Broadcasting Architecture
  • Why Are Facebook, Digg, And Twitter So Hard To Scale?
  • What The Heck Are You Actually Using NoSQL For?
  • Playfish's Social Gaming Architecture - 50 Million Monthly Users And Growing
  • The Updated Big List Of Articles On The Amazon Outage
  • More Favorites...
advertise
  • Login
  • Register
  • All Time Favorites
  • Useful Products
  • Useful Papers
  • Useful Strategies
  • Useful Blogs
  • Useful Books
  • Useful Conferences
  • Book Store
  • High Scalability RSS
  • High Scalability Comments RSS
« Numbers Everyone Should Know | Main | Scaling Digg and Other Web Applications »
Monday
Feb162009

Handle 1 Billion Events Per Day Using a Memory Grid

spacer Monday, February 16, 2009 at 1:58AM

Moshe Kaplan of RockeTier shows the life cycle of an affiliate marketing system that starts off as a cub handling one million events per day and ends up a lion handling 200 million to even one billion events per day. The resulting system uses ten commodity servers at a cost of $35,000.

Mr. Kaplan's paper is especially interesting because it documents a system architecture evolution we may see a lot more of in the future: database centric --> cache centric --> memory grid.

As scaling and performance requirements for complicated operations increase, leaving the entire system in memory starts to make a great deal of sense. Why use cache at all? Why shouldn't your system be all in memory from the start?

General Approach to Evolving the System to Scale

  • Analyze the system architecture and the main business processes. Detect the main hardware bottlenecks and the related business process causing them. Focus efforts on points of greatest return.
  • Rate the bottlenecks by importance and provide immediate and practical recommendation to improve performance.
  • Implement the recommendations to provide immediate relief to problems. Risk is reduced by avoiding a full rewrite and spending a fortune on more resources.
  • Plan a road map for meeting next generation solutions.
  • Scale up and scale out when redesign is necessary.

    One Million Event Per Day System

  • The events are common advertising system operations like: ad impressions, clicks, and sales.
  • Typical two tier system. Impressions and banner sales are written directly to the database.
  • The immediate goal was to process 2.5 million events per day so something needed to be done.

    2.5 Million Event Per Day System

  • PerfMon was used to check web server and DB performance counters. CPU usage was at 100% at peak usage.
  • Immediate fixes included: tuning SQL queries, implementing stored procedures, using a PHP compiler, removing include files and fixing other programming errors.
  • The changes successfully double the performance of the system within 3 months. The next goal was to handle 20 million events per day.

    20 Million Event Per Day System

  • To make this scaling leap a rethinking of how the system worked was in order.
  • The main load of the system was validating inputs in order to prevent forgery.
  • A cache was maintained in the application servers to cut unnecessary database access. The result was 50% reduction in CPU utilization.
  • An in-memory database was used to accumulate transactions over time (impression counting, clicks, sales recording).
  • A periodic process was used to write transactions from the in-memory database to the database server.
  • This architecture could handle 20 million events using existing hardware.
  • Business projections required a system that could handle 200 million events.

    200 Million Event Per Day System

  • The next architectural evolution was to a scale out grid product. It's not mentioned in the paper but I think GigaSpaces was used.
  • A Layer 7 load balancer is used to route requests to sharded application servers. Each app server supports a different set of banners.
  • Data is still stored in the database as the data is used for statistics, reports, billing, fraud detection and so on.
  • Latency was slashed because logic was separated out of the HTTP request/response loop into a separate process and database persistence is done offline.

    At this point architecture supports near-linear scaling and it's projected that it can easily scale to a billion events per day.

    Related Articles

  • GridGain: One Compute Grid, Many Data Grids
  • spacer Todd Hoff | spacer 17 Comments | spacer Permalink | spacer Share Article spacer Print Article spacer Email Article
    in spacer Strategy, spacer memory-grid Tweet

    Reader Comments (17)

    Dear Todd,

    Thank you for the great post,
    I'll be glad to have any reader comment and clarify any issue here, via my email moshe.kaplan @ rocketier.com and in my blog: top-performance.blogspot.com/

    Best,
    Moshe

    November 29, 1990 | spacer Moshe Kaplan

    Let me see if I have this right.

    You have a case study that purportedly shows 200 million txns/day. Let's assume we believe that this is true.

    Then you suggest that based on this a 5x increase should be trivial. And therefore yo

    gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.