Architecture behind our new Search and Explore experience « SoundCloud Backstage SoundCloud Backstage

December 4th, 2012

Architecture behind our new Search and Explore experience

Search is front-and-center in the new SoundCloud, key to the consumer experience. We’ve made the search box one of the first things you see, and beefed it up with suggestions that allow you to jump directly to people, sounds, groups, and sets of interest. We’ve also added a brand-new Explore section that guides you through the huge and dynamic landscape of sounds on SoundCloud. We’ve also completely overhauled our search infrastructure, which helps us provide more relevant results, scale with ease, and experiment quickly with new features and models.

In the beginning

In SoundCloud’s startup days, when we were growing our product at super speed, we spent just two days implementing a straightforward search feature, integrating it directly into our main app. We used Apache Solr, as was the fashion at the time, and built a master-slave cluster with the common semantics: writes to the master, reads from the slaves. Besides a separate component for processing deletes, our indexing logic was pull-based: the master Solr instance used a Solr DataImportHandler to poll our database for changes, and the slaves polled the master.

At first, this worked well. But as time went on, our data grew, our entities became more complex, and our business rules multiplied. We began to see problems.

The problems

The index was only getting updated on the read slaves about every fifteen minutes. Being real-time is crucial in the world of sound, as it’s important that our users’ sounds are searchable immediately after upload.
A complete re-index of the master node from the database took up to 24 hours, which made making a simple schema change for a new feature or bugfix a Sisyphean task.
Worse, during those complete-reindex periods, we couldn’t do incremental indexing because of limitations in the DataImportHandler. That meant the search index was essentially frozen in time: not a great user experience.
Because the search infrastructure was tightly coupled in the main application, low-level Solr or Lucene knowledge leaked everywhere. If we wanted to make a slight change, such as to a filter, we had to touch 25 parts of the codebase in the main application.
Because the application directly interacted with the master, if that single-point-of-failure failed, we lost information about updates, and our only reliable recovery was a complete reindex, which took 24 hours.
Since Solr uses segment or bulk replication, our network was flooded when the slaves pulled updates from the master, which caused additional operational issues.
Solr felt like it was built in a different age, for a different use-case. When we had performance problems, it was opaque and difficult to diagnose.

In early 2012, we knew we couldn’t go forward with many new features and enhancements on the existing infrastructure. We formed a new team and decided we should rebuild the infrastructure from scratch, learning from past lessons. In that sense, the Solr setup was an asset: it could continue to serve site traffic, while we were free to build and experiment with a parallel universe of search. Green-field development, unburdened by legacy constraints: every engineer’s dream.

The goals

We had a few explicit goals with the new infrastructure.

Velocity: we want high velocity when working on schema-affecting bugs and features, so a complete reindex should take on the order of an hour.
Reliability: we want to grow to the next order of magnitude and beyond, so scaling the infrastructure—horizontally and vertically—should require minimum effort.
Maintainability: we don’t want to leak implementation details beyond our borders, so we should strictly control access through APIs that we design and control.

After a survey of the state of the art in search, we decided to abandon Solr in favor of ElasticSearch, which would theoretically address our reliability concerns in their entirety. We then just had to address our velocity and maintenance concerns with our integration with the rest of the SoundCloud universe. To the whiteboard!

The plan

On the read side, everything was pretty straightforward. We’d already built a simple API service between Solr and the main application, but had hit some roadblocks when integrating it. We pivoted and rewrote the backend of that API to make ElasticSearch queries, instead: an advantage of a lightweight service-oriented architecture is that refactoring services in this way is fast and satisfying.

On the write side—conceptually, at least—we had a straightforward task. Search is relatively decoupled from other pieces of business and application logic. To maintain a consistent index, we only need an authoritative source of events, ie. entity creates, updates, and deletes, and some method of enriching those events with their searchable metadata. We had that in the form of our message broker. Every event that search cared about was being broadcast on a well-defined set of exchanges: all we had to do was listen in.

But depending on broker events to materialize a search index is dangerous. We would be susceptible to schema changes: the producers of those events could change the message content, and thereby break search without knowing, potentially in very subtle ways. The simplest, most robust solution was to ignore the content of the messages, and perform our own hydration. That carries a runtime cost, but it would let us better exercise control over our domain of responsibility. We believed the benefits would outweigh the costs.

Using the broker as our event source, and the database as our ground truth, we sketched an ingestion pipeline for the common case. Soon, we realized we had to deal with the problem of synchronization: what should we do if an event hadn’t yet been propagated to the database slave we were querying? It turned out that if we used the timestamp of the event as the “external” version of the document in ElasticSearch, we could lean on ElasticSearch’s consistency guarantees to detect, and resolve, problems of sync.

Once it hit the whiteboard, it became clear we could re-use the common-case pipeline for the bulk-index workflow; we just had to synthesize creation events for every entity in the database. But could we make it fast enough? It seemed at least possible. The final box-diagram looked like this:

Implementation and optimization began. Over a couple of weeks, we iterated over several designs for the indexing service. We eked out a bit more performance in each round, and in the end, we got where we wanted: indexing our entire corpus against a live cluster took around 2 hours, we had no coupling in the application, and ElasticSearch itself gave us a path for scaling.

The benefits

We rolled out the new infrastructure in a dark launch on our beta site. Positive feedback on the time-to-searchability was immediate: newly-posted sounds were discoverable in about 3 seconds (post some sounds and try it out for yourself!). But the true tests came when we started implementing features that had been sitting in our backlog. We would start support for a new feature in the morning, make schema changes into a new index over lunch, and run A/B tests in the afternoon. If all lights were green, we could make an immediate live swap. And if we had any problems, we could rollback to the previous index just as fast.

There’s no onerous, manual work involved in bootstrapping or maintaining nodes: everything is set up with ElasticSearch’s REST API. Our index definitions are specified in flat JSON files. And we have a single dashboard with all the metrics we need to know how search is performing, both in terms of load and search quality.

In short, the new infrastructure has been an unqualified success. Our velocity is through the roof, we have a clear path for orders-of-magnitude growth, and we have a stable platform for a long time to come.

Enter DiscoRank

It’s not the number of degrees of separation between you and John Travolta: DiscoRank is our modification of the PageRank algorithm to provide more relevant search results: short for Discovery Rank. We match your search against our corpus and use DiscoRank to rank the results.

The DiscoRank algorithm involves creating a graph of searchable entities, in which each activity on SoundCloud is an edge connecting two nodes. It then iteratively computes the ranking of each node using the weighted graph of entities and activities. Our graph has millions and millions of nodes and a lot more edges. So, we did a lot of optimizations to adapt PageRank to our use case, to be able to keep the graph in memory, and to recalculate the DiscoRank quickly, using results from previous runs of the algorithm as priors. We keep versioned copies of the DiscoRank so that we can swap between them when testing things out.

How do we know we’re doing better?

Evaluating the relevance of search results is a challenge. You need to know what people are looking for (which is not always apparent from their query) and you need to get a sample that’s representative enough to judge improvement. Before we started work on the new infrastructure, we did research and user testing to better understand how and why people were using search on SoundCloud.

Our evaluation baseline is a set of manually-tagged queries. To gather the first iteration of that data, we built an internal evaluation framework for SoundCloud employees: thumbs up, thumbs down on results. In addition, we have metrics like click positions, click engagement, precision, and recall, all constantly updating in our live dashboard. This allows us to compare ranking algorithms and other changes to the way we construct and execute queries. And we have mountains of click log data that we need to analyse.

We have some good improvements so far, but there’s still a lot of tuning that can be done.

Search as navigation

The new Suggest feature lets you jump straight to the sound, profile, group, or set you’re looking for. We’ll trigger your memory if you don’t remember how something is spelled, or exactly what it’s called. Since we know the types of the results returned, we can send you straight to the source. This ability to make assumptions and customizations is a consequence of knowing the structure and semantics of our data, and gives a huge advantage in applications like this.

Architecture-wise, we decided to separate the suggest infrastructure from the main search cluster, and build a custom engine based on Apache Lucene’s Finite State Transducers. As ever, delivering results fast is crucial. Suggestions need to show up right after the keystroke. It turned out we are competing with the speed of light for this particular use-case.

The fact that ElasticSearch doesn’t come with a suggest engine turned out to be a non-issue and rather forced us to build this feature in isolation. This separation proved to be a wise decision, since update-rate, request patterns and customizations are totally different, and would have made a built-in solution hard to maintain.

Search as a crystal ball

Similar to suggest, Explore is based on our new search infrastructure, but with a twist: our main goal for Explore is to showcase the sounds in our community at this very moment. This led to a time-sensitive ranking, putting more emphasis on the newness of a sound.

So far, so good

We hope you enjoy using the new search features of SoundCloud as much as we enjoyed building them. Staying focused on a re-architecture is painful when you see your existing infrastructure failing more and more each day, but we haven’t looked back—especially since our new infrastructure allows us to deliver a better user experience much faster. We now can roll out new functionality with ease, and the new Suggest and Explore along with an improved Search itself are just the first results.

Keep an eye out for more improvements to come!

Apache Lucene and Solr are trademarks of the Apache Software Foundation

apache discovery elastic search explore lucene search solr suggest Architecture 17 Comments

Petar Djekic

DOING_IT_WRONG

I think the title of the post is wrong. It must be: “How to build a search engine by stealing code from interviews”. Assholes!!
Vivek

LOL !
Nik0

Hi Team

Have you make comparison with Sphinx search ? I’m interested by your opinion.
Peter Bourgon

We didn’t evaluate Sphinx. It felt like a step sideways or even backwards from Solr, and had too-tight coupling with a RDBMS. We also had a lot of experience, mostly positive, with the Lucene engine, which we wanted to capitalize on. And the feature-set of ElasticSearch was unmatched, especially re: horizontal and vertical scaling.
Pingback: IT Operations News Roundup — Dec 3rd to 9th | Web Performance Monitoring and Optimization
www.facebook.com/lehndorff Peter Lehndorff

Granted the search part of the new Sound Cloud is better. BUT. When will the comments be fixed? Pretty awesome that you think “So far so good.” while those of us with paying accounts are all still on Classic so that we can continue working. I’m sorry because I love SC and just had two musician friends leave because of the new SC.
www.facebook.com/profile.php?id=716891408 Michael V Papa

Hey guys just to let you guys know I found a website called zippytune. com and they totally ripped off the design of their website from you guys. I hope they take their site down because soundcloud is awesome!
www.facebook.com/profile.php?id=716891408 Michael V Papa

I dont know what happen to my other comment but, just to let you guys know zippytune copied soundclouds website design…
njam

Interesting! Two questions remain unclear for me:

SCHEMA CHANGE:
I understand you’re doing a reindex in 2h if you change the schema, and then switch over to the new index. What happens to CUD-events during those 2h? Do you still update the active index, but re-apply the events to the new index once created?

STALE META DATA IN SLAVE DB:
So the Broker is reading the documents’ meta data from a slave database server. If the propagation to broker is faster then db-replication, the document will be updated with old data. Now you’re using the update event’s timestamp as the version number for the document in elasticsearch? This way you will never overwrite newer documents with old data. But you still have the stale information in your search index until the document gets updated again, right?

Thanks
www.facebook.com/people/Mark-Grant/100001427513331 Mark Grant

I’m finding the comments very bad at the moment.
Bod

Guys the Comment system is broken beyond belief! How did this get approved internally? It’s blatantly obvious that it doesn’t work in the slightest. If a message is longer than something like 20 characters, I have to go on some kind of ridiculous pilgrimage to the song’s dedicated Comments page, which, by the way, is incredibly difficult to find. Then I have to find the comment in that vertical list. Are you joking? If you want to seem like you’re passionate about your website and your goals, don’t release features that make it look like you couldn’t care less.
Peter Bourgon

To your first question, the event stream is applied to all indices, even ones that are being built up from zero. This works because we’ve defined update to have create-or-update semantics.

To your second question, we detect when the broker event has “beaten” DB replication through the version conflict. When that happens, we can put those events into a sort of retry-queue in the index workflow, with an exponential retry-backoff (and an eventual failure).
Nathan Leclaire

This is so cool! Creating a search engine at this kind of scale would be a challenging but fun task to work on IMO. Keep up the good work Soundcloud devs! I don’t care what the haters say, I love the new interface.
njam

Thanks for clarification!
Dantas

When you guys publish a CUD event to the broker, you attach the update_at timestamp, right ?

This timestamp will be used as “external versioning” in the elastic search and when you have sync problems you “schedule” the event to the future in exponential retry-backoff.

What do you mean for ‘version conflict’ ? Am I missing anything ?

btw, thanks for the blogpost
Dantas

Hi Peter Bourgon,
reading your ‘clarification’ answer to @njam I guess I missed something.

When you guys publish a CUD event to the broker, you attach the update_at timestamp, right ?
This timestamp will be used as “external versioning” in the elastic search and when you have sync problems you “schedule” the event to the future in exponential retry-backoff. What do you mean for ‘version conflict’ ? Am I missing anything ?

thanks for the blog post
Peter Bourgon

> When you guys publish a CUD event to the broker, you attach the update_at timestamp, right ?

Yes, but the search infrastructure ignores it. We treat the event merely as a signal for a specific entity, and go to the DB [slave] for all the important data, including updated_at, which transforms into ES external version.

> What do you mean for ‘version conflict’ ?

ES will reject documents whose external versions aren’t *greater than* the version currently stored in the index. If (for example) a “create” event goes through our indexing pipeline before it’s been replicated to our slave, we’ll pick up the ‘old’ version—including, crucially, the old updated_at—which ES already knows about. When we try to index, ES will reject it as a version conflict, and we know we need to retry.

SoundCloud Backstage