Apr 5, 2010. Facets, multi-queries, and searching 3x faster

A number of Sphinx features is frequently overlooked, and multi-queries is one of them. I’ve heard a myth that Sphinx does not have faceted searching. In fact, it perfectly does. Moreover, multi-query mechanism is more powerful and flexible than just that, it lets you do more than just facets. But the documentation does not ever mention the word “facets” indeed, and voila, a myth is born. Well, time to debunk it!

So what are those multi-queries and how they can make your searches 3x faster?

Multi-queries are just a mechanism that lets you send several search queries to searchd in one batch. That, in turn, lets searchd internally optimize common parts between the queries. And that’s where the savings come from.

Note that you don’t need to do anything special about those queries to batch them. Any queries can be batched. Any combination that you send is perfectly valid. There is a limit on a number of queries (controlled by max_batch_queries directive since 1.10 and fixed at 32 before that) but that’s just a sanity check.

Multi-queries are currently only accessible via SphinxAPI. Support for multi-queries via SphinxQL might also be added later, provided that MySQL server protocol client libraries allow that; at a glance, they do. The relevant methods are AddQuery() and RunQueries() respectively. (Fun fact: regular Query() API method actually uses these two internally.) AddQuery() saves a snapshot of the current query settings resulting from any other previous API calls plus saves the query itself. So once you call AddQuery(), settings attached to query will no longer be changed. No API calls will affect the saved settings any more. And thus, you can use any other settings for your next queries (different sorting modes, different filters, etc). Then RunQueries() takes all the saved snapshots, sends them to searchd in one batch, receives the results from searchd, in one batch as well, and returns an array of multiple result sets.

Why use multi-queries? Generally, it’s all about the performance. First, by sending queries and fetching results in one go, we always save a tiny bit on network roundtrips. Second, and that is much more important, searchd gets a chances to find common query parts across the batch, and optimize for those. We add optimizations over time, so it makes sense to batch the queries where you can: that way, once we implement new batch optimizations, they will automatically apply as you upgrade. In case when no known optimizations can be applied, queries are simply processed one by one, as usual. But in any case, optimizations or not, the results are guaranteed to be exactly the same as if you ran the queries one by one. So that’s entirely transparent for the application.

Why (or rather when) not use multi-queries? Query in a batch naturally have to be independent. That is, query B can not depend on the results of query A. However that’s not always the case. For instance, you might want to display results from ancillary index if and only if no search results were returned from the main one. Or just use a different offset into 2nd result set depending on the amount of matches in the 1st result set. Separate queries (or separate batches) would be required in such cases.

As of time of this writing, there are two major batch optimizations that you should be aware of: common query optimization (available since 0.9.8) and common subquery optimizations (available since 1.10).

Common query optimization is exactly how you implement facets, and it works as follows. In a batch, searchd identifies those queries where only sorting and grouping settings are different, while full-text part, filters and other settings coincide. For such subsets of queries, it runs the expensive full-text searching operation only once, and builds multiple, differently sorted (or grouped) result sets off the found matches. For instance, if there are 3 queries for “ipod nano” in a batch, 1st query picks 10 cheapest matches, 2nd one groups the matches by store ID and then sorts the stores by rating, and 3rd query picks the max price, full-text search for “ipod nano” will only be performed once, but 3 differently sorted nd grouped result sets will be built.

Now, so called faceted search is just a particular case where common query optimization applies. Indeed, you can implement that by sending a number of queries in one batch: one for main search results, a few others with different grouping and sorting settings (top-3 authors, top-5 vendors, top-20 sub-category counts, etc). And, because only sorting and grouping settings would be different, it gets optimized.

Common subquery optimization is even more interesting. It lets searchd identify the common full-text parts within queries. Even if the queries are different, but some parts are common, intermediate search results for those parts can be cached and shared between the queries. For instance, in this batch

barack obama president
barack obama john mccain
barack obama speech

there’s a common 2-word part (“barack obama”) that can only be evaluated once and cached. That’s exactly what common subquery (aka subtree) optimization does. Max cache size for every batch is strictly controlled by subtree_docs_cache and subtree_hits_cache directive, so that even if a common part “i am” matches 100 million documents, server does not unexpectedly run out of RAM.

But let’s go back to the common query optimization. Here’s a sample code which runs the same query with three different sorting modes:

require ( "sphinxapi.php" );
$cl = new SphinxClient ();
$cl->SetMatchMode ( SPH_MATCH_EXTENDED2 );
 
$cl->SetSortMode ( SPH_SORT_RELEVANCE );
$cl->AddQuery ( "the", "lj" );
$cl->SetSortMode ( SPH_SORT_EXTENDED, "published desc" );
$cl->AddQuery ( "the", "lj" );
$cl->SetSortMode ( SPH_SORT_EXTENDED, "published asc" );
$cl->AddQuery ( "the", "lj" );
$res = $cl->RunQueries();

How does one tell whether the optimization actually triggers and works? If it does, the respective log entries would have an extra “multiplier” field that tells how many queries were combined and processed together:

[Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext2/0/rel 747541 (0,20)] [lj] the
[Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext2/0/ext 747541 (0,20)] [lj] the
[Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext2/0/ext 747541 (0,20)] [lj] the

Notice that “x3″ part? It’s the multipler field and it means that this query was optimized and processed as a part of a 3-query subset (including this one). For the sake comparison, here’s the log in which the queries were sent one by one:

[Sun Jul 12 15:18:17.062 2009] 0.059 sec [ext2/0/rel 747541 (0,20)] [lj] the
[Sun Jul 12 15:18:17.156 2009] 0.091 sec [ext2/0/ext 747541 (0,20)] [lj] the
[Sun Jul 12 15:18:17.250 2009] 0.092 sec [ext2/0/ext 747541 (0,20)] [lj] the

We can see that with a batch per-query time improved 1.5x to 2.3x depending on the sorting mode. And that level of sample improvement is definitely not a limit, Maybe even an understament. Cases are known for both optimizations where query speed improved over 3x times, and that’s production cases and not synthetic tests! Common query optimzation maps very well to vertical product search engines and online shops, common subquery cache maps nicely to data mining queries, but, of course, those aren’t the only eligible search types. You can, for one, eliminate the full-text searching at all and compute mulitple reports (with different sorting and grouping).

What kind of batch optimizations to expect in the future? That depends upon you. In our current long-term plan, we have an optimization for identical queries with different set of filters. Know another frequent querying pattern that just might be subject to a clever optimization? Let us know!


« March 31, 2010. Full-text search BOF session at MySQL Conference April 9, 2010. 2 cents on 2% optimizations »

This entry was posted on Monday, April 5th, 2010 at 13:17 and is filed under General. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

11 Responses to “Facets, multi-queries, and searching 3x faster”

  1. Jon says:
    April 5, 2010 at 16:47

    Please make a build containing real time updates for or us! My new site needs it badly.

  2. Hengjie says:
    April 6, 2010 at 17:40

    You say that “(available since 1.10).” in your article but I cannot find version 1.10 anywhere on the site. The latest version is 0.9.9 with a 0.9.10 teaser.

    Where can I find version 1.10?

    Thanks.

  3. shodan says:
    April 6, 2010 at 17:49

    It’s available from public SVN mirror at code.google.com/p/sphinxsearch

  4. Sujith says:
    April 7, 2010 at 16:04

    Hi Shodan,
    Since last one year,I am using sphinx .still i am not using multiple Queries,event though i am getting the result very fast. We are having a very huge database and search is happening very fast.But I am using multiple indexes . Do i really need to use multiple queries?

    Sujith

  5. shodan says:
    April 7, 2010 at 17:44

    Sujith, its as explained in the post. You might want to. You are not required to. Your call :)

  6. mluggy says:
    April 11, 2010 at 12:27

    If I’m running only a single query (say “bill clinton”) but the filters are changing between requests, will multiple queries help?

  7. shodan says:
    April 17, 2010 at 12:38

    mluggy, right now they will not, but that optimization might be added in the future.

  8. automoto dev says:
    April 20, 2010 at 11:59

    Sphinx multi-queries (for facets) in action:

    www.automoto.gr/search?System.item_type=auto

  9. Jaimie Sirovich says:
    June 19, 2010 at 13:04

    Shodan,

    A few questions about multiqueries and how they actually work if you have a moment. It might help to clarify for a few people reading here as well :) They’re basically a win no matter what, but a few things are worth wondering about:

    0. Even if it does nothing beyond not fetch the data twice, it’s a win, as it’s silly to basically run the same query over and over. But …

    A. This sounds like you might have to materialize the result in some temporary structure. I can imagine this can be optimized away for certain cases, but not all. If you *can* actually return results and tabulate some sort of aggregate or grouped resultset simultaneously that would be ideal.

    So can that happen? I’m guessing this might have something to do with the limitations on filters getting optimized in this way.

    B. If not, does Sphinx use multiple threads to tabulate the groups after materializing it? Or does it execute them in series?

    MySQL’s multiqueries run in series, but it’s completely possible to open multiple connections and use mysqlnd’s async setting to “parallelize” multiqueries. I’ve tested this, so the above comes to mind. Of course the best would be if you can return&scan (without building a temp) the results as you’re building the various smaller structures for aggregates and return those afterward . . .

    J.

  10. shodan says:
    June 19, 2010 at 13:54

    Jaimie, we don’t materialize any intermediate set, all result sets are incrementally updated with every found match as we go. And yes, GROUP BY is supported by this, that’s the whole point that makes facets (for one) so efficient with MQ.

  11. Alex K says:
    May 18, 2011 at 18:20

    Thank you for the post! I just built a layer on top Sphinx to make this process easier. Feel free to have a look at fSphinx (Faceted Search with Sphinx):

    https://github.com/alexksikes/fSphinx

    Cheers,

    Alex

Leave a Reply

Click here to cancel reply.