spacer
Open Resource for Lucene/Solr Developers
  • Login
  • Register
Blog | Got A Cool Story? Post It Here.
Home » Lucene » Flexible ranking in Lucene 4
Flexible ranking in Lucene 4
Posted on by robert.muir

Over the summer I served as a Google Summer of Code mentor for David Nemeskey, PhD student at Eötvös Loránd University. David proposed to improve Lucene’s scoring architecture and implement some state-of-the-art ranking models with the new framework.

These improvements are now committed to Lucene’s trunk: you can use these models in tandem with all of Lucene’s features (boosts, slops, explanations, etc) and queries (term, phrase, spans, etc). A JIRA issue has been created to make it easy to use these models from Solr’s schema.xml.

Relevance ranking is the heart of the search engine, and I hope the additional models and flexibility will improve the user experience for Lucene: whether you’ve been frustrated with tuning TF/IDF weights and find an alternative model works better for your case, found it difficult to integrate custom logic that your application needs, or just want to experiment.

I’ll be giving a talk about how you can practically apply some of the upcoming Lucene 4 search features at Lucene Eurocon in October, and at the SFBay Apache Lucene/Solr Meetup later this month.

Some bullet points of the new scoring features:

  • New ranking algorithms, in addition to Lucene’s Vector Space Model:
    • Okapi BM25 Model
    • Language Models
    • Divergence from Randomness Models
    • Information-based Models
  • Added key statistics to the index format to support additional scoring models.
    • Term- and field-level statistics for collection frequencies and deriving averages.
    • Additional document-level statistics for computing normalization factors.
  • Decoupled matching from ranking in Lucene’s core search classes:
    • Customize scoring without digging into the “guts”.
    • Customize explanations: essential for debugging relevance issues.
  • Powerful low-level Similarity API, supporting advanced use cases:
    • Incorporate per-document values from Column Stride Fields into the score.
    • Use different scoring parameters or algorithms for different fields.
    • Fuse multiple scoring algorithms into a combined score.
  • Convenient high-level SimilarityBase for everything else:
    • Write your own scoring function in one Java method.
    • Easy access to available index statistics.

For more information about this GSOC project, take a look at its wiki page

 

Share this:

  • Email
  • Facebook
  • Twitter
  • Digg
  • LinkedIn
  • This entry was posted in Lucene, Relevancy, Solr by robert.muir. Bookmark the permalink.
    Connect With Us!
    Follow @LucidImagineer
    Recent Posts
    And The Winners Are…
    Lucene Revolution 2013 to be held in San Diego
    Stump The Chump and Win a Prize!
    Popular Tags
    Apache ApacheCon Apache Mahout Chump Dismax Enterprise Search Erik Hatcher Faceting Field Collapsing Frange Function Query Grant Ingersoll Hadoop Hoss ISFDB Lucene Lucene/Solr Case Studies Lucene Revolution LucidWorks Mahout Marc Krellenstein Mark Miller Memory NoSQL Open Source Search Podcast Query Parser Rails Realtime Release Result Grouping Richmond Ruby Solr Solr 3.1 Solr 4.0 SolrCloud Solr reference guide Spatial Search Technical Articles Tika Transaction Log VA Videos & Podcast Whitepapers

    You must be logged in to post a comment.

    Profile cancel

    You must be logged in to post a comment.

    Last reply was 1 month ago
    1. Flexible ranking in Lucene 4 « Another Word For It
      View September 14, 2011

      [...] Flexible ranking in Lucene 4 [...]

      Log in to Reply
    2. Lucid Imagination » Happy Anniversary, Lucene! 10 years at the ASF
      View September 18, 2011

      [...] has evolved from offering a single vector space scoring model to one that now offers plug-n-play ranking (BM25 [...]

      Log in to Reply
    3. New index statistics in Lucene 4.0 « Another Word For It
      View 8 months ago

      [...] this situation is wildly improved in trunk (to be 4.0), where we have a selection of modern scoring models, including Okapi BM25, Language models, Divergence from Randomness models and Information-based [...]

      Log in to Reply
    4. SearchHub, brought to you by LucidWorks » Lucene/Solr 4.0-ALPHA – What’s In A Name?
      View 2 months ago

      [...] Similarity has been decoupled from the vector space model (TF/IDF). Additional models such as BM25, Divergence from Randomness, Language Models, and Information-based models are provided (see flexible-ranking-in-lucene-4). [...]

      Log in to Reply
    5. SearchHub, brought to you by LucidWorks » Apache Solr and Apache Lucene 4.0.0 Released
      View 1 month ago

      [...] Similarity has been decoupled from the vector space model (TF/IDF). Additional models such as BM25, Divergence from Randomness, Language Models, and Information-based models are provided. [...]

      Log in to Reply
    • About SearchHub
    • About LucidWorks
    • Help & Support
    • Training
    Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.
    © 2012 LucidWorks. All Right reserved.
    spacer spacer spacer spacer
    gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.