Lodel + Solr

Posted by Jean-André Santoni on Monday, October 3, 2011 · Leave a Comment 

Lodel + Solr

Few months ago, we started a complete rewrite of the new version of Calenda powered by Lodel. The biggest part of the public side of Calenda—programaticaly speaking—is its search engine: this engine is used everywhere, not only on the result pages, but also on the front page and the event page.

Our search engine choice is Lucene/Solr, we already use it to index and provide faceted search on all the contents from Revues.org, Hypotheses.org and the actual Calenda. Solr is fast, powerful, easy to use and to maintain, and supported by a strong community. It also provides a very interesting functionnality: dynamic fields.

Dynamic fields

Here is a detailed explanation about dynamic fields.

Solr configuration file is a simple, almost flat, XML file. Is is called schema.xml.

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.4">

    <types>

        <fieldType name="int" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
        <fieldType name="string" sortMissingLast="true" omitNorms="true"/>

        <fieldType name="text" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer/>
                <filter/>
            </analyzer>
            <analyzer type="query">
                <tokenizer/>
                <filter/>
            </analyzer>
        </fieldType>

    </types>

    <fields>

        <field name="id" type="string" indexed="true" stored="true" required="true"/>
        <field name="nature" type="string" indexed="true" stored="true" required="true"/>
        <field name="class" type="string" indexed="true" stored="true" required="true"/>
        <field name="type" type="string" indexed="true" stored="true" required="true"/>
        <field name="idtype" type="int" indexed="true" stored="true"/>
        <field name="idparent" type="int" indexed="true" stored="true"/>
        <field name="status" type="int" indexed="true" stored="true"/>
        <field name="text" type="text" indexed="true" stored="false" multiValued="true"/>

    </fields>

    <uniqueKey>id</uniqueKey>

    <defaultSearchField>text</defaultSearchField>

    <solrQueryParser defaultOperator="OR"/>

    <copyField source="*" dest="text"/>

</schema>

It allows you define field names and types describing your data. For example, the unique ID, the title, subtitle, etc. This is OK to solve simple problems, but sometimes, we don’t know what our data will be composed of. This is what dynamic fields are for: they make you schema.xml less hard-coded.
As you may already know, Lodel main feature is its customizable Editorial Model: you can define your database structure directly from the back office. Thanks to the dynamic fields, we wrote a schema.xml able to handle any data from Lodel, whatever your Editorial Model looks like!
This means that our work with Solr can be useful for any Lodel users, so we plan to distribute it in the next release.

Our schema.xml

Here is the schema.xml so far:

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.4">

    <types>

        <fieldType name="int" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
        <fieldType name="string" sortMissingLast="true" omitNorms="true"/>

        <fieldType name="text" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer/>
                <filter/>
            </analyzer>
            <analyzer type="query">
                <tokenizer/>
                <filter/>
            </analyzer>
        </fieldType>

    </types>

    <fields>

        <field name="id" type="string" indexed="true" stored="true" required="true"/>
        <field name="nature" type="string" indexed="true" stored="true" required="true"/>
        <field name="class" type="string" indexed="true" stored="true" required="true"/>
        <field name="type" type="string" indexed="true" stored="true" required="true"/>
        <field name="idtype" type="int" indexed="true" stored="true"/>
        <field name="idparent" type="int" indexed="true" stored="true"/>
        <field name="status" type="int" indexed="true" stored="true"/>
        <field name="text" type="text" indexed="true" stored="false" multiValued="true"/>

        <dynamicField name="*_int" type="int" indexed="true" stored="true"/>
        <dynamicField name="*_entries" type="int" indexed="true" stored="true"/>
        <dynamicField name="*_persons" type="int" indexed="true" stored="true"/>
        <dynamicField name="*_file" type="int" indexed="true" stored="true"/>
        <dynamicField name="*_boolean" type="int" indexed="true" stored="true"/>
        <dynamicField name="*_tinytext" type="string" indexed="true" stored="true"/>
        <dynamicField name="*_email" type="string" indexed="true" stored="true"/>
        <dynamicField name="*_date" type="string" indexed="true" stored="true"/>
        <dynamicField name="*_url" type="string" indexed="true" stored="true"/>
        <dynamicField name="*_text" type="text" indexed="true" stored="true"/>
        <dynamicField name="*_mltext" type="text" indexed="true" stored="true"/>

        <dynamicField name="*_int_m" type="int" indexed="true" stored="true" multiValued="true"/>
        <dynamicField name="*_entries_m" type="int" indexed="true" stored="true" multiValued="true"/>
        <dynamicField name="*_persons_m" type="int" indexed="true" stored="true" multiValued="true"/>
        <dynamicField name="*_file_m" type="int" indexed="true" stored="true" multiValued="true"/>
        <dynamicField name="*_boolean_m" type="int" indexed="true" stored="true" multiValued="true"/>
        <dynamicField name="*_tinytext_m" type="string" indexed="true" stored="true" multiValued="true"/>
        <dynamicField name="*_email_m" type="string" indexed="true" stored="true" multiValued="true"/>
        <dynamicField name="*_date_m" type="string" indexed="true" stored="true" multiValued="true"/>
        <dynamicField name="*_url_m" type="string" indexed="true" stored="true" multiValued="true"/>
        <dynamicField name="*_text_m" type="text" indexed="true" stored="true" multiValued="true"/>
        <dynamicField name="*_mltext_m" type="text" indexed="true" stored="true" multiValued="true"/>

    </fields>

    <uniqueKey>id</uniqueKey>

    <defaultSearchField>text</defaultSearchField>

    <solrQueryParser defaultOperator="OR"/>

    <copyField source="*" dest="text"/>

</schema>

Indexing content

So I started to write a simple PHP script to index content from Lodel SQL database.

Solr provides a web service to receive the data through POST requests. You can generate XML and use CURL to send it to Solr and get your content indexed, but the easiest way is to use a library which take care of the boring part. The best library I found for PHP is called Solarium. It is full featured, well documented, and similar to other Solr libraries in Python and Perl.

The script is only few lines of PHP and is pretty safe explanatory:

<?php

require('Solarium/Autoloader.php');
Solarium_Autoloader::register();

$client = new Solarium_Client();

$adapter = $client->getAdapter();
$adapter->setPort(8989);

// Delete all documents
$update = $client->createUpdate();
$update->addDeleteQuery('*:*');
$update->addCommit();
$result = $client->update($update);

// Connect the db
$dbh = new PDO('mysql:host=localhost;dbname=xxxx', 'xxxx', 'xxxx', array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8") );

$docs = array();

function update($client, &$docs) {
    $update = $client->createUpdate();
    $update->addDocuments($docs);
    $update->addCommit();
    $result = $client->update($update);
    $docs = array();
    echo ".";
}

$objects = $dbh->prepare("SELECT * FROM objects WHERE class IN ('entities')");
$objects->execute();
while ( $object = $objects->fetch() ) {

    $doc = new Solarium_Document_ReadWrite();
    $doc->id = $object[id];
    $doc->nature = $object['class'];

    switch ($object['class']) {
        case 'entities':
            $entities = $dbh->prepare("SELECT e.*, t.type, t.class FROM entities e, types t WHERE e.id = ? AND e.idtype = t.id");
            $entities->execute(array($object[id]));
            $entity = $entities->fetch();

            foreach (array('idtype','type','class','idparent','status') as $k) {
                $doc->addField($k, $entity[$k]);
            }

            $things = $dbh->prepare("SELECT * FROM $entity[class] WHERE identity = $object[id]");
            $things->execute();
            $thing = $things->fetch();

            $tablefields = $dbh->prepare("SELECT * FROM tablefields WHERE class = '$entity[class]'");
            $tablefields->execute();
            while ($f = $tablefields->fetch()) {
                $v = preg_replace('/[\p{Cc}]/', '', $thing[$f[name]]);
                if ($v)
                    $doc->addField($f[name].'_'.$f[type], $v);
            }

            $relations = $dbh->prepare("SELECT r.id2, r.nature, et.type FROM relations r, entries e, entrytypes et WHERE id1 = $object[id] AND r.id2 = e.id AND e.idtype = et.id");
            $relations->execute();
            while ($r = $relations->fetch()) {
                $doc->addField('R_'.$r[nature].'_'.$r[type].'_int_m', $r[id2]);
            }

            break;
    }

    $docs[] = $doc;
    if ( count($docs) == 1000 )
        update($client, $docs, $total);
}

update($client, $docs, $total);

?>

There is only one big loop, on the ‘object’ table. An object can be an entity, an entry or a person. For now, it can only index entities, but we plan to support entries and persons in the next move.
The most interesting part is (simplified) here:

while ($f = $tablefields->fetch()) {
    $doc->addField($f[name].'_'.$f[type], $thing[$f[name]]);
}

As you can see, the name of the added Solr field results from the concatenation of its name and type in Lodel. If you defined a multilingual field named ‘title’ in your editorial model, the script will send it to Solr under the name ‘title_mltext’, and Solr will store it using the dynamic field ‘*_mltext’.

To better understand how Solr store our documents, here is the result of a select query:

<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">0</int>
        <lst name="params">
            <str name="q">*:*</str>
        </lst>
    </lst>
    <result name="response" numFound="19573" start="0">
        <doc>...</doc>
        <doc>...</doc>
        <doc>...</doc>
        <doc>...</doc>
        <doc>
            <str name="nature">entities</str>
            <str name="class">event</str>
            <str name="type">event</str>
            <str name="dates_date">2010-12-08</str>
            <str name="id">619209</str>
            <int name="idparent">0</int>
            <int name="idtype">57</int>
            <int name="status">-1</int>
            <str name="title_mltext">
                <r2r:ml lang="fr">Les compagnons de l'Espace</r2r:ml>
            </str>
            <str name="subtitle_mltext">
                <r2r:ml lang="fr">Une exposition de l'Observatoire de l'Espace du CNES</r2r:ml>
            </str>
            <str name="summary_mltext"/>
            <str name="content_mltext">
                <r2r:ml lang="fr"><p>A l'occasion de la 28e &eacute;dition des Journ&eacute;es europ&eacute;ennes du patrimoine, organis&eacute;e par le minist&egrave;re de la Culture et de la communication sur le th&egrave;me<i> Le voyage du patrimoine,</i> l'Observatoire de l'Espace invite le public au si&egrave;ge parisien du CNES, le samedi 17 et le dimanche 18 septembre 2011 de 11h &agrave; 19h pour l'exposition <i>L<b>es compagnons de l'Espace</b>.</i> Chacun&nbsp;pourra&nbsp;venir y&nbsp;d&eacute;couvrir des pi&egrave;ces&nbsp;&eacute;tonnantes issues du patimoine culturel de l'Espace et rencontrer les t&eacute;moins et acteurs de l'aventure spatiale. Cette&nbsp;&eacute;v&egrave;nement in&eacute;dit&nbsp;r&eacute;v&egrave;lera la multiplicit&eacute; des liens que l'Homme tisse avec l'Espace, &agrave; travers les compagnons qui le pr&eacute;c&egrave;dent, l'assistent et le r&eacute;confortent; qu'ils soient fictifs ou r&eacute;els, animaux ou robots, ils sont les partenaires de son exploration.</p><p>Entr&eacute;e libre et gratuite</p><p>Centre National d'Etudes Spatiales&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2, place Maurice Quentin Paris 1er</p><p>M&eacute;tro Chatelet-Les halles/sortie place carr&eacute;e</p><p>Renseignements : 01 44 76 76 18&nbsp; / <a class="mailto:observatoire.espace@cnes.fr">observatoire.espace@cnes.fr</a></p></r2r:ml>
            </str>
            <arr name="R_E_partners_int_m">
                <int>1387273</int>
            </arr>
            <arr name="R_E_places_int_m">
                <int>1434398</int>
            </arr>
            <arr name="R_E_subjects_int_m">
                <int>1387363</int>
            </arr>
            <arr name="R_E_type_int_m">
                <int>1387268</int>
            </arr>
            <arr name="R_E_websites_int_m">
                <int>1396551</int>
            </arr>
        </doc>
        <doc>...</doc>
        <doc>...</doc>
        <doc>...</doc>
        <doc>...</doc>
        <doc>...</doc>
    </result>
</response>

Querying Solr from Lodel

Now that our data is indexed, we want to query Solr from inside Lodel. To ease the process, I started to code a very simple LodelScript loop, and placed it in the dedicated file loops_local.php:

<?php

function loop_solr($context, $funcname, $arguments) {

    $localcontext = $context;

    if (function_exists("code_before_$funcname"))
        call_user_func("code_before_$funcname", $localcontext);

    $client = new Solarium_Client();

    $adapter = $client->getAdapter();
    if (isset($arguments[port]))
        $adapter->setPort($arguments[port]);

    $query = $client->createSelect();
    if (isset($arguments[query]))
        $query->setQuery($arguments[query]);

    $resultset = $client->select($query);

    $context[nbresults] = $resultset->getNumFound();

    $count = 0;
    foreach ($resultset as $document) {

        $localcontext = $context;
        $localcontext[count] = ++$count;

        foreach($document AS $field => $value) {
            $localcontext[$field] = $value;
        }

        call_user_func("code_do_$funcname", $localcontext);
    }

    if (function_exists("code_after_$funcname"))
        call_user_func("code_after_$funcname", $localcontext);
}

?>

This loop is very similar to the SQL loop and can be used like this:

<h1>À la une</h1>
<LOOP NAME="solr" PORT="8989" QUERY="class:event AND status:1 AND frontpage_boolean:1">
    <BEFORE><ul></BEFORE>
    <DO>
        <li><a class="/[#ID]">[#TITLE_MLTEXT]</a></li>
    </DO>
    <AFTER></ul></AFTER>
</LOOP>

More Like This

Solr provide another nice feature called MoreLikeThis queries. Enable MLT during a query and each Solr result will be filled with similar contents. The similarity is based on the fields you want, and you can even weight them.
So I developed another LodelScript sub-loop, similar to the RSS loop, to support MLT:

<LOOP NAME="solr" PORT="8989" QUERY="id:[#ID]" MLT="1" MLTFL="class,status,title_mltext,body_mltext" COUNT="10">
    <BEFORE><ul></BEFORE>
    <DO>
        <LOOP NAME="morelikethis">
            <BEFORE><ul></BEFORE>
            <DO><li><a class="[#ID]">[#TITLE_MLTEXT]</a></li></DO>
            <AFTER></ul></AFTER>
        </LOOP>
    </DO>
    <AFTER></ul></AFTER>
</LOOP>

This code will list the 10 most similar documents for the current ID, based on the class, status, title_mltext and body_mltext fields.

What next?

This code is still a proof of concept, it needs refactoring, polishing, documentation and integration.

Putting in a separate function the code that indexes a document, in order to call them from an edition hook: this way, any updated object will be re-indexed on the fly.

We also plan to clone the hook system for the indexation function. This would allow us to per-site customize the building of a Solr document before it is sent to Solr without forking the script. A good example of such a hook would be querying GeoNames web service at indexing time to geolocalize your content.

Solr also provides an XLST processor. Writing some simple sheets to transform Solr results into another formats would be a very neat way to add plenty of web services to Lodel.

Filed under 1.0, Tools · Tagged with

spacer About Jean-André Santoni

Speak Your Mind

Tell us what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!
Click here to cancel reply.

 
gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.