elasticsearch - - Open Source, Distributed, RESTful, Search Engine

Training

You know, for Search

So we build a web site or an application and want to add search to it, and then it hits us: getting search working is hard. We want our search solution to be fast, we want a painless setup and a completely free search schema, we want to be able to index data simply using JSON over HTTP, we want our search server to be always available, we want to be able to start with one machine and scale to hundreds, we want real-time search, we want simple multi-tenancy, and we want a solution that is built for the cloud.

"This should be easier", we declared, "and cool, bonsai cool".

elasticsearch aims to solve all these problems and more. It is an Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Apache Lucene.

WHO'S USING IT?

Schema Free & Document Oriented

Search Engines' data model roots lie with schema-free and document-oriented databases, and as shown by the #nosql movement, this model proves very effective for building applications.

elasticsearch's model is JSON, which is the de-facto standard for representing data these days. Moreover, with JSON it is simple to provide semi-structured data with complex entities as well as being programming language neutral with first-level parser support.


$ curl -XPUT localhost:9200/twitter/user/kimchy -d '{
    "name" : "Shay Banon"
}'

$ curl -XPUT localhost:9200/twitter/tweet/1 -d '{
    "user": "kimchy",
    "post_date": "2009-11-15T13:12:00",
    "message": "Trying out elasticsearch, so far so good?"
}'

$ curl -XPUT localhost:9200/twitter/tweet/2 -d '{
    "user": "kimchy",
    "post_date": "2009-11-15T14:12:12",
    "message": "You know, for Search"
}'

Schema Mapping

elasticsearch is schema-less. Just toss it a typed JSON document and it will automatically index it. Types such as numbers and dates are automatically detected and treated accordingly.

But, as we all know, Search Engines are quite sophisticated. Fields in documents can have boost levels that affect scoring, analyzers can be used to control how text gets tokenized into terms, certain fields should not be analyzed at all, and so on... . elasticsearch allows you to completely control how a JSON document gets mapped into the search engine on a per type and per index level.

$ curl -XPUT localhost:9200/twitter

$ curl -XPUT localhost:9200/twitter/user/_mapping -d '{
    "user" : {
        "properties" : {
            "name" : { "type" : "string" }
        }
    }
}'

GETting Some Data

Indexing data is always done using a unique identifier (at the type level). This is very handy since many times we wish to update or delete the actual indexed data, or just GET it. Getting data could not be simpler and all that is needed is the index name, the type and the id. What we get back is the actual JSON document used to index the specific data, but please, keep it secret and don't tell any other distributed Key/Value storage systems...

$ curl -XPUT localhost:9200/twitter/tweet/2 -d '{
    "user": "kimchy",
    "post_date": "2009-11-15T14:12:12",
    "message": "You know, for Search"
}'

$ curl -XGET localhost:9200/twitter/tweet/2

Search

It's what it all boils down to at the end: being able to search. And search could never be simpler. Issuing queries is a simple call hiding away the sophisticated distributed-based search support elasticsearch provides. Search can be executed either using a simple, Lucene-based query string or using an extensive JSON-based search query DSL.

Search, though, does not end with just queries. Facets, highlighting, custom scripts and more are all there to be used when needed.

$ curl -XPUT localhost:9200/twitter/tweet/2 -d '{
    "user": "kimchy",
    "post_date": "2009-11-15T14:12:12",
    "message": "You know, for Search"
}'

$ curl -XGET localhost:9200/twitter/tweet/_search?q=user:kimchy

$ curl -XGET localhost:9200/twitter/tweet/_search -d '{
    "query" : {
        "term" : { "user": "kimchy" }
    }
}'

$ curl -XGET localhost:9200/twitter/_search?pretty=true -d '{
    "query" : {
        "range" : {
            "post_date" : {
                "from" : "2009-11-15T13:00:00",
                "to" : "2009-11-15T14:30:00"
            }
        }
    }
}'

Multi Tenancy

A single index is already a major step forward, but what happens when we need to have more than one index. There are many cases for using multiple indices. An example is storing an index per week of log files, or even having different indices with different settings (one with memory storage, and one with file system storage).

When we do that, though, we would like to be able to search across multiple indices (among other operations).

$ curl -XPUT localhost:9200/kimchy

$ curl -XPUT localhost:9200/elasticsearch

$ curl -XPUT localhost:9200/elasticsearch/tweet/1 -d '{
    "post_date": "2009-11-15T14:12:12",
    "message": "Zug Zug",
    "tag": "warcraft"
}'

$ curl -XPUT localhost:9200/kimchy/tweet/1 -d '{
    "post_date": "2009-11-15T14:12:12",
    "message": "Whatyouwant?",
    "tag": "warcraft"
}'

$ curl -XGET localhost:9200/kimchy,elasticsearch/tweet/_search?q=tag:warcraft

$ curl -XGET localhost:9200/_all/tweet/_search?q=tag:warcraft

Settings

The ability to configure is a double-edged sword. We want the ability to start working with the system as fast as possible, with no configuration, and still be able to control almost every aspect of the application if need be.

elasticsearch is built with this notion in mind. Almost everything is configurable and pluggable. Moreover, each index can have its own settings which can override the master settings. For example, one index can be configured with memory storage and have 10 shards with 1 replica each, and another index can have file based storage with 1 shard and 10 replicas. All the index level settings can be controlled when creating an index either using a YAML or JSON format.

$ curl -XPUT localhost:9200/elasticsearch/ -d '{
    "settings" : {
        "number_of_shards" : 2,
        "number_of_replicas" : 3
    }
}'

Distributed

One of the main features of Elastic Search is its distributed nature. Indices are broken down into shards, each shard with 0 or more replicas. Each data node within the cluster hosts one or more shards and acts as a coordinator to delegate operations to the correct shard(s). Rebalancing and routing are done automatically and behind the scenes.

Gateway

Sometimes the whole cluster crashes or needs to be taken down. Many times, in such a case, we want to restore to the latest state of the cluster when it comes back up again. elasticsearch provides the gateway module allowing you to do just that, think Time Machine for search.

The state of the cluster (including the transaction log) can either be recreated from each node local storage (the default), or from a shared storage (like NFS or Amazon S3). When using shared storage, the state is asynchronously replicated to it.

Moreover, when using shared storage for long term persistency, the index can be kept completely in memory while still being able to perform full recovery in the event of cluster shutdown.

Apache Lucene and the logo is a trademark of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners.

elasticsearch.