Blog | Support Portal | Contact

Products
- LucidWorks Search
- LucidWorks Big Data
Support & Services
- Consulting
- LucidWorks University
- Lucene/Solr Support
Resources
About Us
- Management
- Board of Directors
- Committers
- Partners
- Events
- Press Releases
- News
- Careers
- Contact Us
- FAQ

Subscribe by Email

Your email:

Follow Me

Latest Posts

Pig and HBase with LucidWorks Big Data
Find, Discover and Analyze the Value in Big Data
I've uncovered a new form of Identity Theft
LucidWorks Big Data now Integrated with MapR
LucidWorks Big Data & Oozie Workflow With VizOozie
6 Predictions for 2013: Search Marketing & Big Data Analytics
Getting Started with LucidWorks Big Data
Windows Azure support for Solr 4.0 Announced
Windows Azure Websites Quotas, Scaling, and Pricing
How to quickly create a LucidWorks instance on Windows Azure

log a Support Request</a>." display="None" validationGroup="ISBBLatest" id="dnn__ctl3__ctl0_BigProblemCV" evaluationfunction="CustomValidatorEvaluateIsValid" style="color:Red;display:none;">

log a Support Request</a>." display="None" validationGroup="ISBBPopular" id="dnn__ctl4__ctl0_BigProblemCV" evaluationfunction="CustomValidatorEvaluateIsValid" style="color:Red;display:none;">

LucidWorks Blog

Current Articles | RSS Feed

Getting Started with LucidWorks Big Data

Posted by Grant Ingersoll on Tue, Jan 29, 2013 @ 12:43 PM

Email Article

LucidWorks Big Data (LWBD) is LucidWorks newest, developer focused platform, combining the power of search, via LucidWorks Search, with the big data processing capabilities of the Hadoop ecosystem and the machine learning and Natural Language Processing (NLP) capabilities of tools like Mahout, OpenNLP and UIMA, without all of the pain of having to figure out how to wire it together or make it scale. It is designed to help companies deliver deeper insight into data by bringing-- in a single platform-- Search, Discovery and Analytics. At LucidWorks, we firmly believe that, when dealing with data, that one must take a multi-faceted approach to understanding the data, and we think search is a key part, as it is one of the most ubiquitious user interfaces on the planet. It is not enough to just have Hadoop or just have Hive or just have search, when dealing with big data, you often need them all, or you at least need the ability to easily add them when you are ready.

Moreover, when dealing with data, it isn't just about the raw content or just about the logs produced by the system. You need a platform that can tie together both, as that leads to both a deeper understanding of the raw content and the users who interact with the data. For more details on LWBD's features, please refer to the product description page and to the LWBD documentation. If you're interested in more details, contact us and we can show you a demo and discuss how we can help you save months of time off of your next big data implementation.

I'll use the rest of this post to focus in on getting started with the actual product by looking at a simple example of ingesting web content, making it searchable and then doing some aggregate analysis on that data using the platform. Finally, I'll finish up with some ideas of where to proceed to next.

Getting Started

First off, note that this is a developer platform, so I am assuming you are comfortable on the command line. Second, the VM you are downloading is designed to help developers get started, not to be a production system. If you are interested in installing LWBD on your own cluster or in Amazon AWS, please contact our sales department.

To get started, here's what you will need to do first:

Download the LWBD Virtual Machine image (you'll have to fill out a form). It's a 3 GB file, so get a cup of coffee.
See the VM prequsites and start the VM using the install instructions.
SSH into the machine or work from the command line in the VM (I find SSH easier to copy and paste from). In this case, I did ssh ubuntu@192.168.1.84 to access the system.
Consider creating a JSON formatting script for convenience. This is not a requirement to make the examples work.

Verify your system is running by executing the following on the command line in the VM (after you've logged in):

curl -u administrator:foo localhost:8341/sda/v1/client/collections

You should see something like (abbreviated for space):

[
    {
        "status": "EXISTS",
        "createTime": 1359071751934,
        "collection": "collection1",
        "id": "collection1",
        "throwable": null,
        "children": [
            {
                "status": "EXISTS",
                "createTime": 1359071751934,
                "collection": "collection1",
                "children": [],
                "id": "collection1",
                "throwable": null,
                "properties": {
                    "service-impl": "LucidWorksDataManagementService"
                }
            },
            {
                "status": "EXISTS",
                "createTime": 1359071751937,
                "collection": "collection1",
                "children": [],
                "id": "collection1",
                "throwable": null,
                "properties": {
                    "path": "hdfs://localhost:50001/data/collections/collection1",
                    "service-impl": "HadoopDataManagementService"
                }
            }
        ]
    }
]

If the last step does not work, please refer to support.lucidworks.com for help or to ask a question.

Content Acquisition and Search

Now that I have the prerequisites out of the way, let's get started bringing some content into the system. I'll use curl for our examples here, but you can use any REST client you feel comfortable with, as LWBD speaks JSON over REST.

Let's create a collectionwhere I can organize all the data:

curl -s -u administrator:foo -X POST -H 'Content-type: application/json' -d '{"collection":"searchhub"}' localhost:8341/sda/v1/client/collections

The results should look like:

{
    "status": "CREATED",
    "createTime": 1359200703437,
    "collection": "searchhub",
    "children": [
        {
            "status": "CREATED",
            "properties": {
                "name": "searchhub",
                "instance_dir": "searchhub_3"
            },
            "collection": "searchhub",
            "id": "searchhub",
            "throwable": null,
            "children": []
        },
        {
            "status": "CREATED",
            "properties": {
                "path": "hdfs://localhost:50001/data/collections/searchhub",
                "service-impl": "HadoopDataManagementService"
            },
            "collection": "searchhub",
            "id": "searchhub",
            "throwable": null,
            "children": []
        },
        {
            "status": "CREATED",
            "createTime": 1359200717458,
            "collection": "searchhub",
            "children": [],
            "id": "searchhub",
            "throwable": null,
            "properties": {
                "service-impl": "HBaseDataManagementService"
            }
        }
    ],
    "id": "searchhub",
    "throwable": null,
    "properties": {}
}

Next, let's create a Web Data source to crawl a web site:

curl -X POST -H 'Content-type:application/json' -u administrator:foo -d '{"crawler":"lucid.aperture","type":"web","url":"searchhub.org","crawl_depth":-1,"name":"SearchHub", "bounds":"tree", "output_type":"com.lucid.crawl.impl.HBaseUpdateController", "output_args":"localhost:2181", "mapping":{"original_content":"true"}}' localhost:8341/sda/v1/client/collections/searchhub/datasources

The results should look like:

{
    "status": "CREATED",
    "createTime": 1359200742420,
    "collection": "searchhub",
    "children": [],
    "id": "14fd8c7ad12346b1a058d0f5e342d98b",
    "throwable": null,
    "properties": {
        "proxy_password": "",
        "parsing": true,
        "ignore_robots": false,
        "commit_on_finish": true,
        "max_bytes": 10485760,
        "id": "14fd8c7ad12346b1a058d0f5e342d98b",
        "add_failed_docs": false,
        "proxy_host": "",
        "verify_access": true,
        "log_extra_detail": false,
                "pagecount": "pageCount",
                "title": "title",
                "fullname": "author",
                "filelastmodified": "lastModified",
                "content-type": "mimeType"
            },
            "verify_schema": true,
            "dynamic_field": "attr",
            "unique_key": "id",
            "lucidworks_fields": true,
            "multi_val": {
                "body": false,
                "mimeType": false,
                "description": false,
                "title": false,
                "author": true,
                "acl": true,
                "fileSize": false,
                "dateCreated": false
            },
            "datasource_field": "data_source",
            "default_field": null,
            "types": {
                "date": "DATE",
                "lastmodified": "DATE",
                "filesize": "LONG",
                "datecreated": "DATE"
            }
        },
        "output_args": "localhost:2181",
        "crawl_depth": -1,
        "commit_within": 900000,
        "include_paths": [],
        "collection": "searchhub",
        "fail_unsupported_file_types": false,
        "proxy_port": -1,
        "name": "SearchHub",
        "exclude_paths": [],
        "url": "searchhub.org/",
        "max_docs": -1,
        "bounds": "tree",
        "proxy_username": "",
        "caching": false,
        "output_type": "com.lucid.crawl.impl.HBaseUpdateController",
        "auth": [],
        "crawler": "lucid.aperture"
    }
}

Make a note of the "id" value, as I will use it later.

Next, Kick off the ingestion of the data:

curl -u administrator:foo -X POST  localhost:8341/sda/v1/client/collections/searchhub/datasources/14fd8c7ad12346b1a058d0f5e342d98b

The last bit of that URL is the ID from the previous step. Your ID will be different. The result should be something like:

{
    "id": "14fd8c7ad12346b1a058d0f5e342d98b",
    "createTime": 1359200823711,
    "status": "RUNNING",
    "collection": "searchhub",
    "children": [],
    "throwable": null
}

Let that run for a bit so there is data into the system or log into the LucidWorks Search admin (HOST:8989/ -- username: admin, password: admin) and watch the documents flow into the system. I let mine run for a bit and here's what the LWS admin panel looks like:

Next, we need to run a workflow to extract text from the raw HTML in order to make it indexable:

curl -X POST -H 'Content-type:application/json' -u administrator:foo -d '{"parentWfId":"searchhub","workingDir":"/data/collections/searchhub-subwf/tmp/","oozie.wf.application.path":"hdfs://localhost:50001/oozie/apps/_etl/sub_wf/extract","collection":"searchhub","zkConnect":"localhost:2181","tikaProcessorClass":"com.digitalpebble.behemoth.tika.TikaProcessor"}' localhost:8341/sda/v1/client/workflows/extract

This will kick off a Hadoop job that processes all of the raw content through Tika. Since this can be a long running job when you have a lot of content, we simply return you a JobID that you can use to check the status of the results, something like:

{
    "id": "0000006-130123050323975-oozie-hado-W",
    "workflowId": "extract",
    "createTime": 1359201918000,
    "status": "RUNNING",
    "children": [],
    "throwable": null
}

We should now have searchable content. You can searchvia curl or via the Admin option. Since I've been using the APIs, I'll continue so here with the command:

curl -u administrator:foo -X POST -H 'Content-type: application/json' -d '{"query":{"q":"*:*","rows":1, "fl":"id,title,score"}}' localhost:8341/sda/v1/client/collections/searchhub/documents/retrieval

The results look like:

{
    "QUERY": {
        "json": {
            "responseHeader": {
                "status": 0,
                "QTime": 7,
                "params": {
                    "rows": "1",
                    "version": "2.2",
                    "collection": "searchhub",
                    "q": "*:*",
                    "wt": "json",
                    "fl": [
                        "id,title,score",
                        "id"
                    ]
                }
            },
            "response": {
                "start": 0,
                "maxScore": 1,
                "numFound": 2287,
                "docs": [
                    {
                        "score": [
                            1
                        ],
                        "id": "searchhub.org/2013/01/24/apache-solr-4-1-is-here/",
                        "title": [
                            "Apache Lucene/Solr 4.1 is here!"
                        ]
                    }
                ]
            },
            "requestToken": "SDA_USER~779187014188fc3a"
        }
    }
}

See the documentation for more details on how to write queries and process the results.

Digging Deeper

So far, we've covered the basics, let's try running a workflow to extract Statistically Interesting Phrases (SIPs) automatically from the content. What's a SIP? It is a phrase containing words that co-occur together more often than one would expect given a random distribution of words. SIPs are often useful for exploring new data sets as they let you discover potentially important word combinations that you may not think of on your own. Keep in mind, when dealing with SIPs, you will likely spend some time tuning your SIP process to improve the quality of the results. This usually involves stopword analysis, data cleansing and more. For the sake of the example here, I'm only doing a few basic things to clean up the data.

To familiarize yourself with the workflows available in LWBD, run the following command, which will tell you all of the current workflows and the parameters they accept:

curl -u administrator:foo localhost:8341/sda/v1/client/workflows

Due to the length of the output, I will not include the results here, so please refer to the documentation. For this example, I will be using a few of the ETL subworkflows. The steps to run are:

The ETL Vectorize Subworkflow:

curl -X POST -H 'Content-type:application/json' -u administrator:foo -d '{"workingDir": "/data/collections/searchhub-subwf/tmp/","documentsAsText": "hdfs://localhost:50001/data/collections/searchhub-subwf/tmp/document-text","documentsAsVectors": "hdfs://localhost:50001/data/collections/searchhub-subwf/tmp/document-vectors","vec_nGrams": "2","vec_analyzer": "com.lucid.sda.hadoop.analysis.StandardStopwordAnalyzer","collection": "searchhub","zkConnect": "localhost:2181", "parentWfId":"searchhub","oozie.wf.application.path":"hdfs://localhost:50001/oozie/apps/_etl/sub_wf/vectorize"}' localhost:8341/sda/v1/client/workflows/vectorize

The output should look something like:

{"id":"0000000-130123050323975-oozie-hado-W","workflowId":"vectorize","createTime":

Subscribe by Email

Follow Me

Latest Posts

Browse by Tag

LucidWorks Blog

Getting Started with LucidWorks Big Data