Current Articles | RSS Feed
LucidWorks Big Data (LWBD) is LucidWorks newest, developer focused platform, combining the power of search, via LucidWorks Search, with the big data processing capabilities of the Hadoop ecosystem and the machine learning and Natural Language Processing (NLP) capabilities of tools like Mahout, OpenNLP and UIMA, without all of the pain of having to figure out how to wire it together or make it scale. It is designed to help companies deliver deeper insight into data by bringing-- in a single platform-- Search, Discovery and Analytics. At LucidWorks, we firmly believe that, when dealing with data, that one must take a multi-faceted approach to understanding the data, and we think search is a key part, as it is one of the most ubiquitious user interfaces on the planet. It is not enough to just have Hadoop or just have Hive or just have search, when dealing with big data, you often need them all, or you at least need the ability to easily add them when you are ready.
Moreover, when dealing with data, it isn't just about the raw content or just about the logs produced by the system. You need a platform that can tie together both, as that leads to both a deeper understanding of the raw content and the users who interact with the data. For more details on LWBD's features, please refer to the product description page and to the LWBD documentation. If you're interested in more details, contact us and we can show you a demo and discuss how we can help you save months of time off of your next big data implementation.
I'll use the rest of this post to focus in on getting started with the actual product by looking at a simple example of ingesting web content, making it searchable and then doing some aggregate analysis on that data using the platform. Finally, I'll finish up with some ideas of where to proceed to next.
Getting Started
First off, note that this is a developer platform, so I am assuming you are comfortable on the command line. Second, the VM you are downloading is designed to help developers get started, not to be a production system. If you are interested in installing LWBD on your own cluster or in Amazon AWS, please contact our sales department.
To get started, here's what you will need to do first:
curl -u administrator:foo localhost:8341/sda/v1/client/collectionsYou should see something like (abbreviated for space):
[ { "status": "EXISTS", "createTime": 1359071751934, "collection": "collection1", "id": "collection1", "throwable": null, "children": [ { "status": "EXISTS", "createTime": 1359071751934, "collection": "collection1", "children": [], "id": "collection1", "throwable": null, "properties": { "service-impl": "LucidWorksDataManagementService" } }, { "status": "EXISTS", "createTime": 1359071751937, "collection": "collection1", "children": [], "id": "collection1", "throwable": null, "properties": { "path": "hdfs://localhost:50001/data/collections/collection1", "service-impl": "HadoopDataManagementService" } } ] } ]
Content Acquisition and Search
curl -s -u administrator:foo -X POST -H 'Content-type: application/json' -d '{"collection":"searchhub"}' localhost:8341/sda/v1/client/collectionsThe results should look like:
{ "status": "CREATED", "createTime": 1359200703437, "collection": "searchhub", "children": [ { "status": "CREATED", "properties": { "name": "searchhub", "instance_dir": "searchhub_3" }, "collection": "searchhub", "id": "searchhub", "throwable": null, "children": [] }, { "status": "CREATED", "properties": { "path": "hdfs://localhost:50001/data/collections/searchhub", "service-impl": "HadoopDataManagementService" }, "collection": "searchhub", "id": "searchhub", "throwable": null, "children": [] }, { "status": "CREATED", "createTime": 1359200717458, "collection": "searchhub", "children": [], "id": "searchhub", "throwable": null, "properties": { "service-impl": "HBaseDataManagementService" } } ], "id": "searchhub", "throwable": null, "properties": {} }
curl -X POST -H 'Content-type:application/json' -u administrator:foo -d '{"crawler":"lucid.aperture","type":"web","url":"searchhub.org","crawl_depth":-1,"name":"SearchHub", "bounds":"tree", "output_type":"com.lucid.crawl.impl.HBaseUpdateController", "output_args":"localhost:2181", "mapping":{"original_content":"true"}}' localhost:8341/sda/v1/client/collections/searchhub/datasourcesThe results should look like:
{ "status": "CREATED", "createTime": 1359200742420, "collection": "searchhub", "children": [], "id": "14fd8c7ad12346b1a058d0f5e342d98b", "throwable": null, "properties": { "proxy_password": "", "parsing": true, "ignore_robots": false, "commit_on_finish": true, "max_bytes": 10485760, "id": "14fd8c7ad12346b1a058d0f5e342d98b", "add_failed_docs": false, "proxy_host": "", "verify_access": true, "log_extra_detail": false, "pagecount": "pageCount", "title": "title", "fullname": "author", "filelastmodified": "lastModified", "content-type": "mimeType" }, "verify_schema": true, "dynamic_field": "attr", "unique_key": "id", "lucidworks_fields": true, "multi_val": { "body": false, "mimeType": false, "description": false, "title": false, "author": true, "acl": true, "fileSize": false, "dateCreated": false }, "datasource_field": "data_source", "default_field": null, "types": { "date": "DATE", "lastmodified": "DATE", "filesize": "LONG", "datecreated": "DATE" } }, "output_args": "localhost:2181", "crawl_depth": -1, "commit_within": 900000, "include_paths": [], "collection": "searchhub", "fail_unsupported_file_types": false, "proxy_port": -1, "name": "SearchHub", "exclude_paths": [], "url": "searchhub.org/", "max_docs": -1, "bounds": "tree", "proxy_username": "", "caching": false, "output_type": "com.lucid.crawl.impl.HBaseUpdateController", "auth": [], "crawler": "lucid.aperture" } }Make a note of the "id" value, as I will use it later.
curl -u administrator:foo -X POST localhost:8341/sda/v1/client/collections/searchhub/datasources/14fd8c7ad12346b1a058d0f5e342d98bThe last bit of that URL is the ID from the previous step. Your ID will be different. The result should be something like:
{ "id": "14fd8c7ad12346b1a058d0f5e342d98b", "createTime": 1359200823711, "status": "RUNNING", "collection": "searchhub", "children": [], "throwable": null }
curl -X POST -H 'Content-type:application/json' -u administrator:foo -d '{"parentWfId":"searchhub","workingDir":"/data/collections/searchhub-subwf/tmp/","oozie.wf.application.path":"hdfs://localhost:50001/oozie/apps/_etl/sub_wf/extract","collection":"searchhub","zkConnect":"localhost:2181","tikaProcessorClass":"com.digitalpebble.behemoth.tika.TikaProcessor"}' localhost:8341/sda/v1/client/workflows/extractThis will kick off a Hadoop job that processes all of the raw content through Tika. Since this can be a long running job when you have a lot of content, we simply return you a JobID that you can use to check the status of the results, something like:
{ "id": "0000006-130123050323975-oozie-hado-W", "workflowId": "extract", "createTime": 1359201918000, "status": "RUNNING", "children": [], "throwable": null }
curl -u administrator:foo -X POST -H 'Content-type: application/json' -d '{"query":{"q":"*:*","rows":1, "fl":"id,title,score"}}' localhost:8341/sda/v1/client/collections/searchhub/documents/retrievalThe results look like:
{ "QUERY": { "json": { "responseHeader": { "status": 0, "QTime": 7, "params": { "rows": "1", "version": "2.2", "collection": "searchhub", "q": "*:*", "wt": "json", "fl": [ "id,title,score", "id" ] } }, "response": { "start": 0, "maxScore": 1, "numFound": 2287, "docs": [ { "score": [ 1 ], "id": "searchhub.org/2013/01/24/apache-solr-4-1-is-here/", "title": [ "Apache Lucene/Solr 4.1 is here!" ] } ] }, "requestToken": "SDA_USER~779187014188fc3a" } } }See the documentation for more details on how to write queries and process the results.
Digging Deeper
curl -u administrator:foo localhost:8341/sda/v1/client/workflows
curl -X POST -H 'Content-type:application/json' -u administrator:foo -d '{"workingDir": "/data/collections/searchhub-subwf/tmp/","documentsAsText": "hdfs://localhost:50001/data/collections/searchhub-subwf/tmp/document-text","documentsAsVectors": "hdfs://localhost:50001/data/collections/searchhub-subwf/tmp/document-vectors","vec_nGrams": "2","vec_analyzer": "com.lucid.sda.hadoop.analysis.StandardStopwordAnalyzer","collection": "searchhub","zkConnect": "localhost:2181", "parentWfId":"searchhub","oozie.wf.application.path":"hdfs://localhost:50001/oozie/apps/_etl/sub_wf/vectorize"}' localhost:8341/sda/v1/client/workflows/vectorizeThe output should look something like:
{"id":"0000000-130123050323975-oozie-hado-W","workflowId":"vectorize","createTime":