Search
  • Sign In

spacer

New Thinking for a New Era

  • Home
  • About
    • Innovation Insights Blog
    • Community Guidelines
  • Blogs
    • Featured Blog Posts
    • Latest Blogs
    • Add a Blog Post
    • Blog Guidelines
  • Forum
    • Add a Topic
    • Data Tools
    • Cloud/Storage
    • General
    • Software Defined Network
    • Supercomputers
    • All Forum Categories
    • All Forum Discussions
    • Forum FAQ
  • Media
    • Video
    • Photos
    • White Papers
  • Members
    • My Page
    • Leaderboards
  • All Blog Posts
  • My Blog
  • Add
spacer

Find, Discover and Analyze the Value in Big Data

  • Posted by Paul Doscher on March 1, 2013 at 9:00am
  • View Blog

From Google to Facebook to Amazon, major web properties are deriving real and significant value from the huge amounts of structured and unstructured data they collect. Whether it’s presenting related searches, uncovering trending topics, or recommending similar products, these sites recognize that empowering users and organizations to take full advantage of all that data requires a combination of search technology, discovery capabilities, and large-scale analytics.

But you don’t have to be one of the world’s top web properties to reap the benefits of big data. It’s now possible for enterprises to cost effectively combine search, discovery and analytics in order to offer employees and customers fast, efficient access to a wealth of answers and insights that until now were completely out of reach. In fact, companies that aren’t exploring how to develop these capabilities today risk finding themselves uncompetitive and desperately playing catch-up tomorrow.

Whether or not you believe there are now 2.7 zetabytes of data in the digital universe today, there’s no doubt that the massive amount of information being created is impacting businesses. But it’s not just the amount that’s important. Much of the growth has come in the form of what many people call unstructured data, i.e., the text and text-like information that packs our hard drives, flows into and out of social media sites, and fills emails, websites and more. It’s thought of as unstructured, which I find odd, because everyone who works on text understands that it is actually highly structured. It’s just that traditional databases aren’t very good at dealing with the structure inherent in text.

If you think about it, truly unstructured data is just a bunch of random bits, and clearly there’s really very little that’s random about the information stored in most documents, PowerPoint presentations, and social media posts. Whether it’s sales pitches, 140-character tweets around a hashtag, or comments on a Facebook update, there is usually some element of structure to the content. There’s also a great deal of structure in the language itself—ask any linguist. This structure is just much harder to exploit because grammar rules are not universally agreed upon, and words and phrases can have multiple meanings (which is also why the grammar checker in your word processor is so unreliable).

Still, once we recognize that all this information does contain structure—I prefer to call it “multi-structured data”—we can then try to understand the best tools and applications for deriving meaning and value from it depending on the particular purpose. Sometimes search is the best approach—the ability to ask a straightforward question and get a relevant response. But often, we don’t know how to frame the right questions or really aren’t sure of what we’re looking for. In these cases, we might benefit from suggestions or recommendations from the system or the ability to simply browse the data.

Finding the right solution requires building a framework around the needs of users and the business. Key user needs, for example, include:

  • Real-time, ad hoc access to content—e.g., “How are our competitors reacting to our latest product announcement?”
  • Aggressive prioritization of information based on importance—e.g., “What topics are trending following our latest press release?” or “Which emails should I read right now?”
  • Serendipity—e.g., “Wow, there’s a technology company in Chile doing research on a type of power source that would enable a device on our roadmap to last 10 times longer and weigh half as much.”
  • Feedback/learning from past searches—e.g., the capability of the system to better determine the relevance and priority of search results based on the analysis of previous searches.

Key business needs include:

  • Deeper insights into users—e.g., the ability of management to recognize the types of questions users struggle with in order to offer new information resources that can make their teams more productive.
  • Leveraging existing internal knowledge—e.g., providing users and developers interfaces and tools that can leverage their current knowledge; these tools may include a search and navigation UI and best-of-breed, well-documented open source tools.
  • Cost-effectiveness—e.g., highly reliable tools and platforms that simplify and accelerate information ingestion and analysis while delivering a clear and significant return on investment, preferably based on open source tools to avoid expensive licensing costs and vendor lock-in.

Search, Discovery and Analytics
Satisfying these needs and delivering the required capabilities involves three distinct approaches to information: search, discovery and analytics (SDA).

Search—Search is the most familiar of the three approaches, as most of us now perform at least one search every day. Search is highly effective when users know what they are looking for, but search technology requires the ability on the part of users to express questions in a natural, familiar way, as is possible with Google. On a deeper level, the search application may incorporate Natural Language Processing (NLP) to overlay useful structures on the stored information and make it easier for people to describe what they are looking for.

Discovery—Discovery is useful when users struggle to state what it is they want to find, or prefer to browse the data. The complexity and ambiguity of language means that users often have a hard time expressing the right question in a search or they simply don’t know how to spell it out. They may, for instance, simply use an older or less common term, such as “management information systems” instead of “information technology,” or use the wrong spelling, such as “Apache Soler” instead of “Apache Solr.” In such cases, the system should be able to return relevant information by looking for patterns and related terms, and analyzing for spelling errors, homonyms and synonyms. The system should also be able to make suggestions and provide guided navigation for browsing content based on a variety of factors, including but not limited to:

  • Related keywords
  • What other people on the network have searched for
  • Grouping or clustering content
  • Labeling content according to a taxonomy, ontology, or folksonomy
  • User recommendations and much more

Analytics—Analytics can reveal how users interact with the system, and this knowledge can be fed back into the system in order to improve it. Examples include refining the ability to make recommendations to users, changing how the system scores results, and changing how the discovery mechanisms are handled. Analytics can also provide business insight around information utilization and reveal new opportunities, such as how to monetize the data or reduce information costs.

SDA in the Wild
As noted above, major web properties already use various forms of SDA to benefit from big data. Google ranks searches, suggests related searches, and is collecting and analyzing vast amounts of information about tendencies and trends. Amazon offers basic search capabilities to users, but also has an increasingly robust discovery capability for making recommendations on related books and products. Facebook and Twitter also rely on discovery capabilities to recommend friends.

But it isn’t just these powerhouses that are using SDA. Health care systems are using it to help them personalize care. For example, by combining genetic information, medical research, and the results of clinical trials, doctors at a major health care provider are better able to navigate their huge volumes of collected data and more quickly tailor their treatment decisions to the needs of each patient. Scientists are also applying SDA to DNA sequencing data in order to make accurate predictions about whether a particular patient is a good candidate for a new drug.

In other areas, organizations in the intelligence and defense industry are using SDA to identify suspicious people and activities even when users don’t know precisely what to search for. Using search, natural language processing, and advanced tools, the system is able to establish connections among disparate documents and information sources that would otherwise seem completely unrelated.

And in ecommerce, even smaller retailers are able to combine in a practical way structured and multi-structured data—for example, product codes and text descriptions—and, using search and advanced discovery techniques, recommend relevant products to customers visiting their websites. With analytics, they are also able to continually monitor the effectiveness of the system. They can see what strategies are most effective, and they can easily run various types of experiments based on selling strategies to determine where to make further investments in improving the system.

Getting There
To create an SDA platform, enterprise developers need the following key capabilities:

  • Fast, efficient, scalable search that includes bulk and near-real-time indexing and that is capable of handling billions of records with sub-second search and faceting times
  • Large-scale, cost-effective storage and processing capabilities, including whole data consumption and analysis, experimentation and sampling tools, and, where appropriate, the ability to be distributed in memory
  • Natural language processing and machine learning tools that scale to enhance discovery and analysis on massive data stores

All these capabilities are now readily available, and new tools and interfaces are making it easier for developers to rapidly design, create and deploy prototype applications. Best of all, thanks to open source solutions, such as Apache Solr, Hadoop and Mahout, enterprises can begin the journey without major investment or risk.

Few would argue against the benefits of gaining insight from all the data enterprises collect today. The only questions are what capabilities are needed, how quickly can you obtain them, and what will it cost. The answers are now clear. Finding the value hidden in huge volumes of multi-structured data requires multiple approaches, including search, discovery and analytics. The tools to develop these approaches are now available and relatively easy to use. And unlike major information processing capabilities of the past, you can now achieve SDA with a low-risk investment strategy that aligns with your business goals and delivers a solid and consistent return on investment.

As CEO of LucidWorks, Paul Doscher is responsible for the company's vision and success.

Views: 312

Tags: SDA, analytics, big, data, enterprise, open, search, source, unstructured

0 members favorited this

Share Tweet

Comment

You need to be a member of Innovation Insights to add comments!

Welcome to
Innovation Insights

Sign In

Members

  • spacer
  • spacer
  • spacer
  • spacer
  • spacer
  • spacer
  • spacer
  • spacer
  • spacer
  • spacer
  • spacer
  • spacer
  • spacer
  • spacer
  • spacer
  • spacer
  • View All

Forum

spacer

Any insights about the new Johnnie Walker Blue Label smart bottle?

Started by Frank Wallem in General Mar 16. 0 Replies 1 Favorite

  • Add a Discussion
  • View All

Videos

  • spacer

    Interview with Artyom Astafurov, DeviceHive

    Added by Scott Amyx 1 Comment 0 Favorites

  • Add Videos
  • View All
  • spacer

© 2015   Created by Insights Site Creator.   Powered byspacer

Badges  |  Report an Issue  |  Terms of Service

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.