Resources

What is Hadoop?
MapR Technologies' M7 Takes Hadoop and HBase to the Next Level
Microsoft PolyBase Brings the Best of SQL Server and Hadoop Together

Hadoop Featured Article

What is Hadoop?

November 19, 2012

By Rachel Ramsey, TMCnet Web Editor

Smartphones, tablets, MP3 players, SMS messaging, YouTube, Facebook (News - Alert), online banking and Wi-Fi all are examples of day-to-day technologies that people use frequently and help create data. Thanks to mobile devices, social networks, Internet searches, e-commerce, video archives and other advancements in technology, the big data phenomenon came about thanks to the amount of data being generated today by these technologies. Big data refers to the collection of data sets so large and complex, it’s impossible to process them with the usual databases and tools.

Companies pursue big data because it can be revelatory in spotting business trends, improving research quality, and gaining insights in a variety of fields, from IT to medicine to law enforcement and everything in between and beyond.

Hadoop was born out of a need to process big data, as the amount of generated data continued to rapidly increase. As the Web generated more and more information, it was becoming challenging to index the content, so Google (News spacer - Alert) created MapReduce in 2004. Yahoo then created Hadoop as a way to implement the MapReduce function. MapReduce is a programming model for processing large data sets, typically used to do distributed computing on clusters of computers. Google MapReduce and Hadoop are two different implementations (instances) of the MapReduce framework/concept. Hadoop is now an open-source Apache implementation project to store and process data.

Overall, Hadoop enables applications to work with huge amounts of data stored on various servers. Hadoop’s functions allow the existing data to be pulled from various places (since now, data is not centralized, but distributed in places using cloud technology) and use the MapReduce technology to push the query code and run a proper analysis, therefore returning the desired results.

As for the more specific functions, Hadoop has a large-scale file system (Hadoop Distributed File System or HDFS), which can write programs, manage the distribution of programs, then accept the results and generate a data result set.

Developed by Doug Cutting, Cloudera's chief architect and the chairman of the Apache Software Foundation, Apache Hadoop was initially inspired by papers published by Google outlining its approach to handling an avalanche of data, and has since become the de facto standard for storing, processing and analyzing hundreds of terabytes, and even petabytes of data. Hadoop was named after Cutting’s son’s toy elephant.

cdn.blog-sap.com/innovation/files/2012/09/hadoop-elephant.jpg

Image via SAP (News - Alert)

Instead of relying on expensive, proprietary hardware and different systems to store and process data, Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big. And in today’s hyper-connected world where more and more data is being created every day, Hadoop’s breakthrough advantages mean that businesses and organizations can now find value in data that was recently considered useless.

Edited by Rachel Ramsey

View All Hadoop Content >>

Featured White Papers

Evaluating Hadoop in the Data Center

What will make Hadoop and enterprise data center-grade analytics platform?

Learn how MapR makes Hadoop Easy, Dependable and Fast.

High Availability in the Hadoop Ecosystem: MapR provides high availability with no single points of failure across the entire stack.

High Availability in the Hadoop Ecosystem

The MapR Distribution for Apache™ Hadoop® provides high availability with no single points of failure across the entire stack. In the storage layer, MapR's Distributed NameNode HA™ architecture provides high availability with self-healing and support for multiple, simultaneous failures, with no additional hardware whatsoever.

View All >>

Featured Datasheets

MapR: M7 Edition

MapR M7 Edition is a complete distribution for Apache Hadoop and HBase™ that includes Pig, Hive, Mahout, Cascading, Sqoop, Flume and more. The M7 Edition makes HBase™ easy, dependable and fast. M7 not only delivers enterprise grade features such as Instant Recovery, Snapshots and Mirroring but also provides consistent performance while eliminating architectural complexity.

MapR: M5 Edition

Subscription software offering that includes features such as mirroring, snapshots, NFS HA, data placement control, and many more. The M5 Edition also offers full support, on-demand patches and online incident submission.

View All >>