Guest blog post by Vincent Granville
by Ravi Kalakota
Keep hearing about Big Data and Hadoop? Having a hard time understanding what is behind the curtain?
Hadoop is an emerging framework for Web 2.0 and enterprise businesses who are dealing with data deluge challenges – store, process and analyze large amounts of data as part of their business requirements.
The continuous challenge online is how to improve site relevance, performance, understand user behavior, and predictive insight. This is a never ending arms race as each firm tries to become the portal of choice in a fast changing world. Take for instance, the competitive world of travel. Every site has to improve at analytics and machine learning as the contextual data is changing by the second- inventory, pricing, recommendations, economic conditions, natural disasters etc.
Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. This will be a huge shift in how IT apps are engineered.
Hadoop Quick Overview
Traditional relational databases and data warehouse products excel at OLAP and OLTP workloads over structured data. These form the underpinnings of most IT applications.
Hadoop is designed to solve a different problem: the fast, reliable analysis of both structured, unstructured and complex data. Hadoop and related software are designed for 3V’s:
- Volume – Commodity hardware and open source software lowers cost and increases capacity
- Velocity – Data ingest speed aided by append-only and schema-on-read design
- Variety – Multiple tools to structure, process, and access data
As a result, many IT Engineering teams are deploying the Hadoop ecosystem alongside their legacy IT applications, which allows them to combine old data and new data sets in powerful new ways. It also allows them to offload analysis from the data warehouse.
Technically, Hadoop consists of two elements: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel/distributed data processing framework called MapReduce. For more see our primer on Big Data, Hadoop and in-memory Analytics.
Hadoop runs on a collection/cluster of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop is self-healing. It can deliver data — and can run large-scale, high-performance processing jobs — in spite of system changes or failures.
The Hadoop Stack
It’s important to differentiate Hadoop from the Hadoop stack. Firms like Cloudera sell a set of capabilities around Hadoop called the Cloudera’s Distribution for Hadoop (CDH).
This is a set of projects and management tools designed to lower the cost and complexity of administration and production support services; this includes 24/7 problem resolution support, consultative support, and support for certified integrations.
The introduction of Hadoop stack is changing the business intelligence (reporting/analytics/data mining), which has been dominated by very expensive relational databases and data warehouse appliance products.
What is Hadoop good for?
Searching, log processing, recommendation systems, data warehousing, video and image analysis, archiving seem to be the initial uses. One prominent space where Hadoop is playing a big role in is data-driven online websites. The three primary areas include:
1) To aggregate “data exhaust” — messages, posts, blog entries, video clips, maps, web graph….
2) To give data context — friends networks, social graphs, recommendations, collaborative filtering….
3) To keep apps running — web logs, system logs, system metrics, database query logs….
Let’s look at a few realworld examples from LinkedIn, CBS Interactive, Explorys and FourSquare. Walt Disney, Wal-mart, General Electric, Nokia, and Bank of America are also applying Hadoop to a variety of tasks including marketing, advertising, and sentiment and risk analysis. IBM used the software as the engine for its Watson computer, which competed with the champions of TV game show Jeopardy.
Hadoop @ LinkedIn
LinkedIn is a massive data hoard whose value is connections. It currently computes more than 100 billion personalized recommendations every week, powering an ever growing assortment of products, including Jobs You May be Interested In, Groups You May Like, News Relevance, and Ad Targeting. LinkedIn leverages Hadoop to transform raw data to rich features using knowledge aggregated from LinkedIn’s 100 million member base. LinkedIn then uses Lucene to do real-time recommendations, and also Lucene on Hadoop to bridge offline analysis with user-facing services.