Apache Hadoop has addressed two of the growing pains organizations face as they attempt to make sense of larger and larger sets of data in order to out-innovate their competition, but four more remain unaddressed. The lesson: Just because you can avoid designing an end-to-end data supply chain when you start storing data doesn’t mean you should. Architecture matters. Having a plan to reduce the cost of getting answers and simultaneously scale its utility to the broader organization means adding new elements to Hadoop. Fortunately, the market is addressing this need.
The Need for a Single Repository for Big Data
Traditional databases were not the repository we needed for big data, 80 percent of which is unstructured. Hadoop offered us, for the first time, the ability to keep all the data in a single repository, addressing the first big data growing pain. Hadoop is becoming a bit bucket that can store absolutely everything: tabular data, machine data, documents, whatever. In most ways, this is a great thing because data becomes more valuable when it is combined with other data, just like an alloy of two metals can create a substance that is stronger and more resilient. Having lots of different types of data in one repository is a huge long-term win.
No Standard Way to Create Applications to Leverage Big Data
Once we have all the big data in a single repository, the next trick is to create applications that leverage that data. Here again Hadoop has done a fine, if complicated, job. MapReduce provides the plumbing for applications that can benefit from a parallel divide-and-conquer strategy to analyze and distill data spread over the massive Hadoop File System (HDFS). YARN, a result of the refactoring of Hadoop 1 introduced in Hadoop 2, formalizes resource management APIs and allows more parallelization models to be used against a given cluster. This is all great, but it leaves one huge problem: Most of the people who have questions have no way to access the data in Hadoop. A financial analyst at Thomson Reuters or a buyer at Bloomingdales wants answers that their data can provide, but in the main these folks don’t know how to write programs that access Hadoop; it’s not their expertise.
What this leaves us with is a situation that is mighty familiar to people who have worked in large organizations that have implemented traditional Business Intelligence stacks. Questions are plentiful and programs that can help answer those questions are scarce. Pretty soon, there’s a huge backlog that only programmers can help with. All that wonderful stored data is longing to answer questions. All those analysts and businesspeople want answers. The complexity of creating applications hinders progress.
Most companies get stuck at this point, which I call the “dark valley of Hadoop.” The way out is to find a way to address the remaining growing pains.
The Gap between Analysts and Big Data
The third growing pain is to close the gap between the analysts and the data. Splunk’s Hunk is perhaps the most promising technology to deliver a true interactive experience. Especially powerful are Splunk’s capabilities for discovering the structure of machine data and other unstructured data on the fly. My view is that Drill and Impala are addressing a similar need. But it is a mistake to think of this interactivity as closing the entire gap because it’s really the missing productivity that must be addressed.
To accelerate the pace of creating Hadoop applications, comfortable and powerful abstractions must be delivered to power users, analysts, and developers that allow them to express the work of the application at as high a level as possible. For that abstraction to work, it must mean that you can create applications without being a Hadoop expert. For power users and analysts, domain-specific languages like Cascalog allow an application to be built by expressing a series of constraints on the data. Concurrent’s Lingual project allows applications to be expressed as SQL and then translated into Hadoop jobs. For developers, Concurrent’s Cascading library allows the application to be expressed in terms of APIs that hide the details of Hadoop from power users. In addition to Hunk, Pentaho, Alpine Data, Teradata Aster, and Pivotal all offer higher level ways to design and build and applications. Effective higher level abstract methods are crucial to productivity.
In other words, the way to address this growing pain is to hide the complexity of Hadoop so that analysts can get work done without having to become Hadoop experts.
Problems in Processing Big Data
Big data is almost always turned from a raw metal to a valuable alloy through a process that involves many steps. To continue the metaphor, raw metal must be refined before it is forged. This is costly and it is where Hadoop excels. Analysts can ask and answer certain questions using an interactive system, but the data must be cleansed beforehand, resulting in a complex upstream workflow. Typically these workflows are built in a brittle fashion and are difficult to test and debug. Most frequently, the workflow itself is applied to a new dataset where the results are a machine-learning model or a set of analytics. Any benefit to loading and subsequent indexing in interactive tools is lost in these cases.
The fourth growing pain is the ability to manage the execution of a cascading chain of workloads or applications. This can be tricky for many reasons. If you cobble together lots of applications in a complex chain using a wildly varying set of tools within arm’s reach, something data scientists love to do, you end up with a complicated workflow, or using a better phrase, a business data process. If something goes wrong in one of the intermediate applications, it is often impossible to figure out how to fix the process. You have to start the whole thing over. Tools such as Pig, Sqoop, and Oozie offer alternative ways to express the problem, but ultimately do not fix the underlying issue.
The right way to manage the cascading chain of data-oriented applications is, again, to hide the problem through abstraction. The higher level abstractions mentioned in growing pain 3 should be generating the complex of Hadoop jobs and managing their execution. Then through indirection it’s okay to combine a number of Hadoop applications or workloads in a chain, but when your business process depends on dozens of steps, all knit together by hand, something is bound to go wrong.
One Size Does Not Fit All
Today Hadoop is virtually synonymous with big data. In reality, Hadoop is not all things to all people nor is it all things to big data. Business requirements for more timely insight have introduced lower latency requirements as well as newer, more immature technologies to solve these problems and glean insight from sources such as real-time streaming data. As they face these requirements, users are cast back into the darkness, trying to once again make sense of when they should use these new tools as well as how they should glue them together in a comprehensive way to form a coherent big data architecture.
From Data for Business to the Business of Data
Looking down the road a bit, there is a progression in maturity for using data. Many organizations derive business insights from their data. Once you see that your data is actually transformative to your business, you have hit a new level of maturity. More and more organizations are realizing that they are actually in the business of data. At this point, the growing pain becomes acute, particularly if you see in horror that the tools you have chosen to assemble your data workflows are nothing but a house of cards.
The most sophisticated companies using big data, and you know who I mean, have been aggressive about solving all these growing pains. These companies, like yours, don’t want talented people struggling with Hadoop. Instead, they want them focused on the alchemy of data, seeing what the data has to say and putting those insights to work, and inserting them into their business processes.
The masters of big data have created a coherent architecture that allows the integration of new tools. They have created robust big data workflows and applications that support the transition from deriving insights from big data to being in the business of data. That’s the end game, the real value from big data that makes all of the growing pains worthwhile.
Gary Nakamura is Chief Executive Officer of Concurrent.