Guest blog post by Alice Xiong
Finally I got some time to share what I learned during the Hadoop summit this year. 3 days, more than 170 talks, a lot to share and digest, overwhelming with all those new tech and ideas…. I hope I can briefly and clearly describe the vision I gathered during the Hadoop summit with the you below:
- We learn from history, let’s begin with quiz 1:
When the first big data challenge happens, all enterprises adopted RDBM(relational database, like mysql, MSsql, …).Then what percentage of large enterprises does Forrester estimate will adopt Hadoop?
- 34% b. 50% c. 88% d. 100%
- Big data is the fuel, machine learning is the engine, and Hadoop can speed up the process.
# data lake, over drunk with data
# royalty: treat your customer with tailed personalized experience is the best option to form a personal relationship.
- Hadoop doesn’t only refer to one tool, it means a Hadoop ecosystem with HDFS as its foundation. All the other apache free open source data tools, (like hive, pig, pyspark, zipperline, hbase, airflow,…) are built on top of HDFS. They together enrich query functions and modeling libraries for structured, semi-structured, and unstructured big data to make it enterprise ready. We need to notice that, big data challenge not only refers to larger volume, but also means complicated data structures.
- Coming Age: digital disruption has become a way of life. (Geoffrey Moore)
Data coming from 3 waves:
Supply chain coordination (1990-2000) PCs + ERP + Internet = Systems Records
Customer satisfaction (2010-2020) Smarts phones + SasS + Public Cloud = Systems of Engagement
Trend detection (future) Smart sensors + Analytics + Private Cloud = System Intelligent
- One community member (David, VP of security), shared his story with us. “My company adopted Hadoop 2 years ago, cool technology developed by those smart people, how I gonna do with it…?” He told himself “Kill the fear, just do it.You need to step out of the comfortable zone to embrace changes.”
Some open source tools demonstrated there:
- Arun Murthy, cofounder of Hortonwork, he displayed a new version of pyspark notebook with plot features, which will be updated on sandbox a few weeks later. In addition to Java and Scala, it has solid integration with python now. Overall, it provides a unified data science environment.
- Airflow developed by airbnb, they analyze click stream data to help travelers decide where to go; DAG (directed acyclic graph); Rich web UI; 5 companies are using it now www.nerds.airbnb.com
- Zipperline: pyspark notebook
- Giraph is developed by Facebook. Compared with hive, 10X faster, since they deal with all data in Memory.
- Hbase: not particular friendly to enterprise users so far. But it provides transform from “batch analysis” to “real time analysis”. (In order to improve medical document, it demands real time data; real time alert to physician, or nursing.)
- Kafka: it is publish-subscribe messaging rethought as a distributed commit log.
- Presto: distributed SQL query engine for running interactive analytic queries against big data sources.(Netflix Interactive queries petabyte scale