Guest blog post by Alice Xiong
- We learn from history, let’s begin with quiz 1:
- 34% b. 50% c. 88% d. 100%
- Big data is the fuel, machine learning is the engine, and Hadoop can speed up the process.
- Hadoop doesn’t only refer to one tool, it means a Hadoop ecosystem with HDFS as its foundation. All the other apache free open source data tools, (like hive, pig, pyspark, zipperline, hbase, airflow,…) are built on top of HDFS. They together enrich query functions and modeling libraries for structured, semi-structured, and unstructured big data to make it enterprise ready. We need to notice that, big data challenge not only refers to larger volume, but also means complicated data structures.
- Coming Age: digital disruption has become a way of life. (Geoffrey Moore)
- One community member (David, VP of security), shared his story with us. “My company adopted Hadoop 2 years ago, cool technology developed by those smart people, how I gonna do with it…?” He told himself “Kill the fear, just do it.You need to step out of the comfortable zone to embrace changes.”
Some open source tools demonstrated there:
- Arun Murthy, cofounder of Hortonwork, he displayed a new version of pyspark notebook with plot features, which will be updated on sandbox a few weeks later. In addition to Java and Scala, it has solid integration with python now. Overall, it provides a unified data science environment.
- Airflow developed by airbnb, they analyze click stream data to help travelers decide where to go; DAG (directed acyclic graph); Rich web UI; 5 companies are using it now www.nerds.airbnb.com
- Zipperline: pyspark notebook
- Giraph is developed by Facebook. Compared with hive, 10X faster, since they deal with all data in Memory.
- Hbase: not particular friendly to enterprise users so far. But it provides transform from “batch analysis” to “real time analysis”. (In order to improve medical document, it demands real time data; real time alert to physician, or nursing.)
- Kafka: it is publish-subscribe messaging rethought as a distributed commit log.
- Presto: distributed SQL query engine for running interactive analytic queries against big data sources.(Netflix Interactive queries petabyte scale