Subscribe to our Newsletter

Introduction to Apache Spark

Guest blog post by Zygimantas Jacikevicius

New technologies continue to emerge enabling faster data processing and advanced analytics. The Hadoop platform was a great breakthrough in this space as it solved many of the storage and retrieval challenges for very large and varied datasets by dividing and processing across multiple machines. This was faster, more cost-effective, and less prone to failures than traditional RDBMS systems. Though Hadoop was a big step forward and made it easier to store, process and retrieve data in a schemaless environment it is already 10 years old and is not capable of multi-pass computations. When using Hadoop the output data of a job needs to be stored after each step slowing things down due to replication and storage. Apache Spark solves this problem by supporting multi-step data pipe-lines and allows jobs to be run in-memory.  

It’s calculated that Apache Spark can run programs up to a 100 times faster in memory and 10 times faster on disc compared to Hadoop alone. As with many Apache projects it prides itself on simplicity and compatibility. It provides simplified code for developers and is compatible with Java, Scala and Python languages. Spark is also not limited to being run just on top of Hadoop; it can be integrated with other platforms such as Mesos, EC2 and even be run as a standalone platform.

Apache spark has some great features that synergises very well with its “Lightning-fast cluster computing”. These high-level libraries currently include: Spark SQL, Spark Streaming, MLlib and GraphX. Spark SQL lets users to ETL their data from formats such as JSON or Parquet and query their data via SQL or HIVE. Spark Streaming utilises Spark’s speed and allows users to process data in a real-time. It uses a stream of resilient distributed datasets (RDDs) to process the data. MLlib is a machine learning library that uses various algorithms to process the data in a meaningful way that can then be used with GraphX to visualise the results.

All in all, Apache Spark is one of the fastest big data analytics engines in the market that is widely compatible, easy to use and packs a lot of features in one solution.

For more data management solutions and news please visit our website.

Email me when people comment –

You need to be a member of Hadoop360 to add comments!

Join Hadoop360

Resources

Research