Subscribe to our Newsletter

Guest blog post by Marc Borowczak

Moving legacy data to modern big data platform can be daunting at times. It doesn’t have to be. In this short tutorial, we’ll briefly review an approach and demonstrate on my preferred data set: This isn’t a ML repository nor a Kaggle competition data set, simply the data I accumulated over decades to keep track of my plastic model collection, and as such definitely meets the legacy standard!

We’ll describe steps followed on a laptop VirtualBox machine running Ubuntu 16.04.1 LTS Gnome. The following steps are then required:

  1. Import the .csv file in MySQL, and optionally backup a compressed MySQL database file.
  2. Connect to MySQL database in Spark 2.0.1 and then access the data: we’ll demonstrate an interactive Python approach using Jupyter PySpark in this post and leave an Rstudio Sparkyl access based on existing methods for another post.

There’s really no need to abandon legacy data: Migrating data to new platform will enable businesses to extract and analyze data on a broader time scale, and open new ways to leverage ML techniques, analyze results and act on findings.

Additional routes methods to import CSV data will be discussed in a forthcoming post.

E-mail me when people leave their comments –

You need to be a member of Hadoop360 to add comments!

Join Hadoop360

Resources

Research