By now, you have probably heard of the Hadoop Distributed File System (HDFS), especially if you are data analyst or someone who is responsible for moving data from one system to another. However, what are the benefits that HDFS has over relational databases?
HDFS is a scalable, open source solution for storing and processing large volumes of data. HDFS has been proven to be reliable and efficient across many modern data centers.
HDFS utilizes commodity hardware along with open source software to reduce the overall cost per byte of storage.
With its built-in replication and resilience to disk failures, HDFS is an ideal system for storing and processing data for analytics. It does not require the underpinnings and overhead to support transaction atomicity, consistency, isolation, and durability (ACID) as is necessary with traditional relational database systems.
Moreover, when compared with enterprise and commercial databases, such as Oracle, utilizing Hadoop as the analytics platform avoids any extra licensing costs.
One of the questions many people ask when first learning about HDFS is: How do I get my existing data into the HDFS?
In this article, we will examine how to import data from a PostgreSQL database into HDFS. We will use Apache Sqoop, which is currently the most efficient, open source solution to transfer data between HDFS and relational database systems. Apache Sqoop is designed to bulk-load data from a relational database to the HDFS (import) and to bulk-write data from the HDFS to a relational database (export).
The steps in this tutorial are written for someone with a basic knowledge of executing SQL queries and an elementary knowledge of HDFS commands.
The database system used is PostgreSQL 9.5 for Windows, and the HDFS version is Cloudera Hadoop 2.5.0-cdh5.2.0 on a Centos 6.4 Linux virtual machine.
Apache Sqoop relies on the JDBC driver JAR files that are specific to the relational database vendor and database version.
To execute the steps shown in this article, the user will need permissions to connect remotely to the PostgreSQL database,
SELECT permissions on the relational database, write permissions on the HDFS, and execute permissions on the Sqoop executable.
For the purpose of this tutorial, we created a PostgreSQL database, named it Toptal, and made it accessible through port 5432.
PostgreSQL Data Source
To get started, in our PostgreSQL
Toptal database, we will create a test data table named
sales. We will assume that the OpenSSL certificate and private key files already exist on the PostgreSQL server.
Next, we will insert 20 rows into the table:
Import into the HDFS using Sqoop
With the data source defined, we are now ready to import the data into the HDFS. The
sqoop command we will examine is listed below, and we will break down each argument in the bullet points that follow. Note that the command is supposed to be on one complete line or, as shown below, with the backslash (the Linux command line continuation character) at the end of each line except the last.
sqoop import- The executable is named
sqoop, and we are instructing it to import the data from a table or view from a database to the HDFS.
--connect- With the
--connectargument, we are passing in the JDBC connect string for PostgreSQL. In this case, we use the IP address, port number, and database name. We also need to specify that SSL is being utilized and need to supply the
SSLSocketFactoryclass to be used.
--username- In this example, the username is a PostgreSQL login, not a Windows login. The user must have permissions to connect to the specified database and to select from the specified table.
-P- This will prompt the command line user for the password. If Sqoop is rarely executed, this might be a good option. There are multiple other ways to pass the password to the command automatically, but we are trying to keep it simple for this article.
--table- This is where we pass in the name of the PostgreSQL table.
--target-dir- This argument specifies the HDFS directory where the data is to be stored.
--split-by- We must provide Sqoop with a unique identifier to help it distribute the workload. Later in the job output, we will see where Sqoop selects the minimum and maximum values to help set split boundaries.
It is a good idea to put the command in a script for repeatability and editing purposes, as shown below:
Now, it is time to execute the above Sqoop command script. The output from the Sqoop command is shown below.
After executing the Sqoop command, we can execute the
hdfs dfs -ls command to see the directory that was created by default with the table name on the HDFS.
We can use the
hdfs dfs -ls command again to list the contents of the
sales directory. If you look on the HDFS, you can notice the data is partitioned and spread across four files by default, not just contained in one.
hdfs dfs -cat command will display all of the records in the first partition of the sales data on the HDFS.
Notice that the default file delimiter is a comma. Also, notice that there are only five rows in each partition, because the 20 rows in the source have been equally distributed across the four partitions.
To limit the number of rows that are output to the screen, we can pipe the output of the
cat command to the
head command as shown below, to check the contents of the other three partitions.
-n 5 argument to the
head command limits the screen output to the first five rows.
(Note that in our case, this is unnecessary since there are only five rows in each partition to begin with. In practice, though, you will probably have many more rows than this in each partition and will want to just check the first few to make sure that they look right, so this shows you how to do so.)
If you need to run a query to extract data from multiple tables in the PostgreSQL database, this can be accomplished with the following command:
In the above command, we use some of the same arguments to the Sqoop command, but they take on differing importance when used with a SQL command.
--target-dir- The target directory tells Sqoop in which directory on the HDFS to store the selected data. This argument is required by Sqoop when using a free-form query.
--split-by- Even though we are selecting the primary key of the sales table, we still have to provide Sqoop with a unique identifier to help it distribute the workload.
--query- This is the argument in which we supply the SQL query. The query above is enclosed in double quotes. Notice that there is not a backslash (the line continuation character) in the multiple lines containing the query. Also notice the
and \$CONDITIONSat the end of the
WHEREclause. This is required by Sqoop because Sqoop will automatically replace the
$CONDITIONStoken with a unique expression.
Issues or No Issues: You Should Consider HDFS
HDFS has many advantages over the relational databases. If you are doing data analysis, you should consider migrating your data to HDFS, today.
With the skills learned here, importing data from a relational database system into HDFS is a simple and straightforward process that can be accomplished with a single command. While these examples have a small number of rows, the mechanics of importing large volumes of data to HDFS from a PostgreSQL database table remains the same.
You can even experiment with importing large tables and varying storage delimiters. Using Apache Sqoop is more efficient than exporting the database data to a file, transferring the file from the database server to the HDFS, and then loading the file to the HDFS.