Guest blog post by Rahul Patodi
How to use s3 (s3 native) as input / output for hadoop MapReduce job. In this tutorial we will first try to understand what is s3, difference between s3 and s3n and how to set s3n as Input and output for hadoop map reduce job. Configuring s3n as I/O may be useful for local map reduce jobs (ie MR run on local cluster), But It has significant importance when we run elastic map reduce job (ie when we run job on cloud). When we run job on cloud we need to specify storage location for input as well as output, which is available for storage as well as retrieval. In this tutorial we will learn how to specify s3 for input / output.
What is S3: Amazon s3 (Simple Storage Service) is a data storage service. Amazon s3 is storage for the Internet. It is designed to make web-scale computing easier for developers.
Amazon s3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, secure, fast, inexpensive infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers. You are billed monthly for storage and data transfer. Transfer between s3 and Amazon EC2 is free. This makes use of s3 attractive for Hadoop users who run clusters on EC2.
Hadoop provides two filesystems that use S3:
S3 Native FileSystem (URI scheme: s3n)
A native filesystem for reading and writing regular files on s3. The advantage of this filesystem is that you can access files on s3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by s3. For this reason it is not suitable as a replacement for HDFS (which has support for very large files).
S3 Block FileSystem (URI scheme: s3)
A block-based filesystem backed by s3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other s3 tools.
There are two ways that s3 can be used with hadoop's Map/Reduce, either as a replacement for HDFS using the s3 block filesystem (i.e. using it as a reliable distributed filesystem with support for very large files) or as a convenient repository for data input to and output from MapReduce, using either s3 filesystem. In the second case HDFS is still used for the Map/Reduce phase. Note also, that by using s3 as an input to MapReduce you lose the data locality optimization, which may be significant.
In this tutorial we will configure S3n as input / output for hadoop MR jobs, as we can upload any file (file of any format ) for input.
Using S3n as Input / Output for Hadoop MapReduce job:
Now add the following entry to hdfs-site.xml
$ vi HADOOP_INSTALL_DIR/conf/hdfs-site.xml
Now restart Hadoop Daemons:
Now submit a MapReduce job:
$ HADOOP_INSTALL_DIR/bin/hadoop jar hadoop-*-examples.jar wordcount s3n://BUCKET-NAME/ s3n://BUCKET-NAME/DIRECTORY-NAME
Note that for input you must create bucket manually and upload files in that bucket, for output you must create bucket and directory specified for output must not already exist
You can submit MapReduce job, without setting configuration parameters:
Note that since the secret access key can contain slashes, you must remember to escape them by replacing each slash / with the string %2F.
YOUR-AWS-ACCESS-ID: In the web browser click “account >>security credentials” under heading “access credentials >> access keys”
YOUR-AWS-ACCESS-KEY: In the web browser click “account >>security credentials” under heading “access credentials >> access keys