Subscribe to our Newsletter

Guest blog post.

Regularly crunching large amounts of public & proprietary data to make pinpointed predictions is a challenging task. Hadoop data processing is a very useful infrastructure layer to help in that process. However, Python is a programming language of choice for many data scientists & therefore merging best of Hadoops & Pythons capabilities becomes imperative. We have outlined a case study on using Pydoop to crunch data from the American Community Survey (ACS) and Home Insurance data from vHomeInsurance.

About Pydoop

Pydoop is a Python package for Hadoop, MapReduce, and HDFS. Pydoop has several advantages over Hadoop’s built-in solutions for Python programming. As a CPython package, it allows you to access all standard library and third party modules, some of which may not be available for other Python implementations.

Installing Pydoop

  • Before installing Pydoop, we first need to install python2.7 and Apache Hadoop.
  • We may work with python2.6 as well, but for Pydoop to work in python2.6, we need to install the following modules:
    • importlib, and
    • argparse (which can be installed with pip)
  • Pydoop can also be installed with pip using the following command:
    • sudo pip install pydoop

Example for Pydoop: Dividing the Acs Data File into multiple files based on LOGRECNO NUMBER

LOGRECNO Stands for Logical Record Number which is mapped to the geography information.

What is ACS?

American Community Survey, provides statistics for various population and housing related information on a national, state, and even on a community scale.

The annual data release of ACS is available in the following format:

ACS FIELDS

Field Name

Description

Field Size

FILEID

File Identification

6 Characters

FILETYPE

File Type

6 Characters

STUSAB

State/U.S.-Abbreviation (USPS)

2 Characters

CHARITER

Character Iteration

3 Characters

SEQUENCE

Sequence Number

4 Characters

LOGRECNO

Logical Record Number

7 Characters

Field # 7 and up

Estimates such as Home Insurance, home property value data which were appended from vHomeInsurance

Sample File: (http://www2.census.gov/acs2011_5yr/summaryfile/2007-2011_ACSSF_By_State_All_Tables/Alabama_Tracts_Block_Groups_Only.zip)

ACSSF,2011e5,al,000,0001,0004634,180,70

ACSSF,2011e5,al,000,0001,0004635,200,71

ACSSF,2011e5,al,000,0001,0004636,279,111

ACSSF,2011e5,al,000,0001,0004637,905,382

ACSSF,2011e5,al,000,0001,0004638,590,228

ACSSF,2011e5,al,000,0001,0004637,259,104

ACSSF,2011e5,al,000,0001,0004637,226,108

ACSSF,2011e5,al,000,0001,0004641,171,58

ACSSF,2011e5,al,000,0001,0004642,580,223

ACSSF,2011e5,al,000,0001,0004643,341,131

ACSSF,2011e5,al,000,0001,0004644,245,112

ACSSF,2011e5,al,000,0001,0004645,381,175

ACSSF,2011e5,al,000,0001,0004646,206,109

ACSSF,2011e5,al,000,0001,0004647,217,89

ACSSF,2011e5,al,000,0001,0004637,386,160

ACSSF,2011e5,al,000,0001,0004649,354,148

ACSSF,2011e5,al,000,0001,0004650,397,149

ACSSF,2011e5,al,000,0001,0004651,243,96

ACSSF,2011e5,al,000,0001,0004637,517,208

ACSSF,2011e5,al,000,0001,0004653,699,272

ACSSF,2011e5,al,000,0001,0004654,389,159

When we have multiple text files based on the sequence number, we combine all the files into single file. Now we create files for individual places(LOGRECNO). Using the LOGRECNO, we can map to the geo location. (http://www2.census.gov/acs2011_5yr/summaryfile/UserTools/Geography/al.xls)

Example of Creating a Place File:

Mapper Function:

  • For the Mapper function to call the input file, it requires 3 parameters:
    • key: the byte offset with respect to the current input file. In most cases, we may ignore it;
    • value: the line of text to be processed;
    • writer object: a Python object to write output and count values.

Reducer Function :

  • Reducer function will be called for each unique key-value pair produced by your map function. It also receives 3 parameters
    • key: the key produced by your map function;
    • values iterable: iterates over this parameter to traverse all the values emitted for the current key’
    • writer object: this is identical to the one given to the map function.

About Writer Object:

  • Writer object is one of the parameters of mapper and reducer. It has the following functions:
    • emit(k, v): pass a (k, v) key-value pair to the framework;
    • count(what, how_many): add how_many to the counter named what. If the counter doesn’t already exist, it will be created dynamically.

mapper.py

import csv

def mapper(_,text, writer):

   row = text.split("\t")

   logrecno = row[5]

   if logrecno=='0004637':

       values = row

       writer.emit(logrecno,values)

def reducer(key, values, writer):

            writer.emit("\t".join(str(v) for v in values))

How to Execute:

Run Command “pydoop script mapper.py acsfile hdfs_output”

Output file of execution:

ACSSF,2011e5,al,000,0001,0004637,905,382

ACSSF,2011e5,al,000,0001,0004637,259,104

ACSSF,2011e5,al,000,0001,0004637,226,108

ACSSF,2011e5,al,000,0001,0004637,386,160

ACSSF,2011e5,al,000,0001,0004637,517,208

ACSSF,2011e5,al,000,0001,0004637,386,160

===

Post from vHomeInsurance.com

E-mail me when people leave their comments –

You need to be a member of Hadoop360 to add comments!

Join Hadoop360

Resources

Research