Subscribe to our Newsletter

How to get started with Hadoop?

As a Perl, R and Python guy, what is the easiest way to get started with Hadoop? A few specific questions:

  1. Could you install Hadoop on Windows (on my laptop)? The proceduredescribed here is a bit complicated. Some say you can even run Hadoop from your iPhone (I guess browser-based versions, if they exist).
  2. Does it make sense to use Hadoop on just one machine, at least initially? What are the benefits of using  Hadoop on a single machine (is this synonymous to single node) over using just my home-made file management system (basically, UNIX commands on the Cygwin console). 
  3. What are the optimum Hadoop configurations, based on the type/ size and velocity of data that you process?
  4. Any benchmark studies comparing Hadoop to other solutions?
  5. Do you need to know Java to get started?
  6. How to simulate multiple clusters/nodes on one machine? Can you measure the benefits of paralleled computations on just one machine? I was able to see significant gains in the past with a web crawler split and running on 25 processes on a single machine (Map Reduce), a while back. But if the tasks are purely computational and algorithmic (no data transfers in and out of your machine, such as HTTP requests or data base access or API calls), would there be any potential gains?
  7. When using multiple machines, can data transfers reduce the benefits of Hadoop or similar architectures? My guess is no, because usually (in most data processing) the output is small, compared to the input.

Thanks for your help! I'd really like to get started.

Email me when people comment –

You need to be a member of Hadoop360 to add comments!

Join Hadoop360

Comments

  • Here are a few answers from members:

    ------

    LIBLINEAR: fast algorithm for big data:

    http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf

    -----

    Another great comment, from Rahul Singh:

    This describes common problems of applying machine learning on big datasets even after hadoop stack:

    http://www.csie.ntu.edu.tw/~cjlin/talks/icmr2012.pdf

    Following papers are another good read in terms of new research developments/implementation approaches:

    Distributed SVM: 
    http://www.ece.umn.edu/users/alfonso/pubs/jmlr2010.pdf

    MLlib Architecture:

    http://www.cs.berkeley.edu/~ameet/mlbase_cidr.pdf

    ------

    Posted by Andrew Peterson:

    Here are the original Google research papers:

    Mapreduce: 
    http://research.google.com/archive/mapreduce.html

    The Google file system
    http://research.google.com/archive/gfs.html

    BigTable:
    http://static.googleusercontent.com/media/research.google.com/en/us...

    -----

    Another answer posted by Andrew Peterson on our LinkedIn group:

    I’m not a Hadoop expert, but I can agree that the very nature of Hadoop is a bit confusing. If you do download any of the VM’s from Hortonworks or Cloudera, you can configure as much as you want. Or you can do a clean install from Apache, Cloudera or Hortonworks. About 13 years or so ago, Google published two white papers. One outlining their data infrastructure and the second on map-reduce. Some of the Yahoo! staff took to the approaches and decided to create an open source version. It was eventually called Hadoop. And BTW, many of the staff are now at Hortonworks. I was doing a database project for Yahoo! back in 2008, when I first heard about Hadoop, I was fairly sure it was something special. 

    Anyway, from my perspective, Hadoop is first and foremost a massively parallel and redundant file system. Where the files are written only once. Then, on top of that, Map-reduce allows users to overlay a schema on the file data, then search the data and output the data into new files and/or tables. I like to think of map-reduce as being able to make organization out of clutter. Where a typical database design is to first organize the data, and then load the data. 

    Hadoop: load first, then organize 
    Database: organize first, then load 

    -----

    This might sound like a naive question, but are there any platforms (UNIX-like environments if possible) where Hadoop is pre-installed, but the end-user can still fine-tune, customize some Hadoop parameters depending on the tasks being run, to optimize computer resources (I guess, I mean optimizing number of nodes or clusters, load balance, etc.)

    I guess I'm talking about Hadoop in the cloud, or maybe not, I'm still a bit confused and trying to understand the Hadoop ecosystem (I've developed home-made file management and distributed systems, so I'm familiar with the theoretical concepts, but I'm not familiar with the Hadoop product, not yet) . Is Hadoop available by default, on AWS, or do you need to install it on your own mini-cloud on AWS, or sign up with a more expensive plan to have it pre-installed?

    -----

    Hi Vincent,

    As somone still learning the Hadoop ecosystem, my answer is not authoritative, but I can take a stab at your questions.

    1. No production implementation will be easy.  They all require a lot of config.  If you just want to learn the hadoop ecosystem for a bit, use the sandbox VMs provided by Hortonworks or Cloudera.

    2. It does if the point is just learning, but if you're dealing with much data and are hoping to realize performance benefits, you need more spindles spinning.

    3. This has a lot of variables.

    4. Not sure.  What solutions?

    5. Absolutely not.  Python mappers and reducers can be written using stdin/out.

    6. Any measurements on a single node wouldn't be particularly helpful in predicting  performance on a multi-node cluster.  A lot of Hadoop's magic seems to be in having multiple/many cores with multiple/many disks and tweaking the ratio between them.

    7. I'm certainly not ready to say that bandwidth couldn't be an issue for some results, but your point probably covers a lot of cases.  Certainly seems like it covers yours!

    -----------------

    If the intention is to eventually move up to an enterprise-grade HD application be aware that the storage configuration does indeed matter and should be chosen carefully. The de facto direct-attached-storage (DAS) configuration with a “share-nothing” approach may display unacceptable limitations on several dimensions when compared to a network-attached-storage (NAS) /share-everything configuration.

    Such limitations can show up in time to ingest data into the cluster under DAS, cost to scale out storage, security, backups and disaster recovery. Briefly, a NAS with the right features and setup can be a far better choice over DAS, especially when dealing with truly big data and operating in a production enterprise environment where things like security, resilience and infrastructure TCO matter a great deal.

    --------------------

    Here's an interesting answer posted on Google+, by Eric Schubert:

    I'm not the super Hadoop expert, but here are some of my experiences so far:
    Running Hadoop on a single machine does notmake a lot of sense, unless you nees to e.g. develop and debug map reduce jobs offline (e.g. on a plane). Even for regular development, I would suggest submitting small jobs to the real cluster.
    Performance of a single node will be mediocre compared to a well written local application. Hadoop really only shines when your data no longer fits into memory, and it is designed for 100+ nodes where you must expect node failures. For some benchmarks, seehttp://www.vitavonni.de/blog/201309/2013092701-big-data-madness-and...
    You can write map-reduce jobs easily in python or perl, but then you'll probably have to live with the higher overhead of serializing everything to and from text repeatedly. With the Java API you can easily use binary encodings and better control memory consumption. I mostly used this for prototyping (the python jobs can easily be run without hadoop, just be doing "< testinput mapper.py | sort | reducer.py | less", then rewriting the job in Java for the full data set to reduce IO.
    Don't underestimate the administration costof Hadoop. It's far from being a click&run thing, but there are dozens of configiration variables to tune and metrics to monitor. You need to setup monitoring and actively look for bottlenecks. With most jobs, CPU and main memory aren't the limiting factors (most jobs are trivial maps and aggregations) but it is data partitioning, HDFS IO, local tempory storage IO or network IO. Which is why you need to pay attention to data volume more than ever. The key to map-reduce is alsoneither the mapper nor the reducer - key is to make optimal use of the shuffle step, where the big IO happens. That is, unless you are performig a simple data extration job, where the mappers discard 99% of the data and the reducer is identity.
    On a badly condigured cluster (in particular one with a lot of RAM in each node) you can also suffer from CPU scheduling and IO scheduling slowdowns (which is why monitoring the load is important). Much of this tuning is non-intuitive. For example theminimum memory allocation size is what you need to adjust to optimize load (to prevent overallocation). So even for Wordcount it can pay off to allocate 4 GB to each mapper, just to ensure each has a disk and CPU on its own
    -----------------------

    Hadoop is useful even if you work on just one machine: it manages the data and files, to optimize memory usage transparently to you. Without Hadoop, you need to split your data sets into multiple pieces small enough to fit in memory, then aggregate computations. Even if you process a very large data set sequentially (say 10 billion clicks), your hash tables will grow too large to be handled in memory. The procedure need to be split, temporary hash tables saved as files on hard disk, and followed by an output merging step. Hadoop takes care of this.   

This reply was deleted.

Resources

Research