As a Perl, R and Python guy, what is the easiest way to get started with Hadoop? A few specific questions:
- Could you install Hadoop on Windows (on my laptop)? The proceduredescribed here is a bit complicated. Some say you can even run Hadoop from your iPhone (I guess browser-based versions, if they exist).
- Does it make sense to use Hadoop on just one machine, at least initially? What are the benefits of using Hadoop on a single machine (is this synonymous to single node) over using just my home-made file management system (basically, UNIX commands on the Cygwin console).
- What are the optimum Hadoop configurations, based on the type/ size and velocity of data that you process?
- Any benchmark studies comparing Hadoop to other solutions?
- Do you need to know Java to get started?
- How to simulate multiple clusters/nodes on one machine? Can you measure the benefits of paralleled computations on just one machine? I was able to see significant gains in the past with a web crawler split and running on 25 processes on a single machine (Map Reduce), a while back. But if the tasks are purely computational and algorithmic (no data transfers in and out of your machine, such as HTTP requests or data base access or API calls), would there be any potential gains?
- When using multiple machines, can data transfers reduce the benefits of Hadoop or similar architectures? My guess is no, because usually (in most data processing) the output is small, compared to the input.
Thanks for your help! I'd really like to get started.