Subscribe to our Newsletter

Vincent Granville's Posts (34)

Invitation to Join Data Science Central

Join the largest community of machine learning (ML), deep learning, AI, data science, business analytics, BI, operations research, mathematical and statistical professionals: Sign up here. If instead, you are only interested in receiving our newsletter, you can subscribe here. There is no cost.

The full membership includes, in addition to the newsletter subscription:

  • Access to member-only pages, our free data science eBooks, data sets, code snippets, and solutions to data science / machine learning / mathematical challenges.
  • Support to all your questions regarding our community.
  • Data sets, projects, cheat sheets, tutorials, programming tips, summarized information easy to digest, DSC webinars, data science events (conferences, workshops), new books, and news. 
  • Ability to post blogs and forum questions, as well as comments, and get answers from experts in their field. 

You can easily unsubscribe at any time. Our weekly digest features selected discussions, articles written by experts, forum questions and announcements aimed at machine learning, AI,  IoT, analytics, data science, BI, operations research and big data practitioners.

It covers topics such as deep learning, AI, blockchain, visualization, automated machine learning, Hadoop, data integration and engineering, statistical science, computational statistics, analytics, pure data science, data security, and even computer-intensive methods in number theory. It includes

  • Exclusive content for subscribers only: our upcoming book on automated data science (coming soon), detailed research reports about the data science community (for instance, best cities for data scientists, with growth trends), API's (top Twitter accounts, various forecasting apps) and more
  • New book and new journal announcements
  • Salary surveys  - how much a Facebook data scientist makes
  • Workshops, webinars and conference announcements 
  • Programs and certifications for data scientists
  • Case studies, success stories, benchmarks
  • New analytic companies/products announcements
  • Sample source code, questions about coding and algorithms

Click here to sign up and start receiving our newsletter. We respect your privacy: member information (email address etc.) is kept confidential and never shared.

Read more…

8 Hadoop articles that you should read

Here's a selection of Hadoop-related articles worth checking out. Enjoy the reading!

What other articles and resources do you recommend?

Read more…

This tutorial is provided by Guru99. Originally posted here

Apache HADOOP is a framework used to develop data processing applications which are executed in a distributed computing environment.

In this tutorial we will learn,

  • Components of Hadoop
  • Features Of 'Hadoop'
  • Network Topology In Hadoop

Similar to data residing in a local file system of personal computer system, in Hadoop, data resides in a distributed file system which is called as a Hadoop Distributed File system.

Processing model is based on 'Data Locality' concept wherein computational logic is sent to cluster nodes(server) containing data. This computational logic is nothing but a compiled version of a program written in a high level language such as Java. Such a program, processes data stored in Hadoop HDFS.

HADOOP is an open source software framework. Applications built using HADOOP are run on large data sets distributed across clusters of commodity computers. Commodity computers are cheap and widely available. These are mainly useful for achieving greater computational power at low cost.

Do you know?  Computer cluster consists of a set of multiple processing units (storage disk + processor) which are connected to each other and acts as a single system.

Components of Hadoop

Below diagram shows various components in Hadoop ecosystem.

Apache Hadoop consists of two sub-projects:

  1. Hadoop MapReduce : MapReduce is a computational model and software framework for writing applications which are run on Hadoop. These MapReduce programs are capable of processing enormous data in parallel on large clusters of computation nodes.
  2. HDFS ( Hadoop Distributed File System): HDFS takes care of storage part of Hadoop applications. MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them on compute nodes in cluster. This distribution enables reliable and extremely rapid computations.

Although Hadoop is best known for MapReduce and its distributed file system- HDFS, the term is also used for a family of related projects that fall under the umbrella of distributed computing and large-scale data processing. Other Hadoop-related projects at Apache include are HiveHBaseMahoutSqoop , Flume and ZooKeeper.

Features of Hadoop

Suitable for Big Data Analysis - As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best suited for analysis of Big Data. Since, it is processing logic (not the actual data) that flows to the computing nodes, less network bandwidth is consumed. This concept is called as data locality concept which helps increase efficiency of Hadoop based applications.

Scalability - HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes, and thus allows for growth of Big Data. Also, scaling does not require modifications to application logic.

Fault Tolerance - HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes. That way, in the event of a cluster node failure, data processing can still proceed by using data stored on another cluster node.

Network Topology In Hadoop

Topology (Arrangment) of the network, affects performance of the Hadoop cluster when size of the hadoop cluster grows. In addition to the performance, one also needs to care about the high availability and handling of failures. In order to achieve this Hadoop cluster formation makes use of network topology.

Typically, network bandwidth is an important factor to consider while forming any network. However, as measuring bandwidth could be difficult, in Hadoop, network is represented as a tree and distance between nodes of this tree (number of hops) is considered as important factor in the formation of Hadoop cluster. Here, distance between two nodes is equal to sum of their distance to their closest common ancestor.

Hadoop cluster consists of data center, the rack and the node which actually executes jobs. Here, data center consists of racks and rack consists of nodes. Network bandwidth available to processes varies depending upon location of the processes. That is, bandwidth available becomes lesser as we go away from-

  • Processes on the same node
  • Different nodes on the same rack
  • Nodes on different racks of the same data center
  • Nodes in different data centers

DSC Resources

Popular Articles

Read more…

Originally posted here by Bernard Marr.

When you learn about Big Data you will sooner or later come across this odd sounding word: Hadoop - but what exactly is it?

Put simply, Hadoop can be thought of as a set of open source programs and procedures (meaning essentially they are free for anyone to use or modify, with a few exceptions) which anyone can use as the "backbone" of their big data operations.

I'll try to keep things simple as I know a lot of people reading this aren't software engineers, so I hope I don't over-simplify anything - think of this as a brief guide for someone who wants to know a bit more about the nuts and bolts that make big data analysis possible.

The 4 Modules of Hadoop

Hadoop is made up of "modules", each of which carries out a particular task essential for a computer system designed for big data analytics.

1. Distributed File-System

The most important two are the Distributed File System, which allows data to be stored in an easily accessible format, across a large number of linked storage devices, and the MapReduce - which provides the basic tools for poking around in the data.

(A "file system" is the method used by a computer to store data, so it can be found and used. Normally this is determined by the computer's operating system, however a Hadoop system uses its own file system which sits "above" the file system of the host computer - meaning it can be accessed using any computer running any supported OS).

2. MapReduce

MapReduce is named after the two basic operations this module carries out - reading data from the database, putting it into a format suitable for analysis (map), and performing mathematical operations i.e counting the number of males aged 30+ in a customer database (reduce).

3. Hadoop Common

The other module is Hadoop Common, which provides the tools (in Java) needed for the user's computer systems (Windows, Unix or whatever) to read data stored under the Hadoop file system.


The final module is YARN, which manages resources of the systems storing the data and running the analysis.

Various other procedures, libraries or features have come to be considered part of the Hadoop "framework" over recent years, but Hadoop Distributed File System, Hadoop MapReduce, Hadoop Common and Hadoop YARN are the principle four.

How Hadoop Came About

Development of Hadoop began when forward-thinking software engineers realised that it was quickly becoming useful for anybody to be able to store and analyze datasets far larger than can practically be stored and accessed on one physical storage device (such as a hard disk).

This is partly because as physical storage devices become bigger it takes longer for the component that reads the data from the disk (which in a hard disk, would be the "head") to move to a specified segment. Instead, many smaller devices working in parallel are more efficient than one large one.

It was released in 2005 by the Apache Software Foundation, a non-profit organization which produces open source software which powers much of the Internet behind the scenes. And if you're wondering where the odd name came from, it was the name given to a toy elephant belonging to the son of one of the original creators!

The Usage of Hadoop

The flexible nature of a Hadoop system means companies can add to or modify their data system as their needs change, using cheap and readily-available parts from any IT vendor.

Today, it is the most widely used system for providing data storage and processing across "commodity" hardware - relatively inexpensive, off-the-shelf systems linked together, as opposed to expensive, bespoke systems custom-made for the job in hand. In fact it is claimed that more than half of the companies in the Fortune 500 make use of it.

Just about all of the big online names use it, and as anyone is free to alter it for their own purposes, modifications made to the software by expert engineers at, for example, Amazon and Google, are fed back to the development community, where they are often used to improve the "official" product. This form of collaborative development between volunteer and commercial users is a key feature of open source software.

In its "raw" state - using the basic modules supplied here by Apache, it can be very complex, even for IT professionals - which is why various commercial versions have been developed such as Cloudera which simplify the task of installing and running a Hadoop system, as well as offering training and support services.

So that, in a (fairly large) nutshell, is Hadoop. Thanks to the flexible nature of the system, companies can expand and adjust their data analysis operations as their business expands. And the support and enthusiasm of the open source community behind it has led to great strides towards making big data analysis more accessible for everyone.

You might also want to read:

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Originally posted here by Bernard Marr.

One question I get asked a lot by my clients is: Should we go for Hadoop or Spark as our big data framework? Spark has overtaken Hadoop as the most active open source Big Data project. While they are not directly comparable products, they both have many of the same uses.

To shed some light onto the issue of “Spark vs. Hadoop.” I thought an article explaining the essential differences and similarities of each might be useful. As always, I have tried to keep it accessible to anyone, including those without a background in computer science.

Source for picture: click here (Quora)

Hadoop and Spark are both Big Data frameworks–they provide some of the most popular tools used to carry out common Big Data-related tasks.

Hadoop, for many years, was the leading open source Big Data framework but recently the newer and more advanced Spark has become the more popular of the two Apache Software Foundation tools.

However they do not perform exactly the same tasks, and they are not mutually exclusive, as they are able to work together. Although Spark is reported to work up to 100 times faster than Hadoop in certain circumstances, it does not provide its own distributed storage system.

Distributed storage is fundamental to many of today’s Big Data projects as it allows vast multi-petabyte datasets to be stored across an almost infinite number of everyday computer hard drives, rather than involving hugely costly custom machinery which would hold it all on one device. These systems are scalable, meaning that more drives can be added to the network as the dataset grows in size.

As I mentioned, Spark does not include its own system for organizing files in a distributed way (the file system) so it requires one provided by a third-party. For this reason many Big Data projects involve installing Spark on top of Hadoop, where Spark’s advanced analytics applications can make use of data stored using the Hadoop Distributed File System (HDFS).

What really gives Spark the edge over Hadoop is speed. Spark handles most of its operations “in memory” – copying them from the distributed physical storage into far faster logical RAM memory. This reduces the amount of time consuming writing and reading to and from slow, clunky mechanical hard drives that needs to be done under Hadoop’s MapReduce system.

MapReduce writes all of the data back to the physical storage medium after each operation. This was originally done to ensure a full recovery could be made in case something goes wrong – as data held electronically in RAM is more volatile than that stored magnetically on disks. However Spark arranges data in what are known as Resilient Distributed Datasets, which can be recovered following failure.

Spark’s functionality for handling advanced data processing tasks such as real time stream processing and machine learning is way ahead of what is possible with Hadoop alone. This, along with the gain in speed provided by in-memory operations, is the real reason, in my opinion, for its growth in popularity. Real-time processing means that data can be fed into an analytical application the moment it is captured, and insights immediately fed back to the user through a dashboard, to allow action to be taken. This sort of processing is increasingly being used in all sorts of Big Data applications, for example recommendation engines used by retailers, or monitoring the performance of industrial machinery in the manufacturing industry.

Machine learning–creating algorithms which can “think” for themselves, allowing them to improve and “learn” through a process of statistical modelling and simulation, until an ideal solution to a proposed problem is found, is an area of analytics which is well suited to the Spark platform, thanks to its speed and ability to handle streaming data. This sort of technology lies at the heart of the latest advanced manufacturing systems used in industry which can predict when parts will go wrong and when to order replacements, and will also lie at the heart of the driverless cars and ships of the near future. Spark includes its own machine learning libraries, called MLib, whereas Hadoop systems must be interfaced with a third-party machine learning library, for example Apache Mahout.

The reality is, although the existence of the two Big Data frameworks is often pitched as a battle for dominance, that isn’t really the case. There is some crossover of function, but both are non-commercial products so it isn’t really “competition” as such, and the corporate entities which do make money from providing support and installation of these free-to-use systems will often offer both services, allowing the buyer to pick and choose which functionality they require from each framework.

Many of the big vendors (i.e Cloudera) now offer Spark as well as Hadoop, so will be in a good position to advise companies on which they will find most suitable, on a job-by-job basis. For example, if your Big Data simply consists of a huge amount of very structured data (i.e customer names and addresses) you may have no need for the advanced streaming analytics and machine learning functionality provided by Spark. This means you would be wasting time, and probably money, having it installed as a separate layer over your Hadoop storage. Spark, although developing very quickly, is still in its infancy, and the security and support infrastructure is not as advanced.

The increasing amount of Spark activity taking place (when compared to Hadoop activity) in the open source community is, in my opinion, a further sign that everyday business users are finding increasingly innovative uses for their stored data. The open source principle is a great thing, in many ways, and one of them is how it enables seemingly similar products to exist alongside each other – vendors can sell both (or rather, provide installation and support services for both, based on what their customers actually need in order to extract maximum value from their data).

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Including NoSQL, Map-Reduce, Spark, big data, and more. This resource includes technical articles, books, training and general reading. Enjoy the reading!

Source for picture: click here

Here's the list (new additions, more than 30 articles marked with *):

  1. Hadoop: What It Is And Why It’s Such A Big Deal *
  2. The Big 'Big Data' Question: Hadoop or Spark? *
  3. NoSQL and RDBMS are on a Collision Course *
  4. Machine Learning at Scale with Spark *
  5. NoSQL & NewSQL Database Adoption 2014 *
  6. Big Data = 3 data issues *
  7. Embrace Relationships with Neo4J, R & Java *
  8. Hadoop Falcon and Data Lifecycle Management *
  9. Comparing MongoDB with MySQL *
  10. RethinkDB for Advanced Analytics *
  11. THE PAST (Entity-Attribute-Value) vs THE FUTURE (Sign, Signifier, Signified) *
  12. 18 Open Source NoSQL Databases *
  13. S3 instead of HDFS with Hadoop *
  14. Which one is best: R, SAS or Python, for data science? *
  15. Hadoop:- A soft Introduction *
  16. Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives *
  17. Ten top noSQL Databases *
  18. Survey: 'Big Data Paralysis' Is Holding Companies Back *
  19. Predictive Analytics and Data Science: Same or Different? *
  20. How Experian Is Using Big Data *
  21. 7 Amazing Big Data Myths *
  22. Deploy Hadoop Cluster *
  23. Why NoSQL became MORE SQL *
  24. The Big Data Tidal Wave - How Technology is Shaping the way Financial Services Companies Operate *
  25. Solving the Data Growth Crisis with Hadoop *
  26. Google F1 Database: One Step Closer To Discovering The DB Holy Grail *
  27. The Best Of Open Source For Big Data *
  28. The Emerging Data Stack and Mobile Access *
  29. Harnessing Big Data for Security: Intelligence in the Era of Cyber Warfare *
  30. New, fast Excel to process billions of rows via the cloud *
  31. Fake data science *
  32. Basic Understanding of Big Data. What is this and How it is going to solve complex problems *
  33. Lesson 9: Making Your Selection - Final Considerations *
  34. Great list of resources - NoSQL, Big Data, Machine Learning and more | GitHub 
  35. Implementing a Distributed Deep Learning Network over Spark 
  36. Correlation and R-Squared for Big Data 
  37. [Book] Big Data - Principles and best practices of scalable realtime data systems 
  38. 9 Lessons: Picking the Right NoSQL Tools 
  39. Lesson 2: NoSQL Databases are Good for Everything – Except Maybe this One Thing 
  40. 16 resources to learn and understand Hadoop 
  41. 8 Hadoop articles that you should read 
  42. Fast clustering algorithms for massive datasets 
  43. Hadoop – Whose to Choose 
  44. 11 Features any database, SQL or NoSQL, should have 
  45. Big Data: The 4 Layers Everyone Must Know 
  46. The Book: Big Data, NoSQL, Cloud A Paradigm Shift 
  47. Lesson 8: Graph Databases 
  48. How to get started with Hadoop? 
  49. Optimizing care gaps and outreach programs in Healthcare 
  50. Business Intelligence Architecture 
  51. Lesson 4: Features Common to (Most) NoSQL/NewSQL Databases 
  52. Get started with Hadoop and Spark in 10 minutes 
  53. Lesson 5: Key Value Stores (AKA 'Tuple' Stores) 
  54. How to score data in Hadoop/Hive in a flash 
  55. Interesting database questions 
  56. Lesson 3: Open Source, Distribution, or Suite 
  57. Big Data Applications Scaling Using Java Architecture in the Cloud 
  58. Lesson 7: Column Oriented Databases (aka Big Table or Wide Column) 
  59. Big Data Analytics Infrastructure 
  60. Hadoop Technology Stack 
  61. Lesson 6: Document Oriented Databases 
  62. A synthetic variance designed for Hadoop and big data 
  63. Practical illustration of Map-Reduce (Hadoop-style), on real data 
  64. Old SQL, New SQL, NoSQL - Making Sense of the Five Major Classes of Database Technology 
  65. How NoSQL Fundamentally Changed Machine Learning 
  66. eBook: Getting Started With Hadoop 
  67. Salaries for Hadoop professionals 
  68. Modern BI Architecture & Analytical Ecosystems 
  69. Wiley's Hadoop Book Bundle -- A Free 113 Page Sampler 
  70. Earthwatch to Look at Climate Change in Acadia National Park 
  71. Polyglot Persistence? 
  72. 50+ Open Source Tools for Big Data (See Anything Missing?) 
  73. Implementing a Distributed Deep Learning Network over Spark 
  74. Which one is best: R, SAS or Python, for data science? 
  75. 15+ Great Books for Hadoop 
  76. Clustering Similar Images Using MapReduce Style Feature Extraction with C# and R 
  77. A Comparison of NoSQL Offerings 
  78. How To Avoid The Big Data Quicksand 
  79. Deploy Hadoop Cluster 
  80. SQL to NoSQL translator 
  81. Programming for Data Science the Polyglot approach: Python + R + SQL 
  82. Seek the grail up the Knowledge Pyramid, not down 
  83. Big Data Logistics: data transfer using Apache Sqoop from RDBMS 

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post.

Regularly crunching large amounts of public & proprietary data to make pinpointed predictions is a challenging task. Hadoop data processing is a very useful infrastructure layer to help in that process. However, Python is a programming language of choice for many data scientists & therefore merging best of Hadoops & Pythons capabilities becomes imperative. We have outlined a case study on using Pydoop to crunch data from the American Community Survey (ACS) and Home Insurance data from vHomeInsurance.

About Pydoop

Pydoop is a Python package for Hadoop, MapReduce, and HDFS. Pydoop has several advantages over Hadoop’s built-in solutions for Python programming. As a CPython package, it allows you to access all standard library and third party modules, some of which may not be available for other Python implementations.

Installing Pydoop

  • Before installing Pydoop, we first need to install python2.7 and Apache Hadoop.
  • We may work with python2.6 as well, but for Pydoop to work in python2.6, we need to install the following modules:
    • importlib, and
    • argparse (which can be installed with pip)
  • Pydoop can also be installed with pip using the following command:
    • sudo pip install pydoop

Example for Pydoop: Dividing the Acs Data File into multiple files based on LOGRECNO NUMBER

LOGRECNO Stands for Logical Record Number which is mapped to the geography information.

What is ACS?

American Community Survey, provides statistics for various population and housing related information on a national, state, and even on a community scale.

The annual data release of ACS is available in the following format:


Field Name


Field Size


File Identification

6 Characters


File Type

6 Characters


State/U.S.-Abbreviation (USPS)

2 Characters


Character Iteration

3 Characters


Sequence Number

4 Characters


Logical Record Number

7 Characters

Field # 7 and up

Estimates such as Home Insurance, home property value data which were appended from vHomeInsurance

Sample File: (






















When we have multiple text files based on the sequence number, we combine all the files into single file. Now we create files for individual places(LOGRECNO). Using the LOGRECNO, we can map to the geo location. (

Example of Creating a Place File:

Mapper Function:

  • For the Mapper function to call the input file, it requires 3 parameters:
    • key: the byte offset with respect to the current input file. In most cases, we may ignore it;
    • value: the line of text to be processed;
    • writer object: a Python object to write output and count values.

Reducer Function :

  • Reducer function will be called for each unique key-value pair produced by your map function. It also receives 3 parameters
    • key: the key produced by your map function;
    • values iterable: iterates over this parameter to traverse all the values emitted for the current key’
    • writer object: this is identical to the one given to the map function.

About Writer Object:

  • Writer object is one of the parameters of mapper and reducer. It has the following functions:
    • emit(k, v): pass a (k, v) key-value pair to the framework;
    • count(what, how_many): add how_many to the counter named what. If the counter doesn’t already exist, it will be created dynamically.

import csv

def mapper(_,text, writer):

   row = text.split("\t")

   logrecno = row[5]

   if logrecno=='0004637':

       values = row


def reducer(key, values, writer):

            writer.emit("\t".join(str(v) for v in values))

How to Execute:

Run Command “pydoop script acsfile hdfs_output”

Output file of execution:








Post from

Read more…

Hadoop has been the foundation for data programmes since Big Data hit the big time. It has been the launching point for data programmes for almost every company who is serious about their data offerings.

However, as we predicted we are seeing that the rise in in-memory databases has seen the need for companies to adopt frameworks that harness this power effectively.

It was therefore no surprise that Apache have launched Spark, a new framework that utilizes in-memory primitives to deliver performance around 100 times faster than Hadoop’s two-stage disk-based version.

This kind of product has become increasingly important as we move forward into a world where the amount and speed of data has been increasing exponentially.

So is Spark going to be the Hadoop beater that it seems to be?


This kind of technology that allows us to make decisions quicker and with increased amounts of data is going to be something that companies are clamouring for.

It is not simply in principle that this platform will be bringing about change either. As an open source platform, it has the most developers working on it across every Apache product.

This suggests that people support the idea through their willingness to dedicate their time to it. It is common knowledge that many of the data scientists working on Apache products are the same ones who will be using it in their day-to-day roles at different companies, which could suggest that they are going to adopt this system in the future.


One of the main reasons for the success of Hadoop in the last few years has been not only due to its ease of use, but also that companies can get it for nothing. This is because you can run the basics of Hadoop on a regular system and will only need to upgrade when they ramp up their data programmes.

Spark runs on-memory systems which requires a system with high performance, something that companies new to data initiatives are unlikely to invest in.

So which is it more likely to be?

In my opinion, Hadoop will always be the foundation of data programmes and with more companies looking at adopting it as the basis for their implementations, this is unlikely to change.

Spark may well become the upgrade that companies who move to a stage where they want, or need, improved performance will adopt. As Spark can work alongside Hadoop this seems to have also been in the minds of the guys at Apache when coming up with the product in the first place.

Therefore, it is unlikely to be a Hadoop beater, but will instead become more like its big brother. It is capable of doing more, but at increased cost and only necessary for certain data volumes and velocities, is not going to be a replacement. 

Originally posted here.

Read more…

Top 10 Hadoop Blog Posts - by Pivotal

Apache Hadoop has become the dominant platform for Big Data analytics in recent years, thanks to its flexibility, reliability, scalability, and ability to suit the needs of developers, web startups, and enterprise IT. Hadoop is capable of real-time analytics and is cost-effective given its ability to run on commodity hardware, and boasts a robust developer ecosystem supporting the platform's continued development and extensibility.

Pivotal is backed by the world's largest Hadoop support organization and tested at scale in the 1,000-node Pivotal Analytics Workbench. The Apache Hadoop platform has been ascendant in recent years and we've tracked our top ten Hadoop blog posts here.

Hadoop 101: Programming MapReduce with Native Libraries, Hive, Pig, and Cascading
Apache Hadoop and all its flavors of distributions are the hottest technologies on the market. Its fundamentally changing how we store, use and share data...Read more

Pivotal's New Big Data Suite Redefines the Economics of Big Data Including UNLIMITED Hadoop to Enterprises 
The Big Data battleground has been an interesting market to watch lately. Centered around Apache Hadoop, and surrounded by an active ecosystem of commercial ventures and open source projects, this space is unfolding fast...Read more

In 45 Min, Set Up Hadoop (Pivotal HD) on a Multi-VM Cluster & Run Test Data
Getting started with Hadoop can take up a lot of time, but it doesn't have to...Read more

Large-Scale Video Analytics on Hadoop
Big Data is no longer a new term, its a fact, and its one of the fastest growing area in IT...Read more

6 Easy Steps: Deploy Pivotal's Hadoop on Docker
While Hadoop is becoming more and more mainstream, many development leaders want to speed up and reduce errors in their development and deployment processes (i.e. devops) by using platforms like PaaS and lightweight runtime containers... Read more

In-Memory Data Grid + Hadoop: Integrated Real-Time Big Data Platform Previewed at SpringOne 2GX 2013
Apache Hadoop is gaining tremendous momentum, as it is becoming the ubiquitous answer to managing massive amounts of data from disparate sources at a very low cost...Read more

Using Hadoop MapReduce for Distributed Video Transcoding 
Surveillance cameras installed in enterprise facilities and public places produce lots of video data every day...Read more

Exploring Big Data Solutions: When To Use Hadoop vs In-Memory vs MPP
In the past, customers constrained by licenses have had to make architectural choices that are a bad fit, based solely on what licenses they currently own...Read more

JSON on Hadoop Example for Extending HAWQ Data Formats Using Pivotal eXtension Framework (PXF)
In my last post, I demonstrated a simple data workflow using Flume to pull raw JSON tweets from Twitter and store them on your HDFS cluster. These tweets were then analyzed using HAWQ and the Pivotal Xtension Framework (PXF)...Read more

Transform Your Skills: Simple Steps to Set Up SQL on Hadoop
As technologists, we don't have to look any further than a couple of job trend sites to realize Hadoop skills are growing when compared to SQL...Read more

Learn more about Pivotal HD - Hadoop for the enterprise here.

Read more…

A diverse collection of Hadoop tips, tricks, and information from some of today?s leading authors This Wiley e-book bundle includes selected materials from 5 recently published titles in Wiley's expansive catalog of titles. The material that is included for each selection is the book's full Table of Contents as well as a full sample chapter for your enjoyment. Titles Include:

  • Hadoop For Dummies
  • Google BigQuery Analytics
  • Professional Hadoop Solutions
  • Professional NoSQL
  • Hybrid Cloud For Dummies

Whether you're a seasoned veteran of Hadoop or a newcomer, there are valuable lessons and advice in these pages for you. Download today and receive your FREE sampler filled with thought provoking insights from today's leading authors.

Click here to see more.

Read more…
-- (BOOK) "Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2" by  (Pearson/Addison-Wesley Professional, March 2014, ISBN 9780321934505); ** SUMMARY: This book is the comprehensive guide to building distributed, big data applications with Apache Hadoop™ YARN. YARN project founder Arun Murthy and project lead Vinod Kumar Vavilapalli demonstrate how YARN increases scalability and cluster utilization, enables new programming models and services, and opens new options beyond Java and batch processing. They walk through the entire YARN project lifecycle, from installation through deployment, providing examples drawn from their experience—first as Hadoop’s earliest developers and implementers at Yahoo! and now as Hortonworks developers moving the platform forward. 
-- (BOOK) "R for Everyone: Advanced Analytics and Graphics" by Jared Lander (Pearson/Addison-Wesley Professional, Dec. 2013, ISBN 9780321888037); **SUMMARY: Drawing on his unsurpassed experience teaching new users, professional data scientist Jared P. Lander has written the perfect tutorial for anyone new to statistical programming and modeling. Organized to make learning easy and intuitive, this guide focuses on the 20% of R functionality most needed to accomplish 80% of modern data tasks. Lander’s self-contained chapters start with the absolute basics, offering extensive hands-on practice and sample code. Readers will download and install R; navigate and use the R environment; master basic program control, data import, and manipulation; and walk through several essential tests. Then, building on this foundation, Lander shows how to construct several complete models, both linear and nonlinear, and use some data mining techniques. 
Publisher page | Sample Chapter #12, "Data Reshaping"
- (DIGITAL VIDEO) "R Programming LiveLessons: Fundamentals to Advanced", presented by Author Jared Lander (Pearson/Addison-Wesley Professional, Dec. 2013, ISBN 9780133743272); **SUMMARY: In 16+ hours of video instruction, Author Jared Lander provides a tour through the most important parts of R, from the very basics to complex modeling. He covers reading data, programming basics, visualization. data munging, regression, classification, clustering, modern machine learning and more. The video is based on Lander's corresponding book, "R for Everyone"and is a condensed version of the course he teaches at Columbia University. 
- (BOOK) "Data Just Right: Introduction to Large-Scale Data & Analytics" by Michael Manoochehri (Pearson/Addison-Wesley Professional, Dec. 2013, ISBN 9780321898654);  **SUMMARY: This book is for professionals who need practical solutions based on limited resources and time. Manoochehri helps readers to focus on building applications, rather than infrastructure, and to address each of today’s key Big Data use cases in a cost-effective way by combining technologies into hybrid solutions. He provides approaches to: managing massive datasets; visualizing data; building data pipelines and dashboards; choosing tools for statistical analysis; and more. Throughout, Manoochehri demonstrates techniques using many leading data analysis tools, including Hadoop, Hive, Shark, R, Apache Pig, Mahout, and Google BigQuery. The book is organized in parts that describe data challenges and successful solutions in the context of common use cases. 
Publisher page | Sample Chapter #1, "Four Rules for Data Success": 
- (DIGITAL VIDEO) "Data Just Right LiveLessons" presented by Author Michael Manoochehri (Pearson/Addison-Wesley Professional, Dec. 2013, ISBN 9780133807141);  **SUMMARY: In 7 hours of video instruction, Author Manoochehri provides a practical introduction to solving common data challenges, such as managing massive datasets, visualizing data, building data pipelines and dashboards, and choosing tools for statistical analysis. The course does not assume any previous experience in large scale data analytics technology, and includes detailed, practical examples. 
- (BOOK) "Practical Cassandra: A Developer's Approach" by Russell Bradberry, Eric Lubow (Pearson/Addison-Wesley Professional, Dec. 2013, ISBN 9780321933942); **SUMMARY: Practical Cassandra is the first hands-on developer’s guide to building Cassandra systems and applications that deliver breakthrough speed, scalability, reliability, and performance. It reflects the latest versions of Cassandra–including Cassandra Query Language (CQL), which dramatically lowers the learning curve for Cassandra developers. Bradberry and Lubow walk readers through every step of building a real production application that can store enormous amounts of structured, semi-structured, and unstructured data. Drawing on their exceptional expertise, they share practical insights into issues ranging from querying to deployment, management, maintenance, monitoring, and troubleshooting. They cover key issues, from architecture to migration, and decision-making on crucial issues such as configuration and data modeling. They provide tested sample code, detailed explanations of how Cassandra works ”under the covers,” and new case studies. 
Publisher page | Sample Chapter #8, "Monitoring"
Read more…

How to get started with Hadoop?

As a Perl, R and Python guy, what is the easiest way to get started with Hadoop? A few specific questions:

  1. Could you install Hadoop on Windows (on my laptop)? The proceduredescribed here is a bit complicated. Some say you can even run Hadoop from your iPhone (I guess browser-based versions, if they exist).
  2. Does it make sense to use Hadoop on just one machine, at least initially? What are the benefits of using  Hadoop on a single machine (is this synonymous to single node) over using just my home-made file management system (basically, UNIX commands on the Cygwin console). 
  3. What are the optimum Hadoop configurations, based on the type/ size and velocity of data that you process?
  4. Any benchmark studies comparing Hadoop to other solutions?
  5. Do you need to know Java to get started?
  6. How to simulate multiple clusters/nodes on one machine? Can you measure the benefits of paralleled computations on just one machine? I was able to see significant gains in the past with a web crawler split and running on 25 processes on a single machine (Map Reduce), a while back. But if the tasks are purely computational and algorithmic (no data transfers in and out of your machine, such as HTTP requests or data base access or API calls), would there be any potential gains?
  7. When using multiple machines, can data transfers reduce the benefits of Hadoop or similar architectures? My guess is no, because usually (in most data processing) the output is small, compared to the input.

Thanks for your help! I'd really like to get started.

Read more…

15+ Great Books for Hadoop

Books for Hadoop & Map Reduce

  • Hadoop: The Definitive Guide by Tom White

    The Definitive guide is in some ways the ‘hadoop bible’, and can be an excellent reference when working on Hadoop, but do not expect it to provide a simple getting started tutorial for writing a Map Reduce. This book is great for really understanding how everything works and how all the systems fit together.

  • Hadoop Operations by Eric Sammer

    This is the book if you need to know the ins and outs of prototyping, deploying, configuring, optimizing, and tweaking a production Hadoop system. Eric Sammer is a very knowledgeable engineer, so this book is chock full of goodies.

  • Map Reduce Design Patterns by Donald Miller and Adam Shook

    Design Patterns is a great resource to get some insight into how to do non-trivial things with Hadoop. This book goes into useful detail on how to design specific types of algorithms, outlines why they should be designed that way, and provides examples.

  • Hadoop in Action by Chuck Lam

    One of the few non-O’Reilly books in this list, Hadoop in Action is similar to the definitive guide in that it provides a good reference for what Hadoop is and how to use it. It seems like this book provides a more gentle introduction to Hadoop compared to the other books in this list.

  • Hadoop in Practice by Alex Holmes

    A slightly more advanced guide to running Hadoop. It includes chapters that detail how to best move data around, how to think in Map Reduce, and (importantly) how to debug and optimize your jobs.

View full list. Also, check this list for machine learning books. For other data science books, click here.

Read more…

Five are related to Hadoop, and four to analytics. Enjoy the reading!

MapReduce NextGen Architecture

Illustration of YARN (from first article below)

Articles from external publishers and bloggers:

The starred article is very popular.

Our past selections of external articles

Read more…

Enjoy the reading and learning. Add your own suggestions in the comment section.

The Hadoop Ecosystem by Datameer


  1. Free videos - MapR Academia
  2. Udacity course
  3. Hortonworks Sandbox
  4. Hadoop Ecosystem
  5. Running Hadoop Map-Reduce
  6. Hadoop Screencasts
  7. Reza Shiftehfar's blog I
  8. Reza Shiftehfar's blog II
  9. Reza Shiftehfar's blog III
  10. Reza Shiftehfar's blog IV
  11. Reza Shiftehfar's blog V
  12. Reza Shiftehfar's blog VI
  13. Reza Shiftehfar's blog VII
  14. Deploying Storm on Hadoop for Advertising Analysis
  15. Hadoop classes by Cloudera
  16. EMC classes: Big Data, Analytics, Data Science
  17. Simulated Hadoop

Other links

Read more…

eBook: Getting Started With Hadoop

This ebook will guide you through:

  • Why Hadoop ? Comparison of SQL database and Hadoop
  • Understanding Distributed System, MapReduce and Hadoop
  • Installation and Setting up Hadoop
  • Working with HDFS and JobTracker web interface
  • Getting Start MapReduce Programming with Single Cluster
Read more…

Pivotal, an EMC/VMware spin-off that has big plans to deliver big data analytics through platform as a service, has whisked the drapes off Pivotal HD 2.0, its commercially supported enterprise-grade distribution of Hadoop.

But Pivotal's ambitions for HD don't simply involve delivering Hadoop as a free-form building block, albeit one that's professionally supported. Rather, HD is intended to be the data fabric of the company's own Pivotal One, a PaaS offering where companies can develop apps that siphon data in real time from a variety of sources and transform them into actionable information.

HD 2.0 is built on top of Apache Hadoop 2.2, but adds a good deal of proprietary technology -- a move that will likely leave open source purists wincing -- to make Hadoop the substrate of what Pivotal calls a "business data lake" architecture. One of those proprietary pieces is HAWQ, a SQL query engine designed to perform parallelized queries on data stored in HDFS; another component, GemFire XD, is an in-memory database service designed more for processing of incoming data in real time, as opposed to long-running SQL queries. HD 2.0 also includes GraphLab, a graph analytics algorithms package, and tools to allow programmers using R, Python, and Java to "enable business logic and procedures otherwise cumbersome with SQL."

Other distributions have done little more than package up Hadoop for easier delivery and provisioning under the assumption that the deploying parties would know best how to make the most of it -- an attitude that's persisted with Red Hat and Hortonworks joining forces for the sake of supporting Hadoop in Red Hat Enterprise Linux. There, the application and data-access sides have largely consisted of the likes of Red Hat's JBoss data layer. Enterprise developers still have to fit many more of the pieces together themselves.

Pivotal, on the other hand, is using Hadoop as an underlying stratum on which to build its PaaS. To that end, Pivotal One is meant to be directly useful to enterprises needing big data analytics by allowing them to leverage more of the data-access paradigms they're already familiar with (such as SQL) instead of forcing them to scrap everything and learn the Hadoop way. Again, Hadoop purists aren't going to be happy with this news, but Pivotal most wants to satisfy its enterprise customers with big data needs.

When InfoWorld's Eric Knorr pondered the launch of Pivotal back in April 2013, he considered the possibility that Pivotal One was being built as much for Pivotal itself as it was anyone else -- that Pivotal Labs (one of the acquisitions used to form Pivotal) would be "developing the bulk of those next-gen big data applications on Pivotal One for its enterprise customers, rather than enterprises developers using Pivotal One themselves."

The long-term vision, as Knorr found in his discussion later in 2013 with CEO Paul Maritz, involves not just having the ability to generate a given insight with a large data set or even to run arbitrarily large software on top of it. Rather, as he put it, "It's about how you use that in the context of some application that's going to drive a transaction or cause some interaction with the user.... We're not just in the big data business. We're in the applications and data business."

Much of what has held back Hadoop is its status as a technology rather than a product -- as cited by Facebook analytics chief Ken Rudin when talking about how "big data is about business needs." If Pivotal One and Pivotal HD make Hadoop into the kind of useful and even transformative business product that Red Hat was able to craft from Linux, odds are it'll be at least a big a win for Pivotal -- maybe even bigger than it will be for Hadoop.

This story, "Pivotal juices Hadoop with in-memory database and SQL querying," was originally published at Get the first word on what the important tech news really means with the InfoWorld Tech Watch blog. For the latest developments in business technology news, follow on Twitter.

Read more…

The Hadoop ecosystem is going through the same stages of explosive maturation as the social networking space had before it, with the top players now gearing up for public offerings to fuel their growth ambitions. Cloudera fired the opening shot last week with the closing of a $160 million round that bumped up its valuation to $1.8 billion, and today Hortonworksannounced fresh funding of its own.

Rob Bearden at Structure Data 2014.

The company bagged $100 million in Series D financing from asset management powerhouse BlackRock, Passport Capital and all of its existing backers, which include several other high-profile institutional investors as well as Yahoo. Hortonworks spun-off from the Internet giant in 2011, taking with it most of the original Hadoop development team and their vision for open source data analytics. The firm’s commitment to the community is what sets it apart from rivals Cloudera and MapR, which incorporate varying proportions of proprietary technology into their offerings. Hortonworks on the other hand offers its software for free and provides premium services on the side, a business model that has won prominent customers like eBay, Samsung and Cardinal Health.

“The Hadoop market is heating up, that’s clear. This raise from Hortonworks, which comes just a week after a massive raise by Cloudera, indicates to me that the race is on to be the first Hadoop pure-play vendor to go public,” said Jeff Kelly, a senior analyst at Wikibon, where he covers the Big Data market. “Hortonworks’ maturation as a company in just two-plus years of life has been remarkable to watch. With this investment and a slew of important reseller arrangements starting to kick into high-gear, 2014 has the potential to be a breakout year for the company.”

Hortonworks said that the new funding, which brings its total raised to a relatively humble $148 million, will be used to extend its product roadmap and scale global operations. Meanwhile, CEO Rob Bearden told media that the round also sets the stage for a public offering that will take place in the second half of this year or early 2015. That likely positions the company for an earlier IPO than MapR, but the jury is still out on Cloudera: As the oldest and largest distributor in the industry, the firm is poised to take center stage when the Hadoop gold rush reaches Wall Street. However, it remains to be seen whether it will seize the moment.

Other links

Read more…
This article describes methods for machine learning using bootstrap samples and parallel processing to model very large volumes of data in short periods of time. The R programming language includes many packages for machine learning different types of data. Three of these packages include Support Vector Machines (SVM) [1], Generalized Linear Models (GLM) [2], and Adaptive Boosting (AdaBoost) [3]. While all three packages can be highly accurate for various types of classification problems, each package performs very differently when modeling (i.e. learning) different volumes of input data. In particular, model fitting for Generalized Linear Models execute in much shorter periods of time than either Support Vector Machines or Adaptive Boosting.
Read more…