Subscribe to our Newsletter

Featured Posts (352)

The MapReduce of Hadoop is a widely-used parallel computing framework. However, its code reuse mechanism is inconvenient, and it is quite cumbersome to pass parameters. Far different from our usual experience of calling the library function easily, I found both the coder and the caller must bear a sizable amount of precautions in mind when writing even a short pieces of program for calling by others.
Read more…

Book: Big Data Analytics with R and Hadoop

Set up an integrated infrastructure of R and Hadoop to turn your data analytics into Big Data analytics


  • Write Hadoop MapReduce within R
  • Learn data analytics with R and the Hadoop platform
  • Handle HDFS data within R
  • Understand Hadoop streaming with R
  • Encode and enrich datasets into R


In Detail

Big data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations, and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. New methods of working with big data, such as Hadoop and MapReduce, offer alternatives to traditional data warehousing.

Big Data Analytics with R and Hadoop is focused on the techniques of integrating R and Hadoop by various tools such as…

Read more…

The new variance introduced in this article fixes two big data problems associated with the traditional variance and the way it is computed in Hadoop, using a numerically unstable formula.


Synthetic Metrics

This new metric is synthetic: It was not derived naturally from mathematics like the variance taught in any statistics 101 course, or the variance currently implemented in Hadoop (see above picture). By synthetic, I mean that it was built to address issues with big data (outliers) and the way many big data computations are now done: Map Reduce framework, Hadoop being an implementation. It is a top-down approach to metric design - from data to theory, rather than the bottom-up traditional approach - from theory to data.

Other synthetic metrics designed in our research laboratory include:…

Read more…

By 2015, 65 percent of applications with advanced analytics will come embedded with Hadoop. There's never been a better time to unlock the power of your data.

The Hadoop Innovation Summit returns to San Diego at the Marriott Marquis & Marina, on February 19 & 20, 2014.

View the schedule.

Speakers include

- Technical Director, AOL
- Director, BI Platforms, Netflix
- CDO & EVP, Data Science, Live Nation
- Senior Data Scientist, LinkedIn
- Engineering Manager, Analytics Infrastructure, Twitter
- Senior Software Engineer, TripAdvisor
- Data Engineer, Spotify
- Senior Software Development Manager, eBay
- Principal Architect, Yahoo!

Hadoop is…

Read more…

Originally posted by Manish Bhoge on DataScienceCentral.

Few days back i have attended a good webinar conducted by Metascale on topic “Are You Still Moving Data? Is ETL Still Relevant in the Era of Had... This post is targeting this webinar.

In summary, this webinar had nicely explained about how enterprise can use Hadoop as a data hub along with the existing Datawarehouse set up. “Hadoop as a Data Hub” this line itself raised lot of questions in my mind:

  1. When we project Hadoop as a Data-hub and same time maintain the datawarehouse as an another data…
Read more…

Hadoop Training

Course Description: Training course is designed for developers who want to better understand how to create Apache Hadoop solutions. This 35 Hours provides Java programmers the necessary training for creating enterprise solutions using Apache Hadoop. It consists of an prudent combination of interactive lecture and extensive hand-on lab exercises.…

Read more…

Guest blog post by Francesca Krihely.

Here’s a prediction and a challenge, rolled into one. Whatever the level of your present understanding of Hadoop, in short, you’re going to hear a lot more about Hadoop in future.

And the challenge? Well, it’s this: whatever the level of your present understanding of Hadoop, you’re also likely to be missing critical pieces of the jigsaw. Which pieces? Read on.

Hadoop, let’s first of all remind ourselves, is an open source data platform which performs a very neat trick. Simply put, Hadoop is a tool for tying together multiple servers into single, easily-scalable clusters, ideal for distributed data storage and processing.

So it’s not too…

Read more…

Datameer is a browser based BI platform that makes Hadoop accessible to all users of an organization. This demo video is of a multi-channel retail enterprise that wants to build and maintain a 360 degree view of it’s customers using data from sources such as tweet, click stream data, it’s own structured customer databases blended with public datasets.


 Get Started

  1. Watch the Demo Video
  2. Download VirtualBox (required for Datameer Playground)
  3. Download the Datameer – Hortonworks ‘Playground’ (requires…
Read more…

In this article, Dr. Granville proposes a simple metric to measure predictive power. It is used for combinatorial feature selection, where a large number of feature combinations need to be ranked automatically and very fast, for instance in the context of transaction scoring, in order to optimize predictive models. This is about rather big data, and we would like to see an Hadoop methodology for the technology proposed here. It can easily be implemented in a Map Reduce framework. It  was developed by the author in the context of credit card fraud detection, and click/keyword scoring. This material will be part of our data science apprenticeship, and included in our Wiley book.…


Read more…

[Book] Hadoop: The Definitive Guide

Ready to unlock the power of your data? With this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.
Read more…

Featured Blog Posts - DSC