Subscribe to our Newsletter

All Posts (383)

Originally posted on Data Science Cental

Cloud giants like Amazon, Google, Azure and IBM have rushed into the big data analytics cloud market.  They claim their tools will make developer tasks simple. For machine learning, they say their cloud products will free data scientists and developers from implementation details so they can focus on business logic.  

The big companies have kicked off a race between machine learning platforms. Amazon ML, Azure ML, IBM Watson and Google Cloud Prediction are striving to fold data science workflows into their existing ecosystems. They want to drive the adoption of machine learning algorithms across software development teams and expand data science throughout the business.

But data scientists and big data platform engineers do not always want or need this one-size-fits-all approach.  They understand first hand how powerful and flexible Apache Spark, R and Python are when it comes to machine learning. These people are experts who cannot be constr

Read more…

Making data science accessible – HDFS

Originally posted on Analytic Bridge

By Dan Kellett, Director of Data Science, Capital One UK


Disclaimer: This is my attempt to explain some of the ‘Big Data’ concepts using basic analogies. There are inevitably nuances my analogy misses.


What is HDFS?

When people talk about ‘Hadoop’ they are usually referring to either the efficient storing or processing of large amounts of data. MapReduce is a framework for efficient processing using a parallel, distributed algorithm (see my previous blog here). The standard approach to reliable, scalable data storage in Hadoop is through the use of HDFS (Hadoop Distributed File System).


Imagine you wanted to find out how much food each type of animal eats in a day. How would you do it?

The gigantic warehouse

One approach would be to buy or rent out a huge warehouse and store some of every type of animal in the world. Then you could study each type one at a time to get the information you needed – presumably starting with aardvark

Read more…

Originally posted on Data Science Central

Thousands of articles and tutorials have been written about data science and machine learning. Hundreds of books, courses and conferences are available. You could spend months just figuring out what to do to get started, even to understand what data science is about.

In this short contribution, I share what I believe to be the most valuable resources - a small list of top resources and starting points. This will be most valuable to any data practitioner who has very little free time. 


Map-Reduce Explained

These resources cover data sets, algorithms, case studies, tutorials, cheat sheets, and material to learn the most popular data science languages: R and Python. Some non-standard techniques used in machine-to-machine communications and automated data science, even though technically simpler and more robust, are not included here as their use is not widespread, with one exception: turning unstructured into structured data. We will inc

Read more…

5 Big Data Myths Businesses Should Know

Guest blog post by Larry Alton

Big data is seeping into every facet of our lives. Smart home gadgets are becoming part of the nerve systems of new and remodeled homes, and many renters are demanding these interconnected gadgets from landlords.

But nowhere has Big Data created a bigger buzz than in business. Companies of all sizes are collecting data at a seemingly insurmountable rate. Big data is larger than ever before.

We’ve collected more data in the past two years than in the entire history of the human race. It’s also continuing to grow at an incredible rate: By 2020, analysts believe we’ll be generating about 1.7 megabytes of information per second for every human being.

This information can be useful for businesses in a wide array of mediums, from cloud computing to data processing speeds and customer relations. But just because businesses can collect all this information doesn’t mean they know what to do with it or have the resources to analyze it.

In fact, many businesses are

Read more…

Originally posted on Data Science Central

We just started in this article to provide answers to one of the largest collection of data science job interview questions ever published, and we will continue to add answers to most of these questions. Some answers link to solutions offered in my Wiley data science book: you can find this book here. The 91 job interview questions were originally published here with no answers, and we recently added 50 questions to identify a true data scientist, in this article. Over time, we will add more questions and more answers. So, bookmark this page!

Other lists of Q&A for data science job interviews (including about R) can be found here


Technical Questions

  1. What are lift, KPI, robustness, model fitting, design of experiments, and the 80/20 rule? Answer: KPI stands for Key Performance Indicator, or metric, sometimes called feature. A robust model is one that is not sensitive to changes in the data. Design of experiments or experimental desi
Read more…

Guest blog post by Syed Danish Ali


The challenges of big data can be captured succinctly as follows[1],[2]:

  • Volume; ever increasing volume which breaks down traditional data-holding capacity
  • Variety; more and more heterogeneous data from many formats and types are bombarding the data environment
  • Velocity; more and more data is time sensitive now; frequent updates are taking place instead of relying on historical old data and data in real time is being generated now by the internet of things, amongst others.
  • Veracity; how valid and reliable is the data? Since now we have so much data, any point of view can be supported by selective adaption of data.

For volume, Map Reduce[3] works to harness the potential of billions of items of data. The first part is that the data is mapped down into key and value pairs; the reduce job combines the mapped data into smaller set of data by eliminating repetition and redundancy amongst others. Hadoop is open-source for handling bi

Read more…

Originally posted on Data Science Central

This article introduces Mahout, a library for scalable machine learning, and studies potential applications through two Mahout projects. It was written by Linda Terlouw. Linda is a computer scientist who works on Data Science (Data Analysis, Data Visualization, Process Mining).


Apache Mahout is a library for scalable machine learning. Originally a subproject of Apache Lucene (a high-performance text search engine library), Mahout has progressed to be a top-level Apache project. 

While Mahout has only been around for a few years, it has established itself as a frontrunner in the field of machine learning technologies. Mahout has currently been adopted by: Foursquare, which uses Mahout with Apache Hadoop and Apache Hiveto power its recommendation engine; Twitter, which creates user interest models using Mahout; and Yahoo!, which uses Mahout in their anti-spam analytic platform. Other commercial and academic uses of Mahout have been catalog

Read more…

Intro to Bigdata Architecture

Originally posted on Data Science Central


In 2000, Seisint Inc. (now LexisNexis Group) developed a C++-based distributed file-sharing framework for data storage and query. The system stores and distributes structured, semi-structured, and unstructured data across multiple servers. Users can build queries in a C++ dialect called ECL. ECL uses an "apply schema on read" method to infer the structure of stored data when it is queried, instead of when it is stored. In 2004, LexisNexis acquired Seisint Inc. and in 2008 acquired ChoicePoint, Inc. and their high-speed parallel processing platform. The two platforms were merged into HPCC (or High-Performance Computing Cluster) Systems and in 2011, HPCC was open-sourced under the Apache v2.0 License. Currently, HPCC and Quantcast File System are the only publicly available platforms capable of analyzing multiple exabytes of data.

In 2004, Google published a paper on a process called MapReduce that uses a similar architecture. The

Read more…

Guest blog post by Ankit Jain

Hadoop is a leading platform for big data analytics and enterprises have been using Hadoop-based applications to perform a host of operations, including the management, evaluation, and storage of massive amounts of data. Ensuring optimal consumption of time, Hadoop-as-a-service enables companies to function in an extremely cost-effective manner.

The report reviews the global Hadoop-as-a-service market on the basis of geography and divides the international market into Europe, Asia Pacific, North America, and Rest of the World. North America is the largest market for HaaS and is anticipated to retain its lead through 2023. Europe follows next in line with a competitive HaaS market. The Asia Pacific Hadoop-as-a-service market is projected to witness rapid growth in the coming years fueled by strong support from several governments for the effective installation of necessary infrastructure to achieve real-time optimization and for the reconfiguration of netw

Read more…

Originally posted on Data Science Central

The era of big data has witnessed a paradigm shift into analytics. Today, it’s no longer sufficient to simply gather data from social media, IoT, and wearable devices, and be unable to manage or filter it. It is more about delivering the right data to the right person, at the right time.

This trend is growing crucial as data is multiplying every day and pouring in from various devices and smart machines including wearables, electronic gadgets, and other devices. Such factors call for the treatment of vast pools of structured and unstructured data with care and precision. This is precisely where invisible analytics come in.

Big Data was the Past; 2015 is the Start Point to Take Analytics to the Next Level

By far, big data has remained as an enabler of the new wave of analytics solutions. However, the challenge for big data analytics lies in the traditional hardware storage capacity and processing rates that execrably lag during operations, thus

Read more…

Crime Analysis with Zeppelin, R & Spark

Guest blog post by Raghavan Madabusi

Apache Zeppelin, a web-based notebook, enables interactive data analytics including Data Ingestion, Data Discovery, and Data Visualization all in one place. Zeppelin interpreter concept allows any language/data-processing-backend to be plugged into Zeppelin. Currently, Zeppelin supports many interpreters such as Spark (Scala, Python, R, SparkSQL), Hive, JDBC, and others. Zeppelin can be configured with existing Spark eco-system and share SparkContext across Scala, Python, and R.

Check out the Use Case that demonstrates some of the Zeppelin's capabilities...

Crime Analysis Zeppelin Dashboard:


Read more…

Hadoop and In-Cluster Analytics: Whitepaper

Recent technology advances within the Apache Hadoop ecosystem have provided a big boost to Hadoop’s viability as an analytics environment. It is likely that you have a significant amount of data in Hadoop and a need to accurately analyze it. In this whitepaper we outline 3 points to help you avoid bumping up against the limits of the database you’re moving the data into—or how much of it you can afford to use. Thanks to the improvements with SQL-in-Hadoop technologies you can now leverage the power of the cluster you already have in place, expanding and accelerating what you can do while saving you time and resources.

Whitepaper offered courtesy of Looker


>> Download Now

Read more…

Hadoop Security Issues and Best Practices

Originally posted on Analytic Bridge

The big data blast has given rise to a host of information technology software and tools and abilities that enable companies to manage, capture, and analyze large data sets of unstructured and structure data for result oriented insights and competitive success. But with this latest technology comes the challenge of keeping confidential information secure and private.

Big data that resides within a Hadoop environment contains sensitive confidential data such as bank account details financial information in the form credit card, corporate business, property information, personal confidential information, security information of clients and all.

Due to the confidential nature of all of data and the losses that can be done should it fall into the wrong hands, it is mandatory that it be protected from unauthorized access.

look at some general Hadoop security issues along with best practices to keep sensitive data protected and secure.

Security con

Read more…

Guest blog post by skumar T

Yarn Resource manager (The Yarn service Master component)

1) Controls of the total resource capacity of the cluster

2) Whatever the container is needed in the cluster it sets the minimum container size that is controlled by yarn configuration property

àyarn.scheduler.minimum-allocation-mb 1024(This value changes based on cluster ram capacity)

Description: The minimum allocation for every container request at the RM, in MBs. Memory requests lower than this won't take effect, and the specified value will get allocated at minimum

and similarly Max container size

-->yarn.scheduler.maximum-allocation-mb  8192 (This value changes based on cluster ram capacity)

Description:The maximum allocation for every container request at the RM, in MBs. Memory requests higher than this won't take effect, and will get capped to this value

3) In same way the number of cores to assign for each job. 

 -->yarn.scheduler.minimum-allocation-vcores 1 (value)

Description:The minimum a

Read more…

Guest blog post by Ankit Jain

Since its inception in the year 2008, the global Hadoop market has observed growth at a tremendous pace. This market, valued US$1.5 billion in 2012, is estimated to grow at a CAGR of 54.7% from 2012 to 2018. By the end of 2018, this market could amass a net worth of US$20.9 billion. With the massive amount of data generated every day across major industries, the global Hadoop market is anticipated to observe significant growth in the future as well.

Why Hadoop?

Quite naturally, the mounting scales of unstructured data generated every single day from data-intensive industries such as telecommunication, banking and finance, social media, research, healthcare, and defence has led to the rising adoption of Hadoop solutions.

The major factors driving the need to adopt Hadoop are its cost-sensitive and scalable methodologies of data handling. Hadoop has taken the big data market by storm, levelling all other data management technologies that ruled the market be

Read more…

Guest blog post by Tanmay Bhandari

Originally posted on Data Science Central

In the book Hadoop: The definitive guide, Tom white quotes Grace Hopper, “In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.” For long Hadoop has been the data analytics system preferred by businesses all over. The recent entry of the spark engine has however given businesses an option other than Hadoop for data analytics purposes.

A lot of discussion among experts in the field of big data analytics is over which of the two data analytics engines, the Hadoop or the Spark, is the better performer when it comes to applications in business. While Hadoop has been around for a long time, Spark is a new data analytics system released just couple of months ago. Both systems have been developed by apache, with both systems being an open source platform.

Both Hadoop

Read more…


In any hiring process, a candidate with a professional certification always gets extra attention. Here are a few of the certifications in data science.

IBM Certified Data Architect -- Big Data

By this training, you will be able to master your skills in handling big data. The data architect will be having knowledge in different big data technologies, knowing their differences and then finally integrate them to find solutions of any business obscurity. The certification holder will be able to plan big data processors and help in the hardware and software architecture planning. This course is certified by IBM named as IBM Big data and is an added advantage to get your resume shortlisted in Interviews.

EMC Data Scientist Associate (EMCDSA) – EMC

By this certification, the candidate will gain the ability to work together as a team in the projects dealing with big data. Once you do this certification, you will be able to deploy the lifecycle of data analytics, rebuilding an analytics challe

Read more…
Choice is a good thing, but too much choice can lead to confusion and to buyers taking a “wait-and-see” approach until the market coalesces around the eventual winners. Lack of choice was an important factor in how quickly and readily companies bought into the RDBMS movement 30 or so years ago. I believe that too much choice is holding companies back from buying into the Hadoop / NoSQL movement.
Read more…

Guest blog post by Bill Vorhies

Summary:  What happens after you make those critical discoveries in the Data Lake and need to make that new data and its insights operational?

 1327784?profile=RESIZE_1024x1024image source: EMC

Data Lakes are a new paradigm in data storage and retrieval and are clearly here to stay.  As a concept they are an inexpensive way to rapidly store and retrieve very large quantities of data that we think we want to save but aren’t yet sure what we want to do with.  As a bonus they can be unstructured or semi-structured data, streaming data, or very large quantities of data, covering all three “Vs” of Big Data.  The great majority of these are Hadoop key-value DBs which is reported by many technical reviewers to have unstoppable momentum.

What brought these into existance is of course NoSQL technology which arose originally to solve the pain points of not being able to store and retrieve the volume, variety, and velocity of Big Data.  But that was just the start.  Once the technology was

Read more…

70 MongoDB Interview Questions and Answers

Guest blog post by Laetitia Van Cauwenberge

According to Wikipedia, MongoDB is a cross-platform document-oriented database. Classified as a NoSQL database, MongoDB avoids the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. MongoDB is developed by MongoDB Inc. and is published as free and open-source software. MongoDB is the fourth most popular type of database management system, and the most popular for document stores.


MongoDB is definitely a great skill to have on your resume, if you are a data scientist. To install MongoDB on your computer, click here. Below, you will find dozens of job interview questions about MongoDB. For other related job interview questions (R, Python, Data Science, Hadoop, etc.) click here. The above picture compares the performance of four NoSQL database / file systems. Click here 

Read more…

Featured Blog Posts - DSC