Subscribe to our Newsletter

Featured Posts (353)

Sort by

New Books and Resources for DSC Members

We are in the process of writing and adding new material (compact eBooks) exclusively available to our members, and written in simple English, by world leading experts in AI, data science, and machine learning. We invite you to sign up here to not miss these free books. 



Currently, the following content is available:

2058338992?profile=original1. Statistics: New Foundations, Toolbox, and Machine Learning Recipes 

By Vincent Granville. This book is intended for busy professionals working with data of any kind: engineers, BI analysts, statisticians, operations research, AI and machine learning professionals, economists, data scientists, biologists, and quants, ranging from beginners to executives. In about 300 pages and 28 chapters it covers many new topics, offering a fresh perspective on the subject, including rules of thumb and recipes that are easy to automate or integrate in black-box systems, as well as new model-free, data-driven foundations to statistical science and predictive analytics. The approach f

Read more…
Comments: 0

Today, two of the most popular technologies are Big Data and Cloud Computing that are fundamentally different. On one hand, Big Data is all about dealing with huge data amount. On the other hand, cloud computing is all about handling enterprise infrastructure. Both of these technologies have simplified many business operations together. One of the good examples of the merger of these technologies is Amazon's “Elastic Map Reduce” which leverages the power of Big Data Processing and Cloud Computing both. 

The combination of both technologies yields beneficial outcomes for organizations. However, both technologies are in their evolution phase but the combination of both the technologies may leverage cost-effective and scalable business solutions. 

So, can we assume that cloud computing and big data is the perfect and leveraging combination? Well, this article is all about the effect of big data and cloud computing technologies on business operations. 

The Rapport of Big Data and Cloud Comput

Read more…

Invitation to Join Data Science Central

Join the largest community of machine learning (ML), deep learning, AI, data science, business analytics, BI, operations research, mathematical and statistical professionals: Sign up here. If instead, you are only interested in receiving our newsletter, you can subscribe here. There is no cost.


The full membership includes, in addition to the newsletter subscription:

  • Access to member-only pages, our free data science eBooks, data sets, code snippets, and solutions to data science / machine learning / mathematical challenges.
  • Support to all your questions regarding our community.
  • Data sets, projects, cheat sheets, tutorials, programming tips, summarized information easy to digest, DSC webinars, data science events (conferences, workshops), new books, and news. 
  • Ability to post blogs and forum questions, as well as comments, and get answers from experts in their field. 

You can easily unsubscribe at any time. Our weekly digest features selected discussions, articles writt

Read more…
Comments: 0

Apache Hadoop Admin Tips and Tricks

This article was written by Renata Ghisloti Duarte Souza Gra.

In this post I will share some tips I learned after using the Apache Hadoop environment for some years, and  doing many many workshops and courses. The information here considers Apache Hadoop around version 2.9, but it could definably be extended to other similar versions.

These are considerations for when building or using a Hadoop cluster. Some are considerations over the Cloudera distribution. Anyway, hope it helps! 

  • Don't use Hadoop for millions of small files. It overloads the namenode and makes it slower. It is not difficult to overload the namenode. Always check capability vs number of files. Files on Hadoop usually should be more than 100 MB.
  • You have to have a 1 GB of memory for around 1 million files in the namenode.
  • Nodes usually fail after 5 years. Node failures is one of the most frequent problems in Hadoop. Big companies like facebook and google should have node failures by the minute.
  • The MySQL on Cloudera Manager
Read more…
Comments: 0

This article was written by Venkatesan M.



Imagine there are two girls standing in front of you – The first girl is cute, beautiful, interesting and has the smile that any guy would die for. And the other girl is average-looking, quiet, not-so-impressive… no different from the ones that you usually see in the restaurant cash counter. Which girl will you call out for a date? If you’re like me, you will choose the attractive girl. You see, life is full of options and making the right choice is what matters the most.

If you’re a Java developer, then you probably have more choices to make – like the switch from Java to Hadoop.

Big data and Hadoop are the two most popular buzzwords in the industry. Chances are that you have come across these two terms on the Java payscale forums or seen your senior colleagues making the switch to get bigger paychecks. I’ll tell you what, the upgrade from Java to Hadoop is not just about staying updated with the latest technology or getting appraisals – it’s ab

Read more…
Comments: 0

The Know-How of Blockchain Technology

Considered as the fourth industrial revolution, Blockchain Technology is making news everywhere. This technology that was created for supporting cryptocurrencies has found ways to enter different industries like finance, commerce, judiciary & so on.

In this article, I will discuss how Blockchain technology was built, how it works & also some of its features. It is a must read for those who have been hearing about the Blockchain technology from here & there and feel curious to know more.


Let us begin!

Background of Blockchain Technology

Invented by Satoshi Nakamoto in the years 2008, Blockchain Technology served as the very effectual public transaction ledger of Bitcoin (cryptocurrency). But now it is going to have a great impact on institutional functions, national governance, education, business operations, & also our day-to-day lives in the twenty-first century, as projected by many scientists.

Moreover, Blockchain Technologies has the power to convert today’s internet of ‘sharing informa

Read more…

A Look at Big Data and Its Certifications

The prior achievement of the internet has largely been reliant on technology and how it has evolved over the years- getting faster and more secure as enterprises continue to grow with analytics and innovation. Big Data is one of these major breakthroughs, serving huge functions at modern organizations. Concepts like data driven machine learning and the ability to store huge amounts of information to gain context from has been hugely beneficial towards the strategic aspects of running international businesses. Organizations have been making huge changes to the structure of their own infrastructure and workplace processes, to accommodate the latest data science systems into their functioning. But, as the industry expands, traditional educational foundations may not cut it and a dynamic approach toward professional development is recommended. To that end, certification programs are ideal for professionals in the industry who seek to hone their skillsets in particular ways depending on the

Read more…

The SAP engineers as well as the software professionals at Sapphire Now have offlate been discussing the possible restrictions and likelihood for SAP S/4HANA for acting as a robust platform primarily for the purpose of managing the enterprise based strategies for Internet of Things as well as Big Data.

The general notion in the mindset of the people is that Internet of Things (IoT) is capable of sending out a tidal wave of enormous quantities of data in any enterprise. It is has been projected that billions of controllers and sensors, who are all capable of connecting to some other machine for the instruction and analysis, are eagerly waiting for the creation of trillions of data events along with transactions. The SAP managers and engineers has gone ahead and shared their views through a series of conversations when they were probed on how an organization could simply reconcile huge data quantities with in-memory built-for-speed computing.

Initially the SAP engineers admitted that sea

Read more…

8 Hadoop articles that you should read

Here's a selection of Hadoop-related articles worth checking out. Enjoy the reading!


What other articles and resources do you recommend?

Read more…
Comments: 0

This tutorial is provided by Guru99. Originally posted here

Apache HADOOP is a framework used to develop data processing applications which are executed in a distributed computing environment.

In this tutorial we will learn,

  • Components of Hadoop
  • Features Of 'Hadoop'
  • Network Topology In Hadoop

Similar to data residing in a local file system of personal computer system, in Hadoop, data resides in a distributed file system which is called as a Hadoop Distributed File system.

Processing model is based on 'Data Locality' concept wherein computational logic is sent to cluster nodes(server) containing data. This computational logic is nothing but a compiled version of a program written in a high level language such as Java. Such a program, processes data stored in Hadoop HDFS.

HADOOP is an open source software framework. Applications built using HADOOP are run on large data sets distributed across clusters of commodity computers. Commodity computers are cheap and widely available. These a

Read more…
Comments: 0

Google Spanner : The Future Of NoSQL

Guest blog post by Mohammad Tariq Iqbal

Quite often, while working with Hbase, I used to feel how cool it would be to have a database that can replicate my data to datacenters across the world consistently. So that I can take the pleasure of global availability and geographic locality. And also which will save my data even in case of some catastrophe or natural disaster. Which supports general-purpose transactions, and provides a SQL-based query language. And which has features of an SQL database as well. But it was only untill recently I found out that it is not an imagination anymore.

I was sitting with a senior+friend of mine at a Cafe Coffee Day nearby and having a casual chat on BigData stuff. During the discussion he told me about something called as SPANNER.
(You might be wondering, why the heck I have emphasized on the word spanner so much. Believe me, you will do the same after reading this post).

After that meeting I almost forgot about that incident. Out of the blue, th

Read more…
Comments: 1

Originally posted here by Bernard Marr.

When you learn about Big Data you will sooner or later come across this odd sounding word: Hadoop - but what exactly is it?

Put simply, Hadoop can be thought of as a set of open source programs and procedures (meaning essentially they are free for anyone to use or modify, with a few exceptions) which anyone can use as the "backbone" of their big data operations.


I'll try to keep things simple as I know a lot of people reading this aren't software engineers, so I hope I don't over-simplify anything - think of this as a brief guide for someone who wants to know a bit more about the nuts and bolts that make big data analysis possible.

The 4 Modules of Hadoop

Hadoop is made up of "modules", each of which carries out a particular task essential for a computer system designed for big data analytics.

1. Distributed File-System

The most important two are the Distributed File System, which allows data to be stored in an easily accessible format,

Read more…
Comments: 0

Top 10 Commercial Hadoop Platforms

Guest blog post by Bernard Marr

Hadoop – the software framework which provides the necessary tools to carry out Big Data analysis – is widely used in industry and commerce for many Big Data related tasks.

It is open source, essentially meaning that it is free for anyone to use for any purpose, and can be modified for any use. While designed to be user-friendly, in its “raw” state it still needs considerable specialist knowledge to set up and run.

Because of this a large number of commercial versions have come onto the market in recent years, as vendors have created their own versions designed to be more easily used, or supplied alongside consultancy services to get you crunching through your data in no time.


These days, this is often provided in the form of “Hadoop-as-a-service” – all of the installation will actually take place within the vendors own cloud, with customers paying a subscription to access the services.

Here’s a run-down, in no particular order, of 10 of the mos

Read more…
Comments: 0

Batch vs. Real Time Data Processing

Guest blog post by Michael Walker


Batch data processing is an efficient way of processing high volumes of data is where a group of transactions is collected over a period of time. Data is collected, entered, processed and then the batch results are produced (Hadoop is focused on batch data processing). Batch processing requires separate programs for input, process and output. An example is payroll and billing systems.

In contrast, real time data processing involves a continual input, process and output of data. Data must be processed in a small time period (or near real time). Radar systems, customer services and bank ATMs are examples.


While most organizations use batch data processing, sometimes an organization needs real time data processing. Real time data processing and analytics allows an organization the ability to take immediate action for those times when acting within seconds or minutes is significant. The goal is to obtain the insight required to act prudently at th

Read more…
Comments: 0

Is Big Data Harmful or Good?

Data is an advantage for industries as it benefits them make up-to-date choices. Strange! Data is being produced at an extraordinary rate and establishments are hording it like there’s not any tomorrow, generating enormous data groups we call big data. But is big data serving these businesses or is it just obscuring the decision-making procedure? We will find out.

Big data has numerous applications and, collective with analytics, is cast-off to find responses to glitches in a variation of businesses. For organisations, it can benefit them comprehend customer behavior and get most out of business procedures, all of which, in concept, should help administrators make sound choices to drive business development. But like so numerous things that complete good in concept, it’s not precisely working out for numerous organisations. In a worldwide review of over 300 C-level initiative administrators by Chartered Global Management Accountant (CGMA), which was complemented by in-depth meetings wit

Read more…
Comments: 0

Lambda Architecture for Big Data Systems

Guest blog post by Michael Walker


Big data analytical ecosystem architecture is in early stages of development. Unlike traditional data warehouse / business intelligence (DW/BI) architecture which is designed for structured, internal data, big data systems work with raw unstructured and semi-structured data as well as internal and external data sources. Additionally, organizations may need both batch and (near) real-time data processing capabilities from big data systems.

Lambda architecture - developed by Nathan Marz - provides a clear set of architecture principles that allows both batch and real-time or stream data processing to work together while building immutability and recomputation into the system. Batch processes high volumes of data where a group of transactions is collected over a period of time. Data is collected, entered, processed and then batch results produced. Batch processing requires separate programs for input, process and output. An example is payroll

Read more…
Comments: 0


Price discrimination and downward demand spiral are widely used analytical concepts/practices in the Airlines and Hospitality industries respectively, long before the term Big Data Analytics was even coined. Incidentally, these concepts have been taught in global elite b-schools for decades. So, how come Analytics, which has been there in practice for decades experience a meteoric rise suddenly? To answer this question, we need to get the Big Picture. Given below are key factors that led to huge buzz around analytics today.

  • Proliferation of Data Sources – Every day we create 5 quintillion bytes of data. This comes from digital footprints left on social media platforms, IoT sensors, wearables, transactions to name a few. Interesting fact is that only 1% of data collected is ever analyzed. To put into perspective all that innovation and insights driven by analytics are from analyzing just 1% of the data collected globally.
  • Change in Customers’ expectations – Today connected
Read more…


The advent of sharing economy has brought a sea change in the way urban populace commute locally. The Ubers, Lyfts and many other local players have made taxi riding convenient, affordable and safe. These rides have emerged as a strong alternative to the public transport clocking millions of rides per month in some cities. The emergence of hyper-local delivery models to optimize the supply chain has also led to a large number of daily trips by these vehicles.

These developments have mandated the installations of either standalone or smartphone app-based GPS devices to keep track of and better regulate these rides and a fleet of taxis. These GPS systems spew a ton of data generating up to GBs of data per second. With the automobile & technology experts predicting that self-driving cars would replace human-driven cars in no more than a decade, the volume and velocity of GPS data is only set to increase. With that context in mind, it becomes imperative to understand the GPS data and

Read more…

HDFS vs. HBase : All you need to know


The sudden increase in the volume of data from the order of gigabytes to zettabytes has created the need for a more organized file system for storage and processing of data. The demand stemming from the data market has brought Hadoop in the limelight making it one of biggest players in the industry. Hadoop Distributed File System (HDFS), the commonly known file system of Hadoop and Hbase (Hadoop’s database) are the most topical and advanced data storage and management systems available in the market.

What are HDFS and HBase?

HDFS is fault-tolerant by design and supports rapid data transfer between nodes even during system failures. HBase is a non-relational and open source Not-Only-SQL database that runs on top of Hadoop. HBase comes under CP type of CAP (Consistency, Availability, and Partition Tolerance) theorem.

HDFS is most suitable for performing batch analytics. However, one of its biggest drawbacks is its inability to perform real-time analysis, the trending requirement of

Read more…


By now, you have probably heard of the Hadoop Distributed File System (HDFS), especially if you are data analyst or someone who is responsible for moving data from one system to another. However, what are the benefits that HDFS has over relational databases?

HDFS is a scalable, open source solution for storing and processing large volumes of data. HDFS has been proven to be reliable and efficient across many modern data centers.

HDFS utilizes commodity hardware along with open source software to reduce the overall cost per byte of storage.

With its built-in replication and resilience to disk failures, HDFS is an ideal system for storing and processing data for analytics. It does not require the underpinnings and overhead to support transaction atomicity, consistency, isolation, and durability (ACID) as is necessary with traditional relational database systems.

Moreover, when compared with enterprise and commercial databases, such as Oracle, utilizing Hadoop as the analytic

Read more…
Comments: 0

Featured Blog Posts - DSC