Subscribe to our Newsletter

All Posts (383)

25 Predictions About The Future Of Big Data

Guest blog post by Robert J. Abate.

In the past, I have published on the value of information, big data, advanced analytics and the Abate Information Triangle and have recently been asked to give my humble opinion on the future of Big Data.

I have been fortunate to have been on three panels recently at industry conferences which discussed this very question with such industry thought leaders as: Bill Franks (CTO, Teradata), Louis DiModugno (CDAO, AXA US), Zhongcai Zhang, (CAO, NY Community Bank), Dewey Murdick, (CAO, Department Of Homeland Security), Dr. Pamela Bonifay Peele (CAO, UPMC Insurance Services), Dr. Len Usvyat (VP Integrated Care Analytics, FMCNA), Jeffrey Bohn (Chief Science Officer, State Street), Kenneth Viciana (Business Analytics Leader, Equifax) and others.

Each brought their unique perspective to the challenges of Big Data and their insights into their “premonitions” as to the future of the field. I would like to surmise their thoughts adding in color to the discus

Read more…

The Phoenix framework has been growing with popularity at a quick pace, offering the productivity of frameworks like Ruby on Rails, while also being one of the  fastest frameworks available. It breaks the myth that you have to sacrifice performance in order to increase productivity.

So what exactly is Phoenix?

Phoenix is a web framework built with the Elixir programming language. Elixir, built on the Erlang VM, is used for building low-latency, fault-tolerant, distributed systems, which are increasingly necessary qualities of modern web applications. You can learn more about Elixir from this blog post or their official guide.

If you are a Ruby on Rails developer, you should definitely take an interest in Phoenix because of the performance gains it promises. Developers of other frameworks can also follow along to see how Phoenix approaches web development.

Meet Phoenix on Elixir: A Rails-like Framework for Modern Web Apps

In this article we will learn some of the things in Phoenix you should keep in mind if you are coming from the world of Ruby on Rai

Read more…

A Guide to Managing Webpack Dependencies

The concept of modularization is an inherent part of most modern programming languages. JavaScript, though, has lacked any formal approach to modularization until the arrival of the latest version of ECMAScript ES6.

In Node.js, one of today’s most popular JavaScript frameworks, module bundlers allow loading NPM modules in web browsers, and component-oriented libraries (like React) encourage and facilitate modularization of JavaScript code.

Webpack is one of the available module bundlers that processes JavaScript code, as well as all static assets, such as stylesheets, images, and fonts, into a bundled file. Processing can include all the necessary tasks for managing and optimizing code dependencies, such as compilation, concatenation, minification, and compression.

Webpack: A Beginner's Tutorial

However, configuring Webpack and its dependencies can be stressful and is not always a straightforward process, especially for beginners.

This blog post provide

Read more…

Where & Why Do You Keep Big Data & Hadoop?

Guest blog post by Manish Bhoge

I am Back ! Yes, I am back (on the track) on my learning track. Sometime, it is really necessary to take a break and introspect why do we learn, before learning.  Ah ! it was 9 months safe refuge to learn how Big Data & Analytics can contribute to Data Product.


Data strategy has always been expected to be revenue generation. As Big data and Hadoop entering into the enterprise data strategy it is also expected from big data infrastructure to be revenue addition. This is really a tough expectation from new entrant (Hadoop) when the established candidate (DataWarehouse & BI) itself struggle mostly for its existence. So, it is very pertinent for solution architects to raise a question WHERE and WHY to bring the Big data (Obviously Hadoop) in the Data Strategy. And, the safe option for this new entrant should the place where it supports and strengthen the EXISTING data analysis strategy. Yeah! That’s the DATA LAKE.

Hope, you would have already understood by

Read more…

Guide To Budget Friendly Data Mining

Unlike traditional application programming, where API functions are changing every day, database programming basically remains the same. The first version of Microsoft Visual Studio .NET was released in February 2002, with a new version released about every two years, not including Service Pack releases. This rapid pace of change forces IT personnel to evaluate their corporation’s applications every couple years, leaving the functionality of their application intact but with a completely different source code in order to stay current with the latest techniques and technology.

The same cannot be said about your database source code. A standard query of SELECT/FROM/WHERE/GROUP BY, written back in the early days of SQL, still works today. Of course, this doesn’t mean there haven’t been any advancements in relational database programming; there were, and they’ve been more logical than technical.

Data warehouse design hasn’t changed much over the years. However, the way we extract and employ data is evolving and creating new possibilities.

Data warehouse design hasn’t
Read more…

Top 30 people in Big Data and Analytics

Originally posted on Data Science Central

Innovation Enterprise has compiled a top 30 list for individuals in big data that have had a large impact on the development or popularity of the industry. 

Here is an interesting list of top 30 people in Big Data & Analytics, created by Innovation Enterprise. 
Unlike other lists, this is not based on Twitter or social media, but also on contributing directly to the industry, and focuses on those who had important parts to play in its growth and sustained popularity. 
  1. Doug Cutting and Mike Cafarella, for creating Hadoop
  2. Sergey Brin and Larry Page, founders of Google
  3. Edward Snowden, NSA Whistleblower
  4. Rob Bearden, founder of Hortonworks
  5. Kirk D. Borne, professor and co-creator of the field of astroinformatics
  6. Stephen Wolfram, creator of Mathematica and Wolfram Alpha
  7. Rich Miner, co-founder of Android and a pioneer in the mobile space.
  8. Jamie Miller, CIO at GE
  9. DJ Patil, a data science pioneer, coined t
Read more…

Associative Data Modeling Demystified - Part2

Guest blog post by Athanassios Hatzis

Association in Topic Map Data Model


In the previous article of this series we examined the association construct from the perspective of Entity-Relationship data model. In this post we demonstrate how Topic Map data model represents associations. In order to link the two we continue with another SQL query from our relational database

SELECT suppliers.sid,
FROM suppliers
INNER JOIN [catalog]
ON = catalog.catpid)
ON suppliers.sid = catalog.catsid
WHERE (( ( ) = 998 ))
ORDER BY catalog.catcost;


This will fetch all the rows of a result set where we are looking for the minimum catalogue price of a Red Fire Hydrant Cap and who is the supplier that manufactures this part. The reader will notice that apart from the deficiensy of the nested JOINs, (see here), we had to formalize our sear

Read more…

Associative Data Modeling Demystified - Part1

Guest blog post by Athanassios Hatzis

Relation, Relationship and Association

While most players in the IT sector adopted Graph or Document databases and Hadoop based solutions, Hadoop is an enabler of HBase column store, it went almost unnoticed that several new DBMS, AtomicDB previous database engine of X10SYS, and Sentences, based on associative technology appeared on the scene. We have introduced and discussed about the data modelling architecture and the atomic information resource unit(AIR) of AtomicDB. Similar technology has been the engine power of Qlikview, a very popular software in Business Intelligence and Analytics, since 1993. Perhaps it is less known to the reader that the association construct is a first class citizen in Topic Map semantic web standard and it is translated to (RDF), the other semantic web standard. In other posts of this series we can see how it is possible to implement Associative Technology in multi-model graph databases such as OrientDB, in object

Read more…

Originally posted on Data Science Central


Recently, in a previous post, we reviewed a path to leverage legacy Excel data and import CSV files thru MySQL into Spark 2.0.1. This may apply frequently in businesses where data retention did not always take the database route… However, we demonstrate here that the same result can be achieved in a more direct fashion. We’ll illustrate this on the same platform that we used last time (Ubuntu 16.04.1 LTS running in a windows VirtualBox Hadoop 2.7.2 and Spark 2.0.1)  and on the same dataset (my legacy model collection Unbuilt.CSV).  Our objective is to show how to migrate data to Hadoop HDFS and analyze it directly and interactively using the latest ML tools with PySpark 2.0.1 in a Jupyter Notebook.

A number of interesting facts can be deduced thru the combination of sub-setting, filtering and aggregating this data, and are documented in the notebook. For example, with one-liner, we can rank the most popular scale, the most numerous model,

Read more…

Guest blog post by Marc Borowczak


Moving legacy data to modern big data platform can be daunting at times. It doesn’t have to be. In this short tutorial, we’ll briefly review an approach and demonstrate on my preferred data set: This isn’t a ML repository nor a Kaggle competition data set, simply the data I accumulated over decades to keep track of my plastic model collection, and as such definitely meets the legacy standard!

We’ll describe steps followed on a laptop VirtualBox machine running Ubuntu 16.04.1 LTS Gnome. The following steps are then required:

  1. Import the .csv file in MySQL, and optionally backup a compressed MySQL database file.
  2. Connect to MySQL database in Spark 2.0.1 and then access the data: we’ll demonstrate an interactive Python approach using Jupyter PySpark in this post and leave an Rstudio Sparkyl access based on existing methods for another post.

There’s really no need to abandon legacy data: Migrating data to new platform will e

Read more…

I first heard of Spark in late 2013 when I became interested in Scala, the language in which Spark is written. Some time later, I did a fun data science project trying to predict survival on the Titanic. This turned out to be a great way to get further introduced to Spark concepts and programming. I highly recommend it for any aspiring Spark developers looking for a place to get started.

Today, Spark is being adopted by major players like Amazon, eBay, and Yahoo! Many organizations run Spark on clusters with thousands of nodes. According to the Spark FAQ, the largest known cluster has over 8000 nodes. Indeed, Spark is a technology well worth taking note of and learning about.

apache spark tutorial

This article provides an introduction to Spark including use cases and examples. It contains information from the Apache Spark website as well as the book Learning Spark - Lightning-Fast Big Data Analysis.

What is Apache Spark? An Introduction

Spark is an Apache project advertised as “lightning fast cluster compu

Read more…

Ember Data (a.k.a ember-data or is a library for robustly managing model data in Ember.jsapplications. The developers of Ember Data state that it is designed to be agnostic to the underlying persistence mechanism, so it works just as well with JSON APIs over HTTP as it does with streaming WebSockets or local IndexedDB storage. It provides many of the facilities you’d find in server-side object relational mappings (ORMs) like ActiveRecord, but is designed specifically for the unique environment of JavaScript in the browser.

While Ember Data may take some time to grok, once you’ve done so, you will likely find it to have been well worth the investment. It will ultimately make development, enhancement, and maintenance of your system that much easier.

When an API is represented using Ember Data models, adapters and serializers, each association simply becomes a field name. This encapsulates the internal details of each association, thereby insulating the rest of your code from

Read more…

Fast Forward transformation with SPARK

Fast forward transformation process in data science with Apache Spark

Data Curation :

Curation is a critical process in data science that helps to prepare data for feature extraction to run with machine learning algorithms. Curation generally involves extracting, organising, integrating data from different sources. Curation may be a difficult and time consuming process depending on the complexity and volume of the data involved.

Most of the time data won't be readily available for feature extraction process, data may be hidden is unobstructed and complex data sources and has to undergo multiple transformational process before feature extraction .

Also when the volume of data is huge this will be a huge time consuming process and can be a bottle neck for the whole machine learning pipeline.

General Tools used in Data Science :
  • R Language - Widely adopted in data science with lot of supporting libraries
  • Mat lab - Commercial tool with lot of builtin libraries for data science
  • Apache
Read more…

11 Great Hadoop, Spark and Map-Reduce Articles

This reference is a part of a new series of DSC articles, offering selected tutorials, references/resources, and interesting articles on subjects such as deep learning, machine learning, data science, deep data science, artificial intelligence, Internet of Things, algorithms, and related topics. It is designed for the busy reader who does not have a lot of time digging into long lists of advanced publications.


11 Great Hadoop, Spark and Map-Reduce Articles

Read more…
Google formally announced Android 7.0 a few weeks ago, but as usual, you’ll have to wait for it. Thanks to the Android update model, most users won’t get their Android 7.0 over-the-air (OTA) updates for months. However, this does not mean developers can afford to ignore Android Nougat. In this article, Toptal Technical Editor Nermin Hajdarbegovic takes a closer look at Android 7.0, outlining new features and changes. While Android 7.0 is by no means revolutionary, the introduction of a new graphics API, a new JIT compiler, and a range of UI and performance tweaks will undoubtedly unlock more potential and generate a few new possibilities.
Read more…

Java versus Python

Originally posted on Data Science Central

Interesting picture that went viral on Facebook. We've had plenty of discussions about Python versus R on DSC. This picture is trying to convince us that Python is superior to Java. It is about a tiny piece of code to draw a pyramid.


This raises several questions:

  • Is Java faster than Python? If yes, under what circumstances? And by how much? 
  • Does the speed of an algorithm depend more on the design (quick sort versus naive sort) or on the architecture (Map-Reduce) than on the programming language used to code it?
  • For data science, does Python offer better libraries (including for visualization), easier to install, than Java? What about the learning curve?
  • Is Java more popular than Python for data science, mathematics, string processing, or NLP?
  • Is it better to write simple code (like the Java example above) or compact, hard to read code (like the Python example). You can write Python code that is much longer for this pyramid
Read more…

Originally posted on Data Science Central

These are the findings from a CrowdFlower survey. Data preparation accounts for about 80% of the work of data scientists. Cleaning data is the least enjoyable and most time consuming data science task, according to the survey. Interestingly, when we asked the question to our data scientist, his answer was:

Automating the task of cleaning data is the most time consuming aspect of data science, though once done, it applies to most data sets; it is also the most enjoyable because as you automate more and more, it frees a lot of time to focus on other things.

Below are the three charts published in the Forbes article, regarding the survey in question. The one at the bottom lists the most frequent skills found in data scientist job ads.   




DSC Resources

Read more…

Why Not So Hadoop?

Guest blog post by Kashif Saiyed

Does Big Data mean Hadoop? Not really, however when one thinks of the term Big Data, the first thing that comes to mind is Hadoop along with heaps of unstructured data. An exceptional lure for data scientists having the opportunity to work with large amounts data to train their models and businesses getting knowledge previously never imagined. But has it lived up to the hype? In this article, we will look at a brief history of Hadoop and see how it stands today.

2015 Hype Cycle – Gartner


Some key takeaways from the Hype cycle of 2015:

  1. ‘Big Data’ was at the Trough of Disillusionment stage in 2014, but is not seen in the 2015 Hype cycle.
  2. Another interesting point is that ‘Internet of Things’ which suggests a network of interconnected devices around us, is at peak for 2 years consistently now.

Just to check on the relevance of the Hype Cycle sitting in India, I check the Google trend for the terms ‘Big Data’ and ‘Hadoop’, and here are the results:



Read more…

Originally posted on Data Science Central


Introducing Data Science teaches you how to accomplish the fundamental tasks that occupy data scientists. Using the Python language and common Python libraries, you'll experience firsthand the challenges of dealing with data at scale and gain a solid foundation in data science.

About the Technology

Many companies need developers with data science skills to work on projects ranging from social media marketing to machine learning. Discovering what you need to learn to begin a career as a data scientist can seem bewildering. This book is designed to help you get started.

About the Book

Introducing Data ScienceIntroducing Data Science explains vital data science concepts and teaches you how to accomplish the fundamental tasks that occupy data scientists. You’ll explore data visualization, graph databases, the use of NoSQL, and the data science process. You’ll use the Python language and common Python libraries as you experience firs

Read more…

Originally posted on Data Science Central

Summary:  This is the first in a series of articles aimed at providing a complete foundation and broad understanding of the technical issues surrounding an IoT or streaming system so that the reader can make intelligent decisions and ask informed questions when planning their IoT system. 

In This Article

In Lesson 2

In Lesson 3

Is it IoT or Streaming

Stream Processing – Open Source

Three Data Handling Paradigms – Spark versus Storm

Basics of IoT Architecture – Open Source

What Can Stream Processors Do

Streaming and Real Time Analytics

Data Capture – Open Source with Options

Open Source Options for Stream Processors

Beyond Open Source for Streaming

Storage – Open Source with Options

Spark Streaming and Storm

Competitors to Consider

Query – Open Source Open Source with Options

Lambda Architecture

Read more…

Featured Blog Posts - DSC