Subscribe to our Newsletter

Andrei Macsin's Posts (229)

Guest blog post by Bill Vorhies

Summary:  The shortage of data scientists is driving a growing number of developers to fully Automated Predictive Analytic platforms.  Some of these offer true One-Click Data-In-Model-Out capability, playing to Citizen Data Scientists with limited or no data science expertise.  Who are these players and what does it mean for the profession of data science?


In a recent poll the question was raised “Will Data Scientists be replaced by software, and if so, when?”  The consensus answer:

Data Scientists automated and unemployed by 2025.

Are we really just grist for the AI mill?  Will robots replace us?

As part of the broader digital technology revolution we data scientists regard ourselves as part of the solution not part of the problem.  But as part of this fast moving industry built on identifying and removing pain points it’s possible to see that we are actually part of the problem.

Seen as a good news / bad news story it goes like this.  The good news is that advanced predictive analytics are gaining acceptance and penetration at an ever expanding rate.  The bad news is that there are not enough well trained data scientists to go around meaning we’re hard to find and expensive once you find us.  That’s the pain.

A fair number of advanced analytic platform developers see this too and the result is a rising number of Automated Predictive Analytic platforms that actually offer One-Click Data-In-Model-Out.

While close in trends are easy to see, those that may fundamentally remake our professional environment over the next three to five years can be a little more difficult to spot. I think Automated Predictive Analytics is one of those.

This topic is too broad for one article so I’ll devote this blog to illustrating what these platforms claim they can do and give you a short list of participants you can check out for yourself.  You may have thought that the broader issue is whether or not predictive analytics can actually be automated.  It’s not.  When you examine these companies you’ll see that boat has sailed.

The broader issues for future discussion are:

  • Is this is a good or bad thing,
  • How can we integrate it into the reality of the practice of our day-to-day data science lives, and
  • How will this impact our profession over the next three to five years?

What Exactly Is Automated Predictive Analytics

Automated Predictive Analytics are services that allow a data owner to upload data and rapidly build predictive or descriptive models with a minimum of data science knowledge.

Some will say that this is the benign automation of our overly complex toolkit, simplifying tasks like data cleaning and transformation that don’t require much creativity, or the simultaneous parallel generation of multiple ML models to rapidly arrive at a champion model.  This would be akin to the evolution from the hand saw to the power saw to the CNC cutting machine.  These are enhancements that make data scientists more productive so that’s a good thing.

However, ever since Gartner seized on the term Citizen Data Scientist and projected that this group would grow 5X more quickly than data scientists, analytic platform developers have seen this group possessing a minimum of data science knowledge as a key market for expansion.

Whether this is good or bad I’ll leave for later.  For right now we need to acknowledge that the direction of development is toward systems so simplified that only a minimum expertise with data science is required.

A Little History

The history of trying to automate our tool kit is actually quite long.  In his excellent 2014 blog, Thomas Dinsmore traces about a dozen of these events all the way back to UNICA in 1995.  Their Pattern Recognition Workbench used automated trial and error to optimize a predictive model.

He tracks the history through MarketSwitch in the late 1990’s (You Can Fire All Your SAS Programmers), to KXEN (later purchased by SAP), through efforts by SAS and SPSS (now IBM), ultimately to the open source MLBase project and the ML Optimizer by the consortium of UC Berkeley and Brown University to create a scalable ML platform on Spark.  All of these in one form or another took on the automation of either data prep or model tuning and selection or both.

What characterized this period ending just a few years ago is that all of these efforts were primarily aimed at simplifying and making efficient the work of the data scientist.

As far back as about 2011 though, and with many more entrants since 2014 are a cadre to platform developers who now seek One-Click Data-In-Model-Out simplicity for the non-data scientist.

Sorting Out the Market

As you might expect there is a continuum of strategies and capabilities present in these companies.  These range from highly simplified UIs that still require the user to go through the steps of cleaning, discovery, transformation, model creation, and model selection all the way through to true One-Click Data-In-Model-Out. 

On the highly simplified end of the scale are companies like BigML ( targeting non-data scientist.  BigML leads the user through the classical steps in preparing data and building models using a very simplified graphical UI.  There’s a free developer mode and very inexpensive per model pricing.

Similarly Logical Glue ( also targets non-data scientist using the theme ‘Data Science is not Rocket Science’.  Like BigML it still requires the user to execute five simplified data modeling steps using a graphical UI. 

But to keep our focus on the true One-Click Data-In-Model-Out Platforms we’ll focus on these five:

(This is not intended to be an exhaustive list but drawn from platforms I’ve looked at over the last few months.)

  1. PurePredictive (
  2. DataRPM (
  3. DataRobot (
  4. Xpanse Analytics (
  5. ForecastThis (

Essentially all of these are cloud based though a few can also be implemented on-prem or even in workstation. 

While all are true one-clicks their strategies and capabilities reflect different go-to-market strategies.

DataRPM and Xpanse Analytics have well developed front end data blending capabilities while the others start with analytic flat files.

PurePredictive and DataRPM make no bones about pitching directly to the non-data scientist while DataRobot and Xpanse Analytics have expert modes trying to appeal to both amateurs and professionals.  ForecastThis presents as a platform purely for data scientists.

Claims and Capabilities

As to accuracy I’ve only personally tested one, PurePredictive where I ran about a dozen datasets that I had previously scored on other analytic platforms.  The results were surprisingly good with a few coming in slightly more accurate than my previous efforts and a few coming in slightly less so, but with no great discrepancies.  Some of these data sets I left intentionally ‘dirty’ to test the data cleansing function.  The claim of one-click simplicity however was absolutely true and each model completed in only two or three minutes.

Some Detail

PurePredictive (

Target:  Non-data scientist.  One Click MPP system runs over 9,000 ML algorithms in parallel selecting the champion model automatically. (Note:  Their meaning of 9.000 different models is believed to be based on variations in tuning parameters and ensembles using a large number of ML native algorithms.)

  1. Blending: no, starts with analytic flat file.
  2. Cleanse:  yes
  3. Impute and Transform:  yes
  4. Select ML Algorithms to be utilized:  runs over 9.000 simultaneously including many variations on regression, classification, decision trees, neural nets, SVMs, BDMs, and a large number of ensembles.
  5. Run Algorithms in Parallel: yes
  6. Adjust Algorithm Tuning Parameters during model development: yes
  7. Select and deploy:  User selects.  Currently only by API.


DataRPM (

Target:  Non-data scientist.  One Click MPP system for recommendations and predictions.  UI based on ‘recipes’ for different types of DS problems that lead the non-data scientist through the process.

  1. Blending: yes.
  2. Cleanse:  yes
  3. Impute and Transform:  yes
  4. Select ML Algorithms to be utilized:  runs many but types not specified
  5. Run Algorithms in Parallel: yes
  6. Adjust Algorithm Tuning Parameters during model development: yes
  7. Select and deploy:  User selects.  Deploy via API.


DataRobot (

Target:  Non-data scientist but with expert override controls for Data Scientists.  Theme: ‘Data science in the cloud with a copilot’.  Positioned as a high performance machine learning automation software platform and a practical data science education program that work together.

  1. Blending: no, starts with analytic flat file.
  2. Cleanse:  yes
  3. Impute and Transform:  yes
  4. Select ML Algorithms to be utilized:  Random Forests, Support Vector Machines, Gradient Boosted Trees, Elastic Nets, Extreme Gradient Boosting, ensembles, and many more.
  5. Run Algorithms in Parallel: yes
  6. Adjust Algorithm Tuning Parameters during model development: yes
  7. Select and deploy:  User selects.  Deploy via API or exports code in Python, C, or JAVA.


Xpanse Analytics (

Target:  Both Data Scientist and non-data scientist.  Differentiates based on the ability to automatically generate and test thousands of variables from raw data using a proprietary AI based ‘deep feature’ engine.

  1. Blending: yes.
  2. Cleanse:  yes
  3. Impute and Transform:  yes
  4. Select ML Algorithms to be utilized:  yes – exact methods included not specified.
  5. Run Algorithms in Parallel: yes
  6. Adjust Algorithm Tuning Parameters during model development: yes
  7. Select and deploy:  User selects.  Believed to be via API.


ForecastThis, Inc. (

Target:  Data Scientist.  The DSX platform is designed to make the data scientist more efficient by automating model building including many advanced algorithms and ensemble strategies. For modeling only, not data prep.

  1. Blending: no.
  2. Cleanse:  no
  3. Impute and Transform:  no
  4. Select ML Algorithms to be utilized:  A library of deployable algorithms, including Deep Neural Networks, Evolutionary Algorithms, Heterogeneous Ensembles, Natural Language Processing and many proprietary algorithms.
  5. Run Algorithms in Parallel: yes
  6. Adjust Algorithm Tuning Parameters during model development: yes
  7. Select and deploy:  User selects.  R, Python and Matlab plus API.


Since ForecastThis is a modeling platform only with no prep or discovery capabilities, it’s worth mentioning that there are one-click data prep platforms out there.  One worth mentioning as it comes with a particularly good pedigree is Wolfram Mathematica.  Wolfram makes a real Swiss army knife data science platform and while the ML capabilities are not one click they make the claim to automatically preprocess data, including missing-values imputation, normalization, and feature selection with built-in machine learning capabilities.

Next time, more about whether you should be comfortable adopting any of these and what the implications might be for the profession of data science.


About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:

[email protected]

Read more…

Guest blog post by Alessandro Piva

The proliferation of data and the huge potentialities for companies to turn data into valuable insights are increasing more and more the demand of Data Scientists.

But what skills and educational background must a Data Scientist have? What is its role within the organization? What tools and programming languages does he/she mostly use? These are some of the questions that the Observatory for Big Data Analytics of Politecnico di Milano is investigating through an international survey submitted to Data Scientists: if you work with data in your company, please support us in our research and compile this totally anonymous survey.

Programming is one of the five main competence areas at the base of the skill set for a Data Scientist, even if is not the most relevant in term of expertise (see What is the right mix of competences for Data Scientists?). Considering the results of the survey, that involved more than 200 Data Scientist worldwide until today, there isn’t a prevailing choice among the programming languages used during the data science’s activities. However, the choice appears to be addressed mainly to a limited set of alternatives: almost 96% of respondents affirm to use at least one of R, SQL or Phyton.

In particular, at the top of the ranking in the current sample we find R used by 53% of Data Scientists, supported by the R Foundation for Statistical Computing. Initially widespread mainly among statisticians or in academic environments, the use of R has increased substantially in recent years in data science’s activities. Today it’s one of the most popular open source languages and it’s supported by a large and helpful community.

Even if it was developed in the early 1970s, SQL plays a key role still today (in second position of ranking with 49% of preferences). Although SQL is not designed for the task of handling unstructured datasets (typical of Big Data), there is still a strong need for analyse structured data in organizations, and SQL is a very popular choice for data crunching stage.

At the third position of the ranking there is Python (43%), that has become very popular in recent years because of its flexibility and relative easiness to learn. Like R, it also has a large community dedicated to improve the product and developing specific and focused packages.

The top 5’s ranking is completed by Unix Shell/AWK/Gawk (15%) and Java (8%).

If you are a Data Scientist and you want to receive more detailed results with the main and final findings of the research, compile the questionnaire leaving us your email in order to contact you to send the material.

Read more…

25 Predictions About The Future Of Big Data

Guest blog post by Robert J. Abate.

In the past, I have published on the value of information, big data, advanced analytics and the Abate Information Triangle and have recently been asked to give my humble opinion on the future of Big Data.

I have been fortunate to have been on three panels recently at industry conferences which discussed this very question with such industry thought leaders as: Bill Franks (CTO, Teradata), Louis DiModugno (CDAO, AXA US), Zhongcai Zhang, (CAO, NY Community Bank), Dewey Murdick, (CAO, Department Of Homeland Security), Dr. Pamela Bonifay Peele (CAO, UPMC Insurance Services), Dr. Len Usvyat (VP Integrated Care Analytics, FMCNA), Jeffrey Bohn (Chief Science Officer, State Street), Kenneth Viciana (Business Analytics Leader, Equifax) and others.

Each brought their unique perspective to the challenges of Big Data and their insights into their “premonitions” as to the future of the field. I would like to surmise their thoughts adding in color to the discussion.

Recent Article By Bernard Marr

If you haven’t had the opportunity, I believe that a recent article published by Bernard Marr entitled: 17 Predictions About Big Data was a great start (original version posted here). Many of the industry thought leaders that I mentioned above had hit on these points.

What Was Missing…

I agree with all of Bernard’s listing but I believe that he missed some predictions that the industry has called out. I would like to add the following:

18. Data Governance and Stewardship around Master Data and Reference Data is rapidly becoming the key area where focus is required as data volumes and in turn insights grow.

19. Data Visualization is the key to understanding the overwhelming V’s of Big Data (IBM data scientists break big data into four dimensions: volume, variety, velocity and veracity) and in turn the advanced analytics and is an area where much progress is being made with new toolsets.

20. Data Fabrics will become the key delivery mechanism to the enterprise by providing a “single source of the truth” with regard to the right data source. Today the enterprise is full of “spreadmarts” where people get their “trusted information” and this will have to change.

21. More than one human sensory input source (multiple screens, 3D, sound, etc.) is required to truly capture the information that is being conveyed by big data today. The human mind has so many ways to compare information sources that it requires more feeds today in order to find correlations and find clusters of knowledge.

22. Empowerment of business partners is the key to getting information into the hands of decision makers and self-service cleansed and governed data sources and visualization toolsets (such as provided by Tableau, ClickView, etc.) will become the norm of delivery. We have to provide a "single source of the truth" and eliminate the pervasive sharing of information from untrusted sources.

23. Considering Moore's Law (our computing power is increasing rapidly) and the technologies to look thought vast quantities of data is improving with each passing year, our analytical capabilities and in turn insights are starting to grow exponentially andwill soon change organizations to become more data driven and less "business instinct" driven.

24. Data is going to become the next global currency (late addition) and is already being globally monetized by corporations.

25. Data toolsets will become more widely used by corporations to both discover, profile and govern data assets within the confines of a data fabric or marketplace. Toolsets will include the management of metadata and automatic classification of assets and liabilities (i.e.: Global IDs, etc.).

The Four V’s Of Big Data

IBM uses a slide that discussed the myriad of challenges when facing Big Data – it is mostly self explanatory and hits many of the points that were mentioned in Bernard’s article. The Four V's Of Big Data.

What this infographic exemplifies is that there is a barrage of data coming at businesses today and this has changed the information landscape for good.  No longer are enterprises (or even small businesses for that matter) living with mostly internal data, the shift has happened where data is now primarily coming from external sources and at a pace that would make any organizations head spin.

Today's Best Practice “Data Insights Process”

Today, external data sources (SFDC, POS, Market-share, Consumer demographics, psychographics, Census data, CDC, Bureau of labor, etc.) provide much more than half of the information into the enterprise with the norm to create value in weeks. How is this done you may ask? Let’s call this the Data Insights process. The best practice today has turned upside down the development of business intelligence solutions, this process is:

  • Identify a number of disparate data sources of interest to start the investigation
  • Connect them together (data integration using common keys)
  • Cleanse the data (as Data Governance has not been applied) creating your own master and reference data
  • Learn about what the data is saying and visualize it (what insight or trend has been uncovered
  • Create a model that gives you answers
  • Formalize data source (cleanse and publish) to the myriad of enterprise data consumers with governance (if applicable)
  • Use the answers to change your business
  • Repeat (adding new sources, creating new models, etc.)

This process utilizes data experts to find data sources of value (1 to 2 weeks), Quickly connect together and scan to determine suitability and eliminating information which is incomplete or lacking value/connection to other sources (integrating and cleansing takes about 2 weeks), Visualize what value these sources provide using data visualization toolsets - find interesting value statements or features of the data to pursue {like store clustering and customer segmentation} (1 to 2 weeks), Develop a model or advanced analytic to see what your value statement found using a Data Scientist (2 weeks), and Then present to business to determine next steps. This whole process happens in about 6-8 weeks and usually creates the "interest" in the business to invest in developing into a data warehouse or BI solutions.

Yes, the new process is completely reusable – as what is learned can be turned into a data source (governed data store or warehouse which is part of a data fabric) for future usage in BI and in turn for self-service; but what is important is that we now go from data to insights in weeks rather than months, and it forms the foundation for our business requirements – yes, I said that.

The long term investment of a BI solution (often six months or more) is proven rapidly and then the formal process of capturing the business requirements and rules (transformations in ETL language can be taken from rapid prototyping tools like Alteryx) has a head start and typically has the added advantage of cutting down the BI process into 3-4 months.

Recent Advances In Data Engineering

We can thank recent technological advancements for the changes in delivery of information with the advent of a number of toolsets providing self-service to tech-savvy business partners.

The recent tech and analytics advances in the past decade include but are not limited to:

  • Massively parallel processing data platforms
  • Advanced in-database analytical functions
  • Analytics on un-structured data sources (Hadoop, MapReduce)
  • Data visualizations across multiple mediums and devices
  • Linking structured and unstructured data using semantics and linking
  • Self-service BI toolsets and models of delivery
  • Data discovery, profiling, matching and ELT and data enrichment capabilties
  • Self-provisioning of analytics sandboxes enabling collaboration

But there is still a need for managing the information and this process is not going away. I will elaborate further in the paragraph below.

The Need For Enterprise Information Management

The myriad of data sources is changing the way we as business intelligence and analytics experts behave and likewise it has created a demand for data management and governance (with Master data and in turn Reference data) – so this element was added to the predictions. It's a very important piece of the puzzle and should not be overlooked or downplayed. It was even added to my latest information triangle (see my Linked-In page).

The role of enterprise data management in IT has been evolving from “A Single Source of Truth” into becoming “The Information Assurance Flexible Delivery Mechanism”. Back in March of 2008 I published at the DAMA International Symposium the needs for a flexible information delivery environment including:

  • Metadata management for compliance enforcement, audit support, analysis, and reporting
  • Master data integration and control
  • Near-real time business information
  • Source data management for controlling data quality at the transaction level
  • Effective governance for a successful managed data environment
  • Integration of analytics, reporting, and transaction control
  • Control of business processes and information usage

A flexible structure is just as important today as business needs are changing at an accelerating pace and it allows IT to be responsive in meeting new business requirements, hence the need for an information architecture for ingestion, storage, and consumption of data sources.

The Need For Knowing Where Your Data Is Coming From (And Going To)

One of the challenges facing enterprises today is that they have an ERP (like SAP, Oracle, etc.), internal data sources, external data sources and what ends up happening is that “spread-marts” (commonly referred to as Excel Spreadsheets) start proliferating data. Different resources download data from differing (and sometimes the same) sources creating dissimilar answers to the same question. This proliferation of data within the enterprise utilizes precious storage that is already overflowing - causing duplication and wasted resources without standardized or common business rules.

Not to mention that these end up being passed around as inputs to other’s work – without knowledge of the data lineage. This is where many organizations are today - many disparate data sets with little to no knowledge of if this is a "trusted" data source.

Enterprise Data Fabric (or Data Marketplace)

An enterprise data fabric or marketplace (I've used both terms) is one location that everyone in the enterprise can go to get their data – providing quality, semantic consistency and security. This can be accomplishing with data lakes, data virtualization or a number of integration technologies (like API’s, services, etc.). The point is to give a common point of access to the enterprise for data that has been cleansed and is ready for use with master data. Here are a couple of reasons why you should consider this approach:

  • Business mandate to obtain more value out of the data (get answers)
  • Need to adapt and become agile to information and industry-wide changes
  • Variety of sources, amount and granularity of data that customers want to integrate is growing exponentially
  • Need to shrink the latency between the business event and the data availability for analysis and decision-making

Summation – Data Is The New Global Currency

In summation, consider that increasingly information is produced outside the enterprise, combined with information across a set of partners, and consumed by ever more participants so data is the new global currency of the information age and we all pass around currency – so let’s get cracking at delivering this to our enterprise (or it will go elsewhere to find it).

To the point, Big Data is an old acronym and the new one is “Smart Data” if you ask me.

I would welcome any comments or input into the above and have posted this including pictures on my linked in page - let's start a dialog around best-practices in today's information age...

Robert J. Abate, CBIP, CDMP

About the Author

Credited as one of the first to publish on Services Oriented Architecture and the Abate Information Triangle, Robert is a respected IT thought leader. He is the author of the Big Data & Analytics Chapter for DAMA’s DMBoK Publication and on the technology advisory board for Nielson. He was on the governing body for ‘15 CDO Forums and an expert panelist at 2016 CAO Forum / Big Data Innovation Summit.

[email protected]

Read more…

Where & Why Do You Keep Big Data & Hadoop?

Guest blog post by Manish Bhoge

I am Back ! Yes, I am back (on the track) on my learning track. Sometime, it is really necessary to take a break and introspect why do we learn, before learning.  Ah ! it was 9 months safe refuge to learn how Big Data & Analytics can contribute to Data Product.


Data strategy has always been expected to be revenue generation. As Big data and Hadoop entering into the enterprise data strategy it is also expected from big data infrastructure to be revenue addition. This is really a tough expectation from new entrant (Hadoop) when the established candidate (DataWarehouse & BI) itself struggle mostly for its existence. So, it is very pertinent for solution architects to raise a question WHERE and WHY to bring the Big data (Obviously Hadoop) in the Data Strategy. And, the safe option for this new entrant should the place where it supports and strengthen the EXISTING data analysis strategy. Yeah! That’s the DATA LAKE.

Hope, you would have already understood by now the 3 Ws (What: Data Lake, Who: Solution Architect, Where: Enterprise Data strategy) of Five Ws questions for information gathering. Now look at the diagram to depict WHERE and WHY.

Precisely, 3 major areas of opportunity for new entrant (Hadoop):

  1. Semi-structured and/or unstructured data ingestion.
  2. Push down bleeding data integration problems to Hadoop Engine.
  3. Business need to build comprehensive analytical data stores.

Absence of any one of these 3 needs above would make Hadoop case weak to enter into the existing enterprise strategy. And, this data lake approach believes to be aligning to the business analysis outcomes without much disruption, hence it will also create comfortable path in the enterprise. We can further dig into Data Lake Architecture and implementation strategy in detail.

Moreover, there lot of other supporting systems which are brewing in parallel with Hadoop eco-system and Apache Kylin ....opportunities are immense on datalake 

Read the original blog on:DatumEngineering

Read more…

Top 30 people in Big Data and Analytics

Originally posted on Data Science Central

Innovation Enterprise has compiled a top 30 list for individuals in big data that have had a large impact on the development or popularity of the industry. 

Here is an interesting list of top 30 people in Big Data & Analytics, created by Innovation Enterprise. 
Unlike other lists, this is not based on Twitter or social media, but also on contributing directly to the industry, and focuses on those who had important parts to play in its growth and sustained popularity. 
  1. Doug Cutting and Mike Cafarella, for creating Hadoop
  2. Sergey Brin and Larry Page, founders of Google
  3. Edward Snowden, NSA Whistleblower
  4. Rob Bearden, founder of Hortonworks
  5. Kirk D. Borne, professor and co-creator of the field of astroinformatics
  6. Stephen Wolfram, creator of Mathematica and Wolfram Alpha
  7. Rich Miner, co-founder of Android and a pioneer in the mobile space.
  8. Jamie Miller, CIO at GE
  9. DJ Patil, a data science pioneer, coined the term "data scientist" with Jeff Hammerbacher
  10. Monica Rogati, VP of Data at Jawbone
  11. Jeff Smith, CIO at IBM
  12. Jeff Bezos, founder and CEO of Amazon
  13. Andy Palmer, co-founder and CEO of TamR
  14. Gregory Piatetsky-Shapiro, co-founder of KDD and SIGKDD, KDnuggets President
  15. Vincent Granville, co-founder of DSC
  16. Sverre Jarp, ex-CTO at CERN openlab
  17. Tom Reilly, CEO at Cloudera
  18. Tom Davenport, thought leader and author in analytics and business process innovation
  19. John Schroeder and M. C. Srivas, co-founders of MapR
  20. Scott Howe, President and CEO at Acxiom
  21. Hilary Mason, was the Chief Scientist at Bitly, founder at Fast Forward Labs
  22. Edwina Dunn and Clive Humby, founders of Dunnhumby
  23. Anmol Modan, co-founder and CEO at
  24. Chris Towers, head of big data channel at Innovation Enterprise
  25. Billy Beane, baseball coach that inspired Moneyball
  26. Tim O’Reilly, owner of O'Reilly Media
  27. Vadim Kutsyy, head of Inc Data Lab at eBay
  28. Warren Buffett, renowned investor
  29. Arijit Sengupta, CEO at BeyondCore
  30. Paco Nathan, well-known data blogger

To read the original article, click here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Associative Data Modeling Demystified - Part2

Guest blog post by Athanassios Hatzis

Association in Topic Map Data Model


In the previous article of this series we examined the association construct from the perspective of Entity-Relationship data model. In this post we demonstrate how Topic Map data model represents associations. In order to link the two we continue with another SQL query from our relational database

SELECT suppliers.sid,
FROM suppliers
INNER JOIN [catalog]
ON = catalog.catpid)
ON suppliers.sid = catalog.catsid
WHERE (( ( ) = 998 ))
ORDER BY catalog.catcost;

This will fetch all the rows of a result set where we are looking for the minimum catalogue price of a Red Fire Hydrant Cap and who is the supplier that manufactures this part. The reader will notice that apart from the deficiensy of the nested JOINs, (see here), we had to formalize our search in SQL language in order to get back our result. Wouldn’t be nice if we could engage the user in a codeless style of search, independent of the business case. Let us see the difference with the Topic Map data model first.

Associations in Topic Map Data Model

Perhaps there is not a better software tool out there to introduce you to Topic Maps than Wandora information management application, see how.

Tuples to Associations

Our first step is to build a Topic Map data model from the SQL result set above. With Wandora this is easy thanks to its powerful set of extractors. Here we use an Excel adjacency list extractor to convert each spreadsheet row of this Excel file to a Topic MapAssociation.

Tuples of a Relation - Wandora Associations

In the right panel of the screen capture, you may see that we have four associations of typeTuple. They are all sorted by the catcost column. This is the role that cells of this column play in the Tuple association. In our example each Tuple is an instance of the Excel class with a maximum of 8 members and each member plays a role in the association. You may agree that this Topic Map model view of data looks already very familiar to the user that is accustomed with tables.

But behind the scenes Topic Map associations are notably different from the n-ary tuples of the relational model. In the left panel of our screen capture you can see all the data that are extracted from the spreadsheet. Notice that no data value is repeated. Each association is constructed from single instance values and this also means that associations are sharing values among them. We can visualize the network of associations by switching to Wandora’s Graph topic panel. From the left panel, we select the minimum price of the part, which is 11.7 and then we expand this node on the Graph topic panel. This way the first association will be drawn that includes as members all the other values that this cell is associated with. One of them is USA and plays the scountry role. We can right click on the value and expand again the nodes (associated members). Two associations are displayed now on the graph that share four common values between them.

Two Associations in the Graph Topic Panel of Wandora. Brown labels indicate the type (Tuple) of association and the role (sname) of one of its members

SQL to Topic Map Filtering

Another important observation we should make at this point is that instead of writing any query to fetch those suppliers that are located in USA we simply filtered the table based on this value. We are able to do this, because Topic Map data model works with single instance values that are linked bidirectionally. Accordingly, data is always normalized and the main operations of the relational algebra such as set operations, selection and projection, can be performed. For instance, filtering associations that have USA member is equivalent to selecting rows from SQL. Moreover, the user can traverse interactively the graph starting from any value without writing a single line of code.

Topic Map Serialization

To understand better the underlying structure of data in the previous example, we have serialized a Topic Map in LTM format. Dropping this LTM file into an empty topic panel, we invoke the import function of Wandora. Then we expand the topic tree and double click on the 998 cell. The following screen capture looks pretty much the same as the one we have generated from extracting the Excel spreadsheet above. The main difference is that now we have two association types, one for Catalogue tuples and another for Part tuples. Part 998participates in five associations (tuples) in total, four of them are from Catalogue table and one from Part table. We have also taken a minimum number of members, i.e. fields (columns), for our associations to keep it simple.

/* 1 Association of catalog part no 998 with "Red" and "Fire Hydrant Cap" */ 

Prt( prtName08:pname, prtID08:pid, prtColorRed:pcolor )

/* 4 Associations of catalog part no 998 with supplier Ids and catalog prices */

Cat( prtID08:catpid, supPrice18:catcost, supID18:catsid )

Cat( prtID08:catpid, supPrice14:catcost, supID14:catsid )

Cat( prtID08:catpid, supPrice16:catcost, supID16:catsid )

Cat( prtID08:catpid, supPrice12:catcost, supID12:catsid )


Associations of part no. 998

Because of the single instance feature of Topic Maps, If we switch to Wandora’s Graph topic panel we can visualize these associations.

Associations of part no. 998

R3DM Type System in Wandora

We expand our previous example with tuples from three tables and a rich type system.

Continue reading full article here

Read more…

Associative Data Modeling Demystified - Part1

Guest blog post by Athanassios Hatzis

Relation, Relationship and Association

While most players in the IT sector adopted Graph or Document databases and Hadoop based solutions, Hadoop is an enabler of HBase column store, it went almost unnoticed that several new DBMS, AtomicDB previous database engine of X10SYS, and Sentences, based on associative technology appeared on the scene. We have introduced and discussed about the data modelling architecture and the atomic information resource unit(AIR) of AtomicDB. Similar technology has been the engine power of Qlikview, a very popular software in Business Intelligence and Analytics, since 1993. Perhaps it is less known to the reader that the association construct is a first class citizen in Topic Map semantic web standard and it is translated to (RDF), the other semantic web standard. In other posts of this series we can see how it is possible to implement Associative Technology in multi-model graph databases such as OrientDB, in object-relational DBMS such as Intersystems Cache and Oracle or build the engine for in-memory processing with Wolfram Mathematica. In this article, we introduce the concept of association from the perspective of Entity-Relationship (ER) data model and illustrate it with the modelling of a toy dataset.


In this article we described several limitations of the ER model that we wish to overcome, in brief these are:

  • Functional dependence of values
  • Data redundancy
  • Join operations

On the next article of our series we continue with an international industry standard for information management and interchange, the Topic Maps Data Model (TMDM). Associations in TMDM are similar to tuples but they have types. Each member of an association plays a role that is defined explicitly. In fact this is in full agreement with Chen’s Entity-Relationship diagrams (see Fig.1 and Fig.2 above). Chen discusses the role of an entity in a relationship and the role of an attribute in a relation and he considers distinct constrains on allowable values for a value set and constraints on permitted values for a certain attribute.

TMDM view is edifying because it divides information space in two layers. At the conceptual level we have topics that can be associated and represent any subject that can be conceived by a human being. At the occurrence level we have addressable information resources that describe those subjects.

Read full article here. 

Read more…

Originally posted on Data Science Central

Recently, in a previous post, we reviewed a path to leverage legacy Excel data and import CSV files thru MySQL into Spark 2.0.1. This may apply frequently in businesses where data retention did not always take the database route… However, we demonstrate here that the same result can be achieved in a more direct fashion. We’ll illustrate this on the same platform that we used last time (Ubuntu 16.04.1 LTS running in a windows VirtualBox Hadoop 2.7.2 and Spark 2.0.1)  and on the same dataset (my legacy model collection Unbuilt.CSV).  Our objective is to show how to migrate data to Hadoop HDFS and analyze it directly and interactively using the latest ML tools with PySpark 2.0.1 in a Jupyter Notebook.

A number of interesting facts can be deduced thru the combination of sub-setting, filtering and aggregating this data, and are documented in the notebook. For example, with one-liner, we can rank the most popular scale, the most numerous model, and the most valued items, tally models by categories,… and process this legacy data with modern ML tools. Clustering is obtained just as easily, on scaled data, as illustrated here.

This, again, should serve the purpose of demonstrating direct migration of legacy data. We reviewed how to access using PySpark from Jupyter notebook and leverage the interactive interface provided by Toree / Spark 2.0.1.

Jupyter notebook is also provided to help along your migration...

We can restate that there’s really no need to abandon legacy data: Migrating directly or indirectly data to new platform will enable businesses to extract and analyze that data on a broader time scale, and open new ways to leverage ML techniques, analyze results and act on findings.


Read more…

Guest blog post by Marc Borowczak

Moving legacy data to modern big data platform can be daunting at times. It doesn’t have to be. In this short tutorial, we’ll briefly review an approach and demonstrate on my preferred data set: This isn’t a ML repository nor a Kaggle competition data set, simply the data I accumulated over decades to keep track of my plastic model collection, and as such definitely meets the legacy standard!

We’ll describe steps followed on a laptop VirtualBox machine running Ubuntu 16.04.1 LTS Gnome. The following steps are then required:

  1. Import the .csv file in MySQL, and optionally backup a compressed MySQL database file.
  2. Connect to MySQL database in Spark 2.0.1 and then access the data: we’ll demonstrate an interactive Python approach using Jupyter PySpark in this post and leave an Rstudio Sparkyl access based on existing methods for another post.

There’s really no need to abandon legacy data: Migrating data to new platform will enable businesses to extract and analyze data on a broader time scale, and open new ways to leverage ML techniques, analyze results and act on findings.

Additional routes methods to import CSV data will be discussed in a forthcoming post.

Read more…

Fast Forward transformation with SPARK

Fast forward transformation process in data science with Apache Spark

Data Curation :

Curation is a critical process in data science that helps to prepare data for feature extraction to run with machine learning algorithms. Curation generally involves extracting, organising, integrating data from different sources. Curation may be a difficult and time consuming process depending on the complexity and volume of the data involved.

Most of the time data won't be readily available for feature extraction process, data may be hidden is unobstructed and complex data sources and has to undergo multiple transformational process before feature extraction .

Also when the volume of data is huge this will be a huge time consuming process and can be a bottle neck for the whole machine learning pipeline.

General Tools used in Data Science :
  • R Language - Widely adopted in data science with lot of supporting libraries
  • Mat lab - Commercial tool with lot of builtin libraries for data science
  • Apache Spark - New, powerful and gaining traction, Spark on Hadoop provides distributed and Resilient architecture help to fasten the curation process by multiple times.
Recent Study

One of my project involved curing and extracting the features from huge volume of data in natural language conversation text. We started with using R programming language for the transformation process, R language is simple with lot of functionalities in statistics and data science space but has limitations in terms of computation and memory and in turn efficiency and speed. We tried to migrate the transformation process to Apache Spark and observed tremendous improvement in the performance of transformation, We were able to bring down the time for transformation from more day to almost an hour of time for huge volume of data.

Here are some of the benefits that I would like to highlight the benefits of Apache Spark over R.

  • Effective Utilization of resources:

By default R runs in a single core and is limited by the capabilities of the single core and memory usage. Even though you have multi core system R is limited with using only one core, for memory it has the process limitations of a 32 bit R execution with virtual memory user space of 3 GB and for 64 bit R execution limited to amount of RAM. R has some parallel lib packages that can help to span the processing to multi cores.

Spark can run in distributed form with the processing running on executors with each executor running on its own process utilizing the cpu and memory.Spark brings the concept of RDD (Resilient Distributed Dataset) to achieve distributed , resilient and scalable processing solution.

  • Optimized transformation:

Spark has the concept of Transformation and Actions where the transformation perform lazy evaluation of job execution until an Action task is being called and intern brings optimization when multiple transformations are involved before an Action task which leads to transferring the results back to the driver program

  • Integration to Hadoop Eco System

Spark integrates well in the Hadoop ecosystem with yarn architecture and can easily bind to HDFS , multiple NOSQL database like HBase, Cassandra etc.

  • Support for multiple languages:

Spark API's has support on multiple programming languages like Scala, Java and Python

Originally posted on Data Science Central

Read more…

11 Great Hadoop, Spark and Map-Reduce Articles

This reference is a part of a new series of DSC articles, offering selected tutorials, references/resources, and interesting articles on subjects such as deep learning, machine learning, data science, deep data science, artificial intelligence, Internet of Things, algorithms, and related topics. It is designed for the busy reader who does not have a lot of time digging into long lists of advanced publications.

11 Great Hadoop, Spark and Map-Reduce Articles

Previous topics covered in this series

Top DSC Resources

Read more…

Java versus Python

Originally posted on Data Science Central

Interesting picture that went viral on Facebook. We've had plenty of discussions about Python versus R on DSC. This picture is trying to convince us that Python is superior to Java. It is about a tiny piece of code to draw a pyramid.

This raises several questions:

  • Is Java faster than Python? If yes, under what circumstances? And by how much? 
  • Does the speed of an algorithm depend more on the design (quick sort versus naive sort) or on the architecture (Map-Reduce) than on the programming language used to code it?
  • For data science, does Python offer better libraries (including for visualization), easier to install, than Java? What about the learning curve?
  • Is Java more popular than Python for data science, mathematics, string processing, or NLP?
  • Is it better to write simple code (like the Java example above) or compact, hard to read code (like the Python example). You can write Python code that is much longer for this pyramid  project (like the Java example) but far easier to read, yet executes just as fast. The converse is also true. What are the advantages of writing the compact, hard-to-read code?

Related article:

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Originally posted on Data Science Central

These are the findings from a CrowdFlower survey. Data preparation accounts for about 80% of the work of data scientists. Cleaning data is the least enjoyable and most time consuming data science task, according to the survey. Interestingly, when we asked the question to our data scientist, his answer was:

Automating the task of cleaning data is the most time consuming aspect of data science, though once done, it applies to most data sets; it is also the most enjoyable because as you automate more and more, it frees a lot of time to focus on other things.

Below are the three charts published in the Forbes article, regarding the survey in question. The one at the bottom lists the most frequent skills found in data scientist job ads.   

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Why Not So Hadoop?

Guest blog post by Kashif Saiyed

Does Big Data mean Hadoop? Not really, however when one thinks of the term Big Data, the first thing that comes to mind is Hadoop along with heaps of unstructured data. An exceptional lure for data scientists having the opportunity to work with large amounts data to train their models and businesses getting knowledge previously never imagined. But has it lived up to the hype? In this article, we will look at a brief history of Hadoop and see how it stands today.

2015 Hype Cycle – Gartner


Some key takeaways from the Hype cycle of 2015:

  1. ‘Big Data’ was at the Trough of Disillusionment stage in 2014, but is not seen in the 2015 Hype cycle.
  2. Another interesting point is that ‘Internet of Things’ which suggests a network of interconnected devices around us, is at peak for 2 years consistently now.

Just to check on the relevance of the Hype Cycle sitting in India, I check the Google trend for the terms ‘Big Data’ and ‘Hadoop’, and here are the results:


So there is definitely a fall after a point of inflection in 2014-2015.

Brief History of Hadoop

Here is an excerpt from a recently published article by Alexey Grishchenko earlier this year:

  • Hadoop was born by Google’s ideas and Yahoo’s technologies to accommodate the needs for distributed compute and storage frameworks by biggest internet companies. 2003-2008 are the early ages of Hadoop when almost no one knows what it is, why it is and how to use it;
  • In 2008, a group of enthusiasts formed a company called Cloudera, to occupy the market niche of “cloud” and “data” by building commercial product on top of open source Hadoop. Later they abandoned the “cloud” and focused solely on “data”. In March 2009 they have released their first Cloudera Hadoop Distribution. You can see this moment on the trends diagram immediately after 2009 mark, the raise of Hadoop trend. This was a huge marketing push related to the first commercial distribution;
  • From 2009 to 2011, Cloudera was the one who tried to heat the “Hadoop” market, but it was still too small to create a notable buzz around the technology. But first adopters has proven the value of Hadoop platform, and additional players has joined the race: MapR and Hortonworks. Early adopters among startups and internet companies are starting to play with this technology at this time;
  • 2012 – 2014 are the years “Big Data” has became a buzzword, a “must have” thing. This is caused by the massive marketing push by the companies noted above, plus the companies supporting this industry in general. In 2012 alone, major tech companies spent over $15b buying companies doing data processing and analytics. But the demand for “big data” solutions were growing, and the analyst publications were heating the market very hard. Early adopters among enterprises are starting to play with the promising new technology at this time;
  • 2014 – 2015 are the years “Big Data” is approaching the hype peak. Intel has invested $760m in Cloudera giving its the valuation of $4.1b, Hortonworks went public with valuation of $1b. Major new data technologies has emerged like Apache Spark, Apache Flink, Apache Kafka and others. IBM invests $300m in Apache Spark technology. This is the peak of the hype. These years a massive adoption of “Big Data” in enterprises has started, architecture concepts of “Data Lake” / “Data Hub” / “Lambda Architecture” have emerged to simplify integration of modern solutions into conventional infrastructures of enterprises.


  • 2016 and beyond – this is an interesting timing for “Big Data”. Cloudera’s valuation has dropped by 38%. Hortonwork’s valuation has dropped by almost 40%, forcing them to cut the professional services department. Pivotal has abandoned its Hadoop distribution, going to market jointly with Hortonworks. What happened and why? I think the main driver of this decline is enterprise customers that started adoption of technology in 2014-2015. After a couple of years playing around with “Big Data” they has finally understood that Hadoop is only an instrument for solving specific problems, it is not a turnkey solution to take over your competitors by leveraging the holy power of “Big Data”. Moreover, you don’t need Hadoop if you don’t really have a problem of huge data volumes in your enterprise, so hundreds of enterprises were hugely disappointed by their useless 2 to 10TB Hadoop clusters – Hadoop technology just doesn’t shine at this scale. All of this has caused a big wave of priorities re-evaluation by enterprises, shrinking their investments into “Big Data” and focusing on solving specific business problems.

Prospect Dampeners

Gartner conducted this survey with 284 companies, with only 125 stating that they had already invested in Hadoop, or were looking to invest within the next 2 years. The release can be found here:


  • 54% respondents have no plans to invest at this time.
  • 26% only are deploying or piloting with Hadoop
  • 11% plan to invest within 12 months
  • 7% plan to invest within 24 months

Possible Factors

Skills Gap

57% respondents in the survey cited Skills Gap as the major reason for not adopting, and 49% were still trying to figure out the value deriving process. While Gartner estimates that it would take 2 years for finding the right number of people with the needed skills, Hadoop distribution providers are working on creating more user friendly and integrated modules and interfaces. However, these still do not seem to be friendly enough for the average user.


RoI / Priority

As organizations will have to get busy figuring out ways to incorporating new processes, hiring skilled individuals, Hadoop or Big Data deployment is taking a backseat. Traditional database providers have evolved their products be it in-memory or Massive Parallel Processing systems, which stay good enough to get the job done for most times, or even better than Hadoop for some applications. For instance, this experiment was conducted at Airbnb comparing a Amazon Redshift 16 node cluster with a 44 node Hive/Hadoop EMR cluster, and the SQL based Redshift outperformed the EMR cluster. The study was done in 2013, and Hadoop has evolved from then with Hive on Yarn, Apache Impala and so on, however this does not change the fact that Hadoop wasn’t built as a database for performance optimized structured data querying, unless the data is in Petabytes of course.

When the need arises to create something sort of a data mart, the lure towards ‘complicating’ things with Hadoop is something to be debated. Keep in mind, most enterprises have teams who take ownership of social media and other such unstructured information on the digital medium. And there are numerous awesome solutions to do digital data tracking and brand monitoring and what not in real time. Point is Hadoop would be low priority in all such scenarios.


Another important question which would need a large qualitative analysis – Can Hadoop coexist peacefully with an existing data warehouse? If so, how? Offloading processes (even for parts) would have million dollar price tags for the enterprise.

After that bit, the dilemma exists of hosting a cluster but not building any legacy systems versus going cloud. Majority of Cloudera’s customers host their own clusters with a minority being on the cloud. The major options here are a. hosting on premise / leasing data center b. IaaS (Infrastructure as a Service) such as Amazon Web Service, Google Cloud Platform etc , and c. Now emerging Hadoop as a Service.

Expensive to host on premise, cloud based providers such as Amazon EMR require skills of not only Hadoop but also understanding the Amazon version of the ecosystem. Hard to find. Hadoop as a Service providers are still fine tuning their products and will take some time maturing for the enterprise.

Emergence of the Cloud

In the points list above, Skills Gap and RoI/Priority are something which will evolve over time. They are related to each other, and as organizations see value, they will either hire or train people for these skills which are very much attainable. But what will be interesting to see in the next few years is the emergence of cloud based solutions for Hadoop to fine tune integration for the enterprise.

SAP is acquiring Altiscale as reported on SAP reportedly buying Altiscale to power big data services. Altiscale being one the early providers of Hadoop on the Cloud.

Another interesting recent development is Cloudera asked Intel for $1 billion to build a cloud service. The Hadoop distribution market leader – Cloudera is pushing towards going cloud and gain some market share of the Big Data workloads from current leaders in the Hadoop on the Cloud space – Amazon AWS, IBM BigInsights, Google Cloud Platform, and Microsoft Azure HDInsight.

In summary, as organizations realize the value of Big Data, factors such as skills shortage and integration lead to the slow adoption rates. Moreover, traditional database providers have evolved their services with Massive Parallel Processing systems, In-memory and columnar database solutions which has delayed the realization of the value related to Hadoop. The emergence of cloud based Hadoop service providers provides an alternative way for organizations to incorporate Hadoop clusters for Big Data workloads in the future.

Originally posted here.

Read more…

Originally posted on Data Science Central


Introducing Data Science teaches you how to accomplish the fundamental tasks that occupy data scientists. Using the Python language and common Python libraries, you'll experience firsthand the challenges of dealing with data at scale and gain a solid foundation in data science.

About the Technology

Many companies need developers with data science skills to work on projects ranging from social media marketing to machine learning. Discovering what you need to learn to begin a career as a data scientist can seem bewildering. This book is designed to help you get started.

About the Book

Introducing Data ScienceIntroducing Data Science explains vital data science concepts and teaches you how to accomplish the fundamental tasks that occupy data scientists. You’ll explore data visualization, graph databases, the use of NoSQL, and the data science process. You’ll use the Python language and common Python libraries as you experience firsthand the challenges of dealing with data at scale. Discover how Python allows you to gain insights from data sets so big that they need to be stored on multiple machines, or from data moving so quickly that no single machine can handle it. This book gives you hands-on experience with the most popular Python data science libraries, Scikit-learn and StatsModels. After reading this book, you’ll have the solid foundation you need to start a career in data science. 

What’s Inside

  • Handling large data
  • Introduction to machine learning
  • Using Python to work with data
  • Writing data science algorithms

About the Reader

This book assumes you're comfortable reading code in Python or a similar language, such as C, Ruby, or JavaScript. No prior experience with data science is required.

About the Authors

Davy CielenArno D. B. Meysman, and Mohamed Ali are the founders and managing partners of Optimately and Maiton, where they focus on developing data science projects and solutions in various sectors. 

Table of Contents

  1. Data science in a big data world
  2. The data science process
  3. Machine learning
  4. Handling large data on a single computer
  5. First steps in big data
  6. Join the NoSQL movement
  7. The rise of graph databases
  8. Text mining and text analytics
  9. Data visualization to the end user

The book is available, here.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Originally posted on Data Science Central

Summary:  This is the first in a series of articles aimed at providing a complete foundation and broad understanding of the technical issues surrounding an IoT or streaming system so that the reader can make intelligent decisions and ask informed questions when planning their IoT system. 

In This Article

In Lesson 2

In Lesson 3

Is it IoT or Streaming

Stream Processing – Open Source

Three Data Handling Paradigms – Spark versus Storm

Basics of IoT Architecture – Open Source

What Can Stream Processors Do

Streaming and Real Time Analytics

Data Capture – Open Source with Options

Open Source Options for Stream Processors

Beyond Open Source for Streaming

Storage – Open Source with Options

Spark Streaming and Storm

Competitors to Consider

Query – Open Source Open Source with Options

Lambda Architecture – Speed plus Safety

Trends to Watch


Do You Really Need a Stream Processor



Four Applications of Sensor Data



In talking to clients and prospects who are at the beginning of their IoT streaming projects it’s clear that there’s a lot of misunderstanding and gaps in their knowledge.  You can find hundreds of articles on IoT and inevitably they focus on some portion of the whole without an overall context or foundation.  This is understandable since the topic is big and far ranging not to mention changing fast. 

So our intent is to provide a broad foundation for folks who are starting to think about streaming and IoT.  We’ll start with the basics and move up through some of the more advanced topics, hopefully leaving you with enough information to then begin to start designing the details of your project or at least equipped to ask the right questions.

Since this is a large topic, we’ll spread it out over several articles with the goal of starting with the basics and adding detail in logical building blocks.


Is It IoT or Is It Streaming?

The very first thing we need to clear up for beginners is the nomenclature.  You will see the terms “IoT” and “Streaming” used to mean different things as well as parts of the same thing.  Here’s the core of the difference:  If the signal derives from sensors it’s IoT (Internet of Things).  The problem is that there are plenty of situations where the signal doesn’t come from sensors but are handled in essentially the same way.  Web logs, click streams, streams of text from social media, and streams of stock prices are examples of non-sensor streams that are therefore not “IoT”.

What they share however is that all are data-in-motion streams of data. Streaming is really the core concept and we could just as easily have called this “Event Stream Processing”, except that focusing on streaming leaves out several core elements of the architecture such as how we capture the signal, store the data, and query it.

In terms of the architecture, the streaming part is only one of the four main elements we’ll discuss here.  Later we’ll also talk about the fact that although the data may be streaming, you may not need to process it as a stream depending on what you think of as real time.  It’s a little confusing but we promise to clear that up below.

The architecture needed to handle all types of streaming data is essentially the same regardless of whether the source is specifically a sensor or not so throughout we’re going to refer to this as “IoT Architecture”.  And since this is going to be a discussion that focuses on architecture, if you’re still unclear about streaming in general you might start with these overviews: Stream Processing – What Is It and Who Needs It and Stream Processing and Streaming Analytics – How It Works”.


Basics of IoT Architecture – Open Source

Open source in Big Data has become a huge driver of innovation.  So much so that probably 80% of the information available on-line deals with some element or package for data handling that is open source.  Open source is also almost completely synonymous with Apache Institute.  So to understand the basics of IoT architecture we’re going to start by focusing on open source tools and packages.

If you’re at all familiar with IoT you cannot have avoided learning something about SPARK and Storm, two of the primary Apache open source streaming projects but these are only part of the overall architecture.  Also, later in this series we’ll turn our attention to the emerging proprietary non-open source options and why you may want to consider them.

Your IoT architecture will consist of four components: Data Capture, Stream Processing, Storage, and Query.  Depending on the specific packages you choose some of these may be combined but for this open source discussion we’ll assume they’re separate.


Data Capture – Open Source

Think of the Data Capture component as the catchers mitt for all your incoming sources be they sensor, web streams, text, image, or social media.  The Data Capture application needs to:

  1. Be able to capture all your data as fast as it’s coming from all sources at the same time.  In digital advertising bidding for example this can easily be 1 million events per second.  There are applications where the rate is even higher but it’s unlikely that yours will be this high.  However, if you have a million sensors each transmitting once per second you’re already there.
  2. Must not lose events.  Sensor data is notoriously dirty.  This can be caused by malfunction, age, signal drift, connectivity issues, or a variety of other network, software and hardware issues.  Depending on your use case you may be able to stand some data loss but our assumption is that you don’t want to lose any.
  3. Scale Easily:  As your data grows, your data capture app needs to keep up.  This means that it will be a distributed app running on a cluster as will all the other components discussed here.

Streaming data is time series so it arrives with at least three pieces of information: 1.) the time stamp from its moment of origination, 2.) sensor or source ID, and 3.) the value(s) being read at that moment.

Later you may combine your streaming data with static data, for example about your customer, but that happens in another component.


Why Do You Need a Message Collector At All?

Many of the Stream Processing apps including SPARK and Storm can directly ingest messages without a separate Message Collector front end.  However, if a node in the cluster fails they can’t guarantee that the data can be recovered.  Since we assume your business need demands that you be able to save all the incoming data, a front end Message Collector that can temporarily store and repeat data in the case of failure is considered a safe architecture.


Open Source Options for Message Collectors

In open source you have a number of options.  Here are some of the better known Data Collectors.  This is not an exhaustive list.

  • FluentD – General purpose multi-source data collector.
  • Flume – Large scale log aggregation framework.  Part of the Hadoop ecosystem.
  • MQ (e.g. RabbitMQ) There are a number of these lightweight message brokers deriving from the original IBM MQTT (message queuing telemetry transport, shortened to MQ).
  • AWS Kinesis – The other major cloud services also have open source Data Collectors.
  • Kafka – Distributed queue publish-subscribe system for large amounts of streaming data.


Kafka is Currently the Most Popular Choice

Kafka is not your only choice but it is far and away today’s most common choice used by LinkedIn, Netflix, Spotify, Uber, and AirBNB among others.

Kafka is a distributed messaging system designed to tolerate hardware, software, and network failures and to allow segments of failed data to be essentially rewound and replayed, providing the needed safety in your system.  Kafka came out of LinkedIn in 2011 and is known for its ability to handle very high throughput rates and to scale out.

If your stream of data needed no other processing, it could be passed directly through Kafka to a data store.


Storage – Open Source

Here’s a quick way to do a back-of-envelope assessment of how much storage you’ll need.  For example:

Number of Sensors

1 Million

Signal Frequency

Every 60 seconds

Data packet size

1 Kb

Events per sensor per day


Total events per day

1.44 Billion

Events per second


Total data size per day

1.44 TB per day


Your system will need two types of storage, ‘Forever’ storage and ‘Fast’ storage.

Fast storage is for real time look up after the data has passed through your streaming platform or even while it is still resident there.  You might need to query Fast storage in just a few milliseconds to add data and context to the data stream flowing through your streaming platform, like what were the min and max or average readings for sensor X over the last 24 hours or the last month.  How long you hold data in Fast storage will depend on your specific business need.

Forever storage isn’t really forever but you’ll need to assess exactly how long you want to hold on to the data.  It could be forever or it could be a matter of months or years.  Forever storage will support your advanced analytics and the predictive models you’ll implement to create signals in your streaming platform, and for general ad hoc batch queries.

RDBMS is not going to work for either of these needs based on speed, cost, and scale limitations.  Both these are going to be some version of NoSQL.


Cost Considerations

In selecting your storage platforms you’ll be concerned about scalability and reliability, but you’ll also be concerned about cost.  Consider this comparison drawn from Hortonworks:


For on premise storage a Hadoop cluster will be both the low cost and best scalability/reliability option.  Cloud storage also based on Hadoop is now approaching 1¢ per GB per month from Google, Amazon, and Microsoft.


Open Source Options for Storage

Once again we have to pause to explain nomenclature, this time about “Hadoop”.  Many times, indeed most times that you read about “Hadoop” the author is speaking about the whole ecosystem of packages that are available to run on Hadoop. 

Technically however Hadoop consists of three elements that are the minimum requirements for it to operate as a database.  Those are: HDFS (Hadoop file system – how the data is stored), YARN (the scheduler), and Map/Reduce (the query system).  “Hadoop” (the three component database) is good for batch queries but has recently been largely overtaken in new projects by SPARK which runs on HDFS and has a much faster query method. 

What you should really focus on is the HDFS foundation.  There are other open source alternatives to HDFS such as S3 and Mongo, and these are viable options.  However almost universally what you will encounter are NoSQL database systems based on HDFS.  These options include:

  • Hbase
  • Cassandra
  • Accumulo
  • And many others.

We said earlier that RDBMS was non-competitive based on many factors, not the least of which is that the requirement for a schema-on-write is much less flexible than the NoSQL schema-on-read (late schema).  However, if you are committed to RDBMS you should examine the new entries in NewSQL which are RDBMS with most of the benefits of NoSQL.  If you’re not familiar, try one of these refresher articles here, here, or here.


Query – Open Source

The goal of your IoT streaming system is to be able to flag certain events in real time that your customer/user will find valuable.  At any given moment your system will contain two types of data, 1.) Data-in-motion, as it passes through your stream processing platform, and 2.) Data-at-rest, some of which will be in fast storage and some in forever storage.

There are two types of activity that will require you to query your data:

Real time outputs:  If your goal is to send an action message to a human or a machine, or if you are sending data to a dashboard for real time update you may need to enhance your streaming data with stored information.  One common type is static user information.  For example, adding static customer data to the data stream while it is passing through the stream processor can be used to enhance the predictive power of the signal.  A second type might be a signal enhancement.  For example if your sensor is telling you the current reading from a machine you might need to be able to compare that to the average, min, max, or other statistical variations from that same sensor over a variety of time periods ranging from say the last minute to the last month.

These data are going to be stored in your Fast storage and your query needs to be completed within a few milliseconds.

Analysis Queries:  It’s likely that your IoT system will contain some sophisticated predictive models that score the data as it passes by to predict human or machine behavior.  In IoT, developing predictive analytics remains the classic two step data science process: first analyze and model known data to create the predictive model, and second, export that code (or API) into your stream processing system so that it can score data as it passes through based on the model.  Your Forever data is the basis on which those predictive analytic models will be developed.  You will extract that data for analysis using a batch query that is much less time sensitive.


Open Source Options for Query

In the HDFS Apache ecosystem there are three broad categories of query options.

  1. Map/Reduce:  This method is one of the three legs of a Hadoop Database implementation and has been around the longest.  It can be complex to code though updated Apache projects like Pig and Hive seek to make this easier.  In batch mode, for analytic queries where time is not an issue Map/Reduce on a traditional Hadoop cluster will work perfectly well and can return results from large scale queries in minutes or hours.
  2. SPARK:  Based on HDFS, SPARK has started to replace Hadoop Map/Reduce because it is 10X to 100X faster at queries (depending on whether the data is on disc or in memory).  Particularly if you have used SPARK in your streaming platform it will make sense to also use it for your real time queries.  Latencies in the milliseconds range can be achieved depending on memory and other hardware factors.
  3. SQL:  Traditionally the whole NoSQL movement was named after database designs like Hadoop that could not be queried by SQL.  However, so many people were fluent in SQL and not in the more obscure Map/Reduce queries that there has been a constant drumbeat of development aimed at allowing SQL queries.  Today, SQL is so common on these HDFS databases that it’s no longer accurate to say NoSQL.  However, all these SQL implementations require some sort of intermediate translator so they are generally not suited to millisecond queries.  They do however make your non-traditional data stores open to any analysts or data scientists with SQL skills.

Watch for Lessons 2 and 3 in the next weeks.


About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:

[email protected]


Read more…

Originally posted on Data Science Cental

Cloud giants like Amazon, Google, Azure and IBM have rushed into the big data analytics cloud market.  They claim their tools will make developer tasks simple. For machine learning, they say their cloud products will free data scientists and developers from implementation details so they can focus on business logic.  

The big companies have kicked off a race between machine learning platforms. Amazon ML, Azure ML, IBM Watson and Google Cloud Prediction are striving to fold data science workflows into their existing ecosystems. They want to drive the adoption of machine learning algorithms across software development teams and expand data science throughout the business.

But data scientists and big data platform engineers do not always want or need this one-size-fits-all approach.  They understand first hand how powerful and flexible Apache Spark, R and Python are when it comes to machine learning. These people are experts who cannot be constrained by the cloud GUI. They need access to the command line. And they do not want to be tied to any one vendor.

We would like to remove marketing buzz coming from the cloud companies and hone in on the facts. So we ask you to help us gather actual statistics that show how much companies are using these clouds versus using custom solutions they built themselves with opensource technologies and machine learning libraries.

Follow this link to participate in the survey.

You might also want to follow the discussion at Quora or Reddit too and comment there. 

Read more…

Making data science accessible – HDFS

Originally posted on Analytic Bridge

By Dan Kellett, Director of Data Science, Capital One UK


Disclaimer: This is my attempt to explain some of the ‘Big Data’ concepts using basic analogies. There are inevitably nuances my analogy misses.


What is HDFS?

When people talk about ‘Hadoop’ they are usually referring to either the efficient storing or processing of large amounts of data. MapReduce is a framework for efficient processing using a parallel, distributed algorithm (see my previous blog here). The standard approach to reliable, scalable data storage in Hadoop is through the use of HDFS (Hadoop Distributed File System).


Imagine you wanted to find out how much food each type of animal eats in a day. How would you do it?

The gigantic warehouse

One approach would be to buy or rent out a huge warehouse and store some of every type of animal in the world. Then you could study each type one at a time to get the information you needed – presumably starting with aardvarks and finishing with zebras.

The downside of this approach (other than the smell & the risk the lions would eat the antelopes) is that this would take a looooong time and would be very expensive (renting out a huge building for many years would add up).

This is similar to the approach we take with our data at the moment. Huge amounts of information about our customers are available and we are restricted to analyzing it through a fairly narrow pipeline.



An alternative approach would be to split up the animals and send them to lots of smaller centers where each would could be assessed. Maybe the penguins go to Penzance, the lemurs to Leeds and the oxen to Oxford.

This would mean each center could study their animals and send the information back to a head office to be summarized. This would be much faster and a lot cheaper.

This is an example of parallel processing whereby you split up your data, analyze each part separately and bring it back together at the end.



The key drawback is that you are highly susceptible to failure in one of these mini-centers. What happens if there’s a break-in at the center in Lincoln and all the chickens escape? What if the systems go down in Dundee and all the information on sparrows is lost?

The solution to this is to still use mini-centers but to send animals to multiple centers. For example, you may send some rabbits to Birmingham, some to Edinburgh and some to Cardiff. You are then protected from individual failures and can still carry out the survey quickly.

This is from a very high level what HDFS (Hadoop Distributed File System) does. Data are split up in a way that the overall task is not impacted if an individual node fails. Each node carries out its designated task and then passes the results back to a central node to be aggregated.


When would I use HDFS?

As with any technique related to data science HDFS is one of many approaches you could take to solve a business problem using large amounts of data. The key is being able pick and choose when to take HDFS off the shelf. At a high level: HDFS may help you if your problem is huge but not hard. If you can parallelize your problem, then HDFS coupled with MapReduce should help.


Read my previous blogs on: Text Mining, Logistic Regression, Markov Chains, Trees or Neural Networks


Further reading:

Read more…

Originally posted on Data Science Central

Thousands of articles and tutorials have been written about data science and machine learning. Hundreds of books, courses and conferences are available. You could spend months just figuring out what to do to get started, even to understand what data science is about.

In this short contribution, I share what I believe to be the most valuable resources - a small list of top resources and starting points. This will be most valuable to any data practitioner who has very little free time. 

Map-Reduce Explained

These resources cover data sets, algorithms, case studies, tutorials, cheat sheets, and material to learn the most popular data science languages: R and Python. Some non-standard techniques used in machine-to-machine communications and automated data science, even though technically simpler and more robust, are not included here as their use is not widespread, with one exception: turning unstructured into structured data. We will include them, as well as Hadoop-based techniques (distributed algorithms, or Map-Reduce) in a future article. 

1. Technical Material

2. General Content

3. Additional Reading

Enjoy the reading!

Read more…

5 Big Data Myths Businesses Should Know

Guest blog post by Larry Alton

Big data is seeping into every facet of our lives. Smart home gadgets are becoming part of the nerve systems of new and remodeled homes, and many renters are demanding these interconnected gadgets from landlords.

But nowhere has Big Data created a bigger buzz than in business. Companies of all sizes are collecting data at a seemingly insurmountable rate. Big data is larger than ever before.

We’ve collected more data in the past two years than in the entire history of the human race. It’s also continuing to grow at an incredible rate: By 2020, analysts believe we’ll be generating about 1.7 megabytes of information per second for every human being.

This information can be useful for businesses in a wide array of mediums, from cloud computing to data processing speeds and customer relations. But just because businesses can collect all this information doesn’t mean they know what to do with it or have the resources to analyze it.

In fact, many businesses are still struggling to understand what Big Data is all about. Much of this has to do with the vast complexity of data analysis, but it probably has a fair amount to do with some of the myths that permeate the industry as well.

Here are five of the biggest.

1. Big data is large

Big data is effectively just a name, and a somewhat misleading one. When people refer to big data, they’re talking about all the data in the world, but most businesses don’t collect all of it.

They focus more on individual transaction data, which is granular and specific. Big data is made up of a lot of very small chunks data, most of which most businesses never see or collect.

The smaller size of typical data collection is highly beneficial for businesses. It’s much easier for executives to understand and control information when it's collected in small portions.

For that reason, organizations shouldn’t get intimidated when they’re advised to use big data. It’s not nearly as overwhelming as it might sound.

2. It’s expensive to analyze data

Small businesses in particular may be afraid to collect and analyze data because they think it will have a substantial impact on their bottom line.

This might have been a problem five years ago, but today there are so many free data tools available that anyone, even a one-person operation, can analyze a lot of data.

“Availability of inexpensive but advanced analytics tools, combined with the government releasing treasure troves of data — and the avalanche of ‘user exhaust’ data generated in social networks — enables these start-ups to bring innovative products and services to market with little funding,” says an article from The Enterprisers Project.

“They do not need millions of dollars and years of development work to actually achieve significant value — or become a disruptor in the industry along the way. It does not take a team of Ph.Ds to get there either, everything is much more accessible these days.”

3. All data is good data

Another major distinction is the difference between wholesome, useful data and garbage. The quality of data varies, and companies should recognize the difference before trying to use data that won’t be of any use.

Even though the process of collecting accurate, real-time data is improving, there are still a lot of errors and superfluous detail. Photographs and videos can easily be tagged incorrectly or sarcastic content can be taken seriously.

There can also be information about a customer base that’s missing key information, which renders the rest of the data useless. Being willing to throw away data that isn’t helpful will help your business make better sense of its collections.

4. You need clean data

Though we just spoke about throwing away useless data, it’s worthwhile to make certain it’s definitely useless first. Many companies believe that “dirty data,” data that’s clouded with useless and confusing details, isn’t worth their time.

But in reality, analyzing dirty data can potentially lead to great insights. Even when the data is not clean, a firm can employ analytics processing to illuminate useful strengths from the depths of the information.

At other times, the analytics may come back with nothing. This is how you distinguish good, dirty data from clean, useless information.

5. Big data will take your job

One of the primary arguments against big data is that it’s making way for machines to take over the jobs of human analysts. This is not the case, however.

“The World Economic Forum warned that robots and technological advances will take more than 5 million jobs from humans over the next five years,” says Ben Rossi, contributor for Information Age. “Machine learning has undoubtedly earned its place in the workforce, but machines don’t necessarily have to replace humans -- they can in fact enhance the work humans can do.”

Rossi goes on to explain how big data is actually paving the way for better jobs. It handles the grunt work that humans were once required to do and makes it possible for people to fill or grow into better positions, so that big data is a good thing for the job front, not a negative factor.

Read more…