Guest blog post by Bill Vorhies
Summary: Self Service Data Prep Platforms (SSDPPs) may offer some relief for BI and data workers who must deal with IT bottlenecks in getting data. But watch out for widely varying capabilities and the assumptions underlying some of their automated features.
I admit it, I’m confused. There is a category of analytic platforms that Gartner calls “Self Service Data Preparation for Analytics” that seems to group fish-AND-fowl and simultaneously fish-OR-fowl. Since I try to keep ahead of these things so that I can explain them to my clients I’m struggling with what’s in and what’s out in this category as well as exactly who should be considering it. There are multiple overlapping categories and exclusions and I’ll try here to sort it out.
Let’s start with a common frame of reference, the major components of a full-fledged Advanced Analytics Platform. I think we could agree that it should have the following major capabilities:
- Blending: Capture, blend, and harmonize multiple data sources of different types.
- Exploration and Discovery: What does the data mean, where does it come from, how do the data relate to one another.
- Transformation: Dealing with missing data, outliers, normalizing, the creation of synthetic features, and the like.
- Visualization and Reporting: These may not be required but they are so common they need to be included here.
- Modeling: The creation of multiple models and the selection of champion models that are most useful to the business at hand.
- Deployment: Some method of exporting the champion models in order to integrate them in operational systems.
Who’s In and Who’s Out
So here’s where this starts to get tricky. Some of these Self Service Data Prep Platforms (SSDPPs) do essentially all of these things like IBM’s SPSS which is included in this category. But Gartner specifically omits other full Advanced Analytic Platforms such as SAS, RapidMiner, and KNIME that also do all these things.
Others that are included in this group do at least the first three (through transform) while several have some visualization and reporting and others have at least lite-weight modeling tools on board.
It’s fair to say that Gartner did not intend this category of SSDPPs to be a simple subset of Advanced Analytic Platforms, though from a data scientist’s standpoint that’s how it appears. So how does Gartner carve out this group and why?
From Gartner’s perspective they are looking at three distinct subgroups:
- Standalone applications (Teradata, Tamr)
- Integrated as part of a data science/analytics platform (Alpine, IBM SPSS)
- Integrated as part of a BI/data discovery platform (Looker, Platfora)
By the way, Gartner forecasts that within five years the standalone group will either evolve into full Advanced Analytics Platforms or combine forces with others to become one.
This market is aimed at disrupting the pain point caused by the amount of time required to prepare data (blend, discover, transform) for either data science of BI tasks, and because of the bottleneck that the current IT-centric process imposes on many organizations. Fair point.
Who Are SSDPPs For
Providers in this segment are looking at three distinct types of users:
- Data scientists. Capable of developing business models simply by having access to the available data and generally only commit errors when completeness of data is not assured.
- Business analysts. Generally these users are able to ask significant analytic questions. They demonstrate some difficulty in understanding completeness, accuracy and cardinality of data and often commit moderate to serious errors in designing the analysis models and in processing the data.
- Information workers. Information workers are individuals who employ information to assist in the decision-making process or carry out an action, or who create information that supports the decision-making process or drives resulting actions.
Gartner also introduces a somewhat more nuanced list of capabilities here than my simple six-part definition at the beginning, with many of these additional capabilities focusing of collaboration, data curation, and the creation of data catalogues and repositories. There are seven characteristics in all and to be on the list the SSDPP must provide at least five:
- Data discovery. Functionalities such as searching, sampling, profiling, cataloging/inventorying data assets and tagging/annotating data for future exploration, discovering/suggesting sensitive attributes, identifying commonly used attributes (such as geo-data, product ID), discovering data lineage, and pattern detection.
- Data transformation. Functionalities such as data enrichment, data blending, filtering, user-defined calculations, and data augmentation.
- Data modeling/structuring. Functionalities such as support for logical models and logical data structures, and discovery of relationships among data source attributes.
- Data curation. Functionalities such as harmonizing disparate data sources to provide unified datasets for analysis, managing data life cycles to provision reuse and discovery, and maintaining data quality.
- User collaboration workspace. Functionalities such as sharing data source connections, sharing queries, sharing datasets, and publishing/sharing models.
- Metadata catalog/repository. Metadata catalog of data sources, data source attributes, data lineage, relationships and other relevant metadata.
- Interactive data preparation. Business-user-oriented, visual and interactive data preparation.
However, the ways in which the listed vendors mix and match these features is so varied that a potential customer would need to do a much more in-depth analysis. Here are some of the issues that struck me as I reviewed the details of each:
Observations and Issues
- If your company has an active data science program then your data scientists are probably already able to do all of these things through their selected Advanced Analytic Platforms. The possible exceptions are if you are a pure R or Python shop and are looking for some front end efficiencies in data prep these may help, provided your data scientist accept the embedded assumptions and techniques.
- Even in a data science shop, but much more likely in a BI or ‘data worker’ environment there may be uses for the collaboration, curation, and repository features of SSDPPs. Clearly this indicates an environment in which there can be much misunderstanding about what the data means, and a group approach to clarification may be valuable.
- SSDPPs are not a replacement for well-defined ETL procedures with strong SLAs and curated data in a production environment.
- Some of these platforms are much more directive than others in suggesting, automating, or making easy the types of transforms that data scientists consider their personal domain. To the extent that these may be automated there needs to be a careful examination of the underlying assumptions and techniques. If you are creating a data viz or report that slightly misclassifies data that may not be very harmful. If you create a predictive model or recommender based on faulty automated assumptions and put it into your operating system, that may create substantial damage.
- Some of these platforms retain the source data in its original form alongside the transformed data and others do not. It would be easy to lose the trail of what manipulations had been done in what order that make replicating results on future data problematic. Some like Alteryx have drag-and-drop UIs and are more repeatable.
- Particularly the standalone SSDPPs are designed to output a dataset to be used in other applications. Some have data viz and reporting capabilities and some have at least lite-weight modeling capabilities. An important part of selection will be ease of integration with the target applications and reconciling any overlapping capabilities.
- Finally, my observation is that you are much more likely to use these in a BI or ‘data worker’ environment than in a data science group. If you are resolving a pain point by taking the load off of IT and making data more rapidly available is that ever a bad thing? You can probably think of several ways this might go south. Suppose the best understanding of data sources and their meaning lies with analysts in your IT department who are suddenly taken out of the loop. Suppose the ‘information workers’ who have this new capability fail to take care to understand the data provenance or meaning. Suppose in the process of blending new data sources you lose what you had previously gained from data that is ‘a single version of the truth’.
In short, this should not be a launch-and-forget implementation. This will require more management emphasis than simply training analysts how to operate the new platform. There are disciplines around data that require attention, especially when automation may make it look like ‘the machine is taking care of it’.
If any of this fits your case, here’s the non-exhaustive list covered in Gartner’s review:
- Alpine Data Labs (Alpine Chorus)
- Alteryx (Alteryx Analytics)
- ClearStory Data
- IBM (IBM SPSS)
- Lavastorm Analytics
- Microsoft (Power Query for Excel)
- Paxata (Paxata Adaptive Data Preparation platform)
- SAP (SAP Lumira)
- Tamr (Tamr Platform)
- Teradata (Teradata Loom)
- Trifacta (Trifacta Data Transformation Platform)
- Waterline Data (Waterline Data Inventory)
Source: Gartner (March 2015)
About the author: Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at: