Subscribe to our Newsletter

All Posts (382)

Why you need metadata for Big Data success

Guest blog post by John P. Stevens

I recently wrote an article entitled ‘First Big Data initiative – why you need Big Data governance now!’ and one of the comments received was from metadata expert and noted industry metadata presenter and speaker Bob Schork.  I had the privilege of working with Bob in the past and have benefited from his extensive metadata insights over the years. What made me write this article was his comment stating that “metadata which is and will be ignored by many working on a BD (Big Data) project, to their own detriment.”  This resonated with me in that metadata is often taken for granted within the scope of Big Data projects, and the overall industry data management space for that matter, as well. This article will highlight why metadata is crucial to your Big Data project’s overall success and to your enterprise data architecture…

Read more…

Maximizing Data Value with a Data Lake

Contributed by Chuck Currin of Mather Economics:

There’s tremendous value in corporate data, and some companies can maximize their data value through the use of a data lake. This assumes that the adopting company has high volume, unstructured data to contend with. The following article describes ways that a data lake can help companies maximize the value of their data. The term “data lake” has been credited to James Dixon, the CTO of Pentaho. He offered the following analogy: 

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in or take samples.”

Rapid Data…

Read more…

50+ Free Data Science Books

Guest blog post by Laetitia Van Cauwenberge

Very interesting compilation published here, with a strong machine learning flavor (maybe machine learning book authors - usually academics - are more prone to making their books available for free). Many are O'Reilly books freely available. Here we display those most relevant to data science. I haven't checked all the sources, but they seem legit. If you find some issue, let us know in the comment section below. Note that at DSC, we also have our free books:

Read more…

Guest blog post by Irina Papuc

During the last few years, the hottest word on everyone’s lip has been “productivity.” In the rapidly evolving Internet world, getting something done fast always gets an upvote. Despite needing to implement real business logic quickly and accurately, as an experienced PHP developer I still spent hundreds of hours on other tasks, such as setting up database or caches, deploying projects, monitoring online statistics, and so on. Many developers have struggled with these so called miscellaneous tasks for years, wasting time instead concentrating on the project logic.

My life changed when a friend mentioned Amazon Web Services (AWS) four years ago. It opened a new door, and led to a tremendous boost in productivity and project quality. For anyone who has not used AWS, please read this article, which I am sure you will find worth your…

Read more…

10 tools and platforms for data preparation

Guest blog post by Zygimantas Jacikevicius

Traditional approaches to enterprise reporting, analysis and Business Intelligence such as Data Warehousing, upfront modelling and ETL have given way to new, more agile tools and ideas. Within this landscape Data Preparation tools have become very popular for good reason.  Data preparation has traditionally been a very manual task and consumed the bulk of most data project’s time.  Profiling data, standardising it and transforming it has traditionally been very manual and error prone.  This has derailed many Data Warehousing and analysis projects as they become bogged down with infrastructure and consistency issues rather than focusing on the true value add – producing good quality analysis.

Fortunately the latest generation of tools, typically powered by NoSQL technologies take a lot of this pain…

Read more…

Data Lakes Still Need Governance Life Vests

Guest blog post by Gabriel Lowy

1327964?profile=original

As a central repository and processing engine, data lakes hold great promise for raising return on data assets (RDA).  Bringing analytics directly to different data in its native formats can accelerate time-to-value by providing data scientists and business users with increased flexibility and efficiency. 

But to realize higher RDA, data lakes still need governance life vests.  Without data governance and integration, analytics projects risk drowning in unmanageable data that lacks proper definitions or security provisions. …

Read more…

Guest blog post by Randall V Shane

The figure titled "Data Pipeline" is from an article by Jeffrey T. Leek & Roger D. Peng titled, "Statistics: P values are just the tip of the iceberg. These are both well known scientists in the field of statistics and data science, and for them, there is no need to debate the importance of data integrity; it is a fundamental concept. Current terminology uses the term "tidy data", a phrase coined by Hadley Wickham from an article by the same name. Whatever you call it, as scientist, they understand the consequences of bad data. Business decisions today are frequently driven by results from data analysis, and, as such, this requires today's executives to also understand these same consequencese. Bad data leads to bad decisions. …

Read more…

What pays most: R, Python, or SQL?

Guest blog post by Laetitia Van Cauwenberge

Salary mostly depends on experience, education, location, industry, and unfortunately, factors such as gender. Also, most data scientists have all the three skills and more (R + Python + SQL), so it is hard to assess which one is the most valuable.

1327994?profile=original

Source for picture: click here (numbers are from 2014)

You could do a survey, asking data scientists which skills they were hired for, and break the results down into 8 categories:

  • no R, no Python, no SQL
  • R, no Python, no SQL  
  • no R,…
Read more…

The Art of Modeling Names

Guest blog post by Kurt Cagle

This is the first in a series about cross format data modeling principles.

Names are Simple, Right?

In the data modeling realm, there is perhaps no example that is as ubiquitous as modelling personal names. After all, things don’t get much simpler than a name:

Simple, right? Well, not so fast. This isn’t really a model, but rather an instance of a model - an actual example that proves out the model. There are, however, a number of ways that such a model can be described. If, for instance, we use a language such as RDF, this would be modeled as follows:

What you see is a set of assertions that identify that there exists a class named “person”, and that this class has two properties. The domain and range assertions on each property are important, because they indicate what…

Read more…

This is the second article in a series. The first article is available here.

How to implement a temporal database

Not every database requires a temporal database implementation, but some do. We can help you get started. As discussed in our previous article, the SQL-2011 standard included clauses for the definition of temporal tables as part of the SQL/Foundation. However, this standard is very new and not yet widely adopted. For now, most of you will need to extend your current database tables to incorporate temporal concepts.

In this article we'll focus on temporal tables. These tables are the building blocks of temporal databases.

Temporal Tables – The Important Theories

Theory #1: Valid-Time State Tables
From Wikipedia: “Valid time is the time period during which a fact is true with…

Read more…

The 3Ms of Marketing Data Analytics

Data is everywhere and growing. As a marketer, this is a dream world. Your audience is generating tons of data from your web visits, mobile apps and you have partners giving you data. This is rich customer journey data.

However commonly used terms like Big Data and Analytics can mask the complexity of realizing your ROI gained from analyzing all this data. The promise of the visual tools and Hadoop based platforms is all great but if you don't have a data strategy in place all the tools are just shelf ware. 

Many smart CMOs and analysts I have worked with, spend most of their time on getting their data right to begin with, working closely with their data team.

The key, is to have a data strategy for your marketing data using the 3 M’s – Management, Metrics and Metadata.

1. Managing your data: Is all of your data organized such that it’s easy for your analysts to access. Often times there are tons of SQL and NO-SQL /Hadoop…

Read more…

What is a temporal database?

temporal database is a database with built-in support for handling data involving time.

This definition  is from Wikipedia. It is simple and straightforward. Since we expect every database to have some kind of support for time operations, we could say that all databases are temporal databases based on this definition.

The reality is a lot more subtle and complex. Temporal databases enable you to see the data as it was seen in the past, while also enabling you to update even the past in the future. A temporal database will show you the actual value back then, as it was known back then, and the actual value back then, as it is known now. A temporal database allows you to know what your organization was forecasting for the future at a certain time in the past. Temporal databases support a…

Read more…

R Packages: A Healthcare Application

Guest blog post by Divya Parmar

Building off my last post, I want to use the same healthcare data to demonstrate the use of R packages. Packages in R are stored in libraries and often are pre-installed, but reaching the next level of skill requires being able to know when to use new packages and what they contain. With that let’s get to our example.

Useful function: gsub

When working with vectors and strings, especially in cleaning up data, gsub makes cleaning data much simpler. In my healthcare data, I wanted to convert dollar values to integers (ie. $21,000 to 21000), and I used gsub as seen below.

gsub code 1

gsub output 1

Package: reshape2

In…

Read more…

Why SQL Is Not Right for Querying JSON

Guest blog post by Kurt Cagle

1327983?profile=original

Recently, creators of JSON databases have dealt with a fundamental problem. Simply storing and retrieving a given document by a specific key, while useful in any number of scenarios, is not so useful when people want to query that JSON content. This is not really that surprising - once you start gathering data into a database, one of the key requirements that emerge is the ability to query that data based upon specific properties.

Early on, the solution to this dilemma was to build indexes that would create specific look-up tables for certain key-value pairs. By ordering the indexes, you could also compare values for these keys, allowing you to sort within a range or to quickly order documents by their respective…

Read more…

Guest blog post by Vincent Granville

The comparison is performed on a data set where linear regression works well: salary offered to a candidate, based on programming language requirements in the job ad: Python, R or SQL. This is a follow-up to the article highest paying programming skills. The increased accuracy of linear regression estimates is negligible, and well below the noise level present in the data set. The Jackknife method has the advantage to be more stable, easy to code, easy to understand (no need to know matrix algebra), and easy to interpret (meaningful coefficients).

1327956?profile=original

Jackknife is not the…

Read more…

Guest blog post by Andreas Blumauer

Inspired by the development of semantic technologies in recent years, in statistical analysis field the traditional methodology of designing, publishing and consuming statistical datasets is evolving to so-called “Linked Statistical Data” by associating semantics with dimensions, attributes and observation values based on Linked Data design principles.

The representation of datasets is no longer a combination of magic words and numbers. Everything is becoming meaningful when URIs replace their positions as dereferencable resources, which further establishes the relations between resources implicitly and automatically. Different datasets are no longer isolated and all datasets share a globally, uniquely and uniformly defined structure.

With “RDF Data Cube Vocabulary” (…

Read more…

3 Essential SQL Queries in 1.5 Minutes

Guest blog post by Matt Ritter

How often are you stuck waiting for someone else to pull data for you? There's nothing worse than missing a deadline because someone else didn't do a 30 second query. Never again - here are the basic queries that will get you numbers instantly for pivoting, graphing, and other applications of your analysis skills, as applied to an imaginary table of sales data:
 
Get It All

  select * from sales
 
In SQL, the simplest queries are often the most powerful. This grabs every row and every column. If the table is under 50,000 rows, you should have no problem opening it in Excel. If it's bigger, the program may slow down, depending on your RAM and what…

Read more…

Key-Object – A New Paradigm in Search?

Guest blog post by Bill Vorhies

Summary:  The premise of this new Key Object architecture is that search is broken, at least as it applies to complex merchandise like computers, printers, and cameras.  An innovative and workable solution is described.  The question remains, is the pain sufficient to justify a switch?

 

1328484?profile=RESIZE_1024x1024As we are all fond of saying, innovation follows pain points.  If you’re reviewing the hundredth social media/instant-messaging/photo-sharing app you might conclude that those so-called pain points identified by some tech innovators hardly rise to the level of owwies.  But what about search?  Are we missing something in our uber-critical search capabilities that needs to be…

Read more…

k-nearest neighbor algorithm using Python

Guest blog post by Laetitia Van Cauwenberge

This article was written by Natasha Latysheva. Here we publish a short version, with references to full source code in the original article

Our internal data scientist had a few questions and comments about the article:

  • The example used to illustrate the method in the source code is the famous iris data set, consisting of 3 clusters, 150 observations, and 4 variables, first analysed in 1936. How does the methodology perform on large data sets with many variables, or on unstructured data?
  • Why was Python chosen to do this…
Read more…

Guest blog post by Vincent Granville

Originally posted here, but this version here is up-to-date.

We blended together the best of the best resources posted recently on DSC. It would be great to organize them by category, but for now they are organized by date. This is very useful too, since you are likely to have seen old entries already, and can focus on more recent stuff. Starred entries have interesting charts. 

1327999?profile=original

March 14, 2016

  1. Spark Machine Learning Library Tutorial …
Read more…

Featured Blog Posts - DSC