Subscribe to our Newsletter

Featured Posts (351)

50+ Free Data Science Books

Guest blog post by Laetitia Van Cauwenberge

Very interesting compilation published here, with a strong machine learning flavor (maybe machine learning book authors - usually academics - are more prone to making their books available for free). Many are O'Reilly books freely available. Here we display those most relevant to data science. I haven't checked all the sources, but they seem legit. If you find some issue, let us know in the comment section below. Note that at DSC, we also have our free books:

Read more…

Guest blog post by Irina Papuc

During the last few years, the hottest word on everyone’s lip has been “productivity.” In the rapidly evolving Internet world, getting something done fast always gets an upvote. Despite needing to implement real business logic quickly and accurately, as an experienced PHP developer I still spent hundreds of hours on other tasks, such as setting up database or caches, deploying projects, monitoring online statistics, and so on. Many developers have struggled with these so called miscellaneous tasks for years, wasting time instead concentrating on the project logic.

My life changed when a friend mentioned Amazon Web Services (AWS) four years ago. It opened a new door, and led to a tremendous boost in productivity and project quality. For anyone who has not used AWS, please read this article, which I am sure you will find worth your…

Read more…

10 tools and platforms for data preparation

Guest blog post by Zygimantas Jacikevicius

Traditional approaches to enterprise reporting, analysis and Business Intelligence such as Data Warehousing, upfront modelling and ETL have given way to new, more agile tools and ideas. Within this landscape Data Preparation tools have become very popular for good reason.  Data preparation has traditionally been a very manual task and consumed the bulk of most data project’s time.  Profiling data, standardising it and transforming it has traditionally been very manual and error prone.  This has derailed many Data Warehousing and analysis projects as they become bogged down with infrastructure and consistency issues rather than focusing on the true value add – producing good quality analysis.

Fortunately the latest generation of tools, typically powered by NoSQL technologies take a lot of this pain…

Read more…

Data Lakes Still Need Governance Life Vests

Guest blog post by Gabriel Lowy

As a central repository and processing engine, data lakes hold great promise for raising return on data assets (RDA).  Bringing analytics directly to different data in its native formats can accelerate time-to-value by providing data scientists and business users with increased flexibility and efficiency. 

But to realize higher RDA, data lakes still need governance life vests.  Without data governance and integration, analytics projects risk drowning in unmanageable data that lacks proper definitions or security provisions. …

Read more…

Guest blog post by Randall V Shane

The figure titled "Data Pipeline" is from an article by Jeffrey T. Leek & Roger D. Peng titled, "Statistics: P values are just the tip of the iceberg. These are both well known scientists in the field of statistics and data science, and for them, there is no need to debate the importance of data integrity; it is a fundamental concept. Current terminology uses the term "tidy data", a phrase coined by Hadley Wickham from an article by the same name. Whatever you call it, as scientist, they understand the consequences of bad data. Business decisions today are frequently driven by results from data analysis, and, as such, this requires today's executives to also understand these same consequencese. Bad data leads to bad decisions. …

Read more…

What pays most: R, Python, or SQL?

Guest blog post by Laetitia Van Cauwenberge

Salary mostly depends on experience, education, location, industry, and unfortunately, factors such as gender. Also, most data scientists have all the three skills and more (R + Python + SQL), so it is hard to assess which one is the most valuable.

Source for picture: click here (numbers are from 2014)

You could do a survey, asking data scientists which skills they were hired for, and break the results down into 8 categories:

  • no R, no Python, no SQL
  • R, no Python, no SQL  
  • no R,…
Read more…

The Art of Modeling Names

Guest blog post by Kurt Cagle

This is the first in a series about cross format data modeling principles.

Names are Simple, Right?

In the data modeling realm, there is perhaps no example that is as ubiquitous as modelling personal names. After all, things don’t get much simpler than a name:

Simple, right? Well, not so fast. This isn’t really a model, but rather an instance of a model - an actual example that proves out the model. There are, however, a number of ways that such a model can be described. If, for instance, we use a language such as RDF, this would be modeled as follows:

What you see is a set of assertions that identify that there exists a class named “person”, and that this class has two properties. The domain and range assertions on each property are important, because they indicate what…

Read more…

This is the second article in a series. The first article is available here.

How to implement a temporal database

Not every database requires a temporal database implementation, but some do. We can help you get started. As discussed in our previous article, the SQL-2011 standard included clauses for the definition of temporal tables as part of the SQL/Foundation. However, this standard is very new and not yet widely adopted. For now, most of you will need to extend your current database tables to incorporate temporal concepts.

In this article we'll focus on temporal tables. These tables are the building blocks of temporal databases.

Temporal Tables – The Important Theories

Theory #1: Valid-Time State Tables
From Wikipedia: “Valid time is the time period during which a fact is true with…

Read more…

The 3Ms of Marketing Data Analytics

Data is everywhere and growing. As a marketer, this is a dream world. Your audience is generating tons of data from your web visits, mobile apps and you have partners giving you data. This is rich customer journey data.

However commonly used terms like Big Data and Analytics can mask the complexity of realizing your ROI gained from analyzing all this data. The promise of the visual tools and Hadoop based platforms is all great but if you don't have a data strategy in place all the tools are just shelf ware. 

Many smart CMOs and analysts I have worked with, spend most of their time on getting their data right to begin with, working closely with their data team.

The key, is to have a data strategy for your marketing data using the 3 M’s – Management, Metrics and Metadata.

1. Managing your data: Is all of your data organized such that it’s easy for your analysts to access. Often times there are tons of SQL and NO-SQL /Hadoop…

Read more…

What is a temporal database?

temporal database is a database with built-in support for handling data involving time.

This definition  is from Wikipedia. It is simple and straightforward. Since we expect every database to have some kind of support for time operations, we could say that all databases are temporal databases based on this definition.

The reality is a lot more subtle and complex. Temporal databases enable you to see the data as it was seen in the past, while also enabling you to update even the past in the future. A temporal database will show you the actual value back then, as it was known back then, and the actual value back then, as it is known now. A temporal database allows you to know what your organization was forecasting for the future at a certain time in the past. Temporal databases support a…

Read more…

R Packages: A Healthcare Application

Guest blog post by Divya Parmar

Building off my last post, I want to use the same healthcare data to demonstrate the use of R packages. Packages in R are stored in libraries and often are pre-installed, but reaching the next level of skill requires being able to know when to use new packages and what they contain. With that let’s get to our example.

Useful function: gsub

When working with vectors and strings, especially in cleaning up data, gsub makes cleaning data much simpler. In my healthcare data, I wanted to convert dollar values to integers (ie. $21,000 to 21000), and I used gsub as seen below.

gsub code 1

gsub output 1

Package: reshape2

In…

Read more…

Why SQL Is Not Right for Querying JSON

Guest blog post by Kurt Cagle

Recently, creators of JSON databases have dealt with a fundamental problem. Simply storing and retrieving a given document by a specific key, while useful in any number of scenarios, is not so useful when people want to query that JSON content. This is not really that surprising - once you start gathering data into a database, one of the key requirements that emerge is the ability to query that data based upon specific properties.

Early on, the solution to this dilemma was to build indexes that would create specific look-up tables for certain key-value pairs. By ordering the indexes, you could also compare values for these keys, allowing you to sort within a range or to quickly order documents by their respective…

Read more…

Guest blog post by Vincent Granville

The comparison is performed on a data set where linear regression works well: salary offered to a candidate, based on programming language requirements in the job ad: Python, R or SQL. This is a follow-up to the article highest paying programming skills. The increased accuracy of linear regression estimates is negligible, and well below the noise level present in the data set. The Jackknife method has the advantage to be more stable, easy to code, easy to understand (no need to know matrix algebra), and easy to interpret (meaningful coefficients).

Jackknife is not the…

Read more…

Guest blog post by Andreas Blumauer

Inspired by the development of semantic technologies in recent years, in statistical analysis field the traditional methodology of designing, publishing and consuming statistical datasets is evolving to so-called “Linked Statistical Data” by associating semantics with dimensions, attributes and observation values based on Linked Data design principles.

The representation of datasets is no longer a combination of magic words and numbers. Everything is becoming meaningful when URIs replace their positions as dereferencable resources, which further establishes the relations between resources implicitly and automatically. Different datasets are no longer isolated and all datasets share a globally, uniquely and uniformly defined structure.

With “RDF Data Cube Vocabulary” (…

Read more…

3 Essential SQL Queries in 1.5 Minutes

Guest blog post by Matt Ritter

How often are you stuck waiting for someone else to pull data for you? There's nothing worse than missing a deadline because someone else didn't do a 30 second query. Never again - here are the basic queries that will get you numbers instantly for pivoting, graphing, and other applications of your analysis skills, as applied to an imaginary table of sales data:
 
Get It All

  select * from sales
 
In SQL, the simplest queries are often the most powerful. This grabs every row and every column. If the table is under 50,000 rows, you should have no problem opening it in Excel. If it's bigger, the program may slow down, depending on your RAM and what…

Read more…

Key-Object – A New Paradigm in Search?

Guest blog post by Bill Vorhies

Summary:  The premise of this new Key Object architecture is that search is broken, at least as it applies to complex merchandise like computers, printers, and cameras.  An innovative and workable solution is described.  The question remains, is the pain sufficient to justify a switch?

 

As we are all fond of saying, innovation follows pain points.  If you’re reviewing the hundredth social media/instant-messaging/photo-sharing app you might conclude that those so-called pain points identified by some tech innovators hardly rise to the level of owwies.  But what about search?  Are we missing something in our uber-critical search capabilities that needs to be…

Read more…

Including NoSQL, Map-Reduce, Spark, big data, and more. This resource includes technical articles, books, training and general reading. Enjoy the reading!

Source for picture: click here

Here's the list (new additions, more than 30 articles marked with *):

  1. Hadoop: What It Is And Why It’s Such A Big Deal *
  2. The Big 'Big Data' Question: Hadoop or Spark? *…
Read more…

The Next Big Thing In Big Data: BDaaS

Guest blog post by Bernard Marr

We’ve had software as a service, platform as a service and data as a service. Now, by mixing them all together and massively upscaling the amount of data involved, we’ve arrived at Big Data as a Service (BDaaS).

It might not be a term you’re familiar with yet – but it suitably describes a fast-growing new market. In the last few years many businesses have sprung up offering cloud based Big Data services to help other companies and organizations solve their data dilemmas.

Source for illustration: click here

Some estimate that business IT spending on cloud-based,…

Read more…

Guest blog post by Kumar Chinnakali

In this new year 2016, we should be excited that Apache Spark community have released and announced the availability of Apache Spark 1.6, which is the 7th release on the 1.x line.

  • Committers – Contributors to Spark had crossed 1000, which is doubled.
  • Patches – Apache Spark 1.6 version includes & covers 1000 patches.
  • Run SQL query on files – This feature helps user and application to run SQL queries on files directly without create a table. And it’s similar to the feature available in Apache Drill. For an example select id from json.`path/to/json/files` as j.
  • Star (*) expansion for StructTypes – This features makes it easier to nest and unnest arbitrary numbers of columns. It is pretty common…
Read more…

Guest blog post by Florian Douetteau

If you are a start-up director in 2016, not a week goes by without someone talking to you about unicorns.

It is difficult to imagine the feeling of riding a unicorn. Do you feel the wind of success rush against its skin? Do you fear the arrows and traps of pitiless hunters? Do you feel a special sense of excitement when you get to the end of the rainbow of profits?

I find it easier to imagine the state of mind of the directors of Cloudera, Hortonworks or MapR, who are now the leading Big Data unicorns.

Three years ago, Wikibon talked about Hadoop  to describe the forthcoming war between Hortonworks and Cloudera. Their particular challenge then was to establish themselves as one of the…

Read more…

Resources

Research