Subscribe to our Newsletter

Guest blog post by Charlie Silver, CEO of Algebraix Data. Originally entitled  'Data Algebra Does Big Data'.

Algebra is powerful. It enables people to solve for unknowns and frame problems in ways that are universally understandable. For the same reason, data algebra is powerful. Why? Because it can represent data – all data – mathematically.

What is Data Algebra (and when do I use it)?

Data algebra starts small. It designates the fundamental unit of data as a couplet, which you can think of as a value (for example, “28”) associated with a qualifier (for example, “countries”). The value alone has no meaning, but by attaching a qualifier – another item of data that reveals the meaning of the value – you have a couplet, a structure that is well-defined in mathematical terms and can readily be treated mathematically.

If you write the couplet as (28, countries), it might indicate that there are 28 countries in the European Union, as indeed there are. But it might not. It might indicate that there are 28 countries in NATO. Or that you have visited 28 different countries in the last decade, and so on.

In other words, to add context to the data you need to qualify the couplet again. And to store the data in a computer, you need to add another qualifier that says where the stored data is located, so you can retrieve it whenever you need to.

Managing “Little Data”

Mathematically, the unit of data doesn’t get much more complicated than that. And from a mathematical perspective, it’s useful because it can rigorously define data. But if all you’re dealing with is a few items of data, or even a few hundred items, defining it mathematically is overkill. That amount can simply be written down in a document.

When the numbers go up, and the relationships between the various types of data get a little more complicated, applying math is still unnecessary –  a spreadsheet can manage the task. Not only can a spreadsheet store larger amounts of data but it lets you manipulate the data in useful ways. Nowadays, a spreadsheet can easily accommodate 100,000 rows of data.

You can perform various mathematical operations on data in a spreadsheet, such as counting the occurrence of particular values, grouping it in various ways, adding up values, and more, but this is not the same as defining and manipulating it algebraically.

When it comes to graphical data – that is, data expressing specific relationships between data entities – a spreadsheet is less useful, even for relatively small volumes of data. But there’s another option: switching to a graph database. This approach lets you process graphical data in productive ways. In this sense, a graph database is not that different from a spreadsheet because both provide useful capability.

Think of these situations as managing “little data,” and the software that exists right now is good enough for using relatively small amounts of data productively.

Managing “Big Data”

Data algebra becomes applicable when data complexity and volumes start to sharply increase. That is, when you’re dealing with Big Data. For example, let’s say you want to select a set of data from a large database. Data algebra can define the data file precisely, and then define the query you want to run against the data precisely, and finally deliver the answer precisely – and do it all rapidly.

These processes are what database software is designed to do, by employing statistical techniques and clever algorithms that try to determine the fastest way to get the data. However, there’s a limit to how much database software can do. As time goes by and data volumes get larger, older database products run into trouble because of assumptions that were made in their design. The nature of hardware changes. The speed of CPUs change. The speed of memory changes. And storage changes (witness the recent emergence of solid state storage). Older software has trouble keeping up with all these changes, so new database software has to be developed.

Today, there are well over 200 different database software products that run the gamut from very old to very new. Regardless of age, all these products are trying to solve the same problem: how to store and retrieve Big Data as quickly and efficiently as possible.

The Difference That Makes a Difference

If you tried to write a job description for a database, it becomes clear that it has to solve multiple problems:

  • The amount of data (the volume problem)
  • The arrival speed of new data (the ingest problem)
  • The complexity of the data and metadata (the variety problem)
  • The different kinds of requests that are made for data (the workload problem)
  • The number of simultaneous requests that are made (the concurrency problem)
  • The required retrieval speed (the performance problem)

That’s a lot of difficult and important problems to solve, which is why using data algebra can make such a significant difference. By representing data algebraically, you can define everything in the computer domain mathematically, including the capacity and speed of hardware, the speed of software, the workloads being executed, the service level required for any given transaction, and so on.

Data algebra covers everything with mathematics, and this makes is possible to build software that is optimized for specific situations because you can prove it mathematically.

Math Is Ageless

The fact is that even the most talented software engineers will be outdone by mathematics. Big Data may seem massive right now, but it’s actually in its early days. As data volumes and problems grow, a mathematical approach will become a necessity. Math already dominates in other spheres of engineering, and it is only a matter of time before it dominates the engineering of Big Data software. The algebra of data will become the foundation for the data economy.  

Originally posted on Data Science Central

E-mail me when people leave their comments –

You need to be a member of Hadoop360 to add comments!

Join Hadoop360

Featured Blog Posts - DSC