Subscribe to our Newsletter

Big Data = 3 data issues

Originally posted on Data Science Central

There are at least two definitions for Big Data: a broad sense definition and a strict sense definition. For the broad sense definition, Big Data includes all the possible available data on earth.  For the strict sense definition, Big Data is a term for large datasets that traditional data processing applications cannot handle. We are going to follow this last approach.

For Davenport [2014] “Big Data refers to data that is too big to fit in a single server, too unstructured to fit into a row-and-column database, or too continuously flowing to fit into a static data warehouse.” This definition matches the 3V definition: Volume: too big; update Velocity: too continuous flowing; Variety of formats: too unstructured.

In the definition of multimedia [Chapman and Chapman 2000] distinguish between dynamic (video, animation and audio) and static (photography, graphic and text) media.

Regarding text the dichotomy between structured and non-structured is usual. The structured text includes the traditional SQL databases, data warehouse and XML databases, as well as the more recent NoSQL databases.  The unstructured text includes the plain text found in e-mails, forums, blogs and wikis. A growing form of text that is not tagged or specially formatted is the event logs.

Given the data taxonomy, in Figure 1, Big Data should include the three data sub-types: video, NoSQL and event logs.

Figure 1 – Data taxonomy and Big Data

The term Big Data presented as a broad term is actually the union of three different issues. The aggregation of different problems creates further difficulty in finding possible answers. This issue should no longer be treated as a whole entity, but be divided into three sub-types of Big data: (i) too unstructured or complex: video; (ii) too big: NoSQL databases of web companies; (iii) too many updates: log events.

We believe SQL databases and Data Warehouses cannot be classified as Big Data, since they support traditional applications. Regarding the plain text analysis for web pages, e-mails, forums, blogs and wikis, considerable progress has been achieved in Information Retrieval, so we do not include it in the Big Data challenge. However, some authors mix the traditional and the new applications creating an even greater Big Data challenge.

The high-technologic based idea of massive parallel processing is not a solution, since we have three different problems for each data sub-type. The right approach to the Big Data challenge implies dividing the problem into three sub-problems.

Video

With the advent of web 2.0 (a web of people) associated with mobile devices and the Internet of Things (surveillance cameras, video security systems, health care) the production of video increases exponentially.

While to search text is quite an easy task in the computing area, image processing involves algorithms with high complexity and is consequently very time consuming. When searching images, the human-machine interface is not direct. The dense and complex format of an image creates obstacles in the way the user asks the system questions. In video search with dozens of frames (images) per second the intricacy expands substantially.

Despite these difficulties, video analysis and in particular facial recognition of humans is becoming a major development.

NoSQL

The NoSQL technology appears in web companies like Google, Amazon and Facebook, to face the limitations of the traditional 30-year old relational databases technology.

The NoSQL solutions are divided into several groups:

  • Key-value Storage, as Voldemort used in LinkedIn
  • Big-Table Storage, HBase or Cassandra used in Facebook
  • Document Storage, as CouchDB or MongoDB
  • Graph Storage, as InfiniteGraph or Neo4j

On the one hand, NoSQL presents features to deal with large volumes of data using algorithms with better scaling.  It also supports high flexibility in updating the schema during the runtime. On the other hand, SQL technology presents a large number of consultants, which allows rapid implementations and a unique ability in the integration of applications. 

To unify these approaches, NewSQL is a new class of relational databases oriented to high-performance applications on Web-based architectures.

Event logs

The information regarding the execution of a computer program that is recorded in files is called the event logs. The increasing use of sensors in surveillance, industry and logistic magnifies the need of event logs analysis.

The Internet of Things (IoT) is a network of physical objects embedded with software and electronics. Each thing is uniquely identifiable by an IP address and connected to the Internet. In IoT the networks of sensors allow the digital meets physical (or more humorously the bit meets the atom). The increasing use of sensors in surveillance, industry, logistic and IoT magnifies the need of event logs analysis.

There are many approaches to deal with event logs. In order to detach one of them, we can refer Process Mining. Process Mining is a Data Mining technique which used event logs to find data sequences in the organizations and of the more relevant approach in the Business Process Management technology. 

Conclusion

After this reflection we can find a more pedagogic definition for Big Data, as a dataset that is too difficult to search, because it is:

  • too big
  • or with a too complex format
  • or with too many updates

Notice that the disjunction “or” is used purposely instead of any conjunction that aggregates the three conditions. For each specific case the more suitable algorithms should be chosen, instead of mixing all the concepts. Big Data is not one, but three data issues.

References

Chapman, N. and Chapman, J. (2000), Digital Multimedia, John Wiley & Sons.

Davenport T.H. (2014), Big Data at Work: Dispelling the Myths, Uncovering the Opportunities, Harvard Business Review Press.

E-mail me when people leave their comments –

You need to be a member of Hadoop360 to add comments!

Join Hadoop360

Featured Blog Posts - DSC