Guest blog post by John P. Stevens
I recently wrote an article entitled ‘First Big Data initiative – why you need Big Data governance now!’ and one of the comments received was from metadata expert and noted industry metadata presenter and speaker Bob Schork. I had the privilege of working with Bob in the past and have benefited from his extensive metadata insights over the years. What made me write this article was his comment stating that “metadata which is and will be ignored by many working on a BD (Big Data) project, to their own detriment.” This resonated with me in that metadata is often taken for granted within the scope of Big Data projects, and the overall industry data management space for that matter, as well. This article will highlight why metadata is crucial to your Big Data project’s overall success and to your enterprise data architecture organization.
In a recent Gartner review, unstructured data content represents as much as eighty percent of a firm’s total information assets. Managing the ever-growing volume of structured data and unstructured data content in an effective manner creates a competitive advantage that many believe if not realized will ultimately cost them market share or severely hamper their servicing of ever-changing client requirements. Now firms, more than ever, need to capitalize on Big Data so that its power can be harnessed to drive critical future business decisions. Being able to effectively organize and categorize this information will ultimately deliver more intelligence into the business by enabling better and faster decision-making. A major problem is the sheer magnitude of these information assets and the associated number of disparate information silos that can extend across multiple business units. Semi-structured and unstructured data content is very often spread across many internal operational systems, business applications, networks, servers and smart-connected devices.
When I talk to CIOs metadata rarely comes up as a critical priority that they are currently addressing. They seem to keep it buried under mission critical analytics and other data management priorities. But with the increasing onset of Big Data initiatives the value of metadata is now quickly coming to the forefront and is surfacing as a critical priority for Big Data success. The most important changes that came with Big Data, such as Hadoop and other platforms, are that they are ‘schema-less’. This essentially means they are without an accurate description of what this data truly ‘is’. You will now be required to be able to identify this Big Data with an, accurate and descriptive, understanding as you launch each new Big Data project. This will become somewhat of a challenge and will cause a wide range of potential issues if not addressed by identifying metadata for your Big Data initiatives at project startup. You need to recognize that an essential capability for developing and maturing Big Data processing services is to establish a comprehensive enterprise metadata management program. Thus the importance of metadata for Big Data cannot be understated.
Metadata and the ‘Big Data Gap’
According to a report published by IDC, and sponsored by EMC, metadata is one of the fastest-growing sub-segments of enterprise data management. It focuses on the problem that while metadata is growing it's not keeping pace with the rapid increase of Big Data projects being currently initiated by firms. IDC refers to this as the ‘Big Data gap’. Metadata can greatly streamline and enhance processes to collect, integrate, and analyze Big Data sources. Without metadata firms can and will forfeit the deep insights that Big Data can yield. Metadata can manage the entire data life cycle, processes, procedures and customers or users affecting specific business information and can provide an audit trail that can be essential, especially for regulated businesses, at any given point in time. Big Data metadata is the foundation for harnessing these vast amounts data from new disparate data sources and information repositories before they become unmanageable.
Big Data metadata – Where’s the value?
Metadata is the information that describes other data – ‘data about data’. It is the descriptive, administrative and structural data that defines a firm’s data assets. That simple definition has been used by data practitioners for decades. Yet metadata specifically identifies the attributes, properties and tags that will describe and classify information. It would be more appropriately defined as ‘information about data’. It is represented in the form of any number of characteristics associated with the data information asset such as type of asset, author, date originated, workflow state, and usage within the Enterprise, among numerous others. Though once defined, metadata provides the value and purpose of the data content, and thus becomes an effective tool for quickly locating information – a must for Big Data analytics and business user reporting. But metadata can also identify ‘Little Data’ that ultimately provides structure to what becomes Big Data. A recent article in Harvard Business Review identified the three primary ways Big Data and Little Data differ:
- Focus: The focus of Big Data is to advance organizational goals, while Little Data helps individuals achieve personal goals.
- Visibility: Individuals can’t see Big Data; Little Data helps them see it better.
- Control: Big Data is controlled by organizations, while Little Data is controlled by individuals. Companies grant permission for individuals to access Big Data, while individuals grant permission to organizations to access Little Data.
But to realize the true value that metadata, or Little Data, brings to Big Data we need to look at the definition of structure whereby it helps us to find data during data discovery and allows a way to interpret and use Big Data in an accurate manner. Firms have always been able to address metadata in the past because more common data repositories, in the forms of a data lake, data warehouse or a normalized relational database, are structured. The data is organized into rows and columns and the metadata model is ‘native’ by the structure. These data sources can provide a logical structure through readily obtained metadata. But Big Data does not have this availability of 'native' metadata, whereby metadata from new external data sources will be essential to unlock new meaning. Big data will require processing through certain analytics to construct the beginnings of these new metadata definitions. For example, if using Hadoop to capture your data you do not have to specify the metadata at the time of data capture - you only need to define a unique key so you can get to the data when needed. But you will need to define the metadata eventually and Hadoop utilizes HCatalog for that purpose. Once identified this metadata can be correlated to metadata defined from other traditional (structured) data sources in providing an overall comprehensive metadata model for the entire enterprise.
Metadata can link your firm’s data assets by associating relevant criteria. It allows you to associate like data assets and disassociate dissimilar data assets of your various Big Data sources. The incorporation of meaningful metadata attributes into semi-structured data and unstructured content for Big Data makes these data assets more valuable whereby irrelevant information can be dismissed during the search process. As you apply metadata, tied to these search algorithms, you will be able to create high confidence results. This is particularly beneficial in Big Data initiatives whereby standalone keyword-driven results can include an agglomeration of less relevant information. But when leveraging this metadata association, Big Data and analytics users can quickly locate the right information despite the vast amount of content residing across and within these disparate repositories.
You can extend these searches across both structured data and unstructured content repositories your firm may own. The metadata can link all of the content related to one or more metadata attributes regardless of locality or format. As an example, metadata can provide information about a data item, such as product, that uniquely describes that item. A field like product ID is also a means for linking to other data sources, for cross-data integration purposes. Additionally, through metadata descriptors, we can relate data items in common terms and take advantage of this metadata to integrate and better understand our disparate Big Data sources. This approach also provides metadata consistently at the enterprise level.
It is very important that metadata allow you to create and maintain data consistency. Many fortune 1000 firms still have issues in just defining a common term such as ‘customer’. If your firm has disparate data stores, and dispersed business units, the term can be easily misinterpreted across the enterprise, and thus assessed differently. Even if each source is defined correctly, the context of the same data element may change in different application areas. This is a problem in most organizations today and, if not addressed, will affect the integrity of your enterprise reporting and search results. There are two approaches to resolve this issue - rename or tag application terms or names to be more specific or roll up those application names to a more abstract name at the sector or even the enterprise levels. As identified in the next section this is where a metadata repository would be extremely useful. By administering metadata, firms can define a consistent definition or business rule for that specific data attribute and apply it across the enterprise data level so that it can used against structured and unstructured data stores. Metadata ensures a more accurate picture of data across your enterprise and further ensures this level of data consistency for Big Data analytics and business applications.
Some Big Data metadata support considerations – BPEL, RDF and metadata repositories
Big Data metadata design tools can greatly help to visualize new data flows. A very efficient means for visualizing the instructions for Big Data and metadata handling is through utilization of a data mapping service. A data mapping service will connect data visually between the source and destination fields while applying business logic for the data transformation process that can be visualized through an integration flow diagram. Business Process Execution Language (BPEL) is an example of a popular metadata visualization approach that can define the logic of an integration flow diagram, which can also be nested to call other integration flows or other BPEL services. Utilizing these additional capabilities can lead to improved data handling and overall Big Data metadata efficiency. Examples of these tools are available from Eclipse BPEL Designer, Apache ODE, Oracle and IBM Websphere among many others.
A future Big Data metadata management consideration would be to incorporate the Resource Description Framework (RDF) model. Although RDF has been around since it’s inception in 1999 it has recently gained more popularity in Big Data circles for its support model representing metadata. The RDF model defines the breakup of data into ‘triples’ which categorizes it as subjects, predicates and objects. This keeps the data and the metadata to be tightly coupled so querying is more straightforward. One of the more widely implemented RDF query languages utilized today is SPARQL (SPARQL Protocol And RDF Query Language) which has syntax similar to SQL. Although very promising, it can be time consuming to implement if you embark down this path for the first time. If you do make sure you properly assess the time and costs associated before you start your initial RDF implementation.
If you need to manage metadata on an enterprise scale you will need to create and implement a metadata repository. There are three approaches to building a metadata repository. A central metadata repository is the most widely implemented today. This approach provides managed scalability for new metadata to be captured and allows access with high performance. The distributed metadata repository has evolved over the years, especially for those firms that have decentralized business units. It enables users to retrieve metadata from all repositories in a real-time environment. The hybrid approach is slowly evolving and utilizes characteristics of the prior two. It supports real-time access from other repositories as well as provides a central source for maintaining firm-wide metadata definitions. When implementing any of these approaches you still will need to address semantic integration. The new challenge is when you bring Big Data into the picture due to its diversity of data content. Regardless of approach you will need to associate this content to the information itself and accurately align the rules for which this content is interpreted. Once your enterprise metadata repository is in place and matures, it can provide the specific benefits of comprehensive traceability, logical to physical definitions and links, cross-firm business terms, process models as well as data model elements. But the skills needed to integrate these metadata constructs are hard to come by and will be a challenge that most firms will need to address early, when initially implementing your enterprise metadata repository, rather than later.
Metadata management must be part your overall enterprise data governance practice. It is a critical component of any robust data governance practice. An approach to support this is the establishment of data stewardship for metadata. Stewarding metadata will further ensure data consistently to support the enterprise and provide Big Data analytics decision making at an accurate level. Stewardship is necessary for the implementation of enterprise data governance practices since it provides the users of this data with value and a context for understanding the data and its components. Some of the major responsibilities of the metadata data steward include documenting the context of the data content (data heritage and lineage) and the data definitions for data store entities & attributes, identification of the relationships between data, and providing validation of data timeliness, accuracy and completeness. Meta data stewards will also assist in the development of data compliance and associated legal and regulatory controls for data governance adherence. Maintaining proper metadata governance will contribute to your Big Data initiative success and further ensure complete and full realization of business value of the firm’s data assets.
As Big Data utilization in the future increases, new types of metadata will arise to meet the special requirements of different and evolving market segments provisioning Big Data. Implementing a metadata-driven approach and management program for your enterprise, to support structured data and Big Data, provides highly visible and critical value to your firm in the establishment of overall data consistency and ensuring better understanding of data relationships. When in place, enterprise and business initiatives will achieve greater returns through the leveraging of faster access to precise data content that resides in large diverse Big Data stores and across the various data lakes, data warehouses and relational database repositories that are of primary importance to your enterprise.