Subscribe to our Newsletter

Guest blog post by John P. Stevens

You would be surprised how many major firms, who are now ready to launch their first Big Data projects, have not started an internal Big Data governance program. Big data has great potential to help organizations grow their business, but just having this data will not be enough. To derive the highest value and minimize legal exposure and risk to a firm’s reputation, a Big Data governance program is crucial from the start. Even if there is data governance currently in place it is most likely inadequate to address some of the new challenges that many Big Data initiatives bring. Big data will, more than likely, expand the boundaries of existing data governance processes that you may have today.  If you don’t have any Big Data governance in your firm at this time, then you must get started now before you engage your first Big Data initiative. 

Big data involves various different forms of information from non-relational databases and other types of unstructured data - information generated by various social media, video, imaging and every imaginable smart connected device. But as Big data sources expand incrementally it brings with it additional risks. The one critical approach to managing this risk exposure is the creation of a viable Big Data governance program. Big data brings a unique set of characteristics that will affect how it will need to be governed. 

Without Big Data governance, to specifically address and define appropriate governance policies, it will quickly become impossible for your organization to search, classify, and manage these petabytes of information in a truly risk-averse manner. With a structured and concise set of Big Data governance policies, it will ensure that your firm can ultimately protect and increase the true value of Big Data and transform it to drive the company's future vision, growth and success. 

Let’s now focus on the following four major Big Data governance areas that you need to address before starting your first Big Data initiative.


1)    Regulatory, Compliance and Privacy 

These laws are abundant globally. In just the United States alone some of the major regulatory and compliance laws and regulations include the Dodd-Frank Act, Payment Card Industry Data Security Standard (PCI DSS), Health Insurance Portability and Accountability Act (HIPAA), the Federal Information Security Management Act (FISMA) and the Sarbanes-Oxley Act (SOX), among many others. The failure to address any and all legal, regulatory, and compliance laws, in relation to Big Data, could result in serious legal liabilities that will result in major fines, a loss of business and customers, and ultimately cripple the firm’s overall reputation. Individual data privacy laws also need to be highlighted in relation to the use of Big Data since these vary based on locale. In Europe data privacy protection regarding Big Data was introduced to implement safeguards to ensure adequate processing to prevent erroneous data compromises to an individual’s data. The ‘EU Data Protection Reform and Big Data’ was introduced in April of 2015 and stressed their ‘data protection by design principle for the architects of Big Data analytics to use techniques such as anonymization, pseudonymization, encryption, and protocols for anonymous communications’. You will need to assess the level of Big Data encryption and anonymization utilized in your organization today. Prior to data being uploaded to the cloud or other storage mediums, it should be anonymized so any identifiers mapped to individuals are removed from these data sets. Privacy Preserving Data Mining (PPDM) is a newly developed Big Data strategy that was created to protect sensitive information from unsolicited or unsanctioned disclosures while preserving the utility of the data collected from consumers. Although mainstream Big Data product vendors are just starting to introduce more robust tools to specifically address compliance today, it is still mainly a manually defined process.  If you don’t know what regulatory and compliance exposures to address, you will need to have your firm’s legal department identify which laws to focus and what Big Data and metadata definitions & rules apply for each of your Big Data source providers.


2)    Data Access, Protection and Security 

A survey administered by Protegrity, a provider of enterprise data security solutions, at last year’s Strata & Hadoop World Summit found that 86 percent of the over 150 respondents agreed that data security was a crucial requirement for their Hadoop data lake or hub. Protegrity CEO, Suni Munshani, stated that "89 percent of the Big Data professionals we surveyed disagree, or are not sure that security tools native to Hadoop provide enough protection for their sensitive data, it demonstrates a tremendous need for increased education around Big Data security and the availability of more robust data security solutions for Hadoop".  In terms of usage, 80 percent of those surveyed indicated that their organizations are already using Hadoop in production environments. Your governance policy should address the levels of control and amount of access each Big Data consumer you administer (e.g. who, what, when & where). Big Data access, protection and security are a key concern since it will have exposure at every Big Data activity level. A concise governance policy is needed to define the different types of data to be secure at all times. Big Data brings with it a major security challenge in how to protect it. Proper data protection policies should also include backing up the data and protecting it from corruption. Data heat maps can provide some beneficial insights when reviewing your snapshot and data mirroring policies as well. Big Data that have high volume update and delete activities are a prime candidate for a high frequency snapshot policy. This will provide a level of insurance in case critical data gets accidentally overwritten or is deleted. Additionally, for added data protection, you should review the mirroring of frequently utilized data volumes to make sure this data is available at an alternative location.


3)    Data Validity and Quality 

Did you do due diligence regarding the data sources you are using for Big Data initiatives? Do you know that public and open-domain data providers will not readily accept liability for losses arising from the reliance and quality of their data? Data obtained from other contracted third-parties may also have numerous data validity concerns. You will additionally need to ensure whether the metadata is available, interpretable and complete for a proper understanding and use of the data. Such data quality issues can, and will, affect various predictive data analytics processes that your firm might use to drive critical client or customer decisions. It essentially could take your data maintenance team most of its time just to understand what rules to apply to cleanse data from these data sources. Your data clean-up team still may not be able to determine the complete accuracy of the data, even after multiple scrubbings, thus leading to more erroneous results. Make sure you can define Big Data quality metrics to address and enforce adherence to your Big Data governance policies. You must also ask if Big Data stewards and consumers truly understand the data enough to assess for its overall quality, data definitions and completeness.  If the answer is yes then, and only then, can you be assured they can use the data to make necessary critical business, operational, client and strategic decisions.  You need to have strict data governance policies that look closely at all your Big Data providers since your business consumers will be constantly pressuring you to accept their tainted data as just ‘the cost of doing business’ with a public or third-party data provider.  The overall ramifications of invalid data analysis and misinterpretation of Big Data that result from poor data quality can be a major detriment to the overall success of your future Big Data initiatives and, ultimately, to your firm.


4)    Volume and Data Retention 

Unless your firm is focused on avoiding penalties for non-compliance or constraint to holding data to legal or regulatory set expiration timeframes, you will need to seriously look at identifying Big Data volume limitation and retention policies.  Big Data volumes alone can have a huge impact on yearly infrastructure budgets if you need to store it for an indefinite period of time. Identifying the complete Big Data lifecycle is relevant when addressing Big Data governance, but ‘end-of-use’ issues should not be overlooked. Previous data retention schedules may not suit all of your Big Data requirements. Different types of data will have different requirements for retention periods so you’ll need a concise data governance policy that defines how long these disparate types of data will be kept. Next you will need to identify those policies that define when to archive this data and how will future users of the data can access it once archived (include in your Big Data access policies as well). If the data is archived to an off-shore or domestic third-party what data governance policies are in place to ensure alignment to yours?

Addressing the above four Big Data governance risk areas should protect you when going through Big Data internal or external audits as well.  But how robust is your current data auditing tools and audit control processes with regard to Big Data? Do you have them at all? Look at implementing audit logging to help you understand and monitor big data cluster usage as well as establishing additional audit controls for access breaches & violations, data activity levels, compliance infractions among others, and to provide overall support of the four Big Data governance risk areas reviewed above. MapR 5.0 and Apache Oozie are examples of current technologies that already have Big Data auditing features that you can readily implement to assist in assessing these areas now. 

By understanding and taking these necessary steps and actions to put in place your Big Data governance processes and policies now, prior to starting your first Big Data initiative, will not only protect your firm by addressing the aforementioned risk areas but will also ensure the proper accessibility, usability, integrity and security of your Big Data in the years ahead.

E-mail me when people leave their comments –

You need to be a member of Hadoop360 to add comments!

Join Hadoop360

Featured Blog Posts - DSC