BASICS: DEFINING BIG DATA AND RELATED TECHNOLOGIES.
According to Gartner.com , Big Data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
It is evident from allprogrammingtutorials.com that Big Data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques. It requires different techniques, tools, algorithms and architecture. Some of Big Data tools and technologies are Apache Hadoop, Apache Spark, R Language and Apache ZooKeeper.
Facebook and Google heavily rely on Big Data. Let us base our first case study on Facebook--Every time one of the 1.2 billion people who use Facebook visits the site, they see a completely unique, dynamically generated home page. There are several different applications powering this experience--and others across the site--that require global, real-time data fetching. In this section, we will discuss some of the tools, frameworks and applications that Facebook developed to overcome the challenge of processing the huge data:
- RocksDB - RocksDB is an embeddable persistent key-value store for fast storage. RocksDB can also be the foundation for a client-server database but our current focus is on embedded workloads. RocksDB builds on LevelDB to be scalable to run on servers with many CPU cores, to efficiently use fast storage, to support IO-bound, in-memory and write-once workloads, and to be flexible to allow for innovation.
Facebook built RocksDB for storing and accessing hundreds of petabytes of data and is constantly improving and overhauling its tools to make this as fast and efficient as possible.
- Corona - Corona is a new scheduling framework developed by Facebook to overcome the limitations of Apache Hadoop MapReduce scheduling framework. Apache MapReduce scheduling framework is responsible for 2 functions - cluster resource management and jobs tacking. Facebook noticed that Apache MapReduce scheduling framework was not able to cope well with the peak data loads. Facebook solved this problem by coming up with Corona scheduling framework which separates out the cluster management and jobs tracking allowing it to enable processing of peak data loads at optimal speeds
- 3. Presto - Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.
Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day.
Google probably processes more information than any company on the planet and tends to have to invent tools to cope with the data. As a result, its technology runs a good five to 10 years ahead of the competition. Google has come up with quite a few big data processing algorithms such as MapReduce, Flume on which many big data technologies such as Hadoop have been developed. We, in this section, will discuss about some of the big data technology stack at Google:
- Google Mesa - Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related to Google's Internet advertising business. Mesa is designed to satisfy a complex and challenging set of user and systems requirements, including near real-time data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Specifically, Mesa handles petabytes of data, processes millions of row updates per second, and serves billions of queries that fetch trillions of rows per day. Mesa is geo-replicated across multiple datacenters and provides consistent and repeatable query answers at low latency, even when an entire datacenter fails.
- Google File System - Google File System is a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.It is widely deployed within Google as the storage platform for the generation and processing of data used by Google service as well as research and development efforts that require large data sets.
Google File System is the base of hadoop's HDFS that is being used actively in a lot of big data tools and databases such as HBase, Cassandra, Spark etc.
- BigTable - Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements(from backend bulk processing to real-time data serving).
- Google Flume - FlumeJava is a Java framework developed at Google for MapReduce computations. MapReduce though enables distributed computing but not all real life problems can be described using a MapReduce task. Instead, most of the real life problems require a chain of MapReduce tasks for complete processing. This requires intermediate code to pipeline MapReduce tasks. Apache Flume attempts to resolve this problem by providing pipelining of MapReduce tasks out of the box.
Flume has been handed over to Apache and there is an active project running on this named as Apache Flume.
- Google MilWheel - MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework's fault-tolerance guarantees. MillWheel's programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we find that MillWheel's unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.
- Dremel - Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. Google present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.
Google offers a cloud analytics platform called BigQuery based on Dremel to enable companies get their huge structured data processed at lightening fast speeds.
HARNESSING BIG DATA FOR SECURITY: INTELLIGENCE IN THE ERA OF CYBER WARFARE.
There can be information without security but the can be no security without information. This is unequivocally supported by the great French military strategist and tactician, Napoleon Bonaparte who rightfully asserted that “War is ninety percent Information.”
In the Digital Era that has heralded the dawn of the Information Age and the frontier of Cyber-Terrorism, Napoleon would have obviously noted that there can be No Command without Cyber Command!
Blind movement in the world wide web must not be confused with motion. Intelligence navigation requires real Predictive Analytics. We must not only detect events, but also study, analyze and
foresee future patterns of occurrence.
With a combination of powerful information technology tools and methodologies such as Data Mining, Predictive Analytics, Artificial Intelligence and Machine Learning, it is possible to fight a smart war against bandits, terrorists, cyber fraudsters and other gangsters at the click of a button.
Will the construction of a physical wall covering the border with Al-Shabaab stop terrorists from attacking Kenya?
As seen recently in a shooting in Texas, USA, even walls in form of big oceans between the Middle East and America have not stopped the Islamic State from launching an attack on American soil! Pietersen gives a snapshot of what it takes to prepare for intelligence-led missions:
Now looking back over nearly 40 years, I think I have learned the following six things.
- First, how one thinks about the mission affects deeply how one does the mission.
- Second, intelligence failures come from failing to step back to think about underlying trends, forces, and assumptions —not from failing to connect dots or to predict the future.
- Third, good analysis makes the complex comprehensible, which is not the same as simple.
- Fourth, there is no substitute for knowing what one is talking about, which is not the same as knowing the facts.
- Fifth, intelligence analysis starts when we stop reporting on events and start explaining them.
- Sixth, managers of intelligence analysts get the behavior they reward, so they had better know what they are rewarding.
BIG DATA OR SMART DATA?
There is no telling who controls the Internet nor who contributes exactly to the millions of websites being created every moment. The more traditional, rigid and bureaucratic government departments are finding it hard to keep with modern, young minds full of dynamism and terrific zeal. The most horrifying fact is that these young minds are being tapped, funded and indoctrinated by jihadist and terrorist organizations to further their evil agenda—the agenda of eliminating any individual who does not accept their ideology.
Therefore, it is crystal clear that for security agencies and governments to effectively fight terrorism, they must equally invest in dynamic pool of digital talent that will ignite a seamless network of smart, agile adaptive and disruptive army of Cyber-genius credentials. It is possible! Perhaps you would be happy to work in such an exciting pool of digital talent.
So, how can governments leverage the power of Big Data for the Security needs of our Digital Age?
It is not all about abracadabra solutions. Thinks tanks must be created, digital resources must be mobilized and brows must be knit as the mind retires into depths of thought that would yield remarkable new streak of innovations that will not only anayze the huge gig data piles around us, but also invent brand new intelligence tools that must work smart round the clock to process Big Data into actionable and smart information to enhance security.
Wait a minute! Aren’t there enough technologies already about Big Data?
Yes, there are technologies such as Apache Hadoop and it is Open Source! However, it is not enough for it can be developed into an advanced and better innovation to keep pace with the dynamism of our digital era. Let’s have a flashback--Recently, an eminent expert gave a great definition on data security:
Security. What is security? Dan Geer defined it best. Keynoting at the Recorded Future User Network (RFUN) Conference in Washington, D.C. Geer said:
“Security is about the absence of surprises that can’t be mitigated. As such, security that is well thought is security that changes the probability of surprise while foregoing as little as possible.”
Therefore, there is no taking chances when it comes to matters security. Planning must be comprehensive. Implementing data security plans must be surgically thorough.
For instance, America is at the forefront of the Data Mining/Big Data battleground through Data Analysis and Research for Trade Transparency System (DARTTS). This is an office affiliated to the Homeland Security Department that works pretty well in prevention of money laundering and trade-related crimes.
Why then can’t Kenya use the platform of Big Data to detect, deter and decimate terrorists?
Recorded Future. (2015). Dan Geer Keynote: Data and Open Source Security. Retrieved online from https://www.recordedfuture.com/dan-geer-keynote/
Department of Homeland Security. (2015). Data Mining Report 2011. Retrieved online on 8th May 2015 from http://www.au.af.mil/au/awc/awcgate/dhs/privacy_rpt_datamining_2011.pdf
Gartner, Inc. (2013). IT Glossary: Big Data. Retrieved online on 11th May 2015 from http://www.gartner.com/it-glossary/big-data
Sain Technology Solutions. (2015). Big Data Fundamentals. Retrieved online 11th May 2015 from http://www.allprogrammingtutorials.com/big-data-fundamentals/
Originally posted on Data Science Central