Guest blog post by Deepak Kumar
Before going into details about what is big data let’s take a moment to look at the below slides by Hewlett-Packard.
So by going through these slides you must have realized that how much data we are generating every second, of every minute, of every hour, of every day, of every month, of every year.
The phrase that is really popular nowadays and also talks the truth: We have generated more than 90% of data in the last two years itself.
And it is getting generated exponentially day by day with the increasing usage of devices and digitization across the globe.
So what is the problem with these huge amounts of data?
Earlier when common database management application systems were made those systems were built with a scale in mind. Even the organizations were not prepared of the scale that’s what we are producing nowadays.
Since the requirements of these organizations have increased over time, they have to rethink and reinvest in the infrastructure. Now the cost of resources involved in scaling up the infrastructure, gets increases with an exponential factor.
Further, there would be a limitation on the different factors like size of the machine, CPU, RAM etc that could be scaled up. These traditional systems would not be able to support the scale required by most of the companies.
Why traditional data management tools and technologies cannot handle these numbers?
Whatever data that is coming to us can be categorized with respect to VOLUME, VELOCITY and VARIETY. And the problem starts here.
- Volume: Today organizations like NASA, Facebook, Google and many other such companies are producing enormous amount of data per day. These data needs to be stored, analyzed and processed in order to know about the market, trends, customers and their problems along with the solutions.
- Variety: We are generating data from different sources in different forms, like videos, text, images, emails, binaries and lots more, and most of these data are unstructured or semi structured. The traditional data systems that we know all works on structured data. so it is quite difficult for those system to handle the quality and quantity of data we are producing nowadays.
- Velocity: Take an example of a simple query where you want to fetch the name of a person from millions of record. Till the time it is in millions or billions we are fine with the traditional systems , but when it reaches more than that even simplest of query takes lots of time for the execution. And here we are talking about the analysis and processing of data that is in the range of hundreds and thousands of petabytes, exabytes and much more. So to analyze the same we have to develop a system that will process the data at much higher speed and with high scalability.
These volume, velocity and variety also popularly known as 3 Vs are worked out using the solutions provided by BigData. So before going into details of how bigdata handles these complex solutions, let’s try to create a short definition for BigData.
What is Big Data?
Dataset whose volume, velocity, variety and complexity are beyond the ability of commonly used tools to capture, process, store, manage and analyze them can be termed as BIGDATA.
How BigData is handling these complex situations?
Most of the BigData tools and framework architecture are built keeping in mind about the following characteristics:
- Data distribution: The large data set is split into chunks or smaller blocks and distributed over N number of nodes or machines. Hence the data gets distributed on several nodes and becomes ready for parallel processing. In Big data world this kind of data distribution is done with the help of Distributed File System or DFS.
- Parallel processing: The distributed data gets the power of N number of servers and machines in which data is residing and works in parallel for the processing and analysis. After processing, the data gets merged for the final required result. The process is known as MapReduce which is adopted from Google’s MapReduce research work.
- Fault tolerance: Generally we keep the replica of a single block (or chunk) of data more than once. Hence even if one of the servers or machine is completely down, we can get our data from a different machine or data center. Again we might think that replicating of data might cost lots of space. And here comes the fourth point for the rescue.
- Use of Commodity hardware: Most of the BigData tools and frameworks need commodity hardware for its working. So we don’t need specialized hardware with special RAID as Data container. This reduces the cost of the total infrastructure.
- Flexibility and Scalability: It is quite easy to add more and more of rackspace into the cluster as the demand for space increases. And the way these architecture are made, it fits into the scenario very well.
Well these are just a few examples from the bigdata reservoir for the complex problems that is getting solved using bigdata solutions.
Again this article talks about only a glass of water from the entire ocean. Go get started and take a dip dive in the bigdata world or if i can say BigData Planet :)
The article First appeared on http://www.bigdataplanet.info/