What is Big Data?
There are mainly three factors that very well helps define a big data. Volume, Velocity and Variety.
Let me take an example of an imaginary startup company who has around 1 TB of data at the initial phase. How do we define the data? Does it qualify for a big data? Well if I say the amount of data is going to be stable throughout the lifetime of the company, is it a big data? Certainly not. For a data set to be called big data, it should have a good growth rate thereby increasing the volume of the data and should be of different variety (text, picture, pdf, etc).
Here are some of the examples of big data.
Companies like Amazon monitors not only your purchase history and wishlist but also each clicks, recording all the pattern and Processing this big amount of data thereby giving us a better recommendation system.
Here’s what NASA has to say about big data.
“In the time it took you to read this sentence, NASA gathered approximately 1.73 gigabytes of data from our nearly 100 currently active missions! We do this every hour, every day, every year – and the collection rate is growing exponentially. – See more at: http://open.nasa.gov/blog/2012/10/04/what-is-nasa-doing-with-big-data-today/”
Have a look at this
https://gigaom.com/2012/08/22/facebook-is-collecting-your-data-500-terabytes-a-day/
Big Data Challenges
Storage – Storage of data should be as efficient as possible both in terms of hardware and processing and retriving the data.
Computation Efficiency – It should be suitable for computation
Data Loss – Data may be lost due to hardware failure and other reasons. Hence data recovery strategies must be good.
Time – Big data is basically for analysis and processing, hence the amount of time for processing the data set should be minimal.
Cost – It should provide huge space and should also be cost effective.
Traditional Solutions
RDBMS
The main issue is scalability. Once the data increases, the amount of time for data processing goes higher with unmanagable number of tables forcing us to denormalize. Necessities may arise to change the query for efficiency. Also Rdbms is for structured data set only. Once the data is present in various formats, RDBMS cannot be used.
GRID Computing
Grid computing creates nodes hence is good for compute intensive. However, it does not perform well for big set of data. It requires programming in lower level like C.
A good solution, HADOOP
Supports huge volume
Storage Efficiency both in terms of hardware and processing/retrival
Good Data Recovery
Horizontal Scaling – Processing time is minimal
Cost Effective
Easy to Programmers and Non Programmers.
Is Hadoop replacing RDBMS?
So is Hadoop going to replace RDBMS? No. Hadoop is one thing and RDBMS is another better for specific purposes.
Hadoop
Storage : Perabytes
Horizontal Scaling
Cost Effective
Made of commodity computers. These are cost effective but enterprise level hardware.
Batch Processing System
Dynamic Schema (Different formats of files)
RDBMS
Storage: Gigabytes
Scaling limitted
Cost may increase violently with volume
Static Schema
This post first appeared on The Tara Nights - Coders Code, Programers AutomateThe Tara Nights, please read the originial post: here