Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Hadoop Starter Kit What Is Big Data

What is Big Data?

There are mainly three factors that very well helps define a big data. Volume, Velocity and Variety.

Let me take an example of an imaginary startup company who has around 1 TB of data at the initial phase. How do we define the data? Does it qualify for a big data? Well if I say the amount of data is going to be stable throughout the lifetime of the company, is it a big data? Certainly not. For a data set to be called big data, it should have a good growth rate thereby increasing the volume of the data and should be of different variety (text, picture, pdf, etc).

Here are some of the examples of big data.

Companies like Amazon monitors not only your purchase history and wishlist but also each clicks, recording all the pattern and Processing this big amount of data thereby giving us a better recommendation system.

Here’s what NASA has to say about big data.

In the time it took you to read this sentence, NASA gathered approximately 1.73 gigabytes of data from our nearly 100 currently active missions! We do this every hour, every day, every year – and the collection rate is growing exponentially. – See more at: http://open.nasa.gov/blog/2012/10/04/what-is-nasa-doing-with-big-data-today/”

Have a look at this

https://gigaom.com/2012/08/22/facebook-is-collecting-your-data-500-terabytes-a-day/

Big Data Challenges

Storage – Storage of data should be as efficient as possible both in terms of hardware and processing and retriving the data.

Computation Efficiency – It should be suitable for computation

Data Loss – Data may be lost due to hardware failure and other reasons. Hence data recovery strategies must be good.

Time – Big data is basically for analysis and processing, hence the amount of time for processing the data set should be minimal.

Cost – It should provide huge space and should also be cost effective.

Traditional Solutions

RDBMS

The main issue is scalability. Once the data increases, the amount of time for data processing goes higher with unmanagable number of tables forcing us to denormalize. Necessities may arise to change the query for efficiency. Also Rdbms is for structured data set only. Once the data is present in various formats, RDBMS cannot be used.

GRID Computing

Grid computing creates nodes hence is good for compute intensive. However, it does not perform well for big set of data. It requires programming in lower level like C.

A good solution, HADOOP

Supports huge volume

Storage Efficiency both in terms of hardware and processing/retrival

Good Data Recovery

Horizontal Scaling – Processing time is minimal

Cost Effective

Easy to Programmers and Non Programmers.

Is Hadoop replacing RDBMS?

So is Hadoop going to replace RDBMS? No. Hadoop is one thing and RDBMS is another better for specific purposes.

Hadoop

Storage : Perabytes

Horizontal Scaling

Cost Effective

Made of commodity computers. These are cost effective but enterprise level hardware.

Batch Processing System

Dynamic Schema (Different formats of files)

RDBMS

Storage: Gigabytes

Scaling limitted

Cost may increase violently with volume

Static Schema



This post first appeared on The Tara Nights - Coders Code, Programers AutomateThe Tara Nights, please read the originial post: here

Share the post

Hadoop Starter Kit What Is Big Data

×

Subscribe to The Tara Nights - Coders Code, Programers Automatethe Tara Nights

Get updates delivered right to your inbox!

Thank you for your subscription

×