Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Bigdata Storage Problem - HDFS

Storage and Computation problem:

The biggest problem back then before Big Data evolution was the problem of storing the exponentially increasing data and analyzing them for the business vision. The traditional storage databases were not quite good enough to store such a huge growth of data even though the storage was somehow able to achieve but processing took years.


Hadoop solutions:


Hadoop solved the storage problem using the Hadoop HDFS and processing part was solved by Hadoop Mapreduce.
Lets focus on the Storage problem solution by HDFS.
Before that why Hadoop went for a Distributed File system(DFS) lets see:


Lets take a situation where there is a high-end server machine with 4 I/O channels and each channel has 100MB/S bandwidth. The time taken for processing 1 TB of data is 43 min based on the configuration. Now lets take 10 machines with the same configuration but are not high end and are commodity hardware machines and the time taken will be absolutely reduced by 10 times. So Obviously we have an answer now why Hadoop went for Distributed file system instead of centralised file system for Storage purpose.
Now why Hadoop has its own DFS which is HDFS. There is one major difference between normal DFS and HDFS that is in DFS the data is sent to the central server machine for prcessing it and in case of HDFS side the program/code/logic is sent to the data where it is residing for processing and the results are sent back to the server machine.

NameNode and DataNode:

Namenode and Datanodes are the major daemons that runs in the HDFS.

Namenodes are the masternodes in the hadoop cluster which  controls the datanodes. Namenode has the metadata of datanodes that is the location or namespace of the blocks of datanodes.
Datanodes are the slave nodes which stores the actual data. Datanodes sends the heartbeat signal to Namenodes which says that datanode is still alive.
Now lets see how the data is stored in HDFS cluster.

HDFS Blocks:


Every file is stored in HDFS as blocks.The default size of the block is 128MB in hadoop 2nd version(64MB in hadoop 1st version). For example if i have a file of 248MB size then the file is split into 2 blocks(Block A and Block B) with the size split as 128MB and 120MB respectively.

Block Replication:

Since the machines used in Hadoop systems are commodity hardwares there is always a chance that the machine will get fail and its not safe to keep the data in just a single machine. There comes the concept of Block Replication where the data is replicated in multiple blocks and thats how Hadoop is Fault Tolerance where if one machine gets fail we always have a copy of the data in other blocks and nodes.
Hadoop follows a deault replication factor as 3.The default replication factor can be changed in Hadoop configuration files.

HDFS Architecture



Hadoop follows Rack Awareness algorithm to place the replica of the data. Hadoop will store the replica of any data into immediate rack available instead of in the same rack.

Write Mechanism :

Client request for the write operation to Namenode and Namenode provides the IP address of the Datanodes where client can actually write the data. Client sends the data write request to the corresponding datanodes IP addresses provided by Namenode.Once getting the acknowledgement from the datanodes about the readiness then the write pipeline is created and the data is written in the first data node. The first data node takes care of the replication part. Once the write is performed the datanode sends and acknowledgemnt back to the client and client will send the details to Namenode and Namenode will update the metadata accordingly.
In case of multiple block writing the first copy of the blocks are written parallelly to the datanodes and the replication happnes sequentially.

Read Mechanism:

Clinet requests for the particular blocks needed to the Namenode and the Namenode provides the IP addresses of the datanodes stored in the Racks. The client will request the particular datanodes for the read operation through Core switch.


So finally with the HDFS system Hadoop is able to give a perfect solution for Big data storage issue. Also the Fault tolerance system, Block Replication gave a supportive hand for HDFS to handle Big data storage.



This post first appeared on Big Data Basics, please read the originial post: here

Share the post

Bigdata Storage Problem - HDFS

×

Subscribe to Big Data Basics

Get updates delivered right to your inbox!

Thank you for your subscription

×