❮❮ PREV NEXT ❯❯

HDFS

HDFS stands for Hadoop Distributed File System. As the name suggests it is a file system of Hadoop where the data is distributed across various machines. And each of the machines are connected to each other so that they can share data.

Where an individual machine is called a NODE.

And the group of machines connected to each other is called a CLUSTER.

The individual machines are COMMODITY HARDWARES. i.e. They are low cost systems which we usually use at our home. But we have seen that at times the computer at our home breaks down. It can be due to hard disk crash or a CPU failure.

The same thing can happen with the nodes(individual machines) in Hadoop. The nodes can be down and we might loose data.

How does Hadoop solve the problem of data loss?

The data loss problem is solved by replicating the data in more than one node. i.e. Hadoop stores the same data in three machines by default. So, if one machine fails the data can be retrieved from the other two.

How is the data distributed in HDFS?

Say we have a large file 'employee.csv' of 5GB and it has to be stored in HDFS. So, what Hadoop does is, it breaks the file of 5GB into parts, each consisting of 64MB. Once the file is broken, each block of 64MB is taken and is distributed among each of the the nodes in the cluster.

Say there are 5 Nodes A, B, C, D and E in the Cluster. Now, the first block of 64MB is taken and is placed in the first node A. Similarly, the second block is taken and is placed in the second node B. And the same process repeats for all the blocks until the blocks are completely distributed among all the 5 Nodes in the Cluster.

Then how does the data replication happens?

So, when the first block of 64MB is placed in the first node, the same time two copies of the same block is created and is placed in two other nodes in the cluster.

And in case the first node is down the data can be retrieved from any one of the two other nodes.

But who keeps the track of the distributed files?

As we have seen the data is distributed across multiple nodes(i.e. machines) in a cluster. Now it's the responsibility of HDFS to keep a track on how the data should be distributed and managed. It solves this requirement by using a master slave architecture.

From the set of Nodes in the Cluster, it picks any one Node and names it as 'Name Node'. This 'Name Node' becomes the master and keeps track of all the distributed pieces stored in all the other Nodes. So, other than the Name Node all other Nodes becomes the slave and are called 'Data Nodes'.

The name 'Name Node' is quite justified as it contains the information of where the data is stored.

Same applies for the 'Data Node'. As it is the one where the actual data is stored.

HDFS Architecture

When the client sends a request, it is handled by the 'Name Node'. The 'Name Node' then figures out which 'Data Node' is available for storage. After identifying the 'Data Nodes' that are available it informs the client that which all Nodes are available. And the client sends the blocks of 64MB to those 'Data Nodes' for storage.

So, the Name Node is like a manager who knows which of the employees under him is working on what project. And if a client wants to get some work done, he has to approach the manager and the mannager decides how the work has to divided among the employees.

Who is responsible for the information of the replication data?

After the Name Node has informed the client about the free spaces on the Data Nodes. The client goes directly to those Data Nodes to store the data blocks. Let us see the below example to see how the data is replicated.

Example:

Say there are three Data Nodes A, B and C. When Data Node A receives the first block of data. It sends a copy to the Data Node B. Now, Data Node B saves the data, creates one more copy and sends it to Data Node C. When C had successfully saved the data, it gives an acknowledgement back to B saying it had saved the data successfully. B also gives the same acknowledgement back to A. And A gives an acknowledgement back to the client saying the data chunk is properly saved and replicated across three nodes.

Since, the Name Node is the master, it should know that in which all Data Nodes the replicated data block is present. For achieving that the Data Nodes gives a block report every time a new block is added.

The Block report tells the Name Node which data blocks are stored in which Data Nodes.

And the Heartbeat given by the Data Node tells the Name Node that the particular Data Node is active.

❮❮ PREV NEXT ❯❯