How Hadoop achieves parallelisme , Let’s Prove !!

Sonam Kumari Singh
3 min readDec 11, 2023

--

Hello Readers,

Big Data has become problem these days for the big tech giant. To store and retrieve the data have becomes a challenge. But there are some tools like Hadoop which solves the challenge.

In this blog I will explain how the concept of big data and tools like Hadoop solves the problem of Volume and Velocity. We will be using tcdump tool to analyse the flow of data. Hope this will be insighful for you all .

For this lets have the cluster ready.
- A namenode
- 4 Datanodes

Node Type — IP
Namenode — 172.31.42.232
Datanode 1 — 172.31.33.221
Datanode 2 — 172.31.37.144
Datanode 3 — 172.31.13.12
Datanode 4 — 172.31.45.186

So, we have the total size of cluster as with 4 datanodes:

Now, Let’s upload a file from the client machine. Parallely lets monitor the outgoing traffic with tcpdump command.

Running tcpdump from the master node, we can observe the flow of traffic.

Running tcpdump on the slave node:

Datanode 1:

Datanode 2:

Datanode 3:

Hence, it can be concluded that, Data is initially saved to datanode 2.

Now let’s try to retrieve the data and then try stop the reading process.

We stopped the datanode 3

Let’s stop the datanode 1

Stopped the datanode 4 too.

Here are the following conclusions drawn.

  • Once the master copied the data to datanode 2 the entire data was replicated to 3 more nodes. Hence this is way it achieved the parellism. Master node doesn’t directly copied the file.
  • When we stopped the machines while reading the file. It seamlessly displayed us the result without any lag.

Hope you found something useful in this article. Follow me for some more technical articles.

--

--

Sonam Kumari Singh

SONAM here! Grateful for your connection! Tech enthusiast exploring new languages, deep into DevOps, with a spotlight on Linux. 😊🚀