We have seen, in order to process big data it has to be distributed across multiple systems. And someone should be there who is responsible to manage this distributed data. Luckily Hadoop comes into picture for distribution and managing the big data.
The core of Hadoop is :
1. HDFS(Hadoop Distributed File System) : We have seen that data is distributed across various systems in Hadoop. HDFS is responsible to managing the data distributed across various machines. 2. MapReduce : There should be someone to process the data. MapReduce is the one responsible for it. It writes the necessary business logic to continue the data processing. 3. YARN : Introduced in Hadoop 2.0, YARN is responsible to run the processing task created by MapReduce. i.e For running a particular task some amount of memory and CPU has to be allocated. YARN was responsible for that.
We have seen the core business logic is written inside the MapReduce jobs. You can write it in java. Once you are done writing the business logic for MapReduce, the job is sent to YARN for execution. YARN then checks the nodes(individual machines) to see which node is free for executing the MapReduce jobs. It then executes the jobs and saves the resulting data in HDFS.