Apache Hadoop Overview
Apache Hadoop is a collection of open-source frameworks that are used to efficiently store and process big datasets in a distributed computing environment, ranging in size from gigabytes to petabytes of data. It runs applications on clusters of commodity hardware. It provides massive storage for any data, enormous processing power, and the ability to handle virtually limitless concurrent tasks or jobs.
Hadoop Architecture Layers
The Hadoop architecture comprises mainly three layers:
Storage Layer – HDFS
HDFS holds a very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in a redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.
Resource Management Layer – YARN
YARN stands for Yet Another Resource Negotiator. It is the resource management layer of Hadoop. It was introduced in Hadoop 2. YARN is designed with the idea of splitting up the functionalities of job scheduling and resource management into separate daemons.
Processing Layer – MapReduce in Apache Hadoop
It is the data processing layer of Hadoop. It is a software framework for writing applications that process vast amounts of data (terabytes to petabytes in range) in parallel on a cluster of commodity hardware. Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and C++. The programs of MapReduce in cloud computing are parallel in nature, and thus are very useful for performing large-scale data analysis using multiple machines in a cluster.
How is Hadoop special?
- No high-end expensive systems are needed
- Built on commodity hardware
- Can run on your PC/Laptop, etc
- Can run on Linux, Windows, Mac OS/X, as well as Solaris
- No discrimination, as it’s written in Java
- Fault-Tolerant System
- Execution of the job continues even if nodes fail
- It accepts failure as part of the system
- Highly Reliable and Efficient Storage System
- Built-in Intelligence to Speed Up the Application
- Speculative Execution
- Fit for a lot of Applications
- Web Log Processing
- Page Indexing and Page Ranking
- MapReduce Framework