Tutorial on Hadoop basics

Hadoop is an open-source framework designed to store and process intensive datasets in a distributed computing environment. It provides a scalable and cost-effective solution to handle massive amounts of data (i.e. big data) across clusters of commodity hardware. Hadoop is primarily used for big data processing and analysis. It is a crucial technology in the field of data analytics and data driven decision making.

Hadoop and its related projects and components have been developed under the Apache Hadoop ecosystem. It is released under the Apache License, Version 2.0.

Hadoop for Big Data

In other words, Hadoop is basically a software platform which allows one easily write as well as run applications in order to process vast amounts of data.
It includes:
• MapReduce - It is a offline computing engine
• HDFS - It is a Hadoop distributed file system
• HBase (pre-alpha) - It is online data access

Hadoop is used at most of the organizations such as Yahoo, Amazon, Facebook, Netflix etc. which handle big data. Three main applications of Hadoop are Searches (group related documents), Advertisement and Security (search for uncommon patterns).

Requirements of Hadoop

Following are the goals or requirements of Hadoop:
➨High scalability and availability
➨Abstract and facilitate the storage and processing of large and/or rapidly growing data sets (both structured and non-structured).
➨Fault-tolerance
➨Use commodity hardware with little redundancy
➨Move computation rather than data

Hadoop framework and its architecture diagram

The framework of hadoop consists of several core components that work together to enable distributed processing of data. It processes and analyzes massive datasets in parallel across distributed clusters of computers.

The silent features of hadoop are as follows.
➨Main nodes of cluster are where most of the computational power and storage of the system lies.
➨Main nodes run TaskTracker to accept and reply to MapReduce tasks, and also DataNode to store needed blocks closely as possible.
➨Distributed, with some centralization.
➨Hadoop is written in Java. It also supports Ruby and Python.
➨Central control node runs NameNode to keep track of HDFS directories & files, and JobTracker to dispatch compute tasks to TaskTracker.

Hadoop FrameWork Tools

Let us understand components shown in the hadoop framework diagram.
• HDFS (Hadoop Distributed File System) : It is distributed file system which stores data across multiple machines. It is designed for high throughput and can handle petabytes of data.
• MapReduce : It is a programming model and processing engine used to process and analyze large datasets in parallel.
• YARN : It is a resource management layer of Hadoop. It manages and allocates resources such as CPU, memory etc. to different applications running on cluster.
• Hadoop Common : It contains libraries and utilities which are shared across various hadoop components.
• Hadoop Eco-System : Hadoop has rich ecosystem of related projects whch include Apache Hive, Apache SPark, Apache HBase, Apache Sqoop, Apache Flume etc.

Benefits of Hadoop

Following are the advantages or benefits of Hadoop.
➨It allows us to scale data processing infrastructure by adding more machines to the cluster as data under analysis grows.
➨HDFS (Hadoop Distributed File System) breaks large files into smaller blocks and stores them across multiple machines in the cluster. It ensures high data availability and fault tolerance.
➨Hadoop's MapReduce model enables parallel processing of data.
➨Hadoop can run on commodity hardware which is cheaper compared to high end expensive servers.
➨The architecture of Hadoop supports various data types, formats and sources.
➨Hadoop extends its capabilities through various projects and tools for data ingestion, data warehousing, data processing, real time analytics, machine learning and more.
➨It can process data where it is stored and hence reduces the need for data movement.
➨It's distributed nature ensure that data and processing can continue on other nodes even it any one node fails.
➨It enables real time and stream processing which allows organizations to process and analyze data as it arrives.

Conclusion : Hadoop is developed and maintained by vibrant open source community. Hence hadoop framework offers continuous improvements, bug fixes and the availability of wealth of resources and tools.


Similar posts on cloud storage

Main Page of tutorial  What is Cloud Storage  Types  Infrastructure  How does it work  traditional storage vs cloud storage  Service providers  cloud storage security  cloud computing tutorial 

WiFi LTE related links

Terminology  LTE tutorial  LTE Features  UE categories  LTE Bands  what is WLAN   what is wifi and mifi   WLAN tutorial  

What is?

what is RF?  what is wireless ?  what is LTE  what is Bluetooth?  what is z-wave?  what is GSM?  what is GPRS? 

RF and Wireless tutorials

WLAN  802.11ac  802.11ad  wimax  Zigbee  z-wave  GSM  LTE  UMTS  Bluetooth  UWB  IoT  satellite  Antenna  RADAR