Hadoop
Hadoop Versionsβ
- Hadoop 1.0: Includes Common, HDFS, MapReduce
- Hadoop 2.0: Includes Common, HDFS, MapReduce, YARN. Hadoop 1.0 and 2.0 are incompatible. Since Hadoop 2.7, includes Ozone. Since Hadoop 2.10, includes Submarine.
- Hadoop 3.0: Includes Common, HDFS, MapReduce, YARN, and Ozone module. Latest Hadoop 3.0 includes Submarine.
Hadoop Modulesβ
- Hadoop Common: Supports other modules
- HDFS (Hadoop Distributed File System): Distributed storage
- Hadoop YARN: Task scheduling and resource management
- Hadoop MapReduce: Distributed computing framework based on YARN
- Hadoop Ozone: Object storage
- Hadoop Submarine: Machine Learning engine
Hadoop Installationβ
Standalone and Pseudo-Distributedβ
- Hadoop Pseudo-Distributed Installation.txt
- Hadoop Fully Distributed Setup.txt
Differences between Hadoop 2.0 and 3.0β
HDFSβ
NameNodeβ
-
Role
- Manage DataNodes and store metadata.
-
Metadata
-
a. File storage path
-
b. File permissions
-
c. File size
-
d. Block size
-
e. Mapping between File and BlockID
-
f. Mapping between BlockID and DataNode
-
g. Replication factor
-
Metadata is irrelevant to specific content of files.
-
Metadata size is 130~180B.
-
a. Maintained in memory for fast R/W.
-
b. Maintained on disk for crash recovery.
-
Storage files on disk:
edits: Operation file. Records write operations.fsimage: Meta image file. Records metadata. This file's metadata often lags behind metadata in memory.
-
Metadata storage path determined by
hadoop.tmp.dirproperty. Defaults to/tmpif unspecified. -
When NameNode receives write request, it first writes request to
edits_inprogressfile. If successful, updates metadata in memory. If memory update successful, returns ack to client.fsimageis not modified in this process. -
As HDFS runs,
edits_inprogressgrows. Gap betweenfsimageand memory metadata grows. Need to updatefsimageat appropriate time.edits_inprogresswill roll, producing aneditsfile and a newedits_inprogress.- Space: When
editsfile reaches specified size (default 64M,fs.checkpoint.size), rolls newedits_inprogress. - Time: When time interval from last roll reaches condition (default 1H,
fs.checkpoint.period), rolls newedits_inprogress. - Force:
hadoop dfsadmin -rollEdits - Restart: NameNode restart triggers
editsfile rolling.
- Space: When
-
NameNode manages DataNode via Heartbeat.
- Default heartbeat interval 3s (
dfs.heartbeat.interval). - If NameNode receives no heartbeat from DataNode for over 10min, considers node lost. Data on this node is backed up to other nodes to ensure replication factor.
- Heartbeat Signal:
- Current DataNode status (Pre-service, In-service, Pre-decommission, Decommission)
- BlockIDs in DataNode
- clusterID. Generated when NameNode formatted. Sent to DataNode on startup. DataNode carries clusterID in heartbeat for validation.
- Default heartbeat interval 3s (
-
NameNode/HDFS Restart: Triggers
edits_inprogressrolling, updates operations tofsimage, loadsfsimageto memory, waits for DataNode heartbeat, verifies blocks. This process is Safe Mode. Automatically exits upon success.- Force exit safe mode:
hadoop dfsadmin -safemode leave
- Force exit safe mode:
-
DataNodeβ
- Block: Data split into specified size blocks.
- Meaning: 1) Store huge files 2) Fast backup
- Role: Store Block on DataNode disk.
- Sends heartbeat to NameNode periodically.
- Storage location determined by
hadoop.tmp.dir. - Default 128M (
dfs.blocksize). - Smaller files use actual size.
- Global incremental Block ID.
SecondaryNameNodeβ
- Not hot backup of NameNode. Assists NameNode in
editsfile rolling. - If SecondaryNameNode exists, it handles rolling. If not, NameNode handles it.
Replica Placement Strategyβ
- 1st Replica:
- Internal upload: On uploading DataNode.
- External upload: NameNode chooses relatively idle DataNode.
- 2nd Replica:
- Pre-Hadoop 2.7: Different rack from 1st.
- Post-Hadoop 2.7: Same rack as 1st.
- 3rd Replica:
- Pre-Hadoop 2.7: Same rack as 2nd.
- Post-Hadoop 2.7: Different rack from 2nd.
- More Replicas: On relatively idle nodes.
Rack Awarenessβ
- Mapping IP/Hostname to a logical rack ID.
- Configured via script (Shell/Python).
Trash Mechanismβ
- Default disabled.
- Config in
core-site.xml:fs.trash.interval(minutes). Default 0 (disabled). 1440 (1 day).
DFS Directoryβ
- Controlled by
hadoop.tmp.dir. Created on format. - Subdirectories:
data(DataNode),name(NameNode),namesecondary. in_use.lock: Marks process running.
Read/Write Processβ
-
Read:
- Client -> NameNode (Check path).
- NameNode -> Client (Block locations).
- Client -> DataNode (Read Block, verify checksum).
- Repeat for subsequent blocks.
-
Write:
- Client -> NameNode (Check permission/existence).
- NameNode -> Client (Allow).
- Client -> NameNode (Request 1st Block location).
- NameNode -> Client (DataNode list).
- Client -> Pipeline write (DataNode1 -> DataNode2 -> DataNode3). Acks propagate back.
- Repeat for subsequent blocks.
- Client -> NameNode (Close file).
-
Delete:
- Client -> NameNode.
- NameNode marks deletion in metadata. Data remains on DataNode.
- DataNode deletes block upon Heartbeat response from NameNode.
MapReduceβ
Overviewβ
- Distributed computing framework provided by Hadoop.
- Based on Google MapReduce.
- Split into Map phase and Reduce phase.
Maven Configurationβ
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<mainClass>cn.App</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
Data Localization Strategyβ
- JobTracker accesses NameNode for metadata -> Slicing (Logical split).
- Assign map tasks to nodes holding data (reduce transmission).
Shuffleβ
- Map Side:
context.writewrites to buffer (Ring buffer, default 100M).- Partition, Sort in buffer (Quicksort).
- Spill to disk when threshold reached (0.8). Generates spill files (Locally ordered).
- Merge spill files to final output (Mergesort). Combiner if applicable.
- Reduce Side:
- ReduceTask starts fetch thread to pull data from MapTask.
- Merge pulled files (Mergesort).
- Grouping (Group values by key).
Data Skewβ
- Map Side: Rare. Caused by non-splittable uneven files.
- Reduce Side: Common. Caused by key distribution.
- Solution: Two-stage aggregation (Scatter then Aggregate).
Small File Handlingβ
- Problems: High memory usage on NameNode, High thread overhead on MapReduce.
- Hadoop Archive:
hadoop archive -archiveName txt.har -p /result/ /.
YARNβ
Overviewβ
- Yet Another Resource Negotiator. Introduced in Hadoop 2.0.
Architectureβ
- ResourceManager: Master. Resource management.
- NodeManager: Slave. Task monitoring and computing.
Job Execution Flowβ
- Client submits job to ApplicationsManager.
- ApplicationsManager assigns to NodeManager, requesting start of ApplicationMaster.
- ApplicationMaster starts on NodeManager.
- ApplicationMaster splits job to MapTask/ReduceTask.
- ApplicationMaster requests resources from ResourceManager.
- ResourceScheduler returns Container to ApplicationMaster.
- ApplicationMaster assigns tasks to NodeManagers.
- NodeManager executes tasks.
- ApplicationMaster monitors execution.
Speculative Executionβ
- Optimization for slow tasks.
- Run backup task on another node. First data wins.
- Not suitable for data skew scenarios (wastes resources).
Uber Reuseβ
- Run small application tasks within same JVM of ApplicationMaster.
- Config:
mapreduce.job.ubertask.enable.
- Attribution: Retain the original author's signature and code source information in the original and derivative code.
- Preserve License: Retain the Apache 2.0 license file in the original and derivative code.
- Attribution: Give appropriate credit, provide a link to the license, and indicate if changes were made.
- NonCommercial: You may not use the material for commercial purposes. For commercial use, please contact the author.
- ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.