Hadoop Chapter 2: Hadoop and Big Data Architecture(EN)

Hadoop Operating modes

four main modes of Hadoop operation

configuration file: hadoop-env.sh

simulate a fully distributed environment
- Hadoop run on a single node
- each Hadoop daemon running on a single server node
five process
- NameNode
- DataNode
- SecondaryNameNode
- ResourceManager
- NodeManager

HDFS is used to store large amounts of data
required for the normal operation of the HDFS
- NameNode
- DataNode
- SecondaryNameNode
run the start-dfs.sh command to start the HDFS daemon process to provide external services

HDFS Configuration
- core-site.xml: NameNode RPC remote communication address. The default port number is 8020
- HDFS-site.xml: set the number of data block copies to 1
YARN Configuration
- mapred-site.xml: set the MapReduce operating framework to YARN
- yarn-site,xml: set the ResourceManager communication address and aux-services of the NodeManager
Other Configuration
- hadoop-env.sh: set the java environment variables in hadoop.sh

HA architecture solve the problem of NameNode availability by allowing us to have two NameNodes in an active/passive configuration.
- Active NameNode
- Standby/Passive NameNode
Two main issues in maintaining consistency
- Active and Standby NameNode should always be in sync with each other
- There should be only one active NameNode at a time

Using Quorum Journal Nodes
- JournalNodes helps the standby and active NameNode keep in sync
- The active NameNode is responsible for updating the EditLogs present in the JournalNodes.
- The StandbyNode reads the changes made to the EditLogs in the JournalNode and applies it to its own namespace in a constant manner.
Using Shared Storage
- A shared storage device helps standby and active NameNode keep in sync
- The active NameNode logs the record of any modification done in its namespace to an EditLog present in this shared storage.
- The StandbyNode reads the changes made to the EditLogs in this shared storage and applies it to its own namespace.
The administrator must configure at least one fencing

Hadoop access control is divided into two levels
- system level: ServiceLevel Authorization. It is used to control whether specified services can be accessed.
- scheduler level

Add a host name on the master node and copy it on any DataNode node.
Hadoop has configured the relevant parameters to the newly added DataNode and started it on the new node. Specific steps are as follows:
- Increase hostname
- Copy the hadoop installation file
- Start new node

After adding new nodes, if you want to achieve load balancing, you need to use the balance command: bin/start-balancer.sh -threshold 10

Hadoop Ecosystem is neither a programming language nor a service
It is a platform or framework which solves big data problems
Most of the services available in the Hadoop ecosystem are to supplement the main four core components
- HDFS
- YARN
- MapReduce
- Common

Big Data Technology Ecosystem