Hadoop Chapter 1: Big Data Concept(EN)

What is Big Data

A collection of data that is huge in volume
growing exponentially with time
none of traditional data management tools can store it or process it efficiently
a data but with huge size

Structured: Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘structured’ data
Unstructured: Any data with unknown form or the structure is classified as unstructured data.
Semi-structured: Semi-structured data can contain both the forms of data.

Data Storage
- HDFS: File System
- HBase: Column DB Storage
Data Processing
- Map Reduce: Cluster Management
- YARN: Cluster & Resource Management
Data Access
- Hive: SQL
- Pig: Dataflow
- Mahout: Machine Learning
- Avro: RPC
- Sqoop: Data Access
Data Management
- Oozie: Workflow Monitoring
- Chukwa: Monitoring
- ZooKeeper: Management

an open-source, non-relational distributed database
it is a NoSQL database
support all types of data
can handle anything and everything inside Hadoop ecosystem
run on top of HDFS and provides BigTable-like capabilities
written in Java, and HBase applications can be written in REST, Avro, and Thrift APIs

built on Apache Hadoop
manage large distributed data sets
provides following features:
- provides tools to extract/transform/load data(ETL)
- store, query, and analyze large-scale data stored in HDFS(or HBase)
- SQL is converted into MapReduce jobs and run on Hadoop to perform statistical analysis and processing of massive data
defines a query language HQL(similar to SQL)
users familiar with SQL can query data directly using Hive
allows mapReducer-savvy developers to develop custom mappers and reducers to handle the complex analysis work
disadvantages
- does not correctly support transactions
- cannot modify table data(cannot update, delete, insert)
- slow query speed

a free, open source distributed real-time computing system
simplifies the reliable processing of streaming data
very fast, one test achieved one million group processing per second on a single node
storm features:
- easy to expand
- storm fault tolerance
- low latency

is the coordinator of any Hadoop job which includes a combination of services in a Hadoop Ecosystem
coordinates with various services in a distributed environment

Two main problems with big data
- store such large amounts of data
- process the stored data
Hadoop is the solution to the big data problem
consists two components
- Hadoop Distributed File System(HDFS)
- YARN

in the early days: fast throughput
now: includes low latency
recent emphasis on low latency focuses on two key attributes
- raw performance
- scalability

File Hadoop’s default architecture consists of a single NameNode. Hadoop platform avoids single NameNode bottlenecks and has a distributed metadata architecture
Node number your chosen Hadoop implementation may need to scale to 1,000 nodes or more
Node capacity/density you need to scale nodes with higher disk density

Apache Hadoop is designed to scale from a single server to thousands of computers and is highly fault-tolerant