What is Big Data
- A collection of data that is huge in volume
- growing exponentially with time
- none of traditional data management tools can store it or process it efficiently
- a data but with huge size
Types of Big Data
- Structured: Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘structured’ data
- Unstructured: Any data with unknown form or the structure is classified as unstructured data.
- Semi-structured: Semi-structured data can contain both the forms of data.
Characteristics of Big Data
- Volume: enormous size
- Variety: heterogeneous sources and the nature of data
- Velocity: the speed of generation of data
- Veracity: the content quality that should be analyzed
Advantages Of Big Data Processing
- Businesses can utilize outside intelligence while taking decisions.
- Improved customer service
- Early identification of risk to the product/services, if any.
- Better operational efficiency
Hadoop Ecosystem
- Data Storage
- HDFS: File System
- HBase: Column DB Storage
- Data Processing
- Map Reduce: Cluster Management
- YARN: Cluster & Resource Management
- Data Access
- Hive: SQL
- Pig: Dataflow
- Mahout: Machine Learning
- Avro: RPC
- Sqoop: Data Access
- Data Management
- Oozie: Workflow Monitoring
- Chukwa: Monitoring
- ZooKeeper: Management
HBase
- an open-source, non-relational distributed database
- it is a NoSQL database
- support all types of data
- can handle anything and everything inside Hadoop ecosystem
- run on top of HDFS and provides BigTable-like capabilities
- written in Java, and HBase applications can be written in REST, Avro, and Thrift APIs
HIVE
- built on Apache Hadoop
- manage large distributed data sets
- provides following features:
- provides tools to extract/transform/load data(ETL)
- store, query, and analyze large-scale data stored in HDFS(or HBase)
- SQL is converted into MapReduce jobs and run on Hadoop to perform statistical analysis and processing of massive data
- defines a query language HQL(similar to SQL)
- users familiar with SQL can query data directly using Hive
- allows mapReducer-savvy developers to develop custom mappers and reducers to handle the complex analysis work
- disadvantages
- does not correctly support transactions
- cannot modify table data(cannot update, delete, insert)
- slow query speed
Hadoop
- use a distributed approach to store the massive volume of information
- data was divided up and allocated to many individual databases
- HDFS is a specially design file system for storing huge datasets
- main features
- cost
- scalability
- flexibility
- speed
- fault tolerance
- high throughput
- minimum network traffic
STORM
- a free, open source distributed real-time computing system
- simplifies the reliable processing of streaming data
- very fast, one test achieved one million group processing per second on a single node
- storm features:
- easy to expand
- storm fault tolerance
- low latency
Zookeeper
- is the coordinator of any Hadoop job which includes a combination of services in a Hadoop Ecosystem
- coordinates with various services in a distributed environment
Hadoop History & Versions
- Two main problems with big data
- store such large amounts of data
- process the stored data
- Hadoop is the solution to the big data problem
- consists two components
- Hadoop Distributed File System(HDFS)
- YARN
Hadoop History
- 2002: Apache Nutch was wtarted
- 2003: Google publish Google File System paper
- 2004: Google released paper on MapReduce
- 2005: Nutch Distributed File System was introduced
- 2006: Hadoop was introduced along HDFS
- 2007: Yahoo runs two cluster of 1000 machines
- 2013: Hadoop 2 was released
- 2017: Hadoop 3 was released
Hadoop Distribution evaluation criteria
Performance
- in the early days: fast throughput
- now: includes low latency
- recent emphasis on low latency focuses on two key attributes
- raw performance
- scalability
Scalability
- File Hadoop’s default architecture consists of a single NameNode. Hadoop platform avoids single NameNode bottlenecks and has a distributed metadata architecture
- Node number your chosen Hadoop implementation may need to scale to 1,000 nodes or more
- Node capacity/density you need to scale nodes with higher disk density
Reliability
Apache Hadoop is designed to scale from a single server to thousands of computers and is highly fault-tolerant