Featured image of post Hadoop Chapter 1: Big Data Concept(EN)

Hadoop Chapter 1: Big Data Concept(EN)

最后修改:

中文版传送门

What is Big Data

  • A collection of data that is huge in volume
  • growing exponentially with time
  • none of traditional data management tools can store it or process it efficiently
  • a data but with huge size

Types of Big Data

  • Structured: Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘structured’ data
  • Unstructured: Any data with unknown form or the structure is classified as unstructured data.
  • Semi-structured: Semi-structured data can contain both the forms of data.

Characteristics of Big Data

  • Volume: enormous size
  • Variety: heterogeneous sources and the nature of data
  • Velocity: the speed of generation of data
  • Veracity: the content quality that should be analyzed

Advantages Of Big Data Processing

  • Businesses can utilize outside intelligence while taking decisions.
  • Improved customer service
  • Early identification of risk to the product/services, if any.
  • Better operational efficiency

Hadoop Ecosystem

  • Data Storage
    • HDFS: File System
    • HBase: Column DB Storage
  • Data Processing
    • Map Reduce: Cluster Management
    • YARN: Cluster & Resource Management
  • Data Access
    • Hive: SQL
    • Pig: Dataflow
    • Mahout: Machine Learning
    • Avro: RPC
    • Sqoop: Data Access
  • Data Management
    • Oozie: Workflow Monitoring
    • Chukwa: Monitoring
    • ZooKeeper: Management

HBase

  • an open-source, non-relational distributed database
  • it is a NoSQL database
  • support all types of data
  • can handle anything and everything inside Hadoop ecosystem
  • run on top of HDFS and provides BigTable-like capabilities
  • written in Java, and HBase applications can be written in REST, Avro, and Thrift APIs

HIVE

  • built on Apache Hadoop
  • manage large distributed data sets
  • provides following features:
    • provides tools to extract/transform/load data(ETL)
    • store, query, and analyze large-scale data stored in HDFS(or HBase)
    • SQL is converted into MapReduce jobs and run on Hadoop to perform statistical analysis and processing of massive data
  • defines a query language HQL(similar to SQL)
  • users familiar with SQL can query data directly using Hive
  • allows mapReducer-savvy developers to develop custom mappers and reducers to handle the complex analysis work
  • disadvantages
    • does not correctly support transactions
    • cannot modify table data(cannot update, delete, insert)
    • slow query speed

Hadoop

  • use a distributed approach to store the massive volume of information
  • data was divided up and allocated to many individual databases
  • HDFS is a specially design file system for storing huge datasets
  • main features
    • cost
    • scalability
    • flexibility
    • speed
    • fault tolerance
    • high throughput
    • minimum network traffic

STORM

  • a free, open source distributed real-time computing system
  • simplifies the reliable processing of streaming data
  • very fast, one test achieved one million group processing per second on a single node
  • storm features:
    • easy to expand
    • storm fault tolerance
    • low latency

Zookeeper

  • is the coordinator of any Hadoop job which includes a combination of services in a Hadoop Ecosystem
  • coordinates with various services in a distributed environment

Hadoop History & Versions

  • Two main problems with big data
    • store such large amounts of data
    • process the stored data
  • Hadoop is the solution to the big data problem
  • consists two components
    • Hadoop Distributed File System(HDFS)
    • YARN

Hadoop History

  • 2002: Apache Nutch was wtarted
  • 2003: Google publish Google File System paper
  • 2004: Google released paper on MapReduce
  • 2005: Nutch Distributed File System was introduced
  • 2006: Hadoop was introduced along HDFS
  • 2007: Yahoo runs two cluster of 1000 machines
  • 2013: Hadoop 2 was released
  • 2017: Hadoop 3 was released

Hadoop Distribution evaluation criteria

Performance

  • in the early days: fast throughput
  • now: includes low latency
  • recent emphasis on low latency focuses on two key attributes
    • raw performance
    • scalability

Scalability

  • File Hadoop’s default architecture consists of a single NameNode. Hadoop platform avoids single NameNode bottlenecks and has a distributed metadata architecture
  • Node number your chosen Hadoop implementation may need to scale to 1,000 nodes or more
  • Node capacity/density you need to scale nodes with higher disk density

Reliability

Apache Hadoop is designed to scale from a single server to thousands of computers and is highly fault-tolerant

comments powered by Disqus