HBASE
HBASE is a column oriented key value store build on top of HDFS. This indicates HBASE use (leverages)the HDFS infrastructure for data replication. HBASE is a CP system(Consistent and Partition tolerant system). It sacrifice avalibility for consistency [See CAP theorem]
HBase is a NoSql system, so you don’t normalize your data model and you don’t have joins to merge two tables. See Hive (and pig) for query language
HBase automatically shards (splits) the large tables into multiple shards (which is called Regions). Regions are stored in region server
Data Model
You can image a table which can store the sparse data (many column are missing). For any cell the complete coordinates are given by
Table:Row:ColumnFamily:Column:Timestamp → Value
When you try to reterive a value the latest timestamp/ version is returned. By default three version are stored
Deleted nodes are marked with tombstones, to indicate they can’t be returned for any query.
Components
In this post we will understand key concepts used in HBASE.
HBASE have following big architechural pieces:
- H Master: Manage splitting of table when it become huge to multiple RegionServer. All DDL commands are executed by it. HMaster is not in read or the write path of the query. There is usually a secondary H-Master for fault tolerance reason. At any point of time only one H-master is active.
2. RegionServer: Client directly contact Region Server for read and write query.
3. DataNode: Stores the actual data in H-Files. Can be colocated with Region server in most cases. After the table split RegionServer and datanode can be in different machine untill datanode are moved during the compaction process
4. Zookeeeper (coodination service for detecting and managing failures in both regionServer and H-Master). Uses consensus to guarantee “Only one active HMaster”. Also zookeeper use ephimeral node (heartbeat) to detect failure). Of course Zookeeper need to run in multiple nodes to make itself fault tolerant.
Termonology
Region: its a *contiguous* range of row keys in a table
.META table (Stored in Zookeeper) This table has a very similar purpose to what we have in DNS. It is a mapping between table_name:region_start_key:region_id → regionServer IP. This is how client know which regions server to contact for the particular read or write request. .META table is cached at client
DataStructure maintained by RegionServers
WAL (write ahead log): This is a classic technique for having durability in HBase. Different databases have different name for it. This is the journal concept. You write your DB write commands disk (append only). Region server can create the memstore by replaying the WAL.
Memstore: This is the (in memory/RAM) write cache. there is 1 memstore for each column family. Hence this impose max limit on how many max col family you can. Anything which is in memstore must be present in the WAL as Memstore is creating by executing command in WAL. Once the data is in Memstore the client will get the ACK for the write. Time to time the HBase region flush happens
Each time MemStore is flushed it is written to a new HFile in HDFS by doing the sequential write (fast)
Block Cache: LRU based read cache.
HFile: This are the files which are stored in the HDFS (data nodes). This is sorted by key values. HFile also have the index (B-Tree) written with it. This is needed for the fast access during the read. Also there are Bloom filters used to just check if there is a value for the particular key exisit in the Hfile.
To read a particular value for the RegionServer HBase have to search all the Hfiles stored in it and return the latest value (which does not have tombstone).To make this process fast and rule out files which does not have the key bloom filter are used.
Read Ops
- Look in the block cache and if the value is present return
- Look into the memstore
- Load the HFile and look there
Speed up read
To speedup the read the compaction happens time to time
- HBase Minor compaction: Merge multiple HFile. The output can be still multiple files
- HBase Major compaction: Rewrites all the HFile to One HFile per column family. Usually scheduled to run automatically in off-peak load
Region Split When table grow big the table is split in two regions (Which lie in same server/ physical node) . This split is reported to HMaster. For loadbalancing HMaster may schedule the new regions to move off to the other server (BUT not the data node). In this case the moved Region server will have data in remote data node (until major compaction move the data node local to the server)