??xml version="1.0" encoding="utf-8" standalone="yes"?>亚洲六月丁香婷婷综合,亚洲国产精品嫩草影院在线观看 ,亚洲欧洲国产成人精品http://m.tkk7.com/DLevin/category/54892.htmlIn general the OO style is to use a lot of little objects with a lot of little methods that give us a lot of plug points for overriding and variation. To do is to be -Nietzsche, To bei is to do -Kant, Do be do be do -Sinatrazh-cnSun, 15 Nov 2015 05:16:24 GMTSun, 15 Nov 2015 05:16:24 GMT60SSTable详解http://m.tkk7.com/DLevin/archive/2015/09/25/427481.htmlDLevinDLevinThu, 24 Sep 2015 17:35:00 GMThttp://m.tkk7.com/DLevin/archive/2015/09/25/427481.htmlhttp://m.tkk7.com/DLevin/comments/427481.htmlhttp://m.tkk7.com/DLevin/archive/2015/09/25/427481.html#Feedback0http://m.tkk7.com/DLevin/comments/commentRss/427481.htmlhttp://m.tkk7.com/DLevin/services/trackbacks/427481.html 前记几年前在读Google的BigTable论文的时候,当时q没有理解论文里面表辄思想Q因而囫囵吞枣,q没有注意到SSTable的概c再后来开始关注HBase的设计和源码后,开始对BigTable传递的思想慢慢的清晰v来,但是因ؓ事情太多Q没有安排出旉重读BigTable的论文。在目里,我因己在学HBaseQ开始主推HBaseQ而另一个同事则因ؓ对Cassandra比较感冒Q因而他主要xCassandra的设计,不过我们两个人偶都会讨Z下技术、设计的各种观点和心得,然后他偶然的说了一句:Cassandra和HBase都采用SSTable格式存储Q然后我本能的问了一句:什么是SSTableQ他q没有回{,可能也不是那么几句能说清楚的Q或者他自己也没有尝试的去问q自p个问题。然而这个问题本w却一直困扰着我,因而趁着现在有一些时间深入学习HBase和Cassandra相关设计的时候先把这个问题弄清楚了?br />

SSTable的定?/h2>要解释这个术语的真正含义Q最好的Ҏ是从它的出处找{案Q所以重新翻开BigTable的论文。在q篇论文中,最初对SSTable是这么描q的Q第三页末和W四初Q:
SSTable

The Google SSTable file format is used internally to store Bigtable data. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Operations are provided to look up the value associated with a specified key, and to iterate over all key/value pairs in a specified key range. Internally, each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable). A block index (stored at the end of the SSTable) is used to locate blocks; the index is loaded into memory when the SSTable is opened. A lookup can be performed with a single disk seek: we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk. Optionally, an SSTable can be completely mapped into memory, which allows us to perform lookups and scans without touching disk.

单的非直译:
SSTable是Bigtable内部用于数据的文件格式,它的格式为文件本w就是一个排序的、不可变的、持久的Key/Value对MapQ其中Key和value都可以是L的byte字符丌Ӏ用Key来查找ValueQ或通过l定Key范围遍历所有的Key/Value寏V每个SSTable包含一pd的BlockQ一般Block大小?4KBQ但是它是可配置的)Q在SSTable的末是Block索引Q用于定位BlockQ这些烦引在SSTable打开时被加蝲到内存中Q在查找旉先从内存中的索引二分查找扑ֈBlockQ然后一ơ磁盘寻道即可读取到相应的Block。还有一U方案是这个SSTable加蝲到内存中Q从而在查找和扫描中不需要读取磁盘?/span>

q个貌似是HFileW一个版本的格式么,贴张图感受一下:

在HBase使用q程中,对这个版本的HFile遇到以下一些问题(参?a >q里Q:
1. 解析时内存用量比较高?br />2. Bloom Filter和Block索引会变的很大,而媄响启动性能。具体的QBloom Filter可以增长?00MB每个HFileQ而Block索引可以增长?00MBQ如果一个HRegionServer中有20个HRegionQ则他们分别能增长到2GB?GB的大。HRegion需要在打开Ӟ需要加载所有的Block索引到内存中Q因而媄响启动性能Q而在W一ơRequestӞ需要将整个Bloom Filter加蝲到内存中Q再开始查找,因而Bloom Filter太大会媄响第一ơ请求的延迟?br />而HFile在版?中对q些问题做了一些优化,具体会在HFile解析时详l说明?br />

SSTable作ؓ存储使用

l箋BigTable的论文往下走Q在5.3 Tablet Serving节中这样写道(W?)Q?br />
Tablet Serving

Updates are committed to a commit log that stores redo records. Of these updates, the recently committed ones are stored in memory in a sorted buffer called a memtable; the older updates are stored in a sequence of SSTables. To recover a tablet, a tablet server reads its metadata from the METADATA table. This metadata contains the list of SSTables that comprise a tablet and a set of a redo points, which are pointers into any commit logs that may contain data for the tablet. The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points.

When a write operation arrives at a tablet server, the server checks that it is well-formed, and that the sender is authorized to perform the mutation. Authorization is performed by reading the list of permitted writers from a Chubby file (which is almost always a hit in the Chubby client cache). A valid mutation is written to the commit log. Group commit is used to improve the throughput of lots of small mutations [13, 16]. After the write has been committed, its contents are inserted into the memtable.

When a read operation arrives at a tablet server, it is similarly checked for well-formedness and proper authorization. A valid read operation is executed on a merged view of the sequence of SSTables and the memtable. Since the SSTables and the memtable are lexicographically sorted data structures, the merged view can be formed efficiently.

Incoming read and write operations can continue while tablets are split and merged.

W一D和W三D늮单描qͼ非翻译:
在新数据写入Ӟq个操作首先提交到日志中作ؓredoU录Q最q的数据存储在内存的排序~存memtable中;旧的数据存储在一pd的SSTable 中。在recover中,tablet server从METADATA表中dmetadataQmetadata包含了组成Tablet的所有SSTableQ纪录了q些SSTable的元 数据信息Q如SSTable的位|、StartKey、EndKey{)以及一pd日志中的redo炏VTablet ServerdSSTable的烦引到内存Qƈreplayq些redo点之后的更新来重构memtable?br />在读Ӟ完成格式、授权等查后Q读会同时读取SSTable、memtableQHBase中还包含了BlockCache中的数据Qƈ合ƈ他们的结果,׃SSTable和memtable都是字典序排列,因而合q操作可以很高效完成?br />

SSTable在Compactionq程中的使用

在BigTable论文5.4 Compaction节中是q样说的Q?br />
Compaction

As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This minor compaction process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies. Incoming read and write operations can continue while compactions occur.

Every minor compaction creates a new SSTable. If this behavior continued unchecked, read operations might need to merge updates from an arbitrary number of SSTables. Instead, we bound the number of such files by periodically executing a merging compaction in the background. A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished.

A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction. SSTables produced by non-major compactions can contain special deletion entries that suppress deleted data in older SSTables that are still live. A major compaction, on the other hand, produces an SSTable that contains no deletion information or deleted data. Bigtable cycles through all of its tablets and regularly applies major compactions to them. These major compactions allow Bigtable to reclaim resources used by deleted data, and also allow it to ensure that deleted data disappears from the system in a timely fashion, which is important for services that store sensitive data.

随着memtable大小增加C个阀|q个memtable会被M而创Z个新的memtable以供使用Q而旧的memtable会{换成一个SSTable而写道GFS中,q个q程叫做minor compaction。这个minor compaction可以减少内存使用量,q可以减日志大,因ؓ持久化后的数据可以从日志中删除?/span>在minor compactionq程中,可以l箋处理dh?br />每次minor compaction会生成新的SSTable文gQ如果SSTable文g数量增加Q则会媄响读的性能Q因而每ơ读都需要读取所有SSTable文gQ然后合q结果,因而对SSTable文g个数需要有上限Qƈ且时不时的需要在后台做merging compactionQ这个merging compactiond一些SSTable文g和memtable的内容,q将他们合ƈ写入一个新的SSTable中。当q个q程完成后,q些源SSTable和memtable可以被删除了?br />如果一个merging compaction是合q所有SSTableC个SSTableQ则q个q程U做major compaction。一ơmajor compaction会将mark成删除的信息、数据删除,而其他两ơcompaction则会保留q些信息、数据(mark的Ş式)。Bigtable会时不时的扫描所有的TabletQƈ对它们做major compaction。这个major compaction可以需要删除的数据真正的删除从而节省空_q保持系l一致性?/span>

SSTable的locality和In Memory

在Bigtable中,它的本地性是由Locality group来定义的Q即多个column family可以l合C个locality group中,在同一个Tablet中,使用单独的SSTable存储q些在同一个locality group的column family。HBase把这个模型简化了Q即每个column family在每个HRegion都用单独的HFile存储QHFile没有locality group的概念,或者一个column family是一个locality group?/span>

在Bigtable中,q可以支持在locality groupU别讄是否所有这个locality group的数据加载到内存中,在HBase中通过column family定义时设|。这个内存加载采用g时加载,主要应用于一些小的column familyQƈ且经常被用到的,从而提升读的性能Q因而这样就不需要再从磁盘中d了?/span>

SSTable压羃

Bigtable的压~是Zlocality groupU别Q?br />
Compression

Clients can control whether or not the SSTables for a locality group are compressed, and if so, which compression format is used. The user-specified compression format is applied to each SSTable block (whose size is controllable via a locality group specific tuning parameter). Although we lose some space by compressing each block separately, we benefit in that small portions of an SSTable can be read without decompressing the entire file. Many clients use a two-pass custom compression scheme. The first pass uses Bentley and McIlroy’s scheme [6], which compresses long common strings across a large window. The second pass uses a fast compression algorithm that looks for repetitions in a small 16 KB window of the data. Both compression passes are very fast—they encode at 100–200 MB/s, and decode at 400–1000 MB/s on modern machines.

Bigtable的压~以SSTable中的一个Block为单位,虽然每个Block为压~单位损׃些空_但是采用q种方式Q我们可以以Block为单位读取、解压、分析,而不是每ơ以一?#8220;?#8221;的SSTable为单位读取、解压、分析?/span>

SSTable的读~存

Z提升ȝ性能QBigtable采用两层~存机制Q?br />
Caching for read performance

To improve read performance, tablet servers use two levels of caching. The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS. The Scan Cache is most useful for applications that tend to read the same data repeatedly. The Block Cache is useful for applications that tend to read data that is close to the data they recently read (e.g., sequential reads, or random reads of different columns in the same locality group within a hot row).

两层~存分别是:
1. High LevelQ缓存从SSTabled的Key/Value寏V提升那些們֐重复的读取相同的数据的操作(引用局部性原理)?br />2. Low LevelQBlockCacheQ缓存SSTable中的Block。提升那些們֐于读取相q数据的操作?br />

Bloom Filter

前文有提到Bigtable采用合ƈ读,即需要读取每个SSTable中的相关数据Qƈ合ƈ成一个结果返回,然而每ơ读都需要读取所有SSTableQ自然会耗费性能Q因而引入了Bloom FilterQ它可以很快速的扑ֈ一个RowKey不在某个SSTable中的事实Q注Q反q来则不成立Q?br />
Bloom Filter

As described in Section 5.3, a read operation has to read from all SSTables that make up the state of a tablet. If these SSTables are not in memory, we may end up doing many disk accesses. We reduce the number of accesses by allowing clients to specify that Bloom fil- ters [7] should be created for SSTables in a particu- lar locality group. A Bloom filter allows us to ask whether an SSTable might contain any data for a spec- ified row/column pair. For certain applications, a small amount of tablet server memory used for storing Bloom filters drastically reduces the number of disk seeks re- quired for read operations. Our use of Bloom filters also implies that most lookups for non-existent rows or columns do not need to touch disk.

SSTable设计成Immutable的好?/h2>在SSTable定义中就有提到SSTable是一个Immutable的order mapQ这个Immutable的设计可以让pȝ单很多:
Exploiting Immutability

Besides the SSTable caches, various other parts of the Bigtable system have been simplified by the fact that all of the SSTables that we generate are immutable. For example, we do not need any synchronization of accesses to the file system when reading from SSTables. As a result, concurrency control over rows can be implemented very efficiently. The only mutable data structure that is accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make each memtable row copy-on-write and allow reads and writes to proceed in parallel.

Since SSTables are immutable, the problem of permanently removing deleted data is transformed to garbage collecting obsolete SSTables. Each tablet’s SSTables are registered in the METADATA table. The master removes obsolete SSTables as a mark-and-sweep garbage collection [25] over the set of SSTables, where the METADATA table contains the set of roots.

Finally, the immutability of SSTables enables us to split tablets quickly. Instead of generating a new set of SSTables for each child tablet, we let the child tablets share the SSTables of the parent tablet.

关于Immutable的优Ҏ以下几点Q?/span>
1. 在读SSTable是不需要同步。读写同步只需要在memtable中处理,Z减少memtable的读写竞争,Bigtablememtable的row设计成copy-on-writeQ从而读写可以同时进行?/span>
2. 怹的移除数据{变ؓSSTable的Garbage Collect。每个Tablet中的SSTable在METADATA表中有注册,master使用mark-and-sweep法SSTable在GCq程中移除?/span>
3. 可以让Tablet Splitq程变的高效Q我们不需要ؓ每个子Tablet创徏新的SSTableQ而是可以׃n?/span>Tablet的SSTable?/span>

DLevin 2015-09-25 01:35 发表评论
]]>深入HBase架构解析Q二Q?/title><link>http://m.tkk7.com/DLevin/archive/2015/08/22/426950.html</link><dc:creator>DLevin</dc:creator><author>DLevin</author><pubDate>Sat, 22 Aug 2015 11:40:00 GMT</pubDate><guid>http://m.tkk7.com/DLevin/archive/2015/08/22/426950.html</guid><wfw:comment>http://m.tkk7.com/DLevin/comments/426950.html</wfw:comment><comments>http://m.tkk7.com/DLevin/archive/2015/08/22/426950.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://m.tkk7.com/DLevin/comments/commentRss/426950.html</wfw:commentRss><trackback:ping>http://m.tkk7.com/DLevin/services/trackbacks/426950.html</trackback:ping><description><![CDATA[<h2> 前言</h2>q是<a href="http://m.tkk7.com/DLevin/archive/2015/08/22/426877.html">《深入HBase架构解析Q一Q?/a>的箋Q不多废话,l箋。。。?br /><h2>HBaseȝ实现</h2>通过前文的描qͼ我们知道在HBase写时Q相同Cell(RowKey/ColumnFamily/Column相同)q不保证在一P甚至删除一个Cell也只是写入一个新的CellQ它含有Delete标记Q而不一定将一个Cell真正删除了,因而这引起了一个问题,如何实现ȝ问题Q要解决q个问题Q我们先来分析一下相同的Cell可能存在的位|:首先Ҏ写入的CellQ它会存在于MemStore中;然后对之前已lFlush到HDFS中的CellQ它会存在于某个或某些StoreFile(HFile)中;最后,对刚dq的CellQ它可能存在于BlockCache中。既然相同的Cell可能存储在三个地方,在读取的时候只需要扫瞄这三个地方Q然后将l果合ƈ卛_(Merge Read)Q在HBase中扫瞄的序依次是:BlockCache、MemStore、StoreFile(HFile)。其中StoreFile的扫瞄先会用Bloom Filterqo那些不可能符合条件的HFileQ然后用Block Index快速定位CellQƈ其加蝲到BlockCache中,然后从BlockCache中读取。我们知道一个HStore可能存在多个StoreFile(HFile)Q此旉要扫瞄多个HFileQ如果HFileq多又是会引h能问题?br /><img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig16.png" height="278" width="769" /><br /><h2>Compaction</h2>MemStore每次Flush会创建新的HFileQ而过多的HFile会引赯的性能问题Q那么如何解册个问题呢QHBase采用Compaction机制来解册个问题,有点cMJava中的GC机制Qv初Java不停的申请内存而不释放Q增加性能Q然而天下没有免费的午餐Q最l我们还是要在某个条件下L集垃圾,很多时候需要Stop-The-WorldQ这UStop-The-World有些时候也会引起很大的问题Q比如参考本人写?a href="http://m.tkk7.com/DLevin/archive/2015/08/01/426418.html">q篇文章</a>Q因而设计是一U权衡,没有完美的。还是类似Java中的GCQ在HBase中Compaction分ؓ两种QMinor Compaction和Major Compaction?br /><ol><li>Minor Compaction是指选取一些小的、相ȝStoreFile他们合q成一个更大的StoreFileQ在q个q程中不会处理已lDeleted或Expired的Cell。一ơMinor Compaction的结果是更少q且更大的StoreFile。(q个是对的吗QBigTable中是q样描述Minor Compaction?span style="font-size: 10.000000pt; font-family: 'Times'">QAs write operations execute, the size of the memtable in- creases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This </span><span style="font-size: 10.000000pt; font-family: 'Times'; font-style: italic">minor compaction </span><span style="font-size: 10.000000pt; font-family: 'Times'">process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies. Incom- ing read and write operations can continue while com- pactions occur. </span>也就是说它将memtable的数据flush的一个HFile/SSTableUCؓ一ơMinor CompactionQ?/li><li>Major Compaction是指所有的StoreFile合ƈ成一个StoreFileQ在q个q程中,标记为Deleted的Cell会被删除Q而那些已lExpired的Cell会被丢弃Q那些已l超q最多版本数的Cell会被丢弃。一ơMajor Compaction的结果是一个HStore只有一个StoreFile存在。Major Compaction可以手动或自动触发,然而由于它会引起很多的IO操作而引h能问题Q因而它一般会被安排在周末、凌晨等集群比较闲的旉?br /></li></ol>更Ş象一点,如下面两张图分别表示Minor Compaction和Major Compaction?br /><img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig18.png" height="329" width="723" /><img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig19.png" height="339" width="653" /><br /><h2>HRegion Split</h2>最初,一个Table只有一个HRegionQ随着数据写入增加Q如果一个HRegion到达一定的大小Q就需要Split成两个HRegionQ这个大由hbase.hregion.max.filesize指定Q默认ؓ10GB。当splitӞ两个新的HRegion会在同一个HRegionServer中创建,它们各自包含父HRegion一半的数据Q当Split完成后,父HRegion会下U,而新的两个子HRegion会向HMaster注册上线Q处于负载均衡的考虑Q这两个新的HRegion可能会被HMaster分配到其他的HRegionServer中。关于Split的详l信息,可以参考这文章:<a >《Apache HBase Region Splitting and Merging?/a>?br /><img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig21.png" height="361" width="675" /><br /><h2>HRegion负蝲均衡</h2>在HRegion Split后,两个新的HRegion最初会和之前的父HRegion在相同的HRegionServer上,Z负蝲均衡的考虑QHMaster可能会将其中的一个甚至两个重新分配的其他的HRegionServer中,此时会引h些HRegionServer处理的数据在其他节点上,直到下一ơMajor Compaction数据从q端的节点移动到本地节点?br /><br /><img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig22.png" height="358" width="714" /><br /><h2>HRegionServer Recovery</h2>当一台HRegionServer宕机Ӟ׃它不再发送HeartbeatlZooKeeper而被监测刎ͼ此时ZooKeeper会通知HMasterQHMaster会检到哪台HRegionServer宕机Q它宕机的HRegionServer中的HRegion重新分配l其他的HRegionServerQ同时HMaster会把宕机的HRegionServer相关的WAL拆分分配l相应的HRegionServer(拆分出的WAL文g写入对应的目的HRegionServer的WAL目录中,qƈ写入对应的DataNode中)Q从而这些HRegionServer可以Replay分到的WAL来重建MemStore?br /><img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig25.png" height="368" width="708" /><br /><img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig26.png" height="378" width="724" /><br /><h2>HBase架构单ȝ</h2>在NoSQL中,存在著名的CAP理论Q即Consistency、Availability、Partition Tolerance不可全得Q目前市Z基本上的NoSQL都采用Partition Tolerance以实现数据得水^扩展Q来处理Relational DataBase遇到的无法处理数据量太大的问题,或引L性能问题。因而只有剩下C和A可以选择。HBase在两者之间选择了ConsistencyQ然后用多个HMaster以及支持HRegionServer的failure监控、ZooKeeper引入作ؓ协调者等各种手段来解决Availability问题Q然而当|络的Split-Brain(Network Partition)发生Ӟ它还是无法完全解决Availability的问题。从q个角度上,Cassandra选择了AQ即它在|络Split-Brain时还是能正常写,而用其他技术来解决Consistency的问题,如读的时候触发Consistency判断和处理。这是设计上的限制?br /><br />从实C的优点:<br /><ol><li>HBase采用Z致性模型,在一个写q回后,保证所有的读都d相同的数据?/li><li>通过HRegion动态Split和Merge实现自动扩展Qƈ使用HDFS提供的多个数据备份功能,实现高可用性?/li><li>采用HRegionServer和DataNodeq行在相同的服务器上实现数据的本地化Q提升读写性能Qƈ减少|络压力?/li><li>内徏HRegionServer的宕动恢复。采用WAL来Replayq未持久化到HDFS的数据?/li><li>可以无缝的和Hadoop/MapReduce集成?br /></li></ol>实现上的~点Q?br /><ol><li>WAL的Replayq程可能会很慢?/li><li>N恢复比较复杂Q也会比较慢?/li><li>Major Compaction会引起IO Storm?/li><li>。。。?br /></li></ol><h2>参考:</h2> https://www.mapr.com/blog/in-depth-look-hbase-architecture#.VdNSN6Yp3qx<br /> http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable<br /> http://hbase.apache.org/book.html <br /> http://www.searchtb.com/2011/01/understanding-hbase.html <br /> http://research.google.com/archive/bigtable-osdi06.pdf<img src ="http://m.tkk7.com/DLevin/aggbug/426950.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://m.tkk7.com/DLevin/" target="_blank">DLevin</a> 2015-08-22 19:40 <a href="http://m.tkk7.com/DLevin/archive/2015/08/22/426950.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>深入HBase架构解析Q一Q?/title><link>http://m.tkk7.com/DLevin/archive/2015/08/22/426877.html</link><dc:creator>DLevin</dc:creator><author>DLevin</author><pubDate>Sat, 22 Aug 2015 09:44:00 GMT</pubDate><guid>http://m.tkk7.com/DLevin/archive/2015/08/22/426877.html</guid><wfw:comment>http://m.tkk7.com/DLevin/comments/426877.html</wfw:comment><comments>http://m.tkk7.com/DLevin/archive/2015/08/22/426877.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://m.tkk7.com/DLevin/comments/commentRss/426877.html</wfw:commentRss><trackback:ping>http://m.tkk7.com/DLevin/services/trackbacks/426877.html</trackback:ping><description><![CDATA[<h2>前记</h2> 公司内部使用的是MapR版本的Hadoop生态系l,因而从MapR的官|看Cq篇文文章:<a >An In-Depth Look at the HBase Architecture</a>Q原本想译全文Q然而如果翻译就需要各U咬文嚼字,太麻烦,因而本文大部分使用了自q语言Qƈ且加入了其他资源的参考理解以及本p源码时对其的理解Q属于半译、半原创吧?br /> <h2>HBase架构l成</h2> HBase采用Master/Slave架构搭徏集群Q它隶属于Hadoop生态系l,׃下类型节点组成:HMaster节点、HRegionServer节点、ZooKeeper集群Q而在底层Q它数据存储于HDFS中,因而涉及到HDFS的NameNode、DataNode{,Ml构如下Q?br /> <img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArch1.jpg" height="389" width="603" /><br /> 其中<strong>HMaster节点</strong>用于Q?br /> <ol> <li>理HRegionServerQ实现其负蝲均衡?/li> <li>理和分配HRegionQ比如在HRegion split时分配新的HRegionQ在HRegionServer退出时q移其内的HRegion到其他HRegionServer上?/li> <li>实现DDL操作QData Definition LanguageQnamespace和table的增删改Qcolumn familiy的增删改{)?/li> <li>理namespace和table的元数据Q实际存储在HDFS上)?/li> <li>权限控制QACLQ?/li> </ol> <strong>HRegionServer节点</strong>用于Q?br /> <ol> <li>存放和管理本地HRegion?/li> <li>dHDFSQ管理Table中的数据?/li> <li>Client直接通过HRegionServerd数据Q从HMaster中获取元数据Q找到RowKey所在的HRegion/HRegionServer后)?/li> </ol> <strong>ZooKeeper集群是协调系l?/strong>Q用于:<br /> <ol> <li>存放整个 HBase集群的元数据以及集群的状态信息?/li> <li>实现HMasterM节点的failover?/li> </ol> HBase Client通过RPC方式和HMaster、HRegionServer通信Q一个HRegionServer可以存放1000个HRegionQ底层Table数据存储于HDFS中,而HRegion所处理的数据尽量和数据所在的DataNode在一P实现数据的本地化Q数据本地化q不是总能实现Q比如在HRegionUd(如因Split)Ӟ需要等下一ơCompact才能l箋回到本地化?br /> <br /> 本着半翻译的原则Q再贴一个《An In-Depth Look At The HBase Architecture》的架构图:<br /> <img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig1.png" height="343" width="632" /><br /> q个架构图比较清晰的表达了HMaster和NameNode都支持多个热备䆾Q用ZooKeeper来做协调QZooKeeperq不是云般神U,它一般由三台机器l成一个集,内部使用PAXOS法支持三台Server中的一台宕机,也有使用五台机器的,此时则可以支持同时两台宕机,既少于半数的宕机Q然而随着机器的增加,它的性能也会下降QRegionServer和DataNode一般会攑֜相同的Server上实现数据的本地化?br /> <h2>HRegion</h2> HBase使用RowKey表水^切割成多个HRegionQ从HMaster的角度,每个HRegion都纪录了它的StartKey和EndKeyQ第一个HRegion的StartKey为空Q最后一个HRegion的EndKey为空Q,׃RowKey是排序的Q因而Client可以通过HMaster快速的定位每个RowKey在哪个HRegion中。HRegion由HMaster分配到相应的HRegionServer中,然后由HRegionServer负责HRegion的启动和理Q和Client的通信Q负责数据的?使用HDFS)。每个HRegionServer可以同时理1000个左右的HRegionQ这个数字怎么来的Q没有从代码中看到限ӞN是出于经验?过1000个会引v性能问题Q?strong>来回{这个问?/strong>Q感觉这?000的数字是从BigTable的论文中来的Q? Implementation节)QEach tablet server manages a set of tablets(typically we have somewhere between ten to a thousand tablets per tablet server)Q?br /> <img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig2.png" height="337" width="724" /><br /> <h2>HMaster</h2> HMaster没有单点故障问题Q可以启动多个HMasterQ通过ZooKeeper的Master Election机制保证同时只有一个HMasterZActive状态,其他的HMaster则处于热备䆾状态。一般情况下会启动两个HMasterQ非Active的HMaster会定期的和Active HMaster通信以获取其最新状态,从而保证它是实时更新的Q因而如果启动了多个HMaster反而增加了Active HMaster的负担。前文已l介l过了HMaster的主要用于HRegion的分配和理QDDL(Data Definition LanguageQ既Table的新建、删除、修改等)的实现等Q既它主要有两方面的职责Q?br /> <ol> <li>协调HRegionServer <ol> <li>启动时HRegion的分配,以及负蝲均衡和修复时HRegion的重新分配?/li> <li>监控集群中所有HRegionServer的状?通过Heartbeat和监听ZooKeeper中的状??br /> </li> </ol> </li> <li>Admin职能 <ol> <li>创徏、删除、修改Table的定义?br /> </li> </ol> </li> </ol> <img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig3.png" /><br /> <h2> ZooKeeperQ协调?/h2> ZooKeeper为HBase集群提供协调服务Q它理着HMaster和HRegionServer的状?available/alive{?Qƈ且会在它们宕机时通知lHMasterQ从而HMaster可以实现HMaster之间的failoverQ或对宕机的HRegionServer中的HRegion集合的修?它们分配给其他的HRegionServer)。ZooKeeper集群本n使用一致性协?PAXOS协议)保证每个节点状态的一致性?br /> <img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig4.png" height="318" width="703" /><br /> <h2>How The Components Work Together</h2> ZooKeeper协调集群所有节点的׃n信息Q在HMaster和HRegionServerq接到ZooKeeper后创建Ephemeral节点Qƈ使用Heartbeat机制l持q个节点的存zȝ态,如果某个Ephemeral节点实效Q则HMaster会收到通知Qƈ做相应的处理?br /> <img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig5.png" height="329" width="722" /><br /> 另外QHMaster通过监听ZooKeeper中的Ephemeral节点(默认Q?hbase/rs/*)来监控HRegionServer的加入和宕机。在W一个HMasterq接到ZooKeeper时会创徏Ephemeral节点(默认Q?hbasae/master)来表CActive的HMasterQ其后加q来的HMaster则监听该Ephemeral节点Q如果当前Active的HMaster宕机Q则该节Ҏ失,因而其他HMaster得到通知Q而将自n转换成Active的HMasterQ在变ؓActive的HMaster之前Q它会创建在/hbase/back-masters/下创qEphemeral节点?br /> <h3> HBase的第一ơ读?/h3> 在HBase 0.96以前QHBase有两个特D的TableQ?ROOT-?META.Q如<a >BigTable</a>中的设计Q,其中-ROOT- Table的位|存储在ZooKeeperQ它存储?META. Table的RegionInfo信息Qƈ且它只能存在一个HRegionQ?META. Table则存储了用户Table的RegionInfo信息Q它可以被切分成多个HRegionQ因而对W一ơ访问用户TableӞ首先从ZooKeeper中读?ROOT- Table所在HRegionServerQ然后从该HRegionServer中根据请求的TableNameQRowKeyd.META. Table所在HRegionServerQ最后从该HRegionServer中读?META. Table的内容而获取此ơ请求需要访问的HRegion所在的位置Q然后访问该HRegionSever获取h的数据,q需要三ơ请求才能找到用户Table所在的位置Q然后第四次h开始获取真正的数据。当然ؓ了提升性能Q客L会缓?ROOT- Table位置以及-ROOT-/.META. Table的内宏V如下图所C:<br /> <img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/image0030.jpg" height="228" width="399" /><br /> 可是即客户端有~存Q在初始阶段需要三ơ请求才能直到用户Table真正所在的位置也是性能低下的,而且真的有必要支持那么多的HRegion吗?或许对Googleq样的公司来说是需要的Q但是对一般的集群来说好像q没有这个必要。在BigTable的论文中_每行METADATA存储1KB左右数据Q中{大的Tablet(HRegion)?28MB左右Q?层位|的Schema设计可以支持2^34个Tablet(HRegion)。即使去?ROOT- TableQ也q可以支?^17(131072)个HRegionQ?如果每个HRegionq是128MBQ那是16TBQ这个貌g够大Q但是现在的HRegion的最大大都会设|的比较大,比如我们讄?GBQ此时支持的大小则变成了4PBQ对一般的集群来说已经够了Q因而在HBase 0.96以后L?ROOT- TableQ只剩下q个Ҏ的目录表叫做Meta Table(hbase:meta)Q它存储了集中所有用户HRegion的位|信息,而ZooKeeper的节点中(/hbase/meta-region-server)存储的则直接是这个Meta Table的位|,q且q个Meta Table如以前的-ROOT- Table一h不可split的。这P客户端在W一ơ访问用户Table的流E就变成了:<br /> <ol> <li>从ZooKeeper(/hbase/meta-region-server)中获取hbase:meta的位|(HRegionServer的位|)Q缓存该位置信息?/li> <li>从HRegionServer中查询用户Table对应h的RowKey所在的HRegionServerQ缓存该位置信息?/li> <li>从查询到HRegionServer中读取Row?/li> </ol> 从这个过E中Q我们发现客户会~存q些位置信息Q然而第二步它只是缓存当前RowKey对应的HRegion的位|,因而如果下一个要查的RowKey不在同一个HRegion中,则需要l查询hbase:meta所在的HRegionQ然而随着旉的推U,客户端缓存的位置信息来多Q以至于不需要再ơ查找hbase:meta Table的信息,除非某个HRegion因ؓ宕机或Split被移动,此时需要重新查询ƈ且更新缓存?br /> <img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig6.png" height="356" width="590" /><br /> <h3> hbase:meta?/h3> hbase:meta表存储了所有用户HRegion的位|信息,它的RowKey是:tableName,regionStartKey,regionId,replicaId{,它只有info列族Q这个列族包含三个列Q他们分别是Qinfo:regioninfo列是RegionInfo的proto格式QregionId,tableName,startKey,endKey,offline,split,replicaIdQinfo:server格式QHRegionServer对应的server:portQinfo:serverstartcode格式是HRegionServer的启动时间戳?br /> <img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig7.png" height="362" width="736" /><br /> <h2>HRegionServer详解</h2> HRegionServer一般和DataNode在同一台机器上q行Q实现数据的本地性。HRegionServer包含多个HRegionQ由WAL(HLog)、BlockCache、MemStore、HFilel成?br /> <ol> <li><strong>WAL即Write Ahead Log</strong>Q在早期版本中称为HLogQ它是HDFS上的一个文Ӟ如其名字所表示的,所有写操作都会先保证将数据写入q个Log文g后,才会真正更新MemStoreQ最后写入HFile中。采用这U模式,可以保证HRegionServer宕机后,我们依然可以从该Log文g中读取数据,Replay所有的操作Q而不至于数据丢失。这个Log文g会定期Roll出新的文件而删除旧的文?那些已持久化到HFile中的Log可以删除)。WAL文g存储?hbase/WALs/${HRegionServer_Name}的目录中(?.94之前Q存储在/hbase/.logs/目录?Q一般一个HRegionServer只有一个WAL实例Q也是说一个HRegionServer的所有WAL写都是串行的(像log4j的日志写也是串行?Q这当然会引h能问题Q因而在HBase 1.0之后Q通过<a >HBASE-5699</a>实现了多个WALq行?MultiWAL)Q该实现采用HDFS的多个管道写Q以单个HRegion为单位。关于WAL可以参考Wikipedia?a >Write-Ahead Logging</a>。顺便吐槽一句,英文版的l基癄竟然能毫无压力的正常讉K了,q是某个GFW的疏忽还是以后的常态?</li> <li><strong>BlockCache是一个读~存</strong>Q即“引用局部?#8221;原理Q也应用于CPUQ?a >分空间局部性和旉局部?/a>Q空间局部性是指CPU在某一时刻需要某个数据,那么有很大的概率在一下时d需要的数据在其附近Q时间局部性是指某个数据在被访问过一ơ后Q它有很大的概率在不久的来会被再次的访问)Q将数据预读取到内存中,以提升读的性能。HBase中提供两UBlockCache的实玎ͼ默认on-heap LruBlockCache和BucketCache(通常是off-heap)。通常BucketCache的性能要差于LruBlockCacheQ然而由于GC的媄响,LruBlockCache的gq会变的不稳定,而BucketCache׃是自q理BlockCacheQ而不需要GCQ因而它的gq通常比较E_Q这也是有些时候需要选用BucketCache的原因。这文?a >BlockCache101</a>对on-heap和off-heap的BlockCache做了详细的比较?/li><strong> </strong><li><strong>HRegion是一个Table中的一个Region在一个HRegionServer中的表达</strong>。一个Table可以有一个或多个RegionQ他们可以在一个相同的HRegionServer上,也可以分布在不同的HRegionServer上,一个HRegionServer可以有多个HRegionQ他们分别属于不同的Table。HRegion由多个Store(HStore)构成Q每个HStore对应了一个Table在这个HRegion中的一个Column FamilyQ即每个Column Family是一个集中的存储单元Q因而最好将h相近IOҎ的Column存储在一个Column FamilyQ以实现高效d(数据局部性原理,可以提高~存的命中率)。HStore是HBase中存储的核心Q它实现了读写HDFS功能Q一个HStore׃个MemStore ?个或多个StoreFilel成?br /> <ol> <li><strong>MemStore是一个写~存</strong>(In Memory Sorted Buffer)Q所有数据的写在完成WAL日志写后Q会 写入MemStore中,由MemStoreҎ一定的法数据Flush到地层HDFS文g?HFile)Q通常每个HRegion中的每个 Column Family有一个自qMemStore?/li> <li><strong>HFile(StoreFile) 用于存储HBase的数?Cell/KeyValue)</strong>。在HFile中的数据是按RowKey、Column Family、Column排序Q对相同的Cell(卌三个值都一?Q则按timestamp倒序排列?/li> </ol> </li> </ol> <img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig8.png" /><br /> 虽然上面q张囑ֱ现的是最新的HRegionServer的架?但是q不是那么的_)Q但是我一直比较喜Ƣ看以下q张图,即它展现的应该?.94以前的架构?br /> <img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/image0060.jpg" height="347" width="553" /><br /> <h3> HRegionServer中数据写程图解</h3> 当客L发v一个PuthӞ首先它从hbase:meta表中查出该Put数据最l需要去的HRegionServer。然后客LPuth发送给相应的HRegionServerQ在HRegionServer中它首先会将该Put操作写入WAL日志文g?Flush到磁盘中)?br /><img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig9.png" height="363" width="716" /><br /> 写完WAL日志文g后,HRegionServerҎPut中的TableName和RowKey扑ֈ对应的HRegionQƈҎColumn Family扑ֈ对应的HStoreQƈPut写入到该HStore的MemStore中。此时写成功Qƈq回通知客户端?br /><img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig10.png" height="298" width="664" /><br /><h3>MemStore Flush<br /></h3>MemStore是一个In Memory Sorted BufferQ在每个HStore中都有一个MemStoreQ即它是一个HRegion的一个Column Family对应一个实例。它的排列顺序以RowKey、Column Family、Column的顺序以及Timestamp的倒序Q如下所C:<br /><img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig11.png" height="351" width="719" /><br />每一ơPut/Deleteh都是先写入到MemStore中,当MemStore满后会Flush成一个新的StoreFile(底层实现是HFile)Q即一个HStore(Column Family)可以?个或多个StoreFile(HFile)。有以下三种情况可以触发MemStore的Flush动作Q?strong>需要注意的是MemStore的最Flush单元是HRegion而不是单个MemStore</strong>。据说这是Column Family有个数限制的其中一个原因,估计是因为太多的Column Family一起Flush会引h能问题Q具体原因有待考证?br /><ol><li>当一个HRegion中的所有MemStore的大d过了hbase.hregion.memstore.flush.size的大,默认128MB。此时当前的HRegion中所有的MemStore会Flush到HDFS中?/li><li>当全局MemStore的大超q了hbase.regionserver.global.memstore.upperLimit的大,默认40Q的内存使用量。此时当前HRegionServer中所有HRegion中的MemStore都会Flush到HDFS中,Flush序是MemStore大小的倒序Q一个HRegion中所有MemStored作ؓ该HRegion的MemStore的大还是选取最大的MemStore作ؓ参考?有待考证Q,直到M的MemStore使用量低于hbase.regionserver.global.memstore.lowerLimitQ默?8%的内存用量?/li><li>当前HRegionServer中WAL的大超q了hbase.regionserver.hlog.blocksize * hbase.regionserver.max.logs的数量,当前HRegionServer中所有HRegion中的MemStore都会Flush到HDFS中,Flush使用旉序Q最早的MemStore先Flush直到WAL的数量少于hbase.regionserver.hlog.blocksize * hbase.regionserver.max.logs?a >q里</a>说这两个怹的默认大是2GBQ查代码Qhbase.regionserver.max.logs默认值是32Q而hbase.regionserver.hlog.blocksize是HDFS的默认blocksizeQ?2MB。但不管怎么P因ؓq个大小过限制引v的Flush不是一件好事,可能引v长时间的延迟Q因而这文章给的徏议:“<strong style="color: #339966; font-family: STHeiti; font-size: medium; font-style: normal; font-variant: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px; -webkit-text-stroke-width: 0px;">Hint</strong><span style="color: #339966; font-family: STHeiti; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px; -webkit-text-stroke-width: 0px; display: inline !important; float: none;">: keep hbase.regionserver.hlog.blocksize * hbase.regionserver.maxlogs just a bit above hbase.regionserver.global.memstore.lowerLimit * HBASE_HEAPSIZE.</span>”。ƈ且需要注意,<a >q里</a>l的描述是有错的(虽然它是官方的文??br /></li></ol>在MemStore Flushq程中,q会在尾部追加一些meta数据Q其中就包括Flush时最大的WAL sequence|以告诉HBaseq个StoreFile写入的最新数据的序列Q那么在Recover时就直到从哪里开始。在HRegion启动Ӟq个sequence会被dQƈ取最大的作ؓ下一ơ更新时的v始sequence?br /><img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig12.png" height="248" width="622" /><br /><h2> HFile格式</h2>HBase的数据以KeyValue(Cell)的Ş式顺序的存储在HFile中,在MemStore的Flushq程中生成HFileQ由于MemStore中存储的Cell遵@相同的排列顺序,因而Flushq程是顺序写Q我们直到磁盘的序写性能很高Q因Z需要不停的Ud盘指针?br /><img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig13.png" height="351" width="698" /><br />HFile参考BigTable的SSTable和Hadoop?a >TFile</a>实现Q从HBase开始到现在QHFilel历了三个版本,其中V2?.92引入QV3?.98引入。首先我们来看一下V1的格式:<br /><img src="http://m.tkk7.com/images/blogjava_net/dlevin/image0080.jpg" alt="" height="160" border="0" width="554" /><br />V1的HFile由多个Data Block、Meta Block、FileInfo、Data Index、Meta Index、Trailerl成Q其中Data Block是HBase的最存储单元,在前文中提到的BlockCache是ZData Block的缓存的。一个Data Block׃个魔数和一pd的KeyValue(Cell)l成Q魔数是一个随机的数字Q用于表C是一个Data BlockcdQ以快速监这个Data Block的格式,防止数据的破坏。Data Block的大可以在创徏Column Family时设|?HColumnDescriptor.setBlockSize())Q默认值是64KBQ大LBlock有利于顺序ScanQ小号Block利于随机查询Q因而需要权衡。Meta块是可选的QFileInfo是固定长度的块,它纪录了文g的一些Meta信息Q例如:AVG_KEY_LEN, AVG_VALUE_LEN, LAST_KEY, COMPARATOR, MAX_SEQ_ID_KEY{。Data Index和Meta IndexU录了每个Data块和Meta块的其实炏V未压羃时大、Key(起始RowKeyQ?{。TrailerU录了FileInfo、Data Index、Meta Index块的起始位置QData Index和Meta Index索引的数量等。其中FileInfo和Trailer是固定长度的?br /><br />HFile里面的每个KeyValue对就是一个简单的byte数组。但是这个byte数组里面包含了很多项Qƈ且有固定的结构。我们来看看里面的具体结构:<br /><img src="http://m.tkk7.com/images/blogjava_net/dlevin/image0090.jpg" alt="" height="93" border="0" width="553" /><br />开始是两个固定长度的数|分别表示Key的长度和Value的长度。紧接着是KeyQ开始是固定长度的数|表示RowKey的长度,紧接着? RowKeyQ然后是固定长度的数|表示Family的长度,然后是FamilyQ接着是QualifierQ然后是两个固定长度的数|表示Time Stamp和Key TypeQPut/DeleteQ。Value部分没有q么复杂的结构,是Ua的二q制数据了?strong>随着HFile版本q移QKeyValue(Cell)的格式ƈ未发生太多变化,只是在V3版本Q尾部添加了一个可选的Tag数组</strong>?br /> <br />HFileV1版本的在实际使用q程中发现它占用内存多,q且Bloom File和Block Index会变的很大,而引起启动时间变ѝ其中每个HFile的Bloom Filter可以增长?00MBQ这在查询时会引h能问题Q因为每ơ查询时需要加载ƈ查询Bloom FilterQ?00MB的Bloom Filer会引起很大的延迟Q另一个,Block Index在一个HRegionServer可能会增长到d6GBQHRegionServer在启动时需要先加蝲所有这些Block IndexQ因而增加了启动旉。ؓ了解册些问题,?.92版本中引入HFileV2版本Q?br /><img src="http://m.tkk7.com/images/blogjava_net/dlevin/hfilev2.png" alt="" height="418" border="0" width="566" /><br />在这个版本中QBlock Index和Bloom FilterdCData Block中间Q而这U设计同时也减少了写的内存用量Q另外,Z提升启动速度Q在q个版本中还引入了gq读的功能,卛_HFile真正被用时才对其进行解析?br /><br />FileV3版本基本和V2版本相比Qƈ没有太大的改变,它在KeyValue(Cell)层面上添加了Tag数组的支持;q在FileInfol构中添加了和Tag相关的两个字Dc关于具体HFile格式演化介绍Q可以参?a >q里</a>?br /><br />对HFileV2格式具体分析Q它是一个多层的cB+树烦引,采用q种设计Q可以实现查找不需要读取整个文Ӟ<br /><img alt="" src="http://m.tkk7.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig14.png" height="349" width="688" /><br />Data Block中的Cell都是升序排列Q每个block都有它自qLeaf-IndexQ每个Block的最后一个Key被放入Intermediate-Index中,Root-Index指向Intermediate-Index。在HFile的末还有Bloom Filter用于快速定位那么没有在某个Data Block中的RowQTimeRange信息用于l那些用时间查询的参考。在HFile打开Ӟq些索引信息都被加蝲q保存在内存中,以增加以后的d性能?br /><br />q篇先写到q里Q未完待l。。。?br /><br /> <h2>参考:</h2> https://www.mapr.com/blog/in-depth-look-hbase-architecture#.VdNSN6Yp3qx<br /> http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable<br /> http://hbase.apache.org/book.html <br /> http://www.searchtb.com/2011/01/understanding-hbase.html <br /> http://research.google.com/archive/bigtable-osdi06.pdf<img src ="http://m.tkk7.com/DLevin/aggbug/426877.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://m.tkk7.com/DLevin/" target="_blank">DLevin</a> 2015-08-22 17:44 <a href="http://m.tkk7.com/DLevin/archive/2015/08/22/426877.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss> <footer> <div class="friendship-link"> <p>лǵվܻԴȤ</p> <a href="http://m.tkk7.com/" title="亚洲av成人片在线观看">亚洲av成人片在线观看</a> <div class="friend-links"> </div> </div> </footer> վ֩ģ壺 <a href="http://yhanalati.com" target="_blank">97ѹۿƵ߹ۿ</a>| <a href="http://wangdei.com" target="_blank">Ƶ97Ӱ</a>| <a href="http://qixiresort.com" target="_blank">ձxxxx</a>| <a href="http://988938.com" target="_blank">þþþAvר</a>| <a href="http://cndianxian.com" target="_blank">պѸƬ</a>| <a href="http://lemonbt.com" target="_blank">þþþAVվ</a>| <a href="http://fenxue520.com" target="_blank">98ƷȫѹۿƵ</a>| <a href="http://ztxfkj.com" target="_blank">޹ɫƵ</a>| <a href="http://173ba.com" target="_blank">ƷѲ</a>| <a href="http://misiranim.com" target="_blank">Ļ</a>| <a href="http://yinyinai155.com" target="_blank">ĻmvѸƵ7</a>| <a href="http://eoeoyui.com" target="_blank">ۺһƷ</a>| <a href="http://5tww.com" target="_blank">Ʒ˸</a>| <a href="http://6wss.com" target="_blank">ȫѵһëƬ</a>| <a href="http://yy1288.com" target="_blank">츾AV߲</a>| <a href="http://963315.com" target="_blank">AVƬ</a>| <a href="http://wwkk3.com" target="_blank">һ߹ۿ</a>| <a href="http://tv695.com" target="_blank">ձһձһ岻</a>| <a href="http://tianwu520.com" target="_blank">Ծ޾ƷAAƬ߲</a>| <a href="http://xfmkt.com" target="_blank">ٸ</a>| <a href="http://www-8908.com" target="_blank">߹ۿ</a>| <a href="http://bznys.com" target="_blank">þþþAVƬ</a>| <a href="http://maomi90.com" target="_blank">Ѹ弤Ƶ</a>| <a href="http://shunfk.com" target="_blank">ɫƵ</a>| <a href="http://7uj3.com" target="_blank">˳ӰԺ</a>| <a href="http://yctbhb.com" target="_blank">ˬִ̼߳</a>| <a href="http://ahzlgj.com" target="_blank">ҹƵ</a>| <a href="http://gjwlgzs.com" target="_blank">ۺƵ</a>| <a href="http://hgbookvip.com" target="_blank">ձѸƵ</a>| <a href="http://4794d.com" target="_blank">պAVһ</a>| <a href="http://963315.com" target="_blank">Ůһ</a>| <a href="http://wwwdf221.com" target="_blank">һƵѹۿ</a>| <a href="http://lzlcp.com" target="_blank">ѹۿ</a>| <a href="http://kypbuy.com" target="_blank">2020þþƷۺһ</a>| <a href="http://ulihix.com" target="_blank">޵һɫַ</a>| <a href="http://xsdggzs.com" target="_blank">91ʪ</a>| <a href="http://imfakaixin.com" target="_blank">Ʒ޸һ</a>| <a href="http://66keke.com" target="_blank">AVһ </a>| <a href="http://cuuka.com" target="_blank">޹Ʒ߹ۿ</a>| <a href="http://8mav958.com" target="_blank">ר˿ŵƵ</a>| <a href="http://fenxiangceo.com" target="_blank">ȫƵaëƬ</a>| <script> (function(){ var bp = document.createElement('script'); var curProtocol = window.location.protocol.split(':')[0]; if (curProtocol === 'https') { bp.src = 'https://zz.bdstatic.com/linksubmit/push.js'; } else { bp.src = 'http://push.zhanzhang.baidu.com/push.js'; } var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(bp, s); })(); </script> </body>