<rt id="bn8ez"></rt>
<label id="bn8ez"></label>

  • <span id="bn8ez"></span>

    <label id="bn8ez"><meter id="bn8ez"></meter></label>

    ivaneeo's blog

    自由的力量,自由的生活。

      BlogJava :: 首頁 :: 聯(lián)系 :: 聚合  :: 管理
      669 Posts :: 0 Stories :: 64 Comments :: 0 Trackbacks

    Hive HBase Integration

    Introduction

    This page documents the Hive/HBase integration support originally introduced in HIVE-705. This feature allows Hive QL statements to access HBasetables for both read (SELECT) and write (INSERT). It is even possible to combine access to HBase tables with native Hive tables via joins and unions.

    A presentation is available from the HBase HUG10 Meetup

    This feature is a work in progress, and suggestions for its improvement are very welcome.

    Storage Handlers

    Before proceeding, please read Hive/StorageHandlers for an overview of the generic storage handler framework on which HBase integration depends.

    Usage

    The storage handler is built as an independent module, hive_hbase_handler.jar, which must be available on the Hive client auxpath, along with HBase and Zookeeper jars. It also requires the correct configuration property to be set in order to connect to the right HBase master. See the HBase documentation for how to set up an HBase cluster.

    Here's an example using CLI from a source build environment, targeting a single-node HBase server:

    $HIVE_SRC/build/dist/bin/hive --auxpath $HIVE_SRC/build/dist/lib/hive_hbase-handler.jar,$HIVE_SRC/build/dist/lib/hbase-0.20.3.jar,$HIVE_SRC/hbase-handler/build/dist/lib/zookeeper-3.2.2.jar -hiveconf hbase.master=hbase.yoyodyne.com:60000

    Here's an example which instead targets a distributed HBase cluster where a quorum of 3 zookeepers is used to elect the HBase master:

    $HIVE_SRC/build/dist/bin/hive --auxpath $HIVE_SRC/build/dist/lib/hive_hbase-handler.jar,$HIVE_SRC/build/dist/lib/hbase-0.20.3.jar,$HIVE_SRC/build/dist/lib/zookeeper-3.2.2.jar -hiveconf hbase.zookeeper.quorum=zk1.yoyodyne.com,zk2.yoyodyne.com,zk3.yoyodyne.com

    The handler requires Hadoop 0.20 or higher, and has only been tested with dependency versions hadoop-0.20.0, hbase-0.20.3 and zookeeper-3.2.2. If you are not using hbase-0.20.3, you will need to rebuild the handler with the HBase jar matching your version, and change the --auxpath above accordingly. Failure to use matching versions will lead to misleading connection failures such as MasterNotRunningException since the HBase RPC protocol changes often.

    In order to create a new HBase table which is to be managed by Hive, use the STORED BY clause on CREATE TABLE:

    CREATE TABLE hbase_table_1(key int, value string)
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
    TBLPROPERTIES ("hbase.table.name" = "xyz");

    The hbase.columns.mapping property is required and will be explained in the next section. The hbase.table.name property is optional; it controls the name of the table as known by HBase, and allows the Hive table to have a different name. In this example, the table is known as hbase_table_1 within Hive, and asxyz within HBase. If not specified, then the Hive and HBase table names will be identical.

    After executing the command above, you should be able to see the new (empty) table in the HBase shell:

    $ hbase shell
    HBase Shell; enter 'help<RETURN>' for list of supported commands.
    Version: 0.20.3, r902334, Mon Jan 25 13:13:08 PST 2010
    hbase(main):001:0> list
    xyz
    1 row(s) in 0.0530 seconds
    hbase(main):002:0> describe "xyz"
    DESCRIPTION                                                             ENABLED
    {NAME => 'xyz', FAMILIES => [{NAME => 'cf1', COMPRESSION => 'NONE', VE true
    RSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY =>
    'false', BLOCKCACHE => 'true'}]}
    1 row(s) in 0.0220 seconds
    hbase(main):003:0> scan "xyz"
    ROW                          COLUMN+CELL
    0 row(s) in 0.0060 seconds

    Notice that even though a column name "val" is specified in the mapping, only the column family name "cf1" appears in the DESCRIBE output in the HBase shell. This is because in HBase, only column families (not columns) are known in the table-level metadata; column names within a column family are only present at the per-row level.

    Here's how to move data from Hive into the HBase table (see Hive/GettingStarted for how to create the example table pokes in Hive first):

    INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=98;

    Use HBase shell to verify that the data actually got loaded:

    hbase(main):009:0> scan "xyz"
    ROW                          COLUMN+CELL
    98                          column=cf1:val, timestamp=1267737987733, value=val_98
    1 row(s) in 0.0110 seconds

    And then query it back via Hive:

    hive> select * from hbase_table_1;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    ...
    OK
    98      val_98
    Time taken: 4.582 seconds

    Inserting large amounts of data may be slow due to WAL overhead; if you would like to disable this, make sure you have HIVE-1383, and then issue this command before the INSERT:

    set hive.hbase.wal.enabled=false;

    Warning: disabling WAL may lead to data loss if an HBase failure occurs, so only use this if you have some other recovery strategy available.

    If you want to give Hive access to an existing HBase table, use CREATE EXTERNAL TABLE:

    CREATE EXTERNAL TABLE hbase_table_2(key int, value string)
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:val")
    TBLPROPERTIES("hbase.table.name" = "some_existing_table");

    Again, hbase.columns.mapping is required (and will be validated against the existing HBase table's column families), whereas hbase.table.name is optional.

    Column Mapping

    The column mapping support currently available is somewhat cumbersome and restrictive:

    • for each Hive column, the table creator must specify a corresponding entry in the comma-delimited hbase.columns.mapping string (so for a Hive table with n columns, the string should have n entries); whitespace should not be used in between entries since these will be interperted as part of the column name, which is almost certainly not what you want

    • a mapping entry must be either :key or of the form column-family-name:[column-name]

    • there must be exactly one :key mapping (we don't support compound keys yet)

    • (note that before HIVE-1228, :key was not supported, and the first Hive column implicitly mapped to the key; as of HIVE-1228, it is now strongly recommended that you always specify the key explictly; we will drop support for implicit key mapping in the future)

    • if no column-name is given, then the Hive column will map to all columns in the corresponding HBase column family, and the Hive MAP datatype must be used to allow access to these (possibly sparse) columns
    • there is currently no way to access the HBase timestamp attribute, and queries always access data with the latest timestamp.
    • since HBase does not associate datatype information with columns, the serde converts everything to string representation before storing it in HBase; there is currently no way to plug in a custom serde per column
    • it is not necessary to reference every HBase column family, but those that are not mapped will be inaccessible via the Hive table; it's possible to map multiple Hive tables to the same HBase table

    The next few sections provide detailed examples of the kinds of column mappings currently possible.

    Multiple Columns and Families

    Here's an example with three Hive columns and two HBase column families, with two of the Hive columns (value1 and value2) corresponding to one of the column families (a, with HBase column names b and c), and the other Hive column corresponding to a single column (e) in its own column family (d).

    CREATE TABLE hbase_table_1(key int, value1 string, value2 int, value3 int)
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    WITH SERDEPROPERTIES (
    "hbase.columns.mapping" = ":key,a:b,a:c,d:e"
    );
    INSERT OVERWRITE TABLE hbase_table_1 SELECT foo, bar, foo+1, foo+2
    FROM pokes WHERE foo=98 OR foo=100;

    Here's how this looks in HBase:

    hbase(main):014:0> describe "hbase_table_1"
    DESCRIPTION                                                             ENABLED
    {NAME => 'hbase_table_1', FAMILIES => [{NAME => 'a', COMPRESSION => 'N true
    ONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_M
    EMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'd', COMPRESSION =>
    'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN
    _MEMORY => 'false', BLOCKCACHE => 'true'}]}
    1 row(s) in 0.0170 seconds
    hbase(main):015:0> scan "hbase_table_1"
    ROW                          COLUMN+CELL
    100                         column=a:b, timestamp=1267740457648, value=val_100
    100                         column=a:c, timestamp=1267740457648, value=101
    100                         column=d:e, timestamp=1267740457648, value=102
    98                          column=a:b, timestamp=1267740457648, value=val_98
    98                          column=a:c, timestamp=1267740457648, value=99
    98                          column=d:e, timestamp=1267740457648, value=100
    2 row(s) in 0.0240 seconds

    And when queried back into Hive:

    hive> select * from hbase_table_1;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    ...
    OK
    100     val_100 101     102
    98      val_98  99      100
    Time taken: 4.054 seconds

    Hive MAP to HBase Column Family

    Here's how a Hive MAP datatype can be used to access an entire column family. Each row can have a different set of columns, where the column names correspond to the map keys and the column values correspond to the map values.

    CREATE TABLE hbase_table_1(value map<string,int>, row_key int)
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    WITH SERDEPROPERTIES (
    "hbase.columns.mapping" = "cf:,:key"
    );
    INSERT OVERWRITE TABLE hbase_table_1 SELECT map(bar, foo), foo FROM pokes
    WHERE foo=98 OR foo=100;

    (This example also demonstrates using a Hive column other than the first as the HBase row key.)

    Here's how this looks in HBase (with different column names in different rows):

    hbase(main):012:0> scan "hbase_table_1"
    ROW                          COLUMN+CELL
    100                         column=cf:val_100, timestamp=1267739509194, value=100
    98                          column=cf:val_98, timestamp=1267739509194, value=98
    2 row(s) in 0.0080 seconds

    And when queried back into Hive:

    hive> select * from hbase_table_1;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    ...
    OK
    {"val_100":100} 100
    {"val_98":98}   98
    Time taken: 3.808 seconds

    Note that the key of the MAP must have datatype string, since it is used for naming the HBase column, so the following table definition will fail:

    CREATE TABLE hbase_table_1(key int, value map<int,int>)
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    WITH SERDEPROPERTIES (
    "hbase.columns.mapping" = ":key,cf:"
    );
    FAILED: Error in metadata: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.hbase.HBaseSerDe: hbase column family 'cf:' should be mapped to map<string,?> but is mapped to map<int,int>)

    Illegal: Hive Primitive to HBase Column Family

    Table definitions such as the following are illegal because a Hive column mapped to an entire column family must have MAP type:

    CREATE TABLE hbase_table_1(key int, value string)
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    WITH SERDEPROPERTIES (
    "hbase.columns.mapping" = ":key,cf:"
    );
    FAILED: Error in metadata: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.hbase.HBaseSerDe: hbase column family 'cf:' should be mapped to map<string,?> but is mapped to string)

    Key Uniqueness

    One subtle difference between HBase tables and Hive tables is that HBase tables have a unique key, whereas Hive tables do not. When multiple rows with the same key are inserted into HBase, only one of them is stored (the choice is arbitrary, so do not rely on HBase to pick the right one). This is in contrast to Hive, which is happy to store multiple rows with the same key and different values.

    For example, the pokes table contains rows with duplicate keys. If it is copied into another Hive table, the duplicates are preserved:

    CREATE TABLE pokes2(foo INT, bar STRING);
    INSERT OVERWRITE TABLE pokes2 SELECT * FROM pokes;
    -- this will return 3
    SELECT COUNT(1) FROM POKES WHERE foo=498;
    -- this will also return 3
    SELECT COUNT(1) FROM pokes2 WHERE foo=498;

    But in HBase, the duplicates are silently eliminated:

    CREATE TABLE pokes3(foo INT, bar STRING)
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    WITH SERDEPROPERTIES (
    "hbase.columns.mapping" = ":key,cf:bar"
    );
    INSERT OVERWRITE TABLE pokes3 SELECT * FROM pokes;
    -- this will return 1 instead of 3
    SELECT COUNT(1) FROM pokes3 WHERE foo=498;

    Overwrite

    Another difference to note between HBase tables and other Hive tables is that when INSERT OVERWRITE is used, existing rows are not deleted from the table. However, existing rows are overwritten if they have keys which match new rows.

    Potential Followups

    There are a number of areas where Hive/HBase integration could definitely use more love:

    • more flexible column mapping (HIVE-806, HIVE-1245)
    • default column mapping in cases where no mapping spec is given
    • filter pushdown and indexing (see Hive/FilterPushdownDev and Hive/IndexDev)

    • expose timestamp attribute, possibly also with support for treating it as a partition key
    • allow per-table hbase.master configuration
    • run profiler and minimize any per-row overhead in column mapping
    • user defined routines for lookups and data loads via HBase client API (HIVE-758 and HIVE-791)
    • logging is very noisy, with a lot of spurious exceptions; investigate these and either fix their cause or squelch them

    Build

    Code for the storage handler is located under hive/trunk/hbase-handler. The Hive build automatically enables the storage handler build forhadoop.version=0.20.x, and disables it for any other Hadoop version. This behavior can be overridden by setting ant property hbase.enabled to either true orfalse.

    HBase and Zookeeper dependencies are currently checked in under hbase-handler/lib. We will convert this to use Ivy instead once the corresponding POM's are available.

    Tests

    Class-level unit tests are provided under hbase-handler/src/test/org/apache/hadoop/hive/hbase.

    Positive QL tests are under hbase-handler/src/test/queries. These use a HBase+Zookeeper mini-cluster for hosting the fixture tables in-process, so no free-standing HBase installation is needed in order to run them. To avoid failures due to port conflicts, don't try to run these tests on the same machine where a real HBase master or zookeeper is running.

    The QL tests can be executed via ant like this:

    ant test -Dtestcase=TestHBaseCliDriver -Dqfile=hbase_queries.q

    An Eclipse launch template remains to be defined.

    • For information on how to bulk load data from Hive into HBase, see Hive/HBaseBulkLoad.

    • For another project which adds SQL-like query language support on top of HBase, see HBQL (unrelated to Hive).

    Acknowledgements

    • Primary credit for this feature goes to Samuel Guo, who did most of the development work in the early drafts of the patch
    posted on 2011-01-21 18:14 ivaneeo 閱讀(1651) 評論(0)  編輯  收藏 所屬分類:
    主站蜘蛛池模板: 激情五月亚洲色图| 国产精品亚洲美女久久久| 亚洲视频在线免费看| 国产免费一区二区视频| 久久精品夜色噜噜亚洲A∨| 美女被免费视频网站a| 日本免费人成黄页在线观看视频 | 亚洲一级在线观看| 亚洲精品视频免费看| 亚洲春黄在线观看| 亚洲欧洲免费无码| 亚洲AV无码XXX麻豆艾秋| 国产特级淫片免费看| 高h视频在线免费观看| 亚洲一区二区三区在线视频| 国产裸体美女永久免费无遮挡| 九月丁香婷婷亚洲综合色| a级成人毛片免费视频高清| 亚洲人成网站在线播放影院在线| 8888四色奇米在线观看免费看| 亚洲国产综合第一精品小说| 国内外成人免费视频| 成人午夜影视全部免费看| 久久精品国产亚洲| 又粗又大又黑又长的免费视频| 亚洲区日韩精品中文字幕| 亚洲精品偷拍视频免费观看| 免费国产午夜高清在线视频| 亚洲国产精品久久久久秋霞影院| 午夜dj免费在线观看| 一级做a毛片免费视频| 亚洲国产精品自在在线观看 | 中文字幕亚洲不卡在线亚瑟| 日韩av无码免费播放| 亚洲av乱码一区二区三区香蕉 | 亚洲av综合av一区二区三区| 亚洲国产电影av在线网址| 国产精品99精品久久免费| 亚洲熟妇av午夜无码不卡| 国产亚洲精品福利在线无卡一 | 人人狠狠综合久久亚洲高清|