The Data Import Handler Framework

Solr includes a very popular contrib module for importing data known as the DataImportHandler (DIH in short). It's a data processing pipeline built specificallyfor Solr. Here's a summary of notable capabilities:

•    Imports data from databases through JDBC (Java Database Connectivity)
    ° Supports importing only changed records, assuming a last-updated date
•    Imports data from a URL (HTTP GET)
•    Imports data from files (that is it crawls files)
•    Imports e-mail from an IMAP server, including attachments
•    Supports combining data from different sources
•    Extracts text and metadata from rich document formats
•    Applies XSLT transformations and XPath extraction on XML data
•    Includes a diagnostic/development tool

The DIH is not considered a core part of Solr, even though it comes with the Solr download, and so you must add its Java JAR files to your Solr setup to use it. If this isn't done, you'll eventually see a ClassNotFoundException error. The DIH's JAR files are located in Solr's dist directory: apache-solr-dataimporthandler-3.4.0.jar and apache-solr-dataimporthandler-extras-3.4.0.jar. The easiest way to add JAR files to a Solr configuration is to copy them to the <solr_home>/lib directory; you may need to create it. Another method is to reference them from solrconfig.xml via <lib/> tags—see Solr's example configuration for examples of that. You will most likely need some additional JAR files as well. If you'll be communicating with a database, then you'll need to get a JDBC driver for it. If you will be extracting text from various document formats then you'll need to add the JARs in /contrib/extraction/lib. Finally, if you'll be indexing e-mail then you'll need to add the JARs in /contrib /dataimporthandler/lib.

The DIH needs to be registered with Solr in solrconfig.xml like so:

<requestHandler name="/dih_artists_jdbc"
      class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="config">mb-dih-artists-jdbc.xml</str>
    </lst>
</requestHandler>

This reference mb-dih-artists-jdbc.xml is located in <solr-home>/conf, which specifies the details of a data importing process. We'll get to that file in a bit.

DIHQuickStart

http://wiki.apache.org/solr/DIHQuickStart

Index a DB table directly into Solr

Step 1 : Edit your solrconfig.xml to add the request handle

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>

Step 2 : Create a data-config.xml file as follows and save it to the conf dir

Step 3 : Ensure that your solr schema (schema.xml) has the fields 'id', 'name', 'desc'. Change the appropriate details in the data-config.xml

Step 4: Drop your JDBC driver jar file into the <solr-home>/lib directory .

Step 5 : Run the command

http://solr-host:port/solr/dataimport?command=full-import .

Keep in mind that every time a full-import is executed the index is cleaned up. If you do not wish that to happen add clean=false. For example:

http://solr-host:port/solr/dataimport?command=full-import&clean=false

Index the fields in different names

Step: 1 Change the data-config as follows:

Step 2 : This time the fields will be written to the solr fields 'solr_id', 'solr_name', solr_desc'. You must have these fields in the schema.xml.

Step 3 : Run the command http://solr-host:port/dataimpor?command=full-import

Index data from multiple tables into Solr

Step: 1 Change the data-config as follows :

Step 2: The schema.xml should have the solr_details field

Step 3: Run the full-import command

配置數據源

將 dataSource標簽直接添加到dataConfig下面，即成為dataConfig的子元素.

driver(必需的)：jdbc驅動名稱
url（必需的）：jdbc鏈接
user：用戶名
password：密碼
批量大小：jdbc鏈接中的批量大小

數據源也可以配置在solrconfig.xml中
屬性type 指定了實現的類型。它是可選的。默認的實現是JdbcDataSource。
屬性 name 是datasources的名字，當有多個datasources時，可以使用name屬性加以區分
其他的屬性都是隨意的，根據你使用的DataSource實現而定。
當然你也可以實現自己的DataSource。

多 數據源

使用：

<entity name="one" dataSource="ds-1"

</entity>

<entity name="two" dataSource="ds-2"

</entity>

配置data-config.xml

solr document是schema，它的域上的值可能來自于多個表.

data-config.xml的根元素是document。一個document元素代表了一種文檔。一個document元素中包含了一個或者多個root實體。一個root實體包含著一些子實體，這些子實體能夠包含其他的實體。實體就是，關系數據庫上的表或者視圖。每個實體都能夠包含多個域，每個域對應著數據庫返回結果中的一列。域的名字跟列的名字默認是一樣的。如果一個列的名字跟solr field的名字不一樣，那么屬性name就應該要給出。其他的需要的屬性在solrschema.xml文件中配置。

為了能夠從數據庫中取得想要的數據，我們的設計支持標準sql規范。這使得用戶能夠使用他任何想要的sql語句。root實體是一個中心表，使用它的列可以把表連接在一起。

dataconfig的結構

dataconfig 的結構不是一成不變的,entity和field元素中的屬性是隨意的，這主要取決于processor和transformer。

以下是entity的默認屬性

name(必需的):name是唯一的，用以標識entity
processor:只有當datasource不是RDBMS時才是必需的。默認值是 SqlEntityProcessor
transformer:轉換器將會被應用到這個entity上，詳情請瀏覽transformer部分。
pk：entity的主鍵，它是可選的，但使用“增量導入”的時候是必需。它跟schema.xml中定義的 uniqueKey沒有必然的聯系，但它們可以相同。
rootEntity：默認情況下，document元素下就是根實體了，如果沒有根實體的話，直接在實體下面的實體將會被看做跟實體。對于根實體對應的數據庫中返回的數據的每一行，solr都將生成一個document。

一下是SqlEntityProcessor的屬性

query (required) :sql語句
deltaQuery : 只在“增量導入”中使用
parentDeltaQuery : 只在“增量導入”中使用
deletedPkQuery : 只在“增量導入”中使用
deltaImportQuery : (只在“增量導入”中使用) . 如果這個存在，那么它將會在“增量導入”中導入phase時代替query產生作用。這里有一個命名空間的用法${dataimporter.delta.}

`Commands`

打開導入數據界面http://192.168.0.248:9080/solr/admin/dataimport.jsp，看到幾種按鈕分別調用不同的導數據命令。

full-import : "完全導入"這個操作可以通過訪問URL http://192.168.0.248:9080/solr/dataimport?command=full-import 完成。
- 這個操作，將會新起一個線程。response中的attribute屬性將會顯示busy。
- 這個操作執行的時間取決于數據集的大小。
- 當這個操作運行完了以后，它將在conf/dataimport.properties這個文件中記錄下這個操作的開始時間
- 當“增量導入”被執行時，stored timestamp這個時間戳將會被用到
- solr的查詢在“完全導入”時，不是阻塞的
- 它還有下面一些參數：
  - clean : (default 'true'). 決定在建立索引之前，刪除以前的索引。
  - commit : (default 'true'). 決定這個操作之后是否要commit
  - optimize : (default 'true'). 決定這個操作之后是否要優化。
  - debug : (default false). 工作在debug模式下。詳情請看 the interactive development mode (see here )
delta-import : 當遇到一些增量的輸入，或者發生一些變化時使用http://192.168.0.248:9080/solr/dataimport?command= delta-import .它同樣支持 clean, commit, optimize and debug 這幾個參數.
status : 想要知道命令執行的狀態 , 訪問 URL http://192.168.0.248:9080/solr/dataimport .它給出了關于文檔創建、刪除，查詢、結果獲取等等的詳細狀況。
reload-config : 如果data-config.xml已經改變，你不希望重啟solr，而要重新加載配置時，運行一下的命令http://192.168.0.248:9080/solr/dataimport?command=reload-config
abort : 你可以通過訪問 http://192.168.0.248:9080/solr/dataimport?command=abort 來終止一個在運行的操作

Full Import 例子

data-config.xml 如下：

</entity>

</entity>

</document>

</dataConfig>

這里, 根實體是一個名叫“item”的表，它的主鍵是id。我們使用語句 "select * from item"讀取數據. 每一項都擁有多個特性。看下面feature實體的查詢語句：

</entity>

feature表中的外鍵item_id跟item中的主鍵連在一起從數據庫中取得該row的數據。相同地，我們將item和category連表（它們是多對多的關系）。注意，我們是怎樣使用中間表和標準sql連表的：

</entity>

短一點的 data-config

在上面的例子中，這里有好幾個從域到solr域之間的映射。如果域的名字和solr中域的名字是一樣的話，完全避免使用在實體中配置域也是可以的。當然，如果你需要使用轉換器的話，你還是需要加上域實體的。

</entity>

</document>

</dataConfig>

訪問 http://localhost:8983/solr/dataimport?command=full-import 執行一個“完全導入”

使用“增量導入”命令

你可以通過訪問URL http://localhost:8983/solr/dataimport?command=delta-import 來使用增量導入。操作將會新起一個線程，response中的屬性statue也將顯示busy now。操作執行的時間取決于你的數據集的大小。在任何時候，你都可以通過訪問 http://localhost:8983/solr/dataimport 來查看狀態。

當增量導入被執行的時候，它讀取存儲在conf/dataimport.properties中的“start time”。它使用這個時間戳來執行增量查詢，完成之后，會更新這個放在conf/dataimport.properties中的時間戳。

Delta-Import 例子

我們將使用跟“完全導入”中相同的數據庫。注意，數據庫已經被更新了，每個表都包含有一個額外timestamp類型的列叫做last_modified。或許你需要重新下載數據庫，因為它最近被更新了。我們使用這個時間戳的域來區別出那一行是上次索引以來有更新的。

看看下面的這個 data-config.xml：

<entity name="item" pk="ID" query="select * from item"

deltaQuery="select id from item where last_modified > '${dataimporter.last_index_time}'">

<entity name="feature" pk="ITEM_ID"

query="select description as features from feature where item_id='${item.ID}'">

</entity>

<entity name="item_category" pk="ITEM_ID, CATEGORY_ID"

query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'">

<entity name="category" pk="ID"

query="select description as cat from category where id = '${item_category.CATEGORY_ID}'">

</entity>

</document>

</dataConfig>

注意到item實體的屬性deltaquery了嗎，它包含了一個能夠查出最近更新的sql語句。注意，變量{dataimporter.last_index_time } 是DataImporthandler傳過來的變量，我們叫它時間戳，它指出“完全導入”或者“部分導入”的最后運行時間。你可以在data- config.xml文件中的sql的任何地方使用這個變量，它將在processing這個過程中被賦值。

上面例子中deltaQuery 只能夠發現item中的更新，而不能發現其他表的。你可以像下面那樣在一個sql語句中指定所有的表的更新:

deltaQuery="select id from item where id in

(select item_id as id from feature where last_modified > '${dataimporter.last_index_time}')

or id in

(select item_id as id from item_category where item_id in

(select id as item_id from category where last_modified > '${dataimporter.last_index_time}')

or last_modified > '${dataimporter.last_index_time}')

or last_modified > '${dataimporter.last_index_time}'"

寫一個類似上面的龐大的deltaQuery 并不是一件很享受的工作，我們還是選擇其他的方法來達到這個目的

<entity name="item" pk="ID" query="select * from item"

deltaQuery="select id from item where last_modified > '${dataimporter.last_index_time}'">

<entity name="feature" pk="ITEM_ID"

query="select DESCRIPTION as features from FEATURE where ITEM_ID='${item.ID}'"

deltaQuery="select ITEM_ID from FEATURE where last_modified > '${dataimporter.last_index_time}'"

parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}"/>

<entity name="item_category" pk="ITEM_ID, CATEGORY_ID"

query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'"

deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where last_modified > '${dataimporter.last_index_time}'"

parentDeltaQuery="select ID from item where ID=${item_category.ITEM_ID}">

<entity name="category" pk="ID"

query="select DESCRIPTION as cat from category where ID = '${item_category.CATEGORY_ID}'"

deltaQuery="select ID from category where last_modified > '${dataimporter.last_index_time}'"

parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category where CATEGORY_ID=${category.ID}"/>

</entity>

</document>

</dataConfig>

deltaQuery 取得從上次索引更新時間以來有更新的實體的主鍵。
parentDeltaQuery 從deltaQuery中取得當前表中更新的行，并把這些行提交給父表。因為，當子表中的一行發生改變時，我們需要更新它的父表的solr文檔。

下面是一些值得注意的地方:

對于query語句返回的每一行，子實體的query都將被執行一次
對于deltaQuery返回的每一行，parentDeltaQuery都將被執行。
一旦根實體或者子實體中的行發生改變，我們將重新生成包含該行的solr文檔。

posted on 2012-05-30 14:33 CONAN 閱讀(4719) 評論(0) 編輯收藏所屬分類: Solr

CONAN ZONE

留言簿(6)

文章分類(325)

文章檔案(282)

guy's blog

搜索

積分與排名

最新評論

DIHQuickStart

Index a DB table directly into Solr

Index the fields in different names

Index data from multiple tables into Solr

`Commands`

Full Import 例子

使用“增量導入”命令

Delta-Import 例子