posts - 33, comments - 70, trackbacks - 0

轉(zhuǎn):http://www.zdnet.com.cn/developer/webdevelop/story/0,2000081602,39154640,00.htm

用Lucene來建立一個索引

給你的Web網(wǎng)站加上搜索的功能是增強用戶瀏覽體驗的最簡單方式之一，但是在你的應(yīng)用程序里集成一個搜索引擎并不總是很容易。為了幫助你為自己的Java應(yīng)用程序提供一個靈活的搜索引擎，我會講解如何使用Lucene，它是一個極其靈活的開放源代碼的搜索引擎。

Lucene會直接同你的Web應(yīng)用程序集成到一起。它是由Jakarta Apache工作組使用Java編寫成的。你的Java應(yīng)用程序能夠?qū)ucene作為任何搜索功能的核心來使用。Lucene能夠處理任何類型的文本數(shù)據(jù)；但是它沒有內(nèi)置對Word、Excel、PDF和XML的支持。但是還是有一些解決方案能夠讓Lucene支持它們中的每一個。

關(guān)于Lucene的重要一點是，它只是一個搜索引擎，因此沒有內(nèi)置Web圖形用戶界面和Web crawler。要把Lucene集成到你的Web應(yīng)用程序里，你就要編寫一個顯示查詢表單的servlet或者JSP頁面，還要編寫另一個列出結(jié)果的頁面。

用Lucene來建立一個索引

你應(yīng)用程序的文本內(nèi)容由Lucene來索引，并被作為一系列索引文件保存在文件系統(tǒng)里。Lucene能夠接受代表單篇內(nèi)容的文檔（Document）對象，例如一個Web頁面或者PDF文件。你的應(yīng)用程序就負責(zé)將其內(nèi)容轉(zhuǎn)變成Lucene能夠理解的文檔對象。

每個文檔都是由有一個或者多個的字段（Field）對象。這些字段包含有一個名稱和一個值，非常像散裂圖里的一個項目（entry）。每個字段都應(yīng)該對應(yīng)一段信息，這段信息是同你需要查詢或者顯示的檢索結(jié)果相關(guān)的。例如，標(biāo)題應(yīng)該被用在搜索結(jié)果里，因此它會被作為一個字段添加到文檔對象里。這些字段可以被索引，也可以不被索引，而原始的數(shù)據(jù)也可以選擇保存在索引里。保存在索引里的字段在創(chuàng)建檢索結(jié)果頁面的時候會很有用。對于搜索沒有用處的字段，例如唯一的ID，就不需要被索引，只需要被保存就行了。

字段也可以是標(biāo)記化了的（tokenized），這就意味著一個分析程序會將輸入到字段里的內(nèi)容分解成搜索引擎能夠使用的標(biāo)記。Lucene帶有多個分析程序，但是我只會使用最強大的分析程序——StandardAnalyzer類。

StandardAnalyzer類會將文本的所有內(nèi)容變成小寫的，并去掉一些常用的停頓詞（stop word）。停頓詞是像“a”、“the”和“in”這樣的詞，它們都是內(nèi)容里非常常見的詞，但是對搜索卻一點用處都沒有。分析程序也會分析搜索查詢，這就意味著查詢會找到匹配的部分。例如，這段文本“The dog is a golden retriever（這條狗是一只金毛獵犬）”，就會被處理為“dog golden retriever”作為索引。當(dāng)用戶搜索“a Golden Dog”的時候，分析程序會處理這個查詢，并將其轉(zhuǎn)變?yōu)椤癵olden dog”，這就符合我們的內(nèi)容了。

我們的例子準(zhǔn)備使用數(shù)據(jù)訪問對象（Data Access Object，DAO）的商務(wù)對象（business object），前者是Java應(yīng)用程序開發(fā)的一個常見模式。我要使用的DAO——ProductDAO見Listing A。

Listing?A
?2

package ?com.greenninja.lucene;?
?3

?
?4

importjava.util. * ;?
?5

public ? class ?ProductDAO {
?6

?????? private ?Map?map? = ? new ?HashMap();??????
?7

?????? /**
?8

???????*?Initializes?the?map?with?new?Products
?9

???????*
10

??????? */
11

?????? public ? void ?init()?????? {
12

?????????????
13

?????????????Product?product1? = ? new ?Product( " 1E344 " , " Blizzard?Convertible " ,
14

????????????? " The?Blizzard?is?the?finest?convertible?on?the?market?today,?with?120?horsepower,?6?seats,?and?a?steering?wheel. " ,
15

????????????? " The?Blizzard?convertible?model?is?a?revolutionary?vehicle?that?looks?like?a?minivan,?but?has?a?folding?roof?like?a?roadster.?We?took?all?of?the?power?from?our?diesel?engines?and?put?it?into?our?all?new?fuel?cell?power?system. " );
16

?????????????map.put(product1.getId(),product1);
17

?????????????
18

?????????????Product?product2? = ? new ?Product( " R5TS7 " , " Truck?3000 " ,
19

????????????? " Our?Truck?3000?model?comes?in?all?shapes?and?sizes,?including?dump?truck,?garbage?truck,?and?pickup?truck.?The?garbage?truck?has?a?full?3?year?warranty. " ,
20

????????????? " The?Truck?3000?is?built?on?the?same?base?as?our?bulldozers?and?can?be?outfitted?with?an?optional?hovercraft?attachment?for?all-terrain?travel. " );
21

?????????????map.put(product2.getId(),product2);
22

?????????????
23

?????????????Product?product3? = ? new ?Product( " VC456 " , " i954d-b?Motorcycle " ,
24

????????????? " The?motorcycle?comes?with?a?sidecar?on?each?side,?for?additional?stability?and?cornering?ability. " ,
25

????????????? " Our?motorcycle?has?the?same?warranty?as?our?other?products?and?is?guaranteed?for?many?miles?of?fun?biking.?Each?motorcycle?is?shipped?with?a?nylon?windbreaker,?goggles,?and?a?helmet?with?a?neat?visor. " );
26

?????????????map.put(product3.getId(),product3);
27

??????????????????????????
28

??????}
29

??????
30

?????? /**
31

???????*?Gets?a?collection?of?all?of?the?products
32

???????*
33

???????*? @return ?all?of?the?products
34

??????? */
35

?????? public ?Collection?getAllProducts()
36

?????? {
37

????????????? return ?map.values();
38

??????}
39

??????
40

?????? /**
41

???????*?Gets?a?product,?given?the?unique?id
42

???????*
43

???????*? @param ?id?the?unique?id
44

???????*? @return ?the?Product?object,?or?null?if?the?id?wasn't?found
45

??????? */
46

?????? public ?Product?getProduct(String?id)
47

?????? {
48

????????????? if ?(map.containsKey(id))
49

????????????? {
50

???????????????????? return ?(Product)?map.get(id);
51

?????????????}
52

?????????????
53

????????????? // the?product?id?wasn't?found
54

????????????? return ? null ;
55

??????} ?
56

?
57

}
58

?
59

為了讓這個演示程序簡單，我不準(zhǔn)備使用數(shù)據(jù)庫，DAO也只會包含產(chǎn)品（Product）對象的一個集合。在本例里，我會采用Listing B

Listing?B?

package ?com.greenninja.lucene;?

public ? class ?Product

{

?????? private ?String?name;

?????? private ?String?shortDescription;

?????? private ?String?longDescription;??????

?????? private ?String?id;?

??????

?????? /**

???????*?Constructor?to?create?a?new?product

??????? */

?????? public ?Product(String?i,?String?n,?String?sd,?String?ld)

?????? {

????????????? this .id? = ?i;

????????????? this .name = ?n;

????????????? this .shortDescription? = ?sd;

????????????? this .longDescription? = ?ld;

??????}

??????setter / getter?

}

里的產(chǎn)品對象，并將它們轉(zhuǎn)變成為用于索引的文檔。

索引符（Indexer）類在Listing C

Listing?C??

package ?com.greenninja.lucene;?

import ?java.io.IOException;

import ?java.util.Collection;

import ?java.util.Iterator;?

import ?org.apache.lucene.analysis.Analyzer;

import ?org.apache.lucene.analysis.standard.StandardAnalyzer;

import ?org.apache.lucene.document.Document;

import ?org.apache.lucene.document.Field;

import ?org.apache.lucene.index.IndexWriter;?

public ? class ?Indexer

?????? protected ?IndexWriter?writer? = ? null ;

??????

?????? protected ?Analyzer?analyzer? = ? new ?StandardAnalyzer();

??????

?????? public ? void ?init(String?indexPath)? throws ?IOException

?????? {

????????????

????????????? // create?a?new?index?every?time?this?is?run

?????????????writer? = ? new ?IndexWriter(indexPath,?analyzer,? true );

??????}

??????

?????? public ? void ?buildIndex()? throws ?IOException

?????? {

????????????? // get?the?products?from?the?DAO

?????????????ProductDAO?dao? = ? new ?ProductDAO();

?????????????dao.init();

?????????????Collection?products? = ?dao.getAllProducts();

?????????????

?????????????Iterator?iter? = ?products.iterator();

?????????????

????????????? while ?(iter.hasNext())

????????????? {

????????????????????Product?product? = ?(Product)?iter.next();

????????????????????

???????????????????? // ?convert?the?product?to?a?document.

????????????????????Document?doc? = ? new ?Document();

????????????????????

???????????????????? // ?create?an?unindexed,?untokenized,?stored?field?for?the?product?id

????????????????????doc.add(Field.UnIndexed( " productId " ,product.getId()));

????????????????????

???????????????????? // ?create?an?indexed,?untokenized,?stored?field?for?the?name

????????????????????doc.add(Field.Keyword( " name " ,product.getName()));

????????????????????

???????????????????? // ?create?an?indexed,?untokenized,?stored?field?for?the?short?description

???????????????????doc.add(Field.Keyword( " short " ,product.getShortDescription()));

????????????????????

???????????????????? // ?create?an?indexed,?tokenized,?unstored?field?for?all?of?the?content

???????????????????String?content? = ?product.getName()? + ? " ? " ? + ?product.getShortDescription()? +

?????????????????????????? " ? " ? + ?product.getLongDescription();

????????????????????doc.add(Field.Text( " content " ,content));

????????????????????

???????????????????? // ?add?the?document?to?the?index

???????????????????? try

???????????????????? {

??????????????????????????writer.addDocument(doc);

??????????????????????????System.out.println( " Document? " ? + ?product.getName()? + ? " ?added?to?index. " );

????????????????????}

???????????????????? catch ?(IOException?e)

???????????????????? {

??????????????????????????System.out.println( " Error?adding?document:? " ? + ?e.getMessage());

????????????????????} ????????????????????

?????????????} ?????????????

????????????? // optimize?the?index

?????????????writer.optimize();

?????????????

????????????? // close?the?index

?????????????writer.close();?????????????

??????} ????????????

}

里，它將負責(zé)把Product轉(zhuǎn)換成為Lucene文檔，還負責(zé)創(chuàng)建Lucene索引。

產(chǎn)品類里的字段是ID名、簡短描述和詳細描述。通過使用字段（Field）類的UnIndexed方法，ID會被作為一個非索引的非標(biāo)記字段被保存。通過使用字段類的Keyword方法，名稱和簡短描述會被作為索引的非標(biāo)記字段被保存。搜索引擎會對內(nèi)容字段進行查詢，而內(nèi)容字段里會包含有產(chǎn)品的名稱、簡短描述和詳細描述字段。

在所有的文檔都添加完之后，就要優(yōu)化索引并關(guān)閉索引編寫器，這樣你才能夠使用索引。Lucene的大多數(shù)實現(xiàn)都要使用增量索引（incremental indexing），在增量索引里，已經(jīng)在索引里的文檔都是獨立更新的，而不是每次先刪除索引再創(chuàng)建一個新的。

運行查詢

運行查詢

創(chuàng)建一個查詢并在索引里搜索結(jié)果要比創(chuàng)建一個索引簡單。你的應(yīng)用程序會要求使用者提供一個搜索查詢，這個查詢可以是一個簡單的詞語。Lucene擁有一些更加高級的查詢（Query）類，用于布爾搜索或者整句搜索。

高級查詢的一個例子是”Mutual Fund”（互惠基金）AND stock*（股票），它會搜索包含有短語Mutual Fund和以stock開頭的詞（例如stocks、stock或者甚至是stockings）的文檔。

獲取更多關(guān)于Lucene里查詢的信息
Lucene Web網(wǎng)站里的句法頁面會提供更加詳細的信息。

搜索符（Searcher）類放在Listing D

Listing?D?

package ?com.greenninja.lucene;?

import ?java.io.IOException;?

import ?org.apache.lucene.analysis.Analyzer;

import ?org.apache.lucene.analysis.standard.StandardAnalyzer;

import ?org.apache.lucene.queryParser.ParseException;

import ?org.apache.lucene.queryParser.QueryParser;

import ?org.apache.lucene.search.Hits;

import ?org.apache.lucene.search.IndexSearcher;

import ?org.apache.lucene.search.Query;?

public ? class ?Searcher

{

?????? protected ?Analyzer?analyzer? = ? new ?StandardAnalyzer();

?????? public ?Hits?search(String?indexPath,?String?queryString)? throws ?IOException,?ParseException

?????? {

????????????? // the?Lucene?index?Searcher?class,?which?uses?the?query?on?the?index

?????????????IndexSearcher?indexSearcher? = ? new ?IndexSearcher(indexPath);

?????????????

????????????? // ?make?the?query?with?our?content?field,?the?query?string,?and?the?analyzer

?????????????Query?query? = ?QueryParser.parse(queryString, " content " ,analyzer);

?????????????

?????????????Hits?hits? = ?indexSearcher.search(query);

?????????????

????????????? return ?hits;?

??????} ??????

}

里，它負責(zé)在Lucene索引里查找你所使用的詞語。對于本篇演示程序而言，我使用了一個簡單的查詢，它只是一個字符串，而沒有使用任何高級查詢功能。我用QueryParser類從查詢字符串里創(chuàng)建了一個查詢（Query）對象，QueryParser這個類會使用StandardAnalyzer類將查詢字符串分解成標(biāo)記，再去掉停頓詞，然后將這個字符串轉(zhuǎn)換成小寫的。

這個查詢被傳遞給一個IndexSearcher對象。IndexSearcher會在索引的文件系統(tǒng)里被初始化。IndexSearcher的搜索方法將接受這個查詢并返回一個命中（Hits）對象。這個命中對象包含有作為Lucene文檔對象的檢索結(jié)果，以及結(jié)果的長度。使用命中對象的Doc方法將取回命中對象里的每個文檔。

文檔對象包含有我添加到索引符文檔里的字段。這些字段中的一些被保存了，但是沒有被標(biāo)記化，你可以將它們從文檔里提取出來。示例應(yīng)用程序會用搜索引擎運行一個查詢，然后顯示它所找到的產(chǎn)品名稱。

運行演示程序

要運行本文里的示例程序，你需要從Lucene的Web網(wǎng)站下載最新版本的Lucene二進制發(fā)布版本（binary distribution）。Lucene發(fā)行版的lucene-1.3-rc1.jar文件需要被添加到你Java類的路徑下才能夠運行這個演示程序。演示程序會在運行com.greenninja.lucene.Demo類的目錄下創(chuàng)建一個叫做index的索引目錄。你還需要安裝好JDK。一行典型的命令是：java -cp c:\java\lucene-1.3-rc1\lucene-1.3-rc1.jar;. com.greenninja.lucene.Demo（見圖A）。本例所使用的示例數(shù)據(jù)包含在ProductDAO類里。這個查詢是演示（Demo）類的一部分。

圖A

??命令行示例

參考資料
· 下載本文相關(guān)代碼
·javaworld.com:javaworld.com
·Matrix-Java開發(fā)者社區(qū):http://www.matrix.org.cn/
·Lucene 搜索引擎庫:
http://jakarta.apache.org/lucene/docs/index.html
·MAOS 開源項目:
http://sourceforge.net/projects/maos/

posted on 2006-05-15 22:54 地獄男爵(hellboys) 閱讀(1059) 評論(0) 編輯收藏所屬分類: 編程語言(c/c++ java python sql ......)

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發(fā)表評論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問
相關(guān)文章: 優(yōu)化MySQL數(shù)據(jù)庫性能的八種方法 ActiveMQ4.1 +Spring2.0的POJO JMS方案擴展，以更加實用（基于ss）.二 ActiveMQ4.1 +Spring2.0的POJO JMS方案擴展，以更加實用（基于ss） compass 中使用annatation 簡化配置 Compass - springside 中的應(yīng)用 HTMLParser屬性解析使用Lucene建立自己的搜索引擎初步(轉(zhuǎn))

2006年5月

日

一

二

三

四

五

六

常用鏈接

隨筆分類

隨筆檔案

文章檔案

2005年12月 (1)

相冊

連接

差沙
我以前blog地址
聰明的豬(cleverpig)

用Lucene來建立一個索引

運行查詢

運行演示程序

常用鏈接

隨筆分類

隨筆檔案

文章檔案

相冊

連接

最新隨筆

搜索

最新評論

閱讀排行榜

評論排行榜