77777午夜亚洲,国产色在线|亚洲,亚洲欧美成人一区二区三区

Lucene 學習文章來自:http://lighter.javaeye.com

寫文章的時候,感覺比較難寫的就是標題,有時候不知道起什么名字好,反正這里寫的都是關于lucene的一些簡單的實例,就隨便起啦.

Lucene 其實很簡單的,它最主要就是做兩件事:建立索引和進行搜索
來看一些在lucene中使用的術語,這里并不打算作詳細的介紹,只是點一下而已----因為這一個世界有一種好東西，叫搜索。

IndexWriter:lucene中最重要的的類之一，它主要是用來將文檔加入索引，同時控制索引過程中的一些參數使用。

Analyzer:分析器,主要用于分析搜索引擎遇到的各種文本。常用的有StandardAnalyzer分析器,StopAnalyzer分析器,WhitespaceAnalyzer分析器等。

Directory:索引存放的位置;lucene提供了兩種索引存放的位置，一種是磁盤，一種是內存。一般情況將索引放在磁盤上；相應地lucene提供了FSDirectory和RAMDirectory兩個類。

Document:文檔;Document相當于一個要進行索引的單元，任何可以想要被索引的文件都必須轉化為Document對象才能進行索引。

Field：字段。

IndexSearcher:是lucene中最基本的檢索工具，所有的檢索都會用到IndexSearcher工具;

Query:查詢，lucene中支持模糊查詢，語義查詢，短語查詢，組合查詢等等,如有TermQuery,BooleanQuery,RangeQuery,WildcardQuery等一些類。

QueryParser: 是一個解析用戶輸入的工具，可以通過掃描用戶輸入的字符串，生成Query對象。

Hits:在搜索完成之后，需要把搜索結果返回并顯示給用戶，只有這樣才算是完成搜索的目的。在lucene中，搜索的結果的集合是用Hits類的實例來表示的。

上面作了一大堆名詞解釋，下面就看幾個簡單的實例吧:
1、簡單的的StandardAnalyzer測試例子

代碼

package ?lighter.javaeye.com; ??
??
import ?java.io.IOException; ??
import ?java.io.StringReader; ??
??
import ?org.apache.lucene.analysis.Analyzer; ??
import ?org.apache.lucene.analysis.Token; ??
import ?org.apache.lucene.analysis.TokenStream; ??
import ?org.apache.lucene.analysis.standard.StandardAnalyzer; ??
??
public ? class ?StandardAnalyzerTest? ??
{ ??
???? //構造函數， ??
???? public ?StandardAnalyzerTest() ??
????{ ??
????} ??
???? public ? static ? void ?main(String[]?args)? ??
????{ ??
???????? //生成一個StandardAnalyzer對象 ??
????????Analyzer?aAnalyzer?=? new ?StandardAnalyzer(); ??
???????? //測試字符串 ??
????????StringReader?sr?=? new ?StringReader( "lighter?javaeye?com?is?the?are?on" ); ??
???????? //生成TokenStream對象 ??
????????TokenStream?ts?=?aAnalyzer.tokenStream( "name" ,?sr);? ??
???????? try ?{ ??
???????????? int ?i= 0 ; ??
????????????Token?t?=?ts.next(); ??
???????????? while (t!= null ) ??
????????????{ ??
???????????????? //輔助輸出時顯示行號 ??
????????????????i++; ??
???????????????? //輸出處理后的字符 ??
????????????????System.out.println( "第" +i+ "行:" +t.termText()); ??
???????????????? //取得下一個字符 ??
????????????????t=ts.next(); ??
????????????} ??
????????}? catch ?(IOException?e)?{ ??
????????????e.printStackTrace(); ??
????????} ??
????} ??
} ??

顯示結果：

引用

第1行:lighter
第2行:javaeye
第3行:com

提示一下：
StandardAnalyzer是lucene中內置的"標準分析器",可以做如下功能:
1、對原有句子按照空格進行了分詞
2、所有的大寫字母都可以能轉換為小寫的字母
3、可以去掉一些沒有用處的單詞，例如"is","the","are"等單詞，也刪除了所有的標點
查看一下結果與"new StringReader("lighter javaeye com is the are on")"作一個比較就清楚明了。
這里不對其API進行解釋了，具體見lucene的官方文檔。需要注意一點，這里的代碼使用的是lucene2的API，與1.43版有一些明顯的差別。

2、看另一個實例,簡單地建立索引，進行搜索

代碼

package?lighter.javaeye.com; ??
import?org.apache.lucene.analysis.standard.StandardAnalyzer; ??
import?org.apache.lucene.document.Document; ??
import?org.apache.lucene.document.Field; ??
import?org.apache.lucene.index.IndexWriter; ??
import?org.apache.lucene.queryParser.QueryParser; ??
import?org.apache.lucene.search.Hits; ??
import?org.apache.lucene.search.IndexSearcher; ??
import?org.apache.lucene.search.Query; ??
import?org.apache.lucene.store.FSDirectory; ??
??
public?class?FSDirectoryTest?{ ??
??
????//建立索引的路徑 ??
????public?static?final?String?path?=?"c:\\index2"; ??
??
????public?static?void?main(String[]?args)?throws?Exception?{ ??
????????Document?doc1?=?new?Document(); ??
????????doc1.add(?new?Field("name",?"lighter?javaeye?com",Field.Store.YES,Field.Index.TOKENIZED)); ??
??
????????Document?doc2?=?new?Document(); ??
????????doc2.add(new?Field("name",?"lighter?blog",Field.Store.YES,Field.Index.TOKENIZED)); ??
??
????????IndexWriter?writer?=?new?IndexWriter(FSDirectory.getDirectory(path,?true),?new?StandardAnalyzer(),?true); ??
????????writer.setMaxFieldLength(3); ??
????????writer.addDocument(doc1); ??
????????writer.setMaxFieldLength(3); ??
????????writer.addDocument(doc2); ??
????????writer.close(); ??
??
????????IndexSearcher?searcher?=?new?IndexSearcher(path); ??
????????Hits?hits?=?null; ??
????????Query?query?=?null; ??
????????QueryParser?qp?=?new?QueryParser("name",new?StandardAnalyzer()); ??
???????? ??
????????query?=?qp.parse("lighter"); ??
????????hits?=?searcher.search(query); ??
????????System.out.println("查找\"lighter\"?共"?+?hits.length()?+?"個結果"); ??
??
????????query?=?qp.parse("javaeye"); ??
????????hits?=?searcher.search(query); ??
????????System.out.println("查找\"javaeye\"?共"?+?hits.length()?+?"個結果"); ??
??
????} ??
??
}??

運行結果：

代碼

查找"lighter"?共2個結果 ??
查找"javaeye"?共1個結果?

//測試字符串
StringReader sr = new StringReader("lighter javaeye com");
//生成TokenStream對象
TokenStream ts = aAnalyzer.tokenStream("name", sr);
請問:以上的解析是按什么來解析,為什么他會自動的按空格或者","進行字符分割,再一個當SR里輸入是中文字符時,他將會對每個字進行分割,請問這是為什么,同時這功能的實現又意為著什么呢.....????

StandardAnalyzer是lucene中內置的"標準分析器",可以做如下功能:
1、對原有句子按照空格進行了分詞
2、所有的大寫字母都可以能轉換為小寫的字母
3、可以去掉一些沒有用處的單詞，例如"is","the","are"等單詞，也刪除了所有的標點
同時也可以對中文進行分詞(效果不好),現在有很多的中文分詞包可以采用

說明一下,這一篇文章的用到的lucene,是用2.0版本的,主要在查詢的時候2.0版本的lucene與以前的版本有了一些區別.
其實這一些代碼都是早幾個月寫的,自己很懶,所以到今天才寫到自己的博客上,高深的文章自己寫不了，只能記錄下一些簡單的記錄與點滴，其中的代碼算是自娛自樂的，希望高手不要把重構之類的砸下來...

1、在windows系統下的的C盤，建一個名叫s的文件夾,在該文件夾里面隨便建三個txt文件，隨便起名啦，就叫"1.txt","2.txt"和"3.txt"啦
其中1.txt的內容如下：

代碼

中華人民共和國 ??
全國人民 ??
2006年??

而"2.txt"和"3.txt"的內容也可以隨便寫幾寫，這里懶寫，就復制一個和1.txt文件的內容一樣吧

2、下載lucene包，放在classpath路徑中
建立索引:

代碼

package?lighter.javaeye.com; ??
??
import?java.io.BufferedReader; ??
import?java.io.File; ??
import?java.io.FileInputStream; ??
import?java.io.IOException; ??
import?java.io.InputStreamReader; ??
import?java.util.Date; ??
??
import?org.apache.lucene.analysis.Analyzer; ??
import?org.apache.lucene.analysis.standard.StandardAnalyzer; ??
import?org.apache.lucene.document.Document; ??
import?org.apache.lucene.document.Field; ??
import?org.apache.lucene.index.IndexWriter; ??
??
/** ?
?*?author?lighter?date?2006-8-7 ?
?*/??
public?class?TextFileIndexer?{ ??
????public?static?void?main(String[]?args)?throws?Exception?{ ??
????????/*?指明要索引文件夾的位置,這里是C盤的S文件夾下?*/??
????????File?fileDir?=?new?File("c:\\s"); ??
??
????????/*?這里放索引文件的位置?*/??
????????File?indexDir?=?new?File("c:\\index"); ??
????????Analyzer?luceneAnalyzer?=?new?StandardAnalyzer(); ??
????????IndexWriter?indexWriter?=?new?IndexWriter(indexDir,?luceneAnalyzer, ??
????????????????true); ??
????????File[]?textFiles?=?fileDir.listFiles(); ??
????????long?startTime?=?new?Date().getTime(); ??
???????? ??
????????//增加document到索引去 ??
????????for?(int?i?=?0;?i?<?textFiles.length;?i++)?{ ??
????????????if?(textFiles[i].isFile() ??
????????????????????&&?textFiles[i].getName().endsWith(".txt"))?{ ??
????????????????System.out.println("File?"?+?textFiles[i].getCanonicalPath() ??
????????????????????????+?"正在被索引...."); ??
????????????????String?temp?=?FileReaderAll(textFiles[i].getCanonicalPath(), ??
????????????????????????"GBK"); ??
????????????????System.out.println(temp); ??
????????????????Document?document?=?new?Document(); ??
????????????????Field?FieldPath?=?new?Field("path",?textFiles[i].getPath(), ??
????????????????????????Field.Store.YES,?Field.Index.NO); ??
????????????????Field?FieldBody?=?new?Field("body",?temp,?Field.Store.YES, ??
????????????????????????Field.Index.TOKENIZED, ??
????????????????????????Field.TermVector.WITH_POSITIONS_OFFSETS); ??
????????????????document.add(FieldPath); ??
????????????????document.add(FieldBody); ??
????????????????indexWriter.addDocument(document); ??
????????????} ??
????????} ??
????????//optimize()方法是對索引進行優化 ??
????????indexWriter.optimize(); ??
????????indexWriter.close(); ??
???????? ??
????????//測試一下索引的時間 ??
????????long?endTime?=?new?Date().getTime(); ??
????????System.out ??
????????????????.println("這花費了"??
????????????????????????+?(endTime?-?startTime) ??
????????????????????????+?"?毫秒來把文檔增加到索引里面去!"??
????????????????????????+?fileDir.getPath()); ??
????} ??
??
????public?static?String?FileReaderAll(String?FileName,?String?charset) ??
????????????throws?IOException?{ ??
????????BufferedReader?reader?=?new?BufferedReader(new?InputStreamReader( ??
????????????????new?FileInputStream(FileName),?charset)); ??
????????String?line?=?new?String(); ??
????????String?temp?=?new?String(); ??
???????? ??
????????while?((line?=?reader.readLine())?!=?null)?{ ??
????????????temp?+=?line; ??
????????} ??
????????reader.close(); ??
????????return?temp; ??
????} ??
}??

索引的結果：

代碼

File?C:\s\1.txt正在被索引.... ??
中華人民共和國全國人民2006年 ??
File?C:\s\2.txt正在被索引.... ??
中華人民共和國全國人民2006年 ??
File?C:\s\3.txt正在被索引.... ??
中華人民共和國全國人民2006年 ??
這花費了297?毫秒來把文檔增加到索引里面去!c:\s??

3、建立了索引之后，查詢啦....

代碼

package?lighter.javaeye.com; ??
??
import?java.io.IOException; ??
??
import?org.apache.lucene.analysis.Analyzer; ??
import?org.apache.lucene.analysis.standard.StandardAnalyzer; ??
import?org.apache.lucene.queryParser.ParseException; ??
import?org.apache.lucene.queryParser.QueryParser; ??
import?org.apache.lucene.search.Hits; ??
import?org.apache.lucene.search.IndexSearcher; ??
import?org.apache.lucene.search.Query; ??
??
public?class?TestQuery?{ ??
????public?static?void?main(String[]?args)?throws?IOException,?ParseException?{ ??
????????Hits?hits?=?null; ??
????????String?queryString?=?"中華"; ??
????????Query?query?=?null; ??
????????IndexSearcher?searcher?=?new?IndexSearcher("c:\\index"); ??
??
????????Analyzer?analyzer?=?new?StandardAnalyzer(); ??
????????try?{ ??
????????????QueryParser?qp?=?new?QueryParser("body",?analyzer); ??
????????????query?=?qp.parse(queryString); ??
????????}?catch?(ParseException?e)?{ ??
????????} ??
????????if?(searcher?!=?null)?{ ??
????????????hits?=?searcher.search(query); ??
????????????if?(hits.length()?>?0)?{ ??
????????????????System.out.println("找到:"?+?hits.length()?+?"?個結果!"); ??
????????????} ??
????????} ??
????} ??
??
}??

其運行結果：

引用

找到:3 個結果!

balaschen 寫道

引用

說明一下,這一篇文章的用到的lucene,是用2.0版本的,主要在查詢的時候2.0版本的lucene與以前的版本有了一些區別.

主要區別在什么地方啊，從你的代碼看，方法好像是一樣？

打一個例子吧,
這是lucene2.0的API

代碼

QueryParser?qp?=?new?QueryParser("body",?analyzer);??? ??
query?=?qp.parse(queryString);??? ??

這是lucene1.4.3版的API

代碼

query?=?QueryParser.parse(key,queryString,new?new?StandardAnalyzer());??

詳細的改動看一些官方的文檔就清楚啦

posted on 2007-01-26 13:41 leoli 閱讀(200) 評論(0) 編輯收藏所屬分類: java

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關文章: ant junit send mail Java基本數據類型轉換 eclipse plug open source link 幾個開源的 BI java.util.logging.Logger 類 modrian ssh 配置（轉 http://blog.csdn.net/daryl715）大小寫中文轉換 java 日期數字

Thinking in MyLife

Lucene 學習文章來自:http://lighter.javaeye.com

導航

統計

常用鏈接

留言簿(6)

隨筆分類

隨筆檔案(17)

文章分類(86)

收藏夾(3)

flex blog

good site

java blog

my friend

tools

抓蝦

搜索

最新評論

閱讀排行榜

評論排行榜

Thinking in MyLife

Lucene 學習 文章來自:http://lighter.javaeye.com

導航

統計

常用鏈接

留言簿(6)

隨筆分類

隨筆檔案(17)

文章分類(86)

收藏夾(3)

flex blog

good site

java blog

my friend

tools

抓蝦

搜索

最新評論

閱讀排行榜

評論排行榜

Lucene 學習文章來自:http://lighter.javaeye.com