<rt id="bn8ez"></rt>
<label id="bn8ez"></label>

  • <span id="bn8ez"></span>

    <label id="bn8ez"><meter id="bn8ez"></meter></label>

    隨筆-26  評論-111  文章-19  trackbacks-0

            基于自己的興趣,利用業務時間在Lucene基礎上做的一個搜索框架,請大家多多指教。

    一、       
    介紹

    基于Lucene的全文檢索框架,提供快速方便的索引創建及查詢方式,并提供擴展功能對框架進行擴展。

        項目地址:http://code.google.com/p/snoics-retrieval/

    二、        使用指南

    1、 環境要求

    Java1.5+

    Lucene 3.0.x+

    2、 加載

    通過RetrievalApplicationContext 載入配置參數,創建實例,每個被創建出的RetrievalApplicationContext實例中都包含一個完整的、獨立的上下文環境。

    一般情況下,一個應用只需要在啟動時創建一個RetrievalApplicationContext實例,然后由整個應用共享。

    有以下幾種方式創建RetrievalApplicationContext實例:

    以默認的方式,通過讀取classpath下的retrieval.properties配置文件創建

                       RetrievalApplicationContext retrievalApplicationContext=

                  new RetrievalApplicationContext(“c:""index”);

    使用配置文件的Properties實例加載

                       Properties properties=...

                       ...

                       RetrievalApplicationContext retrievalApplicationContext=

               new RetrievalApplicationContext(properties,“c:""index”);

    讀取指定的配置文件創建,文件必須放在classpath

                       RetrievalApplicationContext retrievalApplicationContext=

               new RetrievalApplicationContext(“app-retrieval.properties”,

    “c:""index”);

    通過構建RetrievalProperties對象創建

                       RetrievalProperties retrievalProperties=…

           …

                       RetrievalApplicationContext retrievalApplicationContext=

               new RetrievalApplicationContext(retrievalProperties,

    “c:""index”);

    3、 參數配置

    默認配置文件為classpath下的retrieval.properties,配置參數說明如下

    LUCENE_PARAM_VERSION

    Lucene參數,如果不設置則使用默認值 LUCENE_30
    設置Lucene版本號,將影響到索引文件格式及查詢結果

    LUCENE_20
    LUCENE_21
    LUCENE_22
    LUCENE_23
    LUCENE_24
    LUCENE_29
    LUCENE_30

    LUCENE_PARAM_MAX_FIELD_LENGTH

    Lucene參數,如果不設置則使用默認值 DEFAULT_MAX_FIELD_LENGTH=10000

    The maximum number of terms that will be indexed for a single field in a document.
    This limits the amount of memory required for indexing, so that collections with
    very large files will not crash the indexing process by running out of memory.
    This setting refers to the number of running terms, not to the number of different terms.

    Note: this silently truncates large documents, excluding from the index all terms
    that occur further in the document. If you know your source documents are large,
    be sure to set this value high enough to accomodate the expected size. If you set
    it to Integer.MAX_VALUE, then the only limit is your memory, but you should anticipate
    an OutOfMemoryError.

    By default, no more than DEFAULT_MAX_FIELD_LENGTH=10000 terms will be indexed for a field.

    LUCENE_PARAM_RAM_BUFFER_SIZE_MB

    Lucene 參數,如果不設置,則使用默認值 DEFAULT_RAM_BUFFER_SIZE_MB=16
    控制用于buffer索引文檔的內存上限,如果buffer的索引文檔個數到達該上限就寫入硬盤,越大索引速度越快。

    Determines the amount of RAM that may be used for buffering added documents and
    deletions before they are flushed to the Directory. Generally for faster indexing
    performance it's best to flush by RAM usage instead of document count and use as
    large a RAM buffer as you can.

    When this is set, the writer will flush whenever buffered documents and deletions
    use this much RAM. Pass in DISABLE_AUTO_FLUSH to prevent triggering a flush due to
    RAM usage. Note that if flushing by document count is also enabled, then the flush
    will be triggered by whichever comes first.

    NOTE: the account of RAM usage for pending deletions is only approximate. Specifically,
    if you delete by Query, Lucene currently has no way to measure the RAM usage if
    individual Queries so the accounting will under-estimate and you should compensate by
    either calling commit() periodically yourself, or by using setMaxBufferedDeleteTerms
    to flush by count instead of RAM usage (each buffered delete Query counts as one).

    NOTE: because IndexWriter uses ints when managing its internal storage, the absolute
    maximum value for this setting is somewhat less than 2048 MB. The precise limit depends on
    various factors, such as how large your documents are, how many fields have norms,
    etc., so it's best to set this value comfortably under 2048.

    The default value is DEFAULT_RAM_BUFFER_SIZE_MB=16.

    LUCENE_PARAM_MAX_BUFFERED_DOCS

    Lucene 參數,如果不設置,則使用默認值
    和LUCENE_PARAM_RAM_BUFFER_SIZE_MB這兩個參數是可以一起使用的,一起使用時只要有一個觸發條件
    滿足就寫入硬盤,生成一個新的索引segment文件

    Determines the minimal number of documents required before the buffered in-memory documents
    are flushed as a new Segment. Large values generally gives faster indexing.

    When this is set, the writer will flush every maxBufferedDocs added documents. Pass in
    DISABLE_AUTO_FLUSH to prevent triggering a flush due to number of buffered documents.
    Note that if flushing by RAM usage is also enabled, then the flush will be triggered by
    whichever comes first.

    Disabled by default (writer flushes by RAM usage).

    LUCENE_PARAM_MERGE_FACTOR

    Lucene 參數,如果不設置,則使用默認值 10
    MergeFactor 這個參數就是控制當硬盤中有多少個子索引segments,MergeFactor這個不能設置太大,
    特別是當MaxBufferedDocs比較小時(segment 越多),否則會導致open too many files錯誤,甚至導致虛擬機外面出錯。

    Determines how often segment indices are merged by addDocument(). With smaller values,
    less RAM is used while indexing, and searches on unoptimized indices are faster,
    but indexing speed is slower. With larger values, more RAM is used during indexing,
    and while searches on unoptimized indices are slower, indexing is faster. Thus larger
    values (> 10) are best for batch index creation, and smaller values (< 10) for indices
    that are interactively maintained.

    Note that this method is a convenience method: it just calls mergePolicy.setMergeFactor
    as long as mergePolicy is an instance of LogMergePolicy. Otherwise an IllegalArgumentException
    is thrown.

    This must never be less than 2. The default value is 10.

    LUCENE_PARAM_MAX_MERGE_DOCS

    Lucene 參數,如果不設置,則使用默認值 Integer.MAX_VALUE
    該參數決定寫入內存索引文檔個數,到達該數目后就把該內存索引寫入硬盤,生成一個新的索引segment文件,越大索引速度越快。

    Determines the largest segment (measured by document count) that may be merged with other segments.
    Small values (e.g., less than 10,000) are best for interactive indexing, as this limits the length
    of pauses while indexing to a few seconds. Larger values are best for batched indexing and speedier searches.

    The default value is Integer.MAX_VALUE.

    Note that this method is a convenience method: it just calls mergePolicy.setMaxMergeDocs as long as
    mergePolicy is an instance of LogMergePolicy. Otherwise an IllegalArgumentException is thrown.

    The default merge policy (LogByteSizeMergePolicy) also allows you to set this limit by net
    size (in MB) of the segment, using LogByteSizeMergePolicy.setMaxMergeMB.

    INDEX_MAX_FILE_DOCUMENT_PAGE_SIZE

    設置索引創建執行參數,如果不設置,則使用默認值 20

    批量創建文件索引時每頁最大的文件索引文檔數量,
    即使在創建索引時,通過API設置超過這個值的數量,也不會生效

    INDEX_MAX_INDEX_FILE_SIZE

    設置索引創建執行參數,如果不設置,則使用默認值 3145728(單位:字節)

    創建文件索引時,如果文件大小超過這里設置的限制的大小,則忽略該文件的內容,不對文件內容解析創建索引

    INDEX_MAX_DB_DOCUMENT_PAGE_SIZE

    設置索引創建執行參數,如果不設置,則使用默認值 500

    批量創建數據庫索引時,從數據庫中讀取的每頁數據最大記錄數
    即使在創建索引時,通過API設置超過這個值的數量,也不會生效

    INDEX_DEFAULT_CHARSET

    設置索引創建執行參數,如果不設置,則使用默認值 utf-8

    解析文本文件內容時使用的默認字符集

    QUERY_RESULT_TOP_DOCS_NUM

    設置索引檢索執行參數,如果不設置,則使用默認值 3000

    查詢結果返回的最大結果集

    RETRIEVAL_EXTENDS_CLASS_FILE_CONTENT_PARSER_MANAGER

    Retrieval擴展,如果不設置,則使用默認值 com.snoics.retrieval.engine.index.create.impl.file.FileContentParserManager

    文件內容解析管理器,對文件創建索引時,通過該管理器對不同的文件類型創建各自對應的解析器對文件內容進行解析
    需要實現接口:com.snoics.retrieval.engine.index.create.impl.file.IFileContentParserManager

    RETRIEVAL_EXTENDS_CLASS_ANALYZER_BUILDER

    Retrieval擴展,如果不設置,則使用默認值 com.snoics.retrieval.engine.analyzer.CJKAnalyzerBuilder

    索引分詞器,內置索引分詞器包括:
    com.snoics.retrieval.engine.analyzer.CJKAnalyzerBuilder(默認)
    com.snoics.retrieval.engine.analyzer.IKCAnalyzerBuilder(中文分詞強烈推薦)
    com.snoics.retrieval.engine.analyzer.StandardAnalyzerBuilder
    com.snoics.retrieval.engine.analyzer.ChineseAnalyzerBuilder

    需要實現接口:com.snoics.retrieval.engine.analyzer.IRAnalyzerBuilder

    RETRIEVAL_EXTENDS_CLASS_HEIGHLIGHTER_MAKER

    Retrieval擴展,如果不設置,則使用默認值 com.snoics.retrieval.engine.query.formatter.HighlighterMaker

    對查詢結果內容進行高亮處理

    需要實現接口:com.snoics.retrieval.engine.query.formatter.IHighlighterMaker

    RETRIEVAL_EXTENDS_CLASS_DATABASE_INDEX_ALL

    Retrieval擴展,如果不設置,則使用默認值 com.snoics.retrieval.engine.index.all.impl.DefaultRDatabaseIndexAllImpl

    對查詢結果內容進行高亮處理

    需要繼承抽象類:com.snoics.retrieval.engine.index.all.impl.AbstractDefaultRDatabaseIndexAll
    或直接實現接口:com.snoics.retrieval.engine.index.all.IRDatabaseIndexAll

    4、 索引

    4.1、初始化索引

                    retrievalApplicationContext

    .getFacade()

    .initIndex(new String[]{"DB","FILE"});

    4.2、提供5種方式創建索引

    以普通方式創建索引

           RFacade facade=retrievalApplicationContext.getFacade();

          

           NormalIndexDocument normalIndexDocument=

    facade.createNormalIndexDocument(false);

          

           RDocItem docItem1=new RDocItem();

           docItem1.setContent("搜索引擎");

           docItem1.setName("KEY_FIELD");

           normalIndexDocument.addKeyWord(docItem1);

           RDocItem docItem2=new RDocItem();

           docItem2.setContent("速度覅藕斷絲連房價多少了咖啡卡拉圣誕節");

           docItem2.setName("TITLE_FIELD");

           normalIndexDocument.addContent(docItem2);

           RDocItem docItem3=new RDocItem();

           docItem3.setContent("哦瓦爾卡及討論熱離開家");

           docItem3.setName("CONTENT_FIELD");

           normalIndexDocument.addContent(docItem3);

           IRDocOperatorFacade docOperatorFacade=

    facade.createDocOperatorFacade();

          

           docOperatorFacade.create(normalIndexDocument);

    對單條數據庫記錄內容創建索引

           IRDocOperatorFacade docOperatorHelper=

    retrievalApplicationContext

    .getFacade()

    .createDocOperatorFacade();

          

           String tableName="TABLE1";

           String recordId="849032809432490324093";

          

           DatabaseIndexDocument databaseIndexDocument=

    retrievalApplicationContext

    .getFacade()

    .createDatabaseIndexDocument(false);

          

           databaseIndexDocument.setIndexPathType("DB");

           databaseIndexDocument.setIndexInfoType("TABLE1");

          

           databaseIndexDocument.setTableNameAndRecordId(tableName,

    recordId);

           RDocItem docItem1=new RDocItem();

           docItem1.setName("TITLE");

           docItem1.setContent("SJLKDFJDSLK F");

          

           RDocItem docItem2=new RDocItem();

           docItem2.setName("CONTENT");

           docItem2.setContent("RUEWOJFDLSKJFLKSJGLKJSFLKDSJFLKDSF");

          

           RDocItem docItem3=new RDocItem();

           docItem3.setName("field3");

           docItem3.setContent("adsjflkdsjflkdsf");

          

           RDocItem docItem4=new RDocItem();

           docItem4.setName("field4");

           docItem4.setContent("45432534253");

          

           RDocItem docItem5=new RDocItem();

           docItem5.setName("field5");

           docItem5.setContent("87987yyyyyyyy");

          

           RDocItem docItem6=new RDocItem();

           docItem6.setName("field6");

           docItem6.setContent("87987yyyyyyyy");

          

           databaseIndexDocument.addContent(docItem1);

           databaseIndexDocument.addContent(docItem2);

           databaseIndexDocument.addContent(docItem3);

           databaseIndexDocument.addContent(docItem4);

           databaseIndexDocument.addContent(docItem5);

           databaseIndexDocument.addContent(docItem6);

          

           docOperatorHelper.create(databaseIndexDocument);

    對單個文件內容及文件信息創建索引

           IRDocOperatorFacade docOperatorHelper=

    retrievalApplicationContext

    .getFacade()

    .createDocOperatorFacade();

          

           FileIndexDocument fileIndexDocument=

    retrievalApplicationContext

    .getFacade()

    .createFileIndexDocument(false,"utf-8");

           fileIndexDocument.setFileBasePath("c:""doc");

           fileIndexDocument.setFileId("fileId_123");

           fileIndexDocument.setFile(new File("c:""doc""1.doc"));

           fileIndexDocument.setIndexPathType("FILE");

           fileIndexDocument.setIndexInfoType("SFILE");

          

           docOperatorHelper.create(fileIndexDocument,3*1024*1024);

           

    對數據庫記錄進行批量創建索引

           String tableName = "TABLE1";

           String keyField = "ID";

           String sql = "SELECT ID,"

    + "TITLE,"

    + "CONTENT,"

    + "FIELD3,"

                  + "FIELD4,"

    + "FIELD5,"

    + "FIELD6 FROM TABLE1 ORDER BY ID ASC";

    RDatabaseIndexAllItem databaseIndexAllItem =

               retrievalApplicationContext

                      .getFacade()

    .createDatabaseIndexAllItem(false);

          

    databaseIndexAllItem.setIndexPathType("DB");

           databaseIndexAllItem.setIndexInfoType("TABLE1");

           // 如果無論記錄是否存在,都新增一條索引內容,

    則使用RetrievalType.RIndexOperatorType.INSERT

           // 如果索引中記錄已經存在,則只更新索引中的對應的記錄,

    否則新增記錄,則使用RetrievalType.RIndexOperatorType.UPDATE

           databaseIndexAllItem

    .setIndexOperatorType(RetrievalType.

    RIndexOperatorType.INSERT);

           databaseIndexAllItem.setTableName(tableName);

           databaseIndexAllItem.setKeyField(keyField);

           databaseIndexAllItem.setDefaultTitleFieldName("TITLE");

           databaseIndexAllItem.setDefaultResumeFieldName("CONTENT");

           databaseIndexAllItem.setPageSize(500);

           databaseIndexAllItem.setSql(sql);

           databaseIndexAllItem.setParam(new Object[] {});

           databaseIndexAllItem

    .setDatabaseRecordInterceptor(new TestDatabaseRecordInterceptor());

           IRDocOperatorFacade docOperatorFacade =

    retrievalApplicationContext

                  .getFacade()

    .createDocOperatorFacade();

           long indexCount = docOperatorFacade.

    createAll(databaseIndexAllItem);

           //優化索引

           retrievalApplicationContext

    .getFacade()

    .createIndexOperatorFacade("DB")

    .optimize();

    對大量的文件批量創建索引

           RFileIndexAllItem fileIndexAllItem=

    retrievalApplicationContext

    .getFacade()

    .createFileIndexAllItem(false,"utf-8");

           fileIndexAllItem.setIndexPathType("FILE");

          

           //如果無論記錄是否存在,都新增一條索引內容,

    則使用RetrievalType.RIndexOperatorType.INSERT

           //如果索引中記錄已經存在,則只更新索引中的對應的記錄,

    否則新增記錄,則使用RetrievalType.RIndexOperatorType.UPDATE

           FileIndexAllItem

    .setIndexOperatorType(RetrievalType

    .RIndexOperatorType.INSERT);

           fileIndexAllItem.setIndexInfoType("SFILE");

          

           fileIndexAllItem

    .setFileBasePath("D:""workspace""resources""docs");

           fileIndexAllItem.setIncludeSubDir(true);

           fileIndexAllItem.setPageSize(100);

           fileIndexAllItem

    .setIndexAllFileInterceptor(

    new TestFileIndexAllInterceptor());

          

           //如果要對所有類型的文件創建索引,則不要做設置一下這些設置,

    否則在設置過類型之后,將只對這些類型的文件創建索引

           fileIndexAllItem.addFileType("doc");

           fileIndexAllItem.addFileType("docx");

           fileIndexAllItem.addFileType("sql");

           fileIndexAllItem.addFileType("html");

           fileIndexAllItem.addFileType("htm");

           fileIndexAllItem.addFileType("txt");

           fileIndexAllItem.addFileType("xls");

          

           long count=docOperatorHelper.createAll(fileIndexAllItem);

          

          

    retrievalApplicationContext

    .getFacade()

    .createIndexOperatorFacade("FILE")

    .optimize();

    支持多線程創建索引,而不會出現索引文件異常

           Thread thread1=new Thread(new Runnable(){

           publicvoid run() {

                  do 單條或批量創建索引

               }

           });

           Thread thread2=new Thread(new Runnable(){

               publicvoid run() {

                  do 單條或批量創建索引

               }

           });

           Thread thread3=new Thread(new Runnable(){

           publicvoid run() {

                  do 單條或批量創建索引

               }

           });

          

           thread1.start();

           thread2.start();

    thread3.start();

    5、 查詢

    使用RQuery實例,通過傳入構造好的QueryItem實例進行查詢,并使用QuerySort實例對結果排序

           public QueryItem createQueryItem(

    RetrievalType.RDocItemType docItemType,

    Object name,

    String value){

               QueryItem queryItem=

    retrievalApplicationContext

    .getFacade()

    .createQueryItem(docItemType,

    String.valueOf(name),

    value);

               return queryItem;

            }

    IRQueryFacade queryFacade=

    retrievalApplicationContext

    .getFacade()

    .createQueryFacade();

           RQuery query=queryFacade.createRQuery(indexPathType);

           QueryItem queryItem0=

    testQuery

    .createQueryItem(RetrievalType.RDocItemType.CONTENT,

    "TITLE","啊啊");

           QueryItem queryItem1=

    testQuery

    .createQueryItem(RetrievalType.RDocItemType.CONTENT,

    "TITLE","");

           QueryItem queryItem2=

    testQuery

    .createQueryItem(RetrievalType.RDocItemType.CONTENT,

    "CONTENT","工作");

           QueryItem queryItem3=

    testQuery

    .createQueryItem(RetrievalType.RDocItemType.CONTENT,

    "CONTENT","地方");

           QueryItem queryItem4=

    testQuery

    .createQueryItem(RetrievalType.RDocItemType.CONTENT,

    "FIELD3","過節");

           QueryItem queryItem5=

    testQuery

    .createQueryItem(RetrievalType.RDocItemType.CONTENT,

    "FIELD4","高興");

           QueryItem queryItem=

    queryItem0

    .should(QueryItem.SHOULD,queryItem1)

    .should(queryItem2)

    .should(queryItem3.mustNot(QueryItem.SHOULD,queryItem4)).should(queryItem5);

           QuerySort querySort=new QuerySort(QueryUtil.createScoreSort());

           QueryResult[] queryResults=

    query.getQueryResults(queryItem,querySort);

    query.close();

    6、 擴展

    提供兩種途徑進行擴展:

    1) 在配置文件指定擴展類,在加載時,自動讀取和設置配置文件中的擴展類

    2) RetrievalProperties實例中設置擴展類,

    并使用該實例創建RetrievalApplicationContext實例

    IFileContentParserManager

    通過實現此接口,并替換整個文件內容解析管理器,擴展文件內容解析方式

    或通過以下的方式,在原文件內容解析管理器中替換或新增文件解析器

    實現IFileContentParser接口,并使用以下的方式新增或替換文件類型的內容解析器

               retrievalApplicationContext

    .getRetrievalFactory()

    .getFileContentParserManager()

    .regFileContentParser(“docx”, fileContentParser)

    IRAnalyzerBuilder

    通過實現此接口,并替換分詞器構建器

    IHighlighterMaker

    通過實現此接口,并替換內容高亮處理器

    IRDatabaseIndexAll

    通過實現此接口,實現數據庫數據批量讀取并寫入索引

    或直接繼承AbstractRDatabaseIndexAll抽象類,并實現其中的抽象方法

               /**

                * 獲取當前頁數據庫記錄,每調用一次這個方法,就返回一頁的記錄

                * @return

     */

    publicabstract List<Map> getResultList()

    7、 其它

    更詳細的示例請查閱test中的代碼

    snoics-retrieval項目中使用了snoics-base.jar,如果需要獲取snoics-base.jar的源代碼,請到http://code.google.com/p/snoics-base/中下載

    三、        關于

    項目地址:http://code.google.com/p/snoics-retrieval/

    Email : snoics@gmail.com

           Blog : http://m.tkk7.com/snoics/

    posted on 2010-07-26 08:06 snoics 閱讀(2759) 評論(0)  編輯  收藏

    只有注冊用戶登錄后才能發表評論。


    網站導航:
     
    主站蜘蛛池模板: 国产福利免费视频 | 久久亚洲国产成人影院网站 | 国产美女无遮挡免费视频网站 | 亚洲一区二区三区丝袜| 国产男女性潮高清免费网站| 一日本道a高清免费播放| 久久亚洲sm情趣捆绑调教 | 一本色道久久88—综合亚洲精品| 亚洲AV无码一区二区三区国产 | 国产h视频在线观看免费| 色偷偷亚洲第一综合| 国产AV无码专区亚洲Av| 成年女人毛片免费观看97| 中文字幕av无码不卡免费| 亚洲国产精品xo在线观看| 亚洲女人被黑人巨大进入| xxxxx免费视频| 男女作爱免费网站| 亚洲嫩草影院在线观看| 亚洲国产综合久久天堂| 亚洲精品视频免费在线观看| 一级毛片**免费看试看20分钟| 亚洲日韩在线视频| 亚洲中文字幕无码中文字在线| 国产成人午夜精品免费视频| 国产在线精品一区免费香蕉| 亚洲国产日韩综合久久精品| 亚洲国产精品久久久久婷婷软件 | 亚洲视频在线视频| 免费在线观看a级毛片| 在线精品一卡乱码免费| 免费黄网站在线看| h片在线播放免费高清| 亚洲欧洲AV无码专区| 亚洲视频在线观看网站| 亚洲欧洲日产国码av系列天堂 | 成人超污免费网站在线看| 99视频在线免费看| a级片免费在线观看| 免费看黄网站在线看| 亚洲国产欧洲综合997久久|