<rt id="bn8ez"></rt>
<label id="bn8ez"></label>

  • <span id="bn8ez"></span>

    <label id="bn8ez"><meter id="bn8ez"></meter></label>

    huangfox

    韜光隱晦
    隨筆 - 1, 文章 - 8, 評(píng)論 - 1, 引用 - 0
    數(shù)據(jù)加載中……

    Implementing a Lucene search engine【轉(zhuǎn)】

    摘自:

    http://edn.embarcadero.com/article/38723



    By: John Kaster

    Abstract: A guide to implementing Lucene as a cross-platform, cross-language search engine

        Implementing a Lucene search engine

    If you are interested in implementing a search engine based on Lucene, you may be interested in these notes compiled during our research.

        Lucene

    The Lucene FAQ is an excellent resource for learning about integration options and determining the best approaches to take. Some of the relevant questions and answers are:

    Reading this FAQ before starting on a Lucene implementation is highly recommended.

        Indexing

    Lucene requires a unique ID for each content record it indexes. It uses Java's native Unicode string support for parsing text. Analyzers are available for various languages, including support for CJK languages (Chinese, Japanese, Korean, and some Vietnamese).

    Lucene can also be used to index source code with a custom Analyzer, but we modified YAPP to provide what we needed. You can also generate an AST that can be consumed by Lucene, similar to what was done with this solution for Java source code.

    Here's an example from the Lucene JavaDoc (which, unfortunately does not have convenient direct URLs) showing a complete use case (or Unit test, in this example):

    Analyzer analyzer = new StandardAnalyzer();
    // Store the index in memory:
    Directory directory = new RAMDirectory();
    // To store an index on disk, use this instead:
    //Directory directory = FSDirectory.getDirectory("/tmp/testindex");
    IndexWriter iwriter = new IndexWriter(directory, analyzer, true);
    iwriter.setMaxFieldLength(25000);
    Document doc = new Document();
    String text = "This is the text to be indexed.";
    doc.add(new Field("fieldname", text, Field.Store.YES,
    Field.Index.TOKENIZED));
    iwriter.addDocument(doc);
    iwriter.optimize();
    iwriter.close();
        
    // Now search the index:
    IndexSearcher isearcher = new IndexSearcher(directory);
    // Parse a simple query that searches for "text":
    QueryParser parser = new QueryParser("fieldname", analyzer);
    Query query = parser.parse("text");
    Hits hits = isearcher.search(query);
    assertEquals(1, hits.length());
    // Iterate through the results:
    for (int i = 0; i < hits.length(); i++) {
    Document hitDoc = hits.doc(i);
    assertEquals("This is the text to be indexed.", hitDoc.get("fieldname"));
    }
    isearcher.close();
    directory.close();
    

    Some other information I pulled directly from their JavaDoc intro is included here.

    The Lucene API is divided into several packages:

    To use Lucene, an application should:

    1. Create Documents by adding Fields;
    2. Create an IndexWriter and add documents to it with addDocument();
    3. Call QueryParser.parse() to build a query from a string; and
    4. Create an IndexSearcher and pass the query to its search() method.

    Some simple examples of code which does this are:

        Fields to Index

    Here is a list of common fields we index with Lucene. Certain applications may have other fields to add to their index, but most should be able to provide a value that represents these fields.

    Field

    Description

    AppID

    ID of the application being indexed. This could be "gp" for GetPublished, "cc" for CodeCentral, "qc" for QualityCentral, "blog" for Blogs, etc.

    ID

    ID of the discrete item being indexed. This would be SNIPPET_ID in CodeCentral, VERSIONID in GetPublished, etc. The unique "LuceneID" will be <AppID>.<ID>.

    Author

    The full name of the author of the content

    Title

    Title of the item

    Summary

    Short description, summary, or abstract of item

    Body

    Main contents of the item

    PubDate

    Publication, or modified, timestamp of the item. There are trade-offs for date precision storage.

    Language

    2 character code for the Human language used for the text of the item

    Product

    Optional. Product(s) for which the item applies.

    Version

    Optional. Version(s) of the product for which the item applies.

    Tags

    Optional. Social bookmarking tags entered for the item.

    Category

    Optional. Category(ies) for the item.

    We also have several custom fields for source code metadata in the index.

    The social bookmarking tags field could be interesting to search on, but social bookmarking tags are really supposed to be an alternative to free text search, so this isn't something we've pursued for the current interface.

        Searching

    Lucene has very robust searching options. Phrase searching, wildcard searching, fuzzy searching, and very extensive search syntax are supported.

    For your own needs, you may also want to evaluate Solr to see if it can be readily adopted as a standard search interface. A good summary presentation on Solr is available in slides.

    I find it ironic, however, that the Apache Lucene site uses Google for searching instead of an instance of Solr.

    We may also explore Lucene's MoreLikeThis feature.

        Analysis

    We want to track user searches to find out what kinds of searches are most popular, and what our most popular keywords are, so we can offer suggestions to search engine users. This requires more research to learn how Lucene supports analysis and reporting of usage patterns, or we may need to just track it ourselves.

    A very handy tool to use when working with Lucene indices is Luke, the Lucene Index Toolbox.

        Application and Database Integration

    Lucene can directly index database records with any available JDBC connection. Custom SQL statements are required for retrieving the fields and records to be indexed, but that appears to be pretty easy and flexible.

    By default, Lucene indexes plain text content. We provide either HTML or plain text values to Lucene, which our Document Adapter already supports.

        Our web service interface

    Since Lucene is written in Java, we created a web service to call it from our non-Java applications. The Lucene for .NET project appears to have stagnated, and since JBuilder makes it so easy to create a web service, the web service is the best way to make it available for all of our platforms and languages.

    These are some of the relevant functions of our web service:

    • Index the specified content ID and text
    • Index the specified content ID and html
    • Support "extra data" such as QualityCentral workarounds, products, versions, etc.
    • Perform a search from the specified search expression, returning all matching IDs and scoring
    • Refine an existing search result with the specified search expression

    We are hoping to make our search web service publicly available in the future, and also provide a browser plug-in and IDE integration to it in the future.

    You can download the source code to our web service from CodeCentral.

        Application interaction

    There are only three interactions an application needs to have with the search web service: 1) indexing new or updated content, 2) removing an index entry for deleted content, and 3) retrieving and processing search results.

        Indexing content

    Whenever content is created or updated on one of our web sites, the new content is submitted to our indexing queue with a call to the web service. If the specified unique ID already exists in the index, the current entry is deleted, then the content is reindexed. Lucene is very fast at indexing search content, and most requests happen in real time. However, if many indexing requests are being made at the same time, they are put into a queue to be processed when less demand is being placed on the indexer.

    To ensure content gets indexed, most of our web systems also maintain their own queue of items to be resubmitted later to the Lucene index if the indexing request fails for any reason.

    Example: Indexing blogs

    Here's a relevant PHP snippet from batch indexing our Wordpress-base blog content:

    private function indexPost( $site, $blogID, $blogName, $postID, $authorName,
    $body, $title, $modifiedDate )
    {
    echo( "  Indexing post $postID ($authorName): $title\r\n" );
    $contentID = $site . '.' . $blogName . '.' . $postID;
    // Convert to standardized date format.
    $modifiedDate = str_replace( ' ', '', str_replace( ':', '',
    str_replace( '-', '', $modifiedDate ) ) );
    // Make sure that everything we're passing is valid UTF-8
    $authorName = validateUtf8( $authorName );
    $title = validateUtf8( $title );
    $body = validateUtf8( $body );
    $comments = validateUtf8( $this->getPostComments( $blogID, $postID ) );
    $startTime = microtime( true );
    $resultObj = $this->sc->indexHTMLContentEx( 'blog', $contentID,
    $authorName, $title, $title, $body, $modifiedDate, 'en', $comments,
    '', '', '', '', '' );
    $endTime = microtime( true );
    $result = $resultObj->result;
    $message = $resultObj->message;
    if( !$result )
    throw new Exception( "Indexing failed, error: " . $message );
    // Track call time.
    if( count( $this->callTimes ) == 1000 )
    array_shift( $this->callTimes );
    $this->callTimes[] = ( $endTime - $startTime );
    ++$this->postCount;
    }
    

    Example: Indexing CodeCentral

    Here is a C# sample that calls our web service, based on CodeCentral's metadata:

    private static long indexCC()
    {
    DbConnection con = DbUtils.CreateConnectionName("CodeCentral");
    DateTime started = DateTime.Now;
    Console.WriteLine("Indexing CodeCentral at "
    + DateTime.Now.ToShortTimeString());
    long recCount = 0;
    try
    {
    using (DbCommand cmd = DbUtils.MakeCommand(con,
    "select snippet_id, u.first_name, u.last_name, s.title, s.shortdesc,"
    + "s.description, s.updated, s.comments, p.description, s.low_version,"
    + "s.high_version, c.description "
    + "from snippet s, category c, product p "
    + "inner join users u on s.user_id = u.user_id "
    + "where s.category_id = c.category_id "
    + "and s.product_id = p.product_id"))
    {
    using (DbDataReader reader = DbUtils.InitReader(cmd))
    {
    while (reader.Read())
    {
    try
    {
    // recCount++;
    // indexer.indexHTMLContentEx(
    using (StreamWriter sr = new StreamWriter(
    string.Format("{0}\\current.txt",
    Path.GetDirectoryName(Application.ExecutablePath))))
    {
    sr.Write(reader[0].ToString());
    }
    ZipFile file;
    try
    {
    file = CCZipUtils.GetZipFile(con as TAdoDbxConnection,
    reader.GetInt32(0),
    ConfigurationSettings.AppSettings["cacheRootPath"]);
    }
    catch
    {
    file = null;
    }
    List<SearchResult> result = indexHtmlContent(
    "cc",  // appid
    reader[0].ToString(), // contentid
    reader[1].ToString(), //first name 
    reader[2].ToString(), //last name
    reader[3].ToString(), // title
    reader[4].ToString(), // summary
    reader[5].ToString(), // body
    reader.GetDateTime(6), //pubdate
    "en", //language
    reader[7].ToString(), // comments
    reader[8].ToString(), // product
    string.Format("{0} to {1}", reader.GetDouble(9),
    reader.GetDouble(10)), // version
    "", // tags
    reader[11].ToString(), // category
    "", // extradata
    "", // contenttype
    "", // workaround
    file,
    recCount++
    );
    foreach (SearchResult item in result)
    {
    if (!item.Result)
    {
    badIDs.AppendFormat("cc:{0} - {1}{2}",
    item.ContentID, item.Message,
    Environment.NewLine);
    }
    }
    }
    catch (Exception e)
    {
    badIDs.AppendFormat("cc:{0} - {1}{2}", reader[0],
    e.Message, Environment.NewLine);
    }
    }
    }
    }
    }
    finally
    {
    con.Close();
    }
    showTiming(started, recCount);
    return recCount;
    }
    

    Example: Indexing QualityCentral

    The QualityCentral web service is built with Delphi native code. Although the automatic indexing has not been deployed for the QualityCentral web service yet, we do have a reindexing application that can be used to index all reports in QualityCentral. Here's the routine that submits the content to the search engine:

    procedure TQCIndexDataModule.IndexSince(AFromDate: TDateTime);
    var
    TickStart,
    TickTime : Cardinal;
    SearchWS: Search;
    FldID,
    FldFirstName,
    FldLastName,
    FldTitle,
    FldDescription,
    FldSteps,
    FldModified,
    FldProject,
    FldVersion,
    FldDataType,
    FldWorkaround : TField;
    recs: Int64;
    Name: UnicodeString;
    Result : BooleanResult;
    begin
    SearchWS := GetSearch;
    WriteLn('Started at ' + DateTimeToStr(Now));
    Recs := 0;
    TickStart := GetTickCount;
    try
    QCReports.Active := False;
    QCReports.ParamByName('FromDate').asDateTime := AFromDate;
    QCReports.Active := True;
    FldID := QCReports.FieldByName('defect_no');
    FldFirstName := QCReports.FieldByName('first_name');
    FldLastName := QCReports.FieldByName('last_name');
    FldTitle := QCReports.FieldByName('short_description');
    FldDescription := QCReports.FieldByName('description');
    FldSteps := QCReports.FieldByName('steps');
    FldModified := QCReports.FieldByName('modified_date');
    FldProject := QCReports.FieldByName('project');
    FldVersion := QCReports.FieldByName('version');
    FldDataType := QCReports.FieldByName('data_type');
    FldWorkaround := QCReports.FieldByName('workaround');
    while not QCReports.EOF do
    begin
    Name := FullName(FldFirstname.AsWideString, FldLastName.AsWideString);
    try
    Result := SearchWS.indexContentEx('qc', fldID.AsString, Name,
    FldTitle.AsWideString, FldDescription.asWideString,
    fldSteps.AsWideString, LuceneDateTime(FldModified.AsDateTime),
    'en', QCCommentText(FldId.AsInteger), FldProject.AsWideString,
    FldVersion.AsWideString, '', FldDataType.AsWideString,
    '', '', FldWorkAround.AsWideString );
    if Result.result then
    begin
    Inc(Recs);
    if Recs Mod 30 = 0 then
    WriteLn(Recs,'#', FldId.AsInteger, ':', Name, ':',
    FldTitle.AsWideString);
    end
    else
    WriteLn('#Error on QC#' + fldID.AsString + ':' + Result.message_);
    except
    on E: Exception do
    WriteLn('#Error on QC#' + fldID.AsString + ':' + E.Message);
    end;
    QCReports.Next;
    end;
    finally
    TickTime := GetTickCount - TickStart;
    QCReports.Close;
    WriteLn('Stopped at ' + DateTimeToStr(Now));
    WriteLn('Records: ', Recs, ' Elapsed time: ' + TicksToTime(TickTime));
    end;
    end;
    

    The Search type is the Delphi class for calling the web service generated from Delphi's WSDL importer. This is a console application, so once every 30 records the progress is displayed. We also track the amount of time it takes to create the index, and report on any exceptions we encounter.

        Removing an index entry

    The same method that deletes an index entry that already exists when a reindex request is submitted can be called directly via the web service. All of our content systems tell Lucene to remove index entries from the index when content is deleted.

    Example: Removing QualityCentral reports from the index

    Here's Delphi code for removing all QC reports from the search index. All that needs to be specified is the application ID and the unique contentID, which in this case is the report id.

    procedure TQCIndexDataModule.ClearIndex;
    var
    SearchWS: Search;
    FldID: TField;
    begin
    SearchWS := GetSearch;
    QCAllReports.Close;
    try
    QCAllReports.Open;
    FldID := QCAllReports.FieldByName('defect_no');
    while not QCAllReports.EOF do
    begin
    SearchWS.deleteContent('qc', FldID.AsString);
    QCAllReports.Next;
    end;
    finally
    QCAllReports.Close;
    end;
    end;
    

        Retrieving and processing search results

    Of course, the purpose of indexing all this content is to be able to find it when you search for it., and Lucene makes that very easy. However, displaying the search result is a bit more complicated than that, due to the complexity of our data models and requirements to conform with visibility rules of the systems using our search engine.

    Therefore, application-specific searches are resolved in 2 passes:

    Applications submit keyword search expressions to Lucene, which would return the list of IDs whose keywords match the required criteria. The search expression can be customized with additional fields restrictions such as Title, Abstract, Code, Comments, Workarounds, etc.

    The returned list of IDs from the search expression is then further filtered based on additional metadata values (some of these could be Lucene fields as well if you want to frequently update your index), such as products, versions, author, site, publish date, popularity, ratings, entitlement, and so on.

        A look back in time

    The previous search engine used by CodeCentralQualityCentral, and SiteCentral (our web site presentation engine) is based on a Delphi component I wrote in 1998 for searching CodeCentral. The architecture of this search technology remained basically the same since that time. Keywords are persisted to a database, and searches are performed with nested SQL select statements that combine the keyword filtering results (AND, OR, and NOT) with other criteria for the various databases using this keyword search engine.

    Because we know the metadata we're searching, the results were generally good, but the only options is for keywords are the AND, OR, NOT operations, and wildcards.

    This search engine had only minor updates to it as we moved from native Delphi code to .NET. The "2.0" version was going to include frequency scoring, proximity searching, and Unicode. I was done with the general-purpose Unicode keyword breaking and persistence engine classes, and was starting to work on the updated query generation code when I decided to revisit Lucene. I'm really glad I did.

    posted on 2010-09-25 23:25 fox009 閱讀(222) 評(píng)論(0)  編輯  收藏 所屬分類: 搜索引擎技術(shù)

    主站蜘蛛池模板: 成人免费一区二区三区在线观看| 免费无码A片一区二三区| 2021在线观看视频精品免费| www.免费在线观看| 暖暖在线日本免费中文| 久久精品国产亚洲精品| 日产亚洲一区二区三区| 久久久久久亚洲精品影院| 黄人成a动漫片免费网站| 国产永久免费高清在线| A在线观看免费网站大全| 免费乱码中文字幕网站| 亚洲av无码成h人动漫无遮挡| 亚洲欧洲校园自拍都市| 午夜亚洲国产精品福利| 国产免费一区二区视频| 成在人线AV无码免费| 久久精品国产亚洲7777| 亚洲免费网站在线观看| 免费观看四虎精品成人| 亚洲视频免费一区| 免费在线观看理论片| 亚洲资源在线观看| 青娱乐在线视频免费观看| 无码日韩精品一区二区免费暖暖 | 亚洲中文无码永久免费| 夜夜爽妓女8888视频免费观看| 一级毛片免费毛片一级毛片免费| 国产精品四虎在线观看免费| 亚洲va无码手机在线电影| 亚洲成AV人片在WWW| 人人玩人人添人人澡免费| 日本a级片免费看| 久久精品国产亚洲AV电影| 男男gvh肉在线观看免费| 最近2019中文字幕免费大全5| 国产成人涩涩涩视频在线观看免费 | 无码专区AAAAAA免费视频| 亚洲?V无码乱码国产精品| 亚洲国产成人在线视频| igao激情在线视频免费|