亚洲永久网址在线观看,亚洲国产一区二区三区青草影视,亚洲国产成人久久综合一区77

在IndexSearcher類中有一個管理Lucene得分情況的方法，如下所示：

public Explanation explain(Weight weight, int doc) throws IOException {
return weight.explain(reader, doc);
}

返回的這個Explanation的實(shí)例解釋了Lucene中Document的得分情況。我們可以測試一下，直觀地感覺一下到底這個Explanation的實(shí)例都記錄了一個Document的哪些信息。

寫一個測試類，如下所示：

package org.shirdrn.lucene.learn;

import java.io.IOException;
import java.util.Date;

import net.teamhot.lucene.ThesaurusAnalyzer;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.search.Explanation;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.LockObtainFailedException;

public class AboutLuceneScore {

private String path = "E:\\Lucene\\index";

public void createIndex(){
   IndexWriter writer;
   try {
    writer = new IndexWriter(path,new ThesaurusAnalyzer(),true);

    Field fieldA = new Field("contents","一人",Field.Store.YES,Field.Index.TOKENIZED);
    Document docA = new Document();
    docA.add(fieldA);

    Field fieldB = new Field("contents","一人之交一人之交",Field.Store.YES,Field.Index.TOKENIZED);
    Document docB = new Document();
    docB.add(fieldB);

    Field fieldC = new Field("contents","一人之下一人之下",Field.Store.YES,Field.Index.TOKENIZED);
    Document docC = new Document();
    docC.add(fieldC);

    Field fieldD = new Field("contents","一人做事一人當(dāng) 一人做事一人當(dāng)",Field.Store.YES,Field.Index.TOKENIZED);
    Document docD = new Document();
    docD.add(fieldD);

    Field fieldE = new Field("contents","一人做事一人當(dāng) 一人做事一人當(dāng)",Field.Store.YES,Field.Index.TOKENIZED);
    Document docE = new Document();
    docE.add(fieldE);

    writer.addDocument(docA);
    writer.addDocument(docB);
    writer.addDocument(docC);
    writer.addDocument(docD);
    writer.addDocument(docE);

    writer.close();
   } catch (CorruptIndexException e) {
    e.printStackTrace();
   } catch (LockObtainFailedException e) {
    e.printStackTrace();
   } catch (IOException e) {
    e.printStackTrace();
   }
}

public static void main(String[] args) {
   AboutLuceneScore aus = new AboutLuceneScore();
   aus.createIndex();    // 建立索引
   try {
    String keyword = "一人";
    Term term = new Term("contents",keyword);
    Query query = new TermQuery(term);
    IndexSearcher searcher = new IndexSearcher(aus.path);
    Date startTime = new Date();
    Hits hits = searcher.search(query);
    TermDocs termDocs = searcher.getIndexReader().termDocs(term);
    while(termDocs.next()){
     System.out.print("搜索關(guān)鍵字<"+keyword+">在編號為 "+termDocs.doc());
     System.out.println(" 的Document中出現(xiàn)過 "+termDocs.freq()+" 次");
    }
    System.out.println("********************************************************************");
    for(int i=0;i<hits.length();i++){
     System.out.println("Document的內(nèi)部編號為： "+hits.id(i));
     System.out.println("Document內(nèi)容為： "+hits.doc(i));
     System.out.println("Document得分為： "+hits.score(i));
     Explanation e = searcher.explain(query, hits.id(i));
     System.out.println("Explanation為： \n"+e);
     System.out.println("Document對應(yīng)的Explanation的一些參數(shù)值如下： ");
     System.out.println("Explanation的getValue()為： "+e.getValue());
     System.out.println("Explanation的getDescription()為： "+e.getDescription());
     System.out.println("********************************************************************");
    }
    System.out.println("共檢索出符合條件的Document "+hits.length()+" 個。");
    Date finishTime = new Date();
    long timeOfSearch = finishTime.getTime() - startTime.getTime();
    System.out.println("本次搜索所用的時間為 "+timeOfSearch+" ms");
   } catch (CorruptIndexException e) {
    e.printStackTrace();
   } catch (IOException e) {
    e.printStackTrace();
   }

}
}

該測試類中實(shí)現(xiàn)了一個建立索引的方法createIndex()方法；然后通過檢索一個關(guān)鍵字“一人”，獲取到與它相關(guān)的Document的信息。

打印出結(jié)果的第一部分為：這個檢索關(guān)鍵字“一人”在每個Document中出現(xiàn)的次數(shù)。

打印出結(jié)果的第二部分為：相關(guān)的Explanation及其得分情況的信息。

測試結(jié)果輸出如下所示：

搜索關(guān)鍵字<一人>在編號為 0 的Document中出現(xiàn)過 1 次
搜索關(guān)鍵字<一人>在編號為 1 的Document中出現(xiàn)過 1 次
搜索關(guān)鍵字<一人>在編號為 2 的Document中出現(xiàn)過 1 次
搜索關(guān)鍵字<一人>在編號為 3 的Document中出現(xiàn)過 2 次
搜索關(guān)鍵字<一人>在編號為 4 的Document中出現(xiàn)過 2 次
********************************************************************
Document的內(nèi)部編號為： 0
Document內(nèi)容為： Document<stored/uncompressed,indexed,tokenized<contents:一人>>
Document得分為： 0.81767845
Explanation為：
0.81767845 = (MATCH) fieldWeight(contents:一人 in 0), product of:
1.0 = tf(termFreq(contents:一人)=1)
0.81767845 = idf(docFreq=5)
1.0 = fieldNorm(field=contents, doc=0)

Document對應(yīng)的Explanation的一些參數(shù)值如下：
Explanation的getValue()為： 0.81767845
Explanation的getDescription()為： fieldWeight(contents:一人 in 0), product of:
********************************************************************
Document的內(nèi)部編號為： 3
Document內(nèi)容為： Document<stored/uncompressed,indexed,tokenized<contents:一人做事一人當(dāng) 一人做事一人當(dāng)>>
Document得分為： 0.5059127
Explanation為：
0.5059127 = (MATCH) fieldWeight(contents:一人 in 3), product of:
1.4142135 = tf(termFreq(contents:一人)=2)
0.81767845 = idf(docFreq=5)
0.4375 = fieldNorm(field=contents, doc=3)

Document對應(yīng)的Explanation的一些參數(shù)值如下：
Explanation的getValue()為： 0.5059127
Explanation的getDescription()為： fieldWeight(contents:一人 in 3), product of:
********************************************************************
Document的內(nèi)部編號為： 4
Document內(nèi)容為： Document<stored/uncompressed,indexed,tokenized<contents:一人做事一人當(dāng) 一人做事一人當(dāng)>>
Document得分為： 0.5059127
Explanation為：
0.5059127 = (MATCH) fieldWeight(contents:一人 in 4), product of:
1.4142135 = tf(termFreq(contents:一人)=2)
0.81767845 = idf(docFreq=5)
0.4375 = fieldNorm(field=contents, doc=4)

Document對應(yīng)的Explanation的一些參數(shù)值如下：
Explanation的getValue()為： 0.5059127
Explanation的getDescription()為： fieldWeight(contents:一人 in 4), product of:
********************************************************************
Document的內(nèi)部編號為： 1
Document內(nèi)容為： Document<stored/uncompressed,indexed,tokenized<contents:一人之交一人之交>>
Document得分為： 0.40883923
Explanation為：
0.40883923 = (MATCH) fieldWeight(contents:一人 in 1), product of:
1.0 = tf(termFreq(contents:一人)=1)
0.81767845 = idf(docFreq=5)
0.5 = fieldNorm(field=contents, doc=1)

Document對應(yīng)的Explanation的一些參數(shù)值如下：
Explanation的getValue()為： 0.40883923
Explanation的getDescription()為： fieldWeight(contents:一人 in 1), product of:
********************************************************************
Document的內(nèi)部編號為： 2
Document內(nèi)容為： Document<stored/uncompressed,indexed,tokenized<contents:一人之下一人之下>>
Document得分為： 0.40883923
Explanation為：
0.40883923 = (MATCH) fieldWeight(contents:一人 in 2), product of:
1.0 = tf(termFreq(contents:一人)=1)
0.81767845 = idf(docFreq=5)
0.5 = fieldNorm(field=contents, doc=2)

Document對應(yīng)的Explanation的一些參數(shù)值如下：
Explanation的getValue()為： 0.40883923
Explanation的getDescription()為： fieldWeight(contents:一人 in 2), product of:
********************************************************************
共檢索出符合條件的Document 5 個。
本次搜索所用的時間為 79 ms

先從測試的輸出結(jié)果進(jìn)行分析，可以獲得到如下信息：

■ 測試類中hits.score(i)的值與Explanation的getValue()的值是一樣的，即Lucene默認(rèn)使用的得分；

■ 默認(rèn)情況下，Lucene按照Document的得分進(jìn)行排序檢索結(jié)果；

■ 默認(rèn)情況下，如果兩個Document的得分相同，按照Document的內(nèi)部編號進(jìn)行排序，比如上面編號為(3和4)、(1和2)是兩組得分相同的Document，結(jié)果排序時按照Document的編號進(jìn)行了排序；

通過從IndexSearcher類中的explain方法：

public Explanation explain(Weight weight, int doc) throws IOException {
return weight.explain(reader, doc);
}

可以看出，實(shí)際上是調(diào)用了Weight接口類中的explain()方法，而Weight是與一個Query相關(guān)的，它記錄了一次查詢構(gòu)造的Query的情況，從而保證一個Query實(shí)例可以重用。

具體地，可以在實(shí)現(xiàn)Weight接口的具體類TermWeight中追溯到explain()方法，而TermWeight類是一個內(nèi)部類，定義在TermQuery類內(nèi)部。TermWeight類的explain()方法如下所示：

public Explanation explain(IndexReader reader, int doc)
throws IOException {

ComplexExplanation result = new ComplexExplanation();
result.setDescription("weight("+getQuery()+" in "+doc+"), product of:");

Explanation idfExpl = new Explanation(idf, "idf(docFreq=" + reader.docFreq(term) + ")");

      // explain query weight
      Explanation queryExpl = new Explanation();
      queryExpl.setDescription("queryWeight(" + getQuery() + "), product of:");

      Explanation boostExpl = new Explanation(getBoost(), "boost");
      if (getBoost() != 1.0f)
        queryExpl.addDetail(boostExpl);
      queryExpl.addDetail(idfExpl);

Explanation queryNormExpl = new Explanation(queryNorm,"queryNorm");
queryExpl.addDetail(queryNormExpl);

queryExpl.setValue(boostExpl.getValue() *idfExpl.getValue() *queryNormExpl.getValue());

result.addDetail(queryExpl);

      // 說明Field的權(quán)重
      String field = term.field();
      ComplexExplanation fieldExpl = new ComplexExplanation();
      fieldExpl.setDescription("fieldWeight("+term+" in "+doc+"), product of:");

      Explanation tfExpl = scorer(reader).explain(doc);
      fieldExpl.addDetail(tfExpl);
      fieldExpl.addDetail(idfExpl);

      Explanation fieldNormExpl = new Explanation();
      byte[] fieldNorms = reader.norms(field);
      float fieldNorm =
        fieldNorms!=null ? Similarity.decodeNorm(fieldNorms[doc]) : 0.0f;
      fieldNormExpl.setValue(fieldNorm);
      fieldNormExpl.setDescription("fieldNorm(field="+field+", doc="+doc+")");
      fieldExpl.addDetail(fieldNormExpl);

      fieldExpl.setMatch(Boolean.valueOf(tfExpl.isMatch()));
      fieldExpl.setValue(tfExpl.getValue() *idfExpl.getValue() *fieldNormExpl.getValue());

      result.addDetail(fieldExpl);
      result.setMatch(fieldExpl.getMatch());

      // combine them
      result.setValue(queryExpl.getValue() * fieldExpl.getValue());

if (queryExpl.getValue() == 1.0f)
return fieldExpl;

return result;
}

根據(jù)檢索結(jié)果，以及上面的TermWeight類的explain()方法，可以看出輸出的字符串部分正好一一對應(yīng)，比如：idf(Inverse Document Frequency，即反轉(zhuǎn)文檔頻率)、fieldNorm、fieldWeight。

檢索結(jié)果的第一個Document的信息：

Document的內(nèi)部編號為： 0
Document內(nèi)容為： Document<stored/uncompressed,indexed,tokenized<contents:一人>>
Document得分為： 0.81767845
Explanation為：
0.81767845 = (MATCH) fieldWeight(contents:一人 in 0), product of:
1.0 = tf(termFreq(contents:一人)=1)
0.81767845 = idf(docFreq=5)
1.0 = fieldNorm(field=contents, doc=0)

Document對應(yīng)的Explanation的一些參數(shù)值如下：
Explanation的getValue()為： 0.81767845
Explanation的getDescription()為： fieldWeight(contents:一人 in 0), product of:

tf的計算

上面的tf值Term Frequency，即詞條頻率，可以在org.apache.lucene.search.Similarity類中看到具體地說明。在Lucene中，并不是直接使用的詞條的頻率，而實(shí)際使用的詞條頻率的平方根，即：

tf(t in d) = frequency^½

這是使用org.apache.lucene.search.Similarity類的子類DefaultSimilarity中的方法計算的，如下：

/** Implemented as <code>sqrt(freq)</code>. */
public float tf(float freq) {
return (float)Math.sqrt(freq);
}

即：某個Document的tf = 檢索的詞條在該Document中出現(xiàn)次數(shù)freq取平方根值

也就是freq的平方根。

例如，從我們的檢索結(jié)果來看：

各個Document的tf計算如下所示：

編號為0的Document的 tf 為： (float)Math.sqrt(1) = 1.0；
編號為1的Document的 tf 為： (float)Math.sqrt(1) = 1.0；
編號為2的Document的 tf 為： (float)Math.sqrt(1) = 1.0；
編號為3的Document的 tf 為： (float)Math.sqrt(2) = 1.4142135；
編號為4的Document的 tf 為： (float)Math.sqrt(2) = 1.4142135；

idf的計算

檢索結(jié)果中，每個檢索出來的Document的都對應(yīng)一個idf，在DefaultSimilarity類中可以看到idf計算的實(shí)現(xiàn)方法，如下：

/** Implemented as <code>log(numDocs/(docFreq+1)) + 1</code>. */
public float idf(int docFreq, int numDocs) {
return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
}

其中，docFreq是根據(jù)指定關(guān)鍵字進(jìn)行檢索，檢索到的Document的數(shù)量，我們測試的docFreq=5；numDocs是指索引文件中總共的Document的數(shù)量，我們的測試比較特殊，將全部的Document都檢索出來了，我們測試的numDocs=5。

各個Document的idf的計算如下所示：

編號為0的Document的 idf 為：(float)(Math.log(5/(double)(5+1)) + 1.0) = 0.81767845；
編號為1的Document的 idf 為：(float)(Math.log(5/(double)(5+1)) + 1.0) = 0.81767845；
編號為2的Document的 idf 為：(float)(Math.log(5/(double)(5+1)) + 1.0) = 0.81767845；
編號為3的Document的 idf 為：(float)(Math.log(5/(double)(5+1)) + 1.0) = 0.81767845；
編號為4的Document的 idf 為：(float)(Math.log(5/(double)(5+1)) + 1.0) = 0.81767845；

lengthNorm的計算

在DefaultSimilarity類中可以看到lengthNorm計算的實(shí)現(xiàn)方法，如下：

public float lengthNorm(String fieldName, int numTerms) {
return (float)(1.0 / Math.sqrt(numTerms));
}

各個Document的lengthNorm的計算如下所示：

編號為0的Document的 lengthNorm 為：(float)(1.0 / Math.sqrt(1)) = 1.0/1.0 = 1.0；
編號為1的Document的 lengthNorm 為：(float)(1.0 / Math.sqrt(1)) = 1.0/1.0 = 1.0；
編號為2的Document的 lengthNorm 為：(float)(1.0 / Math.sqrt(1)) = 1.0/1.0 = 1.0；
編號為3的Document的 lengthNorm 為：(float)(1.0 / Math.sqrt(2)) = 1.0/1.4142135 = 0.7071068；
編號為4的Document的 lengthNorm 為：(float)(1.0 / Math.sqrt(2)) = 1.0/1.4142135 = 0.7071068；

關(guān)于fieldNorm

fieldNorm是在建立索引的時候?qū)懭氲模鴻z索的時候需要從索引文件中讀取，然后通過解碼，得到fieldNorm的float型值，用于計算Document的得分。

在org.apache.lucene.search.TermQuery.TermWeight類中，explain方法通過打開的IndexReader流讀取fieldNorm，寫入索引文件的是byte[]類型，需要解碼，如下所示：

byte[] fieldNorms = reader.norms(field);
float fieldNorm = fieldNorms!=null ? Similarity.decodeNorm(fieldNorms[doc]) : 0.0f;

調(diào)用Similarity類的decodeNorm方法，將byte[]類型值轉(zhuǎn)化為float浮點(diǎn)值：

public static float decodeNorm(byte b) {
return NORM_TABLE[b & 0xFF]; // & 0xFF maps negative bytes to positive above 127
}

這樣，一個浮點(diǎn)型的fieldNorm的值就被讀取出來了，可以參加一些運(yùn)算，最終實(shí)現(xiàn)Lucene的Document的得分的計算。

queryWeight的計算

queryWeight的計算可以在org.apache.lucene.search.TermQuery.TermWeight類中的sumOfSquaredWeights方法中看到計算的實(shí)現(xiàn)：

    public float sumOfSquaredWeights() {
      queryWeight = idf * getBoost();             // compute query weight
      return queryWeight * queryWeight;          // square it
    }

其實(shí)默認(rèn)情況下，queryWeight = idf，因?yàn)長ucune中默認(rèn)的激勵因子boost = 1.0。

各個Document的queryWeight的計算如下所示：

queryWeight = 0.81767845 * 0.81767845 = 0.6685980475944025；

queryNorm的計算

queryNorm的計算在DefaultSimilarity類中實(shí)現(xiàn)，如下所示：

/** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
public float queryNorm(float sumOfSquaredWeights) {
return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
}

這里，sumOfSquaredWeights的計算是在org.apache.lucene.search.TermQuery.TermWeight類中的sumOfSquaredWeights方法實(shí)現(xiàn)：

    public float sumOfSquaredWeights() {
      queryWeight = idf * getBoost();             // compute query weight
      return queryWeight * queryWeight;          // square it
    }

其實(shí)默認(rèn)情況下，sumOfSquaredWeights = idf * idf，因?yàn)長ucune中默認(rèn)的激勵因子boost = 1.0。

上面測試?yán)又衧umOfSquaredWeights的計算如下所示：

sumOfSquaredWeights = 0.81767845*0.81767845 = 0.6685980475944025；

然后，就可以計算queryNorm的值了，計算如下所示：

queryNorm = (float)(1.0 / Math.sqrt(0.6685980475944025) = 1.2229746301862302962735534977105；

value的計算

org.apache.lucene.search.TermQuery.TermWeight類類中還定義了一個value成員：

private float value;

關(guān)于value的計算，可以在它的子類org.apache.lucene.search.TermQuery.TermWeight類中看到計算的實(shí)現(xiàn)：

    public void normalize(float queryNorm) {
      this.queryNorm = queryNorm;
      queryWeight *= queryNorm;                   // normalize query weight
      value = queryWeight * idf;                  // idf for document
    }

這里，使用normalize方法計算value的值，即：

value = queryNorm * queryWeight * idf;

上面測試?yán)又衯alue的值計算如下：

value = 1.2229746301862302962735534977105 * 0.6685980475944025 * 0.81767845 = 0.66859804759440249999999999999973；

關(guān)于fieldWeight

從檢索結(jié)果中，可以看到：

0.81767845 = (MATCH) fieldWeight(contents:一人 in 0), product of:

字符串"(MATCH) "的輸在ComplexExplanation類中的getSummary方法中可以看到：

protected String getSummary() {
    if (null == getMatch())
      return super.getSummary();

    return getValue() + " = "
      + (isMatch() ? "(MATCH) " : "(NON-MATCH) ")
      + getDescription();
}

這個fieldWeight的值其實(shí)和Document的得分是相等的，先看這個fieldWeight是如何計算出來的，在org.apache.lucene.search.TermQuery.TermWeight類中的explain方法中可以看到：

      ComplexExplanation fieldExpl = new ComplexExplanation();
      fieldExpl.setDescription("fieldWeight("+term+" in "+doc+
                               "), product of:");

      Explanation tfExpl = scorer(reader).explain(doc);
      fieldExpl.addDetail(tfExpl);
      fieldExpl.addDetail(idfExpl);

      result.addDetail(fieldExpl);
      result.setMatch(fieldExpl.getMatch());

      // combine them
      result.setValue(queryExpl.getValue() * fieldExpl.getValue());

if (queryExpl.getValue() == 1.0f)
return fieldExpl;

上面，ComplexExplanation fieldExpl被設(shè)置了很多項(xiàng)內(nèi)容，我們就從這里來獲取fieldWeight的計算的實(shí)現(xiàn)。

關(guān)鍵是在下面進(jìn)行了計算：

fieldExpl.setValue(tfExpl.getValue() *
idfExpl.getValue() *
fieldNormExpl.getValue());

使用計算式表示就是

fieldWeight = tf * idf * fieldNorm

fieldNorm的值因?yàn)槭窃诮⑺饕臅r候?qū)懭氲剿饕募械?，索引只需要從上面的測試結(jié)果中取來，進(jìn)行如下關(guān)于Document的分?jǐn)?shù)的計算的驗(yàn)證。

使用我們這個例子來計算檢索出來的Docuyment的fieldWeight，需要用到前面計算出來的結(jié)果，如下所示：

編號為0的Document的 fieldWeight 為：1.0 * 0.81767845 * 1.0 = 0.81767845；
編號為1的Document的 fieldWeight 為：1.0 * 0.81767845 * 0.5 = 0.408839225；
編號為2的Document的 fieldWeight 為：1.0 * 0.81767845 * 0.5 = 0.408839225；
編號為3的Document的 fieldWeight 為：1.4142135 * 0.81767845 * 0.4375 = 0.5059127074089703125；
編號為4的Document的 fieldWeight 為：1.4142135 * 0.81767845 * 0.4375 = 0.5059127074089703125；

對比一下，其實(shí)檢索結(jié)果中Document的得分就是這個fieldWeight的值，驗(yàn)證后，正好相符(注意：我這里沒有進(jìn)行舍入運(yùn)算)。

總結(jié)說明

上面的計算得分是按照Lucene默認(rèn)設(shè)置的情況下進(jìn)行的，比如激勵因子的默認(rèn)值為1.0，它體現(xiàn)的是一個Document的重要性，即所謂的fieldWeight。

不僅可以通過為一個Document設(shè)置激勵因子boost，而且可以通過為一個Document中的Field設(shè)置boost，因?yàn)橐粋€Document的權(quán)重體現(xiàn)在它當(dāng)中的Field上，即上面計算出來的fieldWeight與Document的得分是相等的。

提高一個Document的激勵因子boost，可以使該Document被檢索出來的默認(rèn)排序靠前，即說明比較重要。也就是說，修改激勵因子boost能夠改變檢索結(jié)果的排序。

發(fā)表于 2011-04-15 11:02 西瓜閱讀(3744) 評論(1) 編輯收藏所屬分類: Lucene

lucene評分分析

常用鏈接

留言簿(2)

隨筆分類(116)

隨筆檔案(114)

文章分類(1)

文章檔案(1)

搜索

最新評論

閱讀排行榜

評論排行榜

西瓜地兒沈陽求職（java3年以上經(jīng)驗(yàn)）！ashutc@126.com
BlogJava \| 首頁 \| 發(fā)新隨筆 \| 發(fā)新文章 \| 聯(lián)系 \| 聚合 \| 管理	隨筆：114 文章：1 評論：45 引用：0