亚洲熟妇成人精品一区,亚洲日韩精品无码专区网站,青草久久精品亚洲综合专区

2008年2月9日

反向索引：

正向索引（草稿，不完全，因為收到field info的影響，不同的field存儲內容不同，且fieldInfo的有些信息,TOKENIZED BINARY COMPRESSED也是保存在.fdt的每個document相關段的bits中,而不是.fnm中）:

posted @ 2008-02-27 18:14 鵬飛萬里閱讀(1585) | 評論 (0) | 編輯收藏

Lucene和GCJ

Lucene在1.9版本的時候就已經加入了對GCJ的支持，利用GCJ編譯Lucene，并且使用新的GCJIndexInput.java讀寫文件系統，
直接調用操作系統級別的native方法，相信讀寫性能能夠極大得提高啊。

具體代碼可見Lucene的gcj目錄，編譯使用ant gcj

posted @ 2008-02-14 15:27 鵬飛萬里閱讀(451) | 評論 (0) | 編輯收藏

備忘：lucene中的ranking算法

說明見Similarity.java的javadoc信息：

算法請參考javadoc的，它使用的是Vector Space Model (VSM) of Information Retrieval。

針對一條查詢語句q(query)，一個d(document)的得分公式

score(q,d) = coord(q,d) · queryNorm(q) ·	∑	( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) )
	t in q

其中，

tf(t in d) 表示某個term的出現頻率，定義了term t出現在當前地document d的次數。那些query中給定地term，如果出現越多次的，得分越高。它在默認實現DefaultSimilarity的公式為

tf(t in d) = frequency^½

idf(t) 表示反向文檔頻率。這個參數表示docFreq(term t一共在多少個文檔中出現)的反向影響值。它意味著在越少文檔中出現的terms貢獻越高地分數。它在默認實現DefaultSimilarity的公式為:

idf(t) =

1 + log (

numDocs

–––––––––

docFreq+1

)

coord(q,d) 是一個基于在該文檔中出現了多少個query中的terms的得分因素。文檔中出現的query中的terms數量/query總共多少個query數量。典型的，一個文檔包含越多地query中的terms會得到更高地分。This is a search time factor computed in coord(q,d) by the Similarity in effect at search time.
queryNorm(q) 是一個標準化參數，它是用來區分比較不同queries時的因素，這個因素不影響document ranking (因為所有的ranked document都會乘以相同的值)，但是不同地queries（或這不同地indexes中）它會得到不同的可用于比較的值。This is a search time factor computed by the Similarity in effect at search time. 它在默認實現DefaultSimilarity的公式為:

queryNorm(q) = queryNorm(sumOfSquaredWeights) =

––––––––––––––

sumOfSquaredWeights^½

其中的sumOfSquaredWeights(of the query terms)是根據the query Weight object計算出來的. For example, a boolean query computes this value as:

`sumOfSquaredWeights` = `q.getBoost()` ² ·	∑	( idf(t) · t.getBoost() ) ²
	t in q

t.getBoost() 是一個term t在query q中的search time boost，它是在the query text (see query syntax)中指定的, 或者被應用程序直接調用setBoost()設置的. 注意，這兒沒有直接的API去訪問在 a multi term query的一個term的boost值，但是multi terms會以multi TermQuery objects在一個query中被表示,因此the boost of a term in the query可以使用子query的getBoost()反問到.

norm(t,d) 封裝(encapsulates)了一些(indexing time)的boost和length factors: ???這個參數之和field中tokens的數量有關，和term本身無關???

Document boost - set by calling doc.setBoost() before adding the document to the index.

Field boost - set by calling field.setBoost() before adding the field to a document.

lengthNorm(field) -。當文檔被加入到索引時計算，，和document的field中的tokens的數量有關，因此，一個比較短的fields貢獻更高的分數。LengthNorm is computed by the Similarity class in effect at indexing. DefaultSimilarity中的實現為(float)(1.0 / Math.sqrt(numTerms));

當一個文檔被加入索引時，上述因素會被相乘。如果文檔有多個fields同名，他們的boosts數值會被多次相乘。

norm(t,d) = `doc.getBoost()` · `lengthNorm(field)` ·	∏	`f.getBoost`()
	field f in d named as t

但是，計算出的norm數值在存儲時是使用一個a single byte編碼的。search時，這個norm byte從index directory讀取，并且被解碼回float。這個編碼/解碼算法會產生精度丟失。 - it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75. Also notice that search time is too late to modify this norm part of scoring, e.g. by using a different Similarity for search.

posted @ 2008-02-09 17:58 鵬飛萬里閱讀(1863) | 評論 (0) | 編輯收藏

2008年2月9日

導航

統計

常用鏈接

留言簿(4)

我參與的團隊

隨筆檔案

搜索

最新評論

閱讀排行榜

評論排行榜