??xml version="1.0" encoding="utf-8" standalone="yes"?>
score(q,d) = coord(q,d) · queryNorm(q) · | ?/big> | ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t,d) ) |
t in q |
tf(t in d) = |
frequency½ |
idf(t) = |
1 + log ( |
|
) |
coord(q,d)
by the Similarity in effect at search time. queryNorm(q) = queryNorm(sumOfSquaredWeights) = |
|
Weight
object计算出来? For example, a boolean query
computes this value as:
sumOfSquaredWeights = q.getBoost() 2 · |
?/big> | ( idf(t) · t.getBoost() ) 2 |
t in q |
setBoost()
讄? 注意Q这儿没有直接的API去访问在 a multi term query的一个term的boost|但是multi terms?x)以multi TermQuery
objects在一个query中被表示,因此the boost of a term in the query可以使用子query?a>getBoost()
反问? doc.setBoost()
before adding the document to the index. field.setBoost()
before adding the field to a document. lengthNorm(field)
-。当文被加入到索引时计,Q和document的fieldnorm(t,d) = doc.getBoost() · lengthNorm(field) · |
∏ | f.getBoost () |
field f in d named as t |
Similarity
for search. ------------------------------------
Store.COMPRESS Store the original field value in the index in a compressed form. This is useful for long documents and for binary valued fields.压羃存储Q?br />
Store.YES Store the original field value in the index. This is useful for short texts like a document's title which should be displayed with the results. The value is stored in its original form, i.e. no analyzer is used before it is stored. 索引文g本来只存储烦(ch)引数? 此设计将原文内容直接也存储在索引文g中,如文档的标题?br />
Store.NO Do not store the field value in the index. 原文不存储在索引文g中,搜烦(ch)l果命中后,再根据其他附加属性如文g的PathQ数据库的主键等Q重新连接打开原文Q适合原文内容较大的情c(din)?br />
军_?jin)Field对象?this.isStored ?nbsp; this.isCompressed
------------------------------------
Index.NO Do not index the field value. This field can thus not be searched, but one can still access its contents provided it is Field.Store stored. 不进行烦(ch)引,存放不能被搜索的内容如文的一些附加属性如文cd, URL{?br />
Index.TOKENIZED Index the field's value so it can be searched. An Analyzer will be used to tokenize and possibly further normalize the text before its terms will be stored in the index. This is useful for common text. 分词索引
Index.UN_TOKENIZED Index the field's value without using an Analyzer, so it can be searched. As no analyzer is used the value will be stored as a single term. This is useful for unique Ids like product numbers. 不分词进行烦(ch)引,如作者名Q日期等QRod Johnson本nZ单词Q不再需要分词?/p>
Index.NO_NORMS 不分词,建烦(ch)引。norms是什???字段???。但是Field的g像通常那样被保存,而是只取一个byteQ这栯U存储空???? Index the field's value without an Analyzer, and disable the storing of norms. No norms means that index-time boosting and field length normalization will be disabled. The benefit is less memory usage as norms take up one byte per indexed field for every document in the index.Note that once you index a given field <i>with</i> norms enabled, disabling norms will have no effect. In other words, for NO_NORMS to have the above described effect on a field, all instances of that field must be indexed with NO_NORMS from the beginning.
军_?jin)Field对象?this.isIndexed this.isTokenized this.omitNorms
------------------------------------
Lucene 1.4.3新增的:(x)
TermVector.NO Do not store term vectors. 不保存term vectors
TermVector.YES Store the term vectors of each document. A term vector is a list of the document's terms and their number of occurences in that document. 保存term vectors?
TermVector.WITH_POSITIONS Store the term vector + token position information 保存term vectors。(保存值和token位置信息Q?br />
TermVector.WITH_OFFSETS Store the term vector + Token offset information
TermVector.WITH_POSITIONS_OFFSETS Store the term vector + Token position and offset information 保存term vectors。(保存值和Token的offsetQ?br />
军_?jin)Field对象的this.storeTermVector this.storePositionWithTermVector this.storeOffsetWithTermVector
最q,ruby 1.9又提供了(jin)新的定义lambda
以下内容均ؓ(f)转蝲,url见具体链?
最常见的四个Analyzer,说明: http://windshowzbf.bokee.com/3016397.html
WhitespaceAnalyzer 仅仅是去除空|对字W没有lowcase?不支持中?br />
SimpleAnalyzer :功能ZWhitespaceAnalyzer,除去letter之外的符号全部过滤掉,q且所有的字符lowcase?不支持中?br />
StopAnalyzer: StopAnalyzer的功能超了(jin)SimpleAnalyzerQ在SimpleAnalyzer的基?增加?jin)去除StopWords的功?不支持中?cM使用一个static数组保存?sh)(jin)ENGLISH_STOP_WORDS, 太常见不index的words
StandardAnalyzer: 用Javacc定义的一套EBNFQ严的语法。有英文的处理能力同于StopAnalyzer.支持中文采用的方法ؓ(f)单字切分。未仔细比较Q不敢确定?/p>
其他的扩?
ChineseAnalyzer:来自于Lucene的sand box.性能cM于StandardAnalyzer,~点是不支持中英文和分?
CJKAnalyzer:chedong写的CJKAnalyzer的功能在英文处理上的功能和StandardAnalyzer相同.但是在汉语的分词上,不能qo(h)掉标点符P即用二元切?br />
TjuChineseAnalyzer: http://windshowzbf.bokee.com/3016397.html写的,功能最为强?TjuChineseAnlyzer的功能相当强?在中文分词方面由于其调用的ؓ(f)ICTCLAS的java接口.所以其在中文方面性能上同与ICTCLAS.其在英文分词上采用了(jin)Lucene的StopAnalyzer,可以去除 stopWords,而且可以不区分大写,qo(h)掉各cL点符?
例子:
http://www.langtech.org.cn/index.php/uid-5080-action-viewspace-itemid-68, q有单的代码分析
Analyzing "The quick brown fox jumped over the lazy dogs"
WhitespaceAnalyzer:
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
SimpleAnalyzer:
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
StopAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
StandardAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
Analyzing "XY&Z Corporation - xyz@example.com"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [xyz@example.com]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [xyz@example.com]
参考连?
http://macrochen.blogdriver.com/macrochen/1167942.html
http://macrochen.blogdriver.com/macrochen/1153507.html
http://my.dmresearch.net/bbs/viewthread.php?tid=8318
http://windshowzbf.bokee.com/3016397.html
ISO-10646术语 |
Unicode术语 |
UCS-2 | BMP UTF-16 |
UCS-4 | UTF-32 |
另:(x)
Java 1.0 supports Unicode version 1.1.
Java 1.1 onwards supports Unicode version 2.0.
J2SE 1.4中的字符处理是基于Unicode 3.0标准的?br />
J2SE v 1.5 supports Unicode 4.0 character set.
而:(x)
Unicode 3.0Q?999q九(ji)月;涵蓋?jin)來自ISO 10646-1的十六位元通用字元集(UCSQ基本多文種q面QBasic Multilingual PlaneQ?
Unicode 3.1Q?001q三月;新增從ISO 10646-2定義的輔助^面(Supplementary Planes)
U-00000000 - U-0000007F: | 0xxxxxxx |
U-00000080 - U-000007FF: | 110xxxxx 10xxxxxx |
U-00000800 - U-0000FFFF: | 1110xxxx 10xxxxxx 10xxxxxx |
U-00010000 - U-001FFFFF: | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
U-00200000 - U-03FFFFFF: | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
U-04000000 - U-7FFFFFFF: | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
代碼圍 十六進制 |
標量?scalar value 二進制 |
UTF-8 二進制 / 十六進制 |
a釋 |
---|---|---|---|
000000 - 00007F 128個代?/small> |
00000000 00000000 0zzzzzzz | 0zzzzzzz(00-7F) | ASCII{值範圍,位元i由雉?/td> |
七個z | 七個z | ||
000080 - 0007FF 1920個代?/small> |
00000000 00000yyy yyzzzzzz | 110yyyyy(C2-DF) 10zzzzzz(80-BF) | W一?a title="字节" >位元i?/a>?10開始Q接著的位元i?/a>?0開始 |
三個yQ二個yQ六個z | 五個yQ六個z | ||
000800 - 00FFFF 63488個代?/small> |
00000000 xxxxyyyy yyzzzzzz | 1110xxxx(E0-EF) 10yyyyyy 10zzzzzz | W一?a title="字节" >位元i?/a>?110開始Q接著的位元i?/a>?0開始 |
四個xQ四個yQ二個yQ六個z | 四個xQ六個yQ六個z | ||
010000 - 10FFFF 1048576個代?/small> |
000wwwxx xxxxyyyy yyzzzzzz | 11110www(F0-F4) 10xxxxxx 10yyyyyy 10zzzzzz | ?1110開始Q接著的位元i?/a>?0開始 |
三個wQ二個xQ四個xQ四個yQ二個yQ六個z | 三個wQ六個xQ六個yQ六個z |
在大U?1993 q之后开发的大多数现代编E语a都有一个特别的数据cd, 叫做 Unicode/ISO 10646-1 字符. ?Ada95 中叫 Wide_Character, ?Java 中叫 char.
ISO C 也详l说明了(jin)处理多字节编码和宽字W?(wide characters) 的机? 1994 q?9 ?Amendment 1 to ISO C 发表时又加入?jin)更? q些机制主要是ؓ(f)各类东亚~码而设计的, 它们比处?UCS 所需的要健壮得多. UTF-8 ?ISO C 标准调用多字节字W串的编码的一个例? wchar_t cd可以用来存放 Unicode 字符.
以下析部?br />
btw: 猜测: javacc?如果使用[],则允许出?ơ或1?br />