代碼為QueryParser.jj,語法為JavaCC實現的LL():
完整文檔:http://lucene.apache.org/java/2_0_0/queryparsersyntax.html
和正則一樣:
?表示0個或1個
+表示一個或多個
*表示0個或多個
以下是Token部分:
_NUM_CHAR::=["0"-"9"] //數字
_ESCAPED_CHAR::= "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^", "[", "]", "\"", "{", "}", "~", "*", "?" ] > //特殊字符,
_TERM_START_CHAR ::=( ~[ " ", "\t", "\n", "\r", "+", "-", "!", "(", ")", ":", "^","[", "]", "\"", "{", "}", "~", "*", "?" ] //TERM的起始字符,除了列出的其它字符都可以
_TERM_CHAR::=( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" | "+" ) > //TERM可使用字符
_WHITESPACE::= ( " " | "\t" | "\n" | "\r") //空格和回車,
<DEFAULT> TOKEN:
AND::=("AND" | "&&")
OR::=("OR" | "||")
NOT::=("NOT" | "!")
PLUS::="+"
MINUS::="-"
LPAREN::="("
RPAREN::=")"
COLON::=":"
STAR::="*"
CARAT::="^" //后接Boost,原文<CARAT: "^" > : Boost,后面Boost說明什么沒明白
QUOTED::="\"" (~["\""] | "\\\"")+ "\"" // 表示用"包起來的字符串,字符"開始,中間由不是"的符號或者連著的這兩個符號\"組成,字符"結束,
TERM::=<_TERM_START_CHAR> (<_TERM_CHAR>)*
FUZZY_SLOP::="~" ( (<_NUM_CHAR>)+ ( "." (<_NUM_CHAR>)+ )? )? //字符~開始,而后是數字.Lucene支持模糊查詢,例如"roam~"或"roam~0.8",The value is between 0 and 1,算法為the Levenshtein Distance, or Edit Distance algorithm
PREFIXTERM::=(<_TERM_START_CHAR> | "*") (<_TERM_CHAR>)* "*" > //模糊查找,表示以某某開頭的查詢, 字符表示為"something*",前綴允許模糊符號*,中間可有字符也可沒有, 結尾必須是*
WILDTERM::=(<_TERM_START_CHAR> | [ "*", "?" ]) (<_TERM_CHAR> | ( [ "*", "?" ] ))* > //類似上面,但同時支持?字符,結尾可以是字符也可以是* ?。使用[]表示or關系時,不需要使用|,只要,號分割即可
RANGEIN_START::="[" //在RangeQuery中,[或{表示了是否包含邊界條件本身, 用字符表示為"[begin TO end]" 或者"{begin TO end}",后接RangeIn
RANGEEX_START::="{" //同上,后接RangeEx
<Boost> TOKEN:
NUMBER::=(<_NUM_CHAR>)+ ( "." (<_NUM_CHAR>)+ )? //后接DEFAULT, 整數或小數
<RangeIn> TOKEN:
RANGEIN_TO::="TO"
RANGEIN_END::="]" //后接DEFAULT, RangIn的結束
RANGEIN_QUOTED::= "\"" (~["\""] | "\\\"")+ "\"" //同上述QUOTED,表示用"包起來的字符串,
RANGEIN_GOOP::= (~[ " ", "]" ])+ //1個或多個不是空格和]的符號,這樣就能提取出[]中的內容
<RangeEx> TOKEN :
RANGEEX_TO::="TO">
RANGEEX_END::="}" //后接DEFAULT, RangeEx的結束
RANGEEX_QUOTED::="\"" (~["\""] | "\\\"")+ "\"" //同上述QUOTED,表示用"包起來的字符串,
RANGEEX_GOOP::=(~[ " ", "}" ])+ //1個或多個不是空格和]的符號,這樣就能提取出[]中的內容
<DEFAULT, RangeIn, RangeEx> SKIP : {
< <_WHITESPACE>>
} //所有空格和回車被忽略
_ESCAPED_CHAR::= "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^", "[", "]", "\"", "{", "}", "~", "*", "?" ] > //特殊字符,
_TERM_START_CHAR ::=( ~[ " ", "\t", "\n", "\r", "+", "-", "!", "(", ")", ":", "^","[", "]", "\"", "{", "}", "~", "*", "?" ] //TERM的起始字符,除了列出的其它字符都可以
_TERM_CHAR::=( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" | "+" ) > //TERM可使用字符
_WHITESPACE::= ( " " | "\t" | "\n" | "\r") //空格和回車,
<DEFAULT> TOKEN:
AND::=("AND" | "&&")
OR::=("OR" | "||")
NOT::=("NOT" | "!")
PLUS::="+"
MINUS::="-"
LPAREN::="("
RPAREN::=")"
COLON::=":"
STAR::="*"
CARAT::="^" //后接Boost,原文<CARAT: "^" > : Boost,后面Boost說明什么沒明白
QUOTED::="\"" (~["\""] | "\\\"")+ "\"" // 表示用"包起來的字符串,字符"開始,中間由不是"的符號或者連著的這兩個符號\"組成,字符"結束,
TERM::=<_TERM_START_CHAR> (<_TERM_CHAR>)*
FUZZY_SLOP::="~" ( (<_NUM_CHAR>)+ ( "." (<_NUM_CHAR>)+ )? )? //字符~開始,而后是數字.Lucene支持模糊查詢,例如"roam~"或"roam~0.8",The value is between 0 and 1,算法為the Levenshtein Distance, or Edit Distance algorithm
PREFIXTERM::=(<_TERM_START_CHAR> | "*") (<_TERM_CHAR>)* "*" > //模糊查找,表示以某某開頭的查詢, 字符表示為"something*",前綴允許模糊符號*,中間可有字符也可沒有, 結尾必須是*
WILDTERM::=(<_TERM_START_CHAR> | [ "*", "?" ]) (<_TERM_CHAR> | ( [ "*", "?" ] ))* > //類似上面,但同時支持?字符,結尾可以是字符也可以是* ?。使用[]表示or關系時,不需要使用|,只要,號分割即可
RANGEIN_START::="[" //在RangeQuery中,[或{表示了是否包含邊界條件本身, 用字符表示為"[begin TO end]" 或者"{begin TO end}",后接RangeIn
RANGEEX_START::="{" //同上,后接RangeEx
<Boost> TOKEN:
NUMBER::=(<_NUM_CHAR>)+ ( "." (<_NUM_CHAR>)+ )? //后接DEFAULT, 整數或小數
<RangeIn> TOKEN:
RANGEIN_TO::="TO"
RANGEIN_END::="]" //后接DEFAULT, RangIn的結束
RANGEIN_QUOTED::= "\"" (~["\""] | "\\\"")+ "\"" //同上述QUOTED,表示用"包起來的字符串,
RANGEIN_GOOP::= (~[ " ", "]" ])+ //1個或多個不是空格和]的符號,這樣就能提取出[]中的內容
<RangeEx> TOKEN :
RANGEEX_TO::="TO">
RANGEEX_END::="}" //后接DEFAULT, RangeEx的結束
RANGEEX_QUOTED::="\"" (~["\""] | "\\\"")+ "\"" //同上述QUOTED,表示用"包起來的字符串,
RANGEEX_GOOP::=(~[ " ", "}" ])+ //1個或多個不是空格和]的符號,這樣就能提取出[]中的內容
<DEFAULT, RangeIn, RangeEx> SKIP : {
< <_WHITESPACE>>
} //所有空格和回車被忽略
以下為解析部分
Conjunction::=[ <AND> { ret = CONJ_AND; } | <OR> { ret = CONJ_OR; } ] //連接
Modifiers::=[ <PLUS> { ret = MOD_REQ; } | <MINUS> { ret = MOD_NOT; } | <NOT> { ret = MOD_NOT; } ] //+ - !符號
Query::=Modifiers Clause (Conjunction Modifiers Clause)*
Clause::=[(<TERM> <COLON>|<STAR> <COLON>)] //btw:代碼中LOOKAHEAD[2]表示使用LL(2)
(Term|<LPAREN> Query <RPAREN> (<CARAT> <NUMBER>)?) //子句. ???????這兒語法有點,仿佛允許 *:(*:dog)這樣的語法,很奇怪
Term::=(
(<TERM>|<STAR>|<PREFIXTERM>|<WILDTERM>|<NUMBER>) [<FUZZY_SLOP>] [<CARAT><NUMBER>[<FUZZY_SLOP>]}
| ( <RANGEIN_START> (<RANGEIN_GOOP>|<RANGEIN_QUOTED>) [ <RANGEIN_TO> ] (<RANGEIN_GOOP>|<RANGEIN_QUOTED> <RANGEIN_END> ) [ <CARAT> boost=<NUMBER> ] //這兒看出range必須同時有兩端,不能只有有一端
| ( <RANGEEX_START> <RANGEEX_GOOP>|<RANGEEX_QUOTED> [ <RANGEEX_TO> ] <RANGEEX_GOOP>|<RANGEEX_QUOTED> <RANGEEX_END> )[ <CARAT> boost=<NUMBER> ] //在RangeQuery中,[或{表示了是否包含邊界條件本身, 用字符表示為"[begin TO end]" 或者"{begin TO end}",后接RangeIn
| <QUOTED> [ <FUZZY_SLOP> ] [ <CARAT> boost=<NUMBER> ] //被""包含的內容
Modifiers::=[ <PLUS> { ret = MOD_REQ; } | <MINUS> { ret = MOD_NOT; } | <NOT> { ret = MOD_NOT; } ] //+ - !符號
Query::=Modifiers Clause (Conjunction Modifiers Clause)*
Clause::=[(<TERM> <COLON>|<STAR> <COLON>)] //btw:代碼中LOOKAHEAD[2]表示使用LL(2)
(Term|<LPAREN> Query <RPAREN> (<CARAT> <NUMBER>)?) //子句. ???????這兒語法有點,仿佛允許 *:(*:dog)這樣的語法,很奇怪
Term::=(
(<TERM>|<STAR>|<PREFIXTERM>|<WILDTERM>|<NUMBER>) [<FUZZY_SLOP>] [<CARAT><NUMBER>[<FUZZY_SLOP>]}
| ( <RANGEIN_START> (<RANGEIN_GOOP>|<RANGEIN_QUOTED>) [ <RANGEIN_TO> ] (<RANGEIN_GOOP>|<RANGEIN_QUOTED> <RANGEIN_END> ) [ <CARAT> boost=<NUMBER> ] //這兒看出range必須同時有兩端,不能只有有一端
| ( <RANGEEX_START> <RANGEEX_GOOP>|<RANGEEX_QUOTED> [ <RANGEEX_TO> ] <RANGEEX_GOOP>|<RANGEEX_QUOTED> <RANGEEX_END> )[ <CARAT> boost=<NUMBER> ] //在RangeQuery中,[或{表示了是否包含邊界條件本身, 用字符表示為"[begin TO end]" 或者"{begin TO end}",后接RangeIn
| <QUOTED> [ <FUZZY_SLOP> ] [ <CARAT> boost=<NUMBER> ] //被""包含的內容
btw: 猜測: javacc中,如果使用[],則允許出現0次或1次