锘??xml version="1.0" encoding="utf-8" standalone="yes"?>
A variable-length format for positive integers is
defined where the high-order bit of each byte indicates whether more
bytes remain to be read. The low-order seven bits are appended as
increasingly more significant bits in the resulting integer value.
Thus values from zero to 127 may be stored in a single byte, values
from 128 to 16,383 may be stored in two bytes, and so on. 鍙彉鏍煎紡鐨勬暣鍨嬪畾涔夛細鏈楂樹綅琛ㄧず鏄惁榪樻湁瀛楄妭瑕佽鍙栵紝浣庝竷浣嶅氨鏄氨鏄叿浣撶殑鏈夋晥浣嶏紝娣誨姞鍒?/p>
緇撴灉鏁版嵁涓傛瘮濡?0000001 鏈楂樹綅琛ㄧず0錛岄偅涔堣鏄庤繖涓暟灝辨槸涓涓瓧鑺傝〃紺猴紝鏈夋晥浣嶆槸鍚庨潰鐨勪竷浣?000001錛屽間負1銆?0000010 00000001 絎竴涓瓧鑺傛渶楂樹綅涓?錛岃〃紺哄悗闈㈣繕鏈夊瓧鑺傦紝絎簩浣嶆渶楂樹綅0琛ㄧず鍒版涓烘浜嗭紝鍗沖氨鏄袱涓瓧鑺傦紝閭d箞鍏蜂綋鐨勫兼敞鎰忥紝鏄粠鏈鍚庝竴涓瓧鑺傜殑涓冧綅鏈夋晥鏁版斁鍦ㄦ渶鍓嶉潰錛屼緷嬈℃斁緗紝鏈鍚庢槸絎竴涓嚜宸辯殑涓冧綅鏈夋晥浣嶏紝鎵浠ヨ繖涓暟琛ㄧず 0000001 0000010錛屾崲綆楁垚鏁存暟灝辨槸130 VInt Encoding Example Value
First byte
Second byte
Third byte
0
00000000
1
00000001
2
00000010
...
127
01111111
128
10000000
00000001
129
10000001
00000001
130
10000010
00000001
...
16,383
11111111
01111111
16,384
10000000
10000000
00000001
16,385
10000001
10000000
00000001
...
Lucene婧愪唬鐮佷腑榪涜瀛樺偍鍜岃鍙栨槸榪欐牱鐨勩侽utputStream鏄礋璐e啓錛?/p>
鐢?Hadoop 榪涜鍒嗗竷寮忔暟鎹鐞嗭紝絎?1 閮ㄥ垎: 鍏ラ棬
浠ヤ笅鏄竴閮ㄥ垎鐞嗚瀛︿範錛?br />
The storage is provided by HDFS, and analysis by MapReduce.
MapReduce is a good fit for problems
that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.
An RDBMS is good for point queries or updates, where the dataset has been indexed
to deliver low-latency retrieval and update times of a relatively small amount of
data.
MapReduce suits applications where the data is written once, and read many
times, whereas a relational database is good for datasets that are continually updated.
MapReduce tries to colocate the data with the compute node, so data access is fast
since it is local.* This feature, known as data locality, is at the heart of MapReduce and
is the reason for its good performance.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined
map function for each record in the split.
On the other hand, if splits are too small, then the overhead of managing the splits and
of map task creation begins to dominate the total job execution time.For most jobs, a
good split size tends to be the size of a HDFS block, 64 MB by default.
Reduce tasks don’t have the advantage of data locality鈥攖he input to a single reduce
task is normally the output from all mappers.
Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays
to minimize the data transferred between map and reduce tasks. Hadoop allows the
user to specify a combiner function to be run on the map output鈥攖he combiner function’s
output forms the input to the reduce function.
Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost
of seeks. By making a block large enough, the time to transfer the data from the disk
can be made to be significantly larger than the time to seek to the start of the block.
Thus the time to transfer a large file made of multiple blocks operates at the disk transfer
rate.
A quick calculation shows that if the seek time is around 10ms, and the transfer rate is
100 MB/s, then to make the seek time 1% of the transfer time, we need to make the
block size around 100 MB. The default is actually 64 MB, although many HDFS installations
use 128 MB blocks. This figure will continue to be revised upward as transfer
speeds grow with new generations of disk drives.
This argument shouldn’t be taken too far, however. Map tasks in MapReduce normally
operate on one block at a time, so if you have too few tasks (fewer than nodes in the
cluster), your jobs will run slower than they could otherwise.
鎰忔濇槸榪欐牱鐨勶紝Block澶х殑璇濓紝瀵繪壘Block鐨勬椂闂村ぇ姒傚皯錛屼富瑕佽楀湪浼犺緭鐨勬椂闂翠笂錛屼絾鏄鏋淏lock灝忕殑璇濓紝浼犺緭鐨勬椂闂村拰瀵誨潃鐨勬椂闂村氨鐩稿綋浜嗭紝絳変簬璇村氨鏄秷鑰楃殑鏃墮棿鏄?鍊嶄紶杈撶殑鏃墮棿錛屽垝涓嶆潵銆傚叿浣撶殑璇存槸錛屽鏋滄暟鎹噺涓?00M錛岄偅涔圔lock鐨勫ぇ灝忔槸100M錛岄偅涔堜紶杈撶殑鏃墮棿灝辨槸1s(100M/s)錛屼絾鏄鏋淏lock鐨勫ぇ灝忔槸1M錛岄偅涔堜紶杈撶殑鏃墮棿榪樻槸1s錛屼絾鏄痵eek鐨勬椂闂?0ms*100=1s浜嗐傝繖鏍鋒誨叡鑺卞幓鐨勬椂闂村氨鏄?s銆傛槸涓嶆槸瓚婂ぇ瓚婂ソ鍛紵涔熶笉鏄紝澶ぇ鐨勮瘽錛屽彲鑳藉鑷存枃妗f病鏈夊垎甯冨紡鐨勫瓨鍌紝涔熷氨娌℃湁寰堝ソ鐨勫埄鐢∕apReduce妯″瀷榪涜璁$畻浜嗭紝鍙嶈屽彲鑳芥洿鎱€?br />
]]>
2 * five bytes. Smaller values take fewer bytes. Negative numbers are not
3 * supported.
4 * @see InputStream#readVInt()
5 */
6 public final void writeVInt(int i) throws IOException {
7 while ((i & ~0x7F) != 0) {
8 writeByte((byte)((i & 0x7f) | 0x80));
9 i >>>= 7;
10 }
11 writeByte((byte)i);
12 }
InputStream璐熻矗璇伙細
2 * five bytes. Smaller values take fewer bytes. Negative numbers are not
3 * supported.
4 * @see OutputStream#writeVInt(int)
5 */
6 public final int readVInt() throws IOException {
7 byte b = readByte();
8 int i = b & 0x7F;
9 for (int shift = 7; (b & 0x80) != 0; shift += 7) {
10 b = readByte();
11 i |= (b & 0x7F) << shift;
12 }
13 return i;
14 }
>>>琛ㄧず鏃犵鍙峰彸縐?br />
]]>
tomcat鍜宩dk閮藉畨瑁呭ソ錛?/p>
浜岋細nutch-0.9.tar.gz
灝嗕笅杞藉埌鐨則ar.gz鍖咃紝瑙e帇鍒?opt鐩綍涓嬪茍鏀瑰悕錛?br />
#gunzip -xf nutch-0.9.tar.gz |tar xf
#mv nutch-0.9.tar.gz /usr/local/nutch
嫻嬭瘯鐜鏄惁璁劇疆鎴愬姛錛氳繍琛岋細/urs/local/nutch/bin/nutch鐪嬩竴涓嬫湁娌℃湁鍛戒護鍙傛暟杈撳嚭錛屽鏋滄湁璇存槑娌¢棶棰樸?/p>
鎶撳彇榪囩▼錛?cd /opt/nutch
#mkdir urls
#vi nutch.txt 杈撳叆www.aicent.net
#vi conf/crawl-urlfilter.txt 鍔犲叆浠ヤ笅淇℃伅錛氬埄鐢ㄦ鍒欒〃杈懼紡瀵圭綉绔檜rl鎶撳彇絳涢夈?br />
/**** accept hosts in MY.DOMAIN.NAME******/
+^http://([a-z0-9]*\.)*aicent.net/
#vi nutch/nutch-site.xml錛堢粰鑷繁鐨勮湗铔涘彇涓涓悕瀛楋級璁劇疆濡備笅錛?br />
<configuration>
<property>
<name>http.agent.name</name>
<value>test/unique</value>
</property>
</configuration>
寮濮嬫姄鍙栵細#bin/nutch crawl urls -dir crawl -detpth 5 -thread 10 >& crawl.log
絳夊緟涓浼氾紝鏃墮棿渚濇嵁緗戠珯鐨勫ぇ灝忥紝鍜岃緗殑鎶撳彇娣卞害銆?/p>
涓夛細apache-tomcat
鍦ㄨ繖閲岋紝褰撲綘鐪嬪埌姣忔媯绱㈢殑欏甸潰涓?閲岋紝闇瑕佷慨鏀逛竴涓嬪弬鏁幫紝鍥犱負tomcat涓殑nutch鐨勬绱㈣礬寰勪笉瀵歸犳垚鐨勩?br />
#vi /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml
<property>
<name>searcher.dir</name>
<value>/opt/nutch/crawl</value>鎶撳彇緗戦〉鎵鍦ㄧ殑璺緞
<description>My path to nutch's searcher dir.</description>
</property>
#/opt/tomcat/bin/startup.sh
OK,鎼炲畾銆傘傘?/p>
闂姹囨伙細
榪愯錛歴h ./bin/nutch crawl urls -dir crawl -depth 3 -threads 60 -topN 100 >&./logs/nutch_log.log
1.Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
緗戜笂鏌ユ湁璇存槸JDK鐗堟湰鐨勯棶棰橈紝涓嶈兘鐢↗DK1.6錛屼簬鏄畨瑁?.5銆備絾鏄繕鏄悓鏍風殑闂錛屽鎬晩銆?br />
浜庢槸緇х畫google錛屽彂鐜版湁濡備笅鐨勫彲鑳斤細
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
璇存槑錛氫竴鑸負crawl-urlfilters.txt涓厤緗棶棰橈紝姣斿榪囨護鏉′歡搴斾負
+^http://www.ihooyo.com ,鑰岄厤緗垚浜?http://www.ihooyo.com 榪欐牱鐨勬儏鍐靛氨寮曡搗濡備笂閿欒銆?/p>
浣嗘槸鑷繁鐨勯厤緗牴鏈氨娌℃湁闂鍟娿?br />
鍦↙ogs鐩綍涓嬮潰闄や簡鐢熸垚nutch_log.log榪樿嚜鍔ㄧ敓鎴愪竴涓猯og鏂囦歡錛歨adoop.log
鍙戠幇鏈夐敊璇嚭鐜幫細
2009-07-22 22:20:55,501 INFO crawl.Crawl - crawl started in: crawl
2009-07-22 22:20:55,501 INFO crawl.Crawl - rootUrlDir = urls
2009-07-22 22:20:55,502 INFO crawl.Crawl - threads = 60
2009-07-22 22:20:55,502 INFO crawl.Crawl - depth = 3
2009-07-22 22:20:55,502 INFO crawl.Crawl - topN = 100
2009-07-22 22:20:55,603 INFO crawl.Injector - Injector: starting
2009-07-22 22:20:55,604 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb
2009-07-22 22:20:55,604 INFO crawl.Injector - Injector: urlDir: urls
2009-07-22 22:20:55,605 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries.
2009-07-22 22:20:56,574 INFO plugin.PluginRepository - Plugins: looking in: /opt/nutch/plugins
2009-07-22 22:20:56,720 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2009-07-22 22:20:56,720 INFO plugin.PluginRepository - Registered Plugins:
2009-07-22 22:20:56,720 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Basic Query Filter (query-basic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Site Query Filter (query-site)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - HTTP Framework (lib-http)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - JavaScript Parser (parse-js)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - URL Query Filter (query-url)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Registered Extension-Points:
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2009-07-22 22:20:56,786 WARN regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2009-07-22 22:20:56,829 WARN mapred.LocalJobRunner - job_2319eh
java.lang.RuntimeException: java.net.UnknownHostException: jackliu: jackliu
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:617)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:591)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:364)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:390)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.startPartition(MapTask.java:294)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:355)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$100(MapTask.java:231)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:180)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
Caused by: java.net.UnknownHostException: jackliu: jackliu
at java.net.InetAddress.getLocalHost(InetAddress.java:1353)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:614)
... 8 more
涔熷氨鏄疕ost閰嶇疆閿欒錛屼簬鏄細
Add the following to your /etc/hosts file
127.0.0.1 jackliu
榪欐鍐嶆榪愯錛岀粨鏋滄垚鍔燂紒
2:http://127.0.0.1:8080/nutch-0.9
杈撳叆nutch榪涜鏌ヨ錛岀粨鏋滄姤閿欙細
HTTP Status 500 -
type Exception report
message
description The server encountered an internal error () that prevented it from fulfilling this request.
exception
org.apache.jasper.JasperException: /search.jsp(151,22) Attribute value language + "/include/header.html" is quoted with " which must be escaped when used within the value
org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:40)
org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:407)
org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:198)
org.apache.jasper.compiler.Parser.parseQuoted(Parser.java:299)
org.apache.jasper.compiler.Parser.parseAttributeValue(Parser.java:249)
org.apache.jasper.compiler.Parser.parseAttribute(Parser.java:211)
org.apache.jasper.compiler.Parser.parseAttributes(Parser.java:154)
org.apache.jasper.compiler.Parser.parseInclude(Parser.java:867)
org.apache.jasper.compiler.Parser.parseStandardAction(Parser.java:1134)
org.apache.jasper.compiler.Parser.parseElements(Parser.java:1461)
org.apache.jasper.compiler.Parser.parse(Parser.java:137)
org.apache.jasper.compiler.ParserController.doParse(ParserController.java:255)
org.apache.jasper.compiler.ParserController.parse(ParserController.java:103)
org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:170)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:332)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:312)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:299)
org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:586)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:317)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:342)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:267)
javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
note The full stack trace of the root cause is available in the Apache Tomcat/6.0.20 logs.
鍒嗘瀽錛氭煡鐪媙utch Web搴旂敤鏍圭洰褰曚笅鐨剆earch.jsp鍙煡錛屾槸寮曞彿鍖歸厤鐨勯棶棰樸?/p>
<jsp:include page="<%= language + "/include/header.html"%>"/> //line 152 search.jsp
絎竴涓紩鍙峰拰鍚庨潰絎竴涓嚭鐜扮殑寮曞彿榪涜鍖歸厤錛岃屼笉鏄拰榪欎竴琛屾渶鍚庝竴涓紩鍙瘋繘琛屽尮閰嶏紝鎵浠ラ棶棰樺氨鍑虹幇浜嗐?/p>
瑙e喅鏂規硶錛?/p>
灝嗚琛屼唬鐮佷慨鏀逛負錛?lt;jsp:include page="<%= language+urlsuffix %>"/>
榪欓噷鎴戜滑瀹氫竴涓瓧絎︿覆urlsuffix錛屾垜浠妸瀹冨畾涔夊湪language瀛楃涓插畾涔変箣鍚庯紝
String language = // line 116 search.jsp
ResourceBundle.getBundle("org.nutch.jsp.search", request.getLocale())
.getLocale().getLanguage();
String urlsuffix="/include/header.html";
淇敼瀹屾垚鍚庯紝涓虹‘淇濅慨鏀規垚鍔燂紝閲嶅惎涓涓婽omcat鏈嶅姟鍣紝榪涜鎼滅儲錛屼笉鍐嶆姤閿欍?/p>
3.鏃犳硶鏌ヨ緇撴灉錛?br />
瀵規瘮nutch_log.log鐨勭粨鏋滃彂鐜板拰緗戜笂鎻忚堪鐨勪笉鍚岋紝鑰屼笖crawl閲岄潰鍙湁涓や釜鏂囦歡澶箂egments鍜宑rawldb錛屽悗鏉ラ噸鏂扮埇浜嗕竴嬈?br />
鍙戠幇榪欐鏄ソ鐨勶紝濂囨笉鐭ラ亾涓轟粈涔堜笂嬈$埇鐨勫け璐ヤ簡銆?br />
4.cached.jsp explain.jsp絳夐兘鏈変笂闈?鐨勯敊璇紝鏇存榪囧幓灝監K浜嗐?/p>
5.浠婂ぉ鑺變簡涓涓婂崍鍜屽崐涓笅鍗堢殑鏃墮棿緇堜簬鎼炲畾浜唍utch鐨勫畨瑁呭拰閰嶇疆浜嗐傛槑澶╃戶緇涔犮?/p>
欏哄簭寰堥噸瑕侊細
String[] phrase = new String[] {"fox", "quick"};
assertFalse("hop flop", matched(phrase, 2));
assertTrue("hop hop slop", matched(phrase, 3));
鍘熺悊濡備笅鍥炬墍紺猴細
瀵逛簬鏌ヨ鍏抽敭瀛梣uick鍜宖ox錛屽彧闇瑕乫ox縐誨姩涓涓綅緗嵆鍙尮閰峲uick brown fox銆傝屽浜巉ox鍜宷uick榪欎袱涓叧閿瓧
闇瑕佸皢fox縐誨姩涓変釜浣嶇疆銆傜Щ鍔ㄧ殑璺濈瓚婂ぇ錛岄偅涔堣繖欏硅褰曠殑score灝辮秺灝忥紝琚煡璇㈠嚭鏉ョ殑鍙兘琛屽氨瓚婂皬浜嗐?br />
SpanQuery鍒╃敤浣嶇疆淇℃伅鏌ヨ鏇存湁鎰忔濈殑鏌ヨ錛?br />
SpanQuery type Description
SpanTermQuery Used in conjunction with the other span query types. On its own, it’s
functionally equivalent to TermQuery.
SpanFirstQuery Matches spans that occur within the first part of a field.
SpanNearQuery Matches spans that occur near one another.
SpanNotQuery Matches spans that don’t overlap one another.
SpanOrQuery Aggregates matches of span queries.
SpanFirstQuery錛歍o query for spans that occur within the first n positions of a field, use Span-FirstQuery.
quick = new SpanTermQuery(new Term("f", "quick"));
brown = new SpanTermQuery(new Term("f", "brown"));
red = new SpanTermQuery(new Term("f", "red"));
fox = new SpanTermQuery(new Term("f", "fox"));
lazy = new SpanTermQuery(new Term("f", "lazy"));
sleepy = new SpanTermQuery(new Term("f", "sleepy"));
dog = new SpanTermQuery(new Term("f", "dog"));
cat = new SpanTermQuery(new Term("f", "cat"));
SpanFirstQuery sfq = new SpanFirstQuery(brown, 2);
assertNoMatches(sfq);
sfq = new SpanFirstQuery(brown, 3);
assertOnlyBrownFox(sfq);
SpanNearQuery錛?br />
褰兼鐩擱偦鐨勮法搴?
3.PhrasePrefixQuery 涓昏鐢ㄦ潵榪涜鍚屼箟璇嶆煡璇㈢殑錛?br />
IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), true);
Document doc1 = new Document();
doc1.add(Field.Text("field", "the quick brown fox jumped over the lazy dog"));
writer.addDocument(doc1);
Document doc2 = new Document();
doc2.add(Field.Text("field","the fast fox hopped over the hound"));
writer.addDocument(doc2);
PhrasePrefixQuery query = new PhrasePrefixQuery();
query.add(new Term[] {new Term("field", "quick"), new Term("field", "fast")});
query.add(new Term("field", "fox"));
Hits hits = searcher.search(query);
assertEquals("fast fox match", 1, hits.length());
query.setSlop(1);
hits = searcher.search(query);
assertEquals("both match", 2, hits.length());
]]>
int *binarySearch(int val, int array[], int n)
{
int m = n/2;
if(n <= 0) return NULL;
if(val == array[m]) return array + m;
if(val < array[m]) return binarySearch(val, array, m);
else return binarySearch(val, array+m+1, n-m-1);
}
瀵逛簬鏈塶涓厓绱犵殑鏁扮粍鏉ヨ錛屼簩鍏冩悳绱㈢畻娉曡繘琛屾渶澶?+log2(n)嬈℃瘮杈冦?濡傛灉鏈変竴鐧句竾鍏冪礌錛屽ぇ姒傛瘮杈?0嬈★紝涔熷氨鏄渶澶?0嬈¢掑綊鎵цbinarySearch()鍑芥暟銆?/p>
3.Index dates
Document doc = new Document();
doc.add(Field.Keyword("indexDate", new Date()));
4.Tuning indexing performance
IndexWriter System property Default value Description
--------------------------------------------------------------------------------------------------
mergeFactor org.apache.lucene.mergeFactor 10 Controls segment merge frequency and size
maxMergeDocs org.apache.lucene.maxMergeDocs Integar.MAX_VALUE Limits the number of documents per segement
minMergeDocs org.apache.lucene.minMergeDocs 10 Controls the amount of RAM used when indexing
mergeFactor鎺у埗鍐欏叆紜洏鍓嶅唴瀛樹腑緙撳瓨鐨刣ocument鏁伴噺錛屽悓鏃舵帶鍒秏erge index segments鐨勯鐜囥傚叾榛樿鍊兼槸10錛屽嵆瀛樻弧10涓?br />
documents鍚庡氨蹇呴』鍐欏叆紜洏錛岃屼笖濡傛灉segment鐨勬暟閲忚揪鍒?0鐨勭駭鏁扮殑鏃跺欎細merge鎴愪竴涓猻egment錛屽綋鐒秏axMergeDocs闄愬埗浜嗘瘡涓?br />
segment鏈澶ц兘澶熶繚瀛樼殑document鏁伴噺銆俶ergeFactor瓚婂ぇ鐨勮瘽灝辮秺鑳藉埄鐢≧AM錛屾彁楂榠ndex鐨勬晥鐜囷紝浣嗘槸mergeFactor瓚婇珮涔熷氨鎰忓懗鐫
merge鐨勯鐜囧氨瓚婁綆錛屼細鍙兘瀵艱嚧segments鐨勬暟閲忓緢澶э紙鍥犱負娌℃湁merge錛夛紝榪欐牱search鐨勬椂鍊欏氨闇瑕佹墦寮鏇村鐨剆egment鏂囦歡錛屼篃灝?br />
闄嶄綆浜唖earch鐨勬晥鐜囥俶inMergeDocs is another IndexWriter instance variable that affects indexing performance. Its
value controls how many Documents have to be buffered before they’re merged to a segment.涔熷嵆鏄minMergeDocs涔熷叿鏈?br />
mergeFactor鎺у埗緙撳瓨document鏁伴噺鐨勫姛鑳姐?/p>
5.RAMDirectory甯姪鍒╃敤RAM錛屼篃鍙互閲囩敤闆嗙兢鎴栬呭綰跨▼鐨勬柟寮忓厖鍒嗗埄鐢ㄧ‖浠跺拰杞歡璧勬簮錛屾彁楂榠ndex鐨勬晥鐜囥?/p>
6.鏈夋椂鍊欏浜庢瘡涓猣ield鍙兘甯屾湜鎺у埗鍏跺ぇ灝忥紝姣斿鍙鍓?000涓猼erm鍋歩ndex錛岃繖涓椂鍊欏氨闇瑕佷嬌鐢╩axFieldLength鏉ユ帶鍒躲?/p>
7.IndexWriter’s optimize()鏂規硶灝辨槸灝唖egments榪涜merge錛岄檷浣巗egments鐨勬暟閲忎粠鑰屽噺灝憇earch鐨勬椂鍊欒鍙杋ndex鐨勬椂闂淬?/p>
8.娉ㄦ剰澶氱嚎紼嬬幆澧冧笅鐨勫伐浣滐細an index-modifying IndexReader operation can’t be executed
while an index-modifying IndexWriter operation is in progress.涓轟簡闃叉璇敤錛孡ucene鍦ㄤ嬌鐢ㄦ煇浜汚PI鏃朵細緇?br />
index涓婇攣銆?/p>
1.TermQuery甯哥敤錛屽涓涓猅erm錛堟渶灝忕殑绱㈠紩鍧楋紝鍖呭惈涓涓猣ield鍚嶅瓧鍜屽鹼級榪涜绱㈠紩鏌ヨ銆?br /> Term鐩存帴涓嶲ueryParser.parse閲岄潰鐨刱ey鍜宖ield鐩存帴瀵瑰簲銆?/p>
IndexSearcher searcher = new IndexSearcher(directory);
Term t = new Term("isbn", "1930110995");
Query query = new TermQuery(t);
Hits hits = searcher.search(query);
2.RangeQuery鐢ㄤ簬鍖洪棿鏌ヨ,RangeQuery鐨勭涓変釜鍙傛暟琛ㄧず鏄紑鍖洪棿榪樻槸闂尯闂淬?br /> QueryParser浼氭瀯寤轟粠begin鍒癳nd涔嬮棿鐨凬涓煡璇㈣繘琛屾煡璇€?/p>
Term begin, end;
Searcher searcher = new IndexSearcher(dbpath);
begin = new Term("pubmonth","199801");
end = new Term("pubmonth","199810");
RangeQuery query = new RangeQuery(begin, end, true);
RangeQuery鏈川鏄瘮杈冨ぇ灝忋傛墍浠ュ涓嬫煡璇篃鏄彲浠ョ殑錛屼絾鏄剰涔夊氨浜庝笂闈笉澶т竴鏍蜂簡錛屾諱箣鏄ぇ灝忕殑姣旇緝
璁懼畾浜嗕竴涓尯闂達紝鍦ㄥ尯闂村唴鐨勯兘鑳藉鎼滅儲鍑烘潵錛岃繖閲屽氨瀛樺湪涓涓瘮杈冨ぇ灝忕殑鍘熷垯錛屾瘮濡傚瓧絎︿覆浼氶鍏堟瘮杈冪涓涓瓧絎︼紝榪欐牱涓庡瓧絎﹂暱搴︽病鏈夊叧緋匯?br />
begin = new Term("pubmonth","19");
end = new Term("pubmonth","20");
RangeQuery query = new RangeQuery(begin, end, true);
3.PrefixQuery.瀵逛簬TermQuery錛屽繀欏誨畬鍏ㄥ尮閰嶏紙鐢‵ield.Keyword鐢熸垚鐨勫瓧孌碉級鎵嶈兘澶熸煡璇㈠嚭鏉ャ?br />
榪欏氨鍒剁害浜嗘煡璇㈢殑鐏墊椿鎬э紝PrefixQuery鍙渶瑕佸尮閰峷alue鐨勫墠闈換浣曞瓧孌靛嵆鍙傚Field涓簄ame錛岃褰?br />
涓偅涔堟湁jackliu,jackwu,jackli,閭d箞浣跨敤jack灝卞彲浠ユ煡璇㈠嚭鎵鏈夌殑璁板綍銆俀ueryParser creates a PrefixQuery
for a term when it ends with an asterisk (*) in query expressions.
IndexSearcher searcher = new IndexSearcher(directory);
Term term = new Term("category", "/technology/computers/programming");
PrefixQuery query = new PrefixQuery(term);
Hits hits = searcher.search(query);
4.BooleanQuery.涓婇潰鎵鏈夌殑鏌ヨ閮芥槸鍩轟簬鍗曚釜field鐨勬煡璇紝澶氫釜field鎬庝箞鏌ヨ鍛紝BooleanQuery
灝辨槸瑙e喅澶氫釜鏌ヨ鐨勯棶棰樸傞氳繃add(Query query, boolean required, boolean prohibited)鍔犲叆
澶氫釜鏌ヨ.閫氳繃BooleanQuery鐨勫祵濂楀彲浠ョ粍鍚堥潪甯稿鏉傜殑鏌ヨ銆?br />
IndexSearcher searcher = new IndexSearcher(directory);
TermQuery searchingBooks =
new TermQuery(new Term("subject","search"));
RangeQuery currentBooks =
new RangeQuery(new Term("pubmonth","200401"),
new Term("pubmonth","200412"),true);
BooleanQuery currentSearchingBooks = new BooleanQuery();
currentSearchingBooks.add(searchingBook s, true, false);
currentSearchingBooks.add(currentBooks, true, false);
Hits hits = searcher.search(currentSearchingBooks);
BooleanQuery鐨刟dd鏂規硶鏈変袱涓猙oolean鍙傛暟錛?br />
true錛唂alse錛氳〃鏄庡綋鍓嶅姞鍏ョ殑瀛愬彞鏄繀欏昏婊¤凍鐨勶紱
false錛唗rue錛氳〃鏄庡綋鍓嶅姞鍏ョ殑瀛愬彞鏄笉鍙互琚弧瓚崇殑錛?br />
false錛唂alse錛氳〃鏄庡綋鍓嶅姞鍏ョ殑瀛愬彞鏄彲閫夌殑錛?br />
true錛唗rue錛氶敊璇殑鎯呭喌銆?/p>
QueryParser handily constructs BooleanQuerys when multiple terms are specified.
Grouping is done with parentheses, and the prohibited and required flags are
set when the –, +, AND, OR, and NOT operators are specified.
5.PhraseQuery榪涜鏇翠負綺劇‘鐨勬煡鎵俱傚畠鑳藉瀵圭儲寮曟枃鏈腑鐨勪袱涓垨鏇村鐨勫叧閿瘝鐨勪綅緗繘琛?br />
闄愬畾銆傚鎼滄煡鍖呭惈A鍜孊騫朵笖A銆丅涔嬮棿榪樻湁涓涓枃瀛椼俆erms surrounded by double quotes in
QueryParser parsed expressions are translated into a PhraseQuery.
The slop factor defaults to zero, but you can adjust the slop factor
by adding a tilde (~) followed by an integer.
For example, the expression "quick fox"~3
6.WildcardQuery.WildcardQuery姣擯refixQuery鎻愪緵浜嗘洿緇嗙殑鎺у埗鍜屾洿澶х殑鐏墊椿鎬э紝榪欎釜鏈瀹規槗
鐞嗚В鍜屼嬌鐢ㄣ?/p>
7.FuzzyQuery.榪欎釜Query姣旇緝鐗瑰埆錛屽畠浼氭煡璇笌鍏抽敭瀛楅暱寰楀緢鍍忕殑鍏朵粬璁板綍銆俀ueryParser
supports FuzzyQuery by suffixing a term with a tilde (~),for exmaple wuzza~.
public void testFuzzy() throws Exception {
indexSingleFieldDocs(new Field[] {
Field.Text("contents", "fuzzy"),
Field.Text("contents", "wuzzy")
});
IndexSearcher searcher = new IndexSearcher(directory);
Query query = new FuzzyQuery(new Term("contents", "wuzza"));
Hits hits = searcher.search(query);
assertEquals("both close enough", 2, hits.length());
assertTrue("wuzzy closer than fuzzy",
hits.score(0) != hits.score(1));
assertEquals("wuzza bear","wuzzy", hits.doc(0).get("contents"));
}
鏂規硶 |
鍒囪瘝 |
绱㈠紩 |
瀛樺偍 |
鐢ㄩ?/span> |
Field.Text(String name, String value) |
Yes |
Yes |
Yes |
鍒囧垎璇嶇儲寮曞茍瀛樺偍錛屾瘮濡傦細鏍囬錛屽唴瀹瑰瓧孌?/span> |
Field.Text(String name, Reader value) |
Yes |
Yes |
No |
鍒囧垎璇嶇儲寮曚笉瀛樺偍錛屾瘮濡傦細META淇℃伅錛?/span> 涓嶇敤浜庤繑鍥炴樉紺猴紝浣嗛渶瑕佽繘琛屾绱㈠唴瀹?/span> |
Field.Keyword(String name, String value) |
No |
Yes |
Yes |
涓嶅垏鍒嗙儲寮曞茍瀛樺偍錛屾瘮濡傦細鏃ユ湡瀛楁 |
Field.UnIndexed(String name, String value) |
No |
No |
Yes |
涓嶇儲寮曪紝鍙瓨鍌紝姣斿錛氭枃浠惰礬寰?/span> |
Field.UnStored(String name, String value) |
Yes |
Yes |
No |
鍙叏鏂囩儲寮曪紝涓嶅瓨鍌?/span> |