亚洲一级特黄无码片,中文字幕亚洲有码在线,最新国产成人亚洲精品影院

2007年4月27日

Nutch 0.9筆記

      一直留意Lucene,Nutch的進(jìn)展，最近這兩個項(xiàng)目都發(fā)展得非常快，Lucne已發(fā)展到 2.1,Nutch已發(fā)展到 0.9，改進(jìn)了很多，令人欣喜。
      今天小試了一下Nutch-0.9,筆記如下：

1、解壓Nutch包，在Nutch根目錄下建目錄urls,里面建一些包含URL的文本如urlt.txt，一行一個URL,內(nèi)容如：http://m.tkk7.com
http://www.javaeye.com/

2、修改conf目錄下的crawl-urlfilter.txt,片斷如下：
# accept hosts in MY.DOMAIN.NAME
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^http://m.tkk7.com/
+^http://www.javaeye.com/
+^http://lucene.apache.org/

3、修改conf目錄下的nutch-site.xml，內(nèi)容如下：

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>http.agent.name</name>

<value>Nutch</value>

<description>HTTP 'User-Agent' request header. MUST NOT be empty -

please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents

http.agent.description

http.agent.url

http.agent.email

http.agent.version

and set their values appropriately.

</description>

</property>

<name>http.robots.agents</name>

<value>Nutch,*</value>

<description>The agent strings we'll look for in robots.txt files,

comma-separated, in decreasing order of precedence. You should

put the value of http.agent.name as the first agent name, and keep the

default * at the end of the list. E.g.: BlurflDev,Blurfl,*

</description>

</property>

<name>http.agent.description</name>

<value>Nutch Search Engineer</value>

<description>Further description of our bot- this text is used in

the User-Agent header. It appears in parenthesis after the agent name.

</description>

</property>

<name>http.agent.url</name>

<value>http://lucene.apache.org/nutch/bot.html</value>

<description>A URL to advertise in the User-Agent header. This will

appear in parenthesis after the agent name. Custom dictates that this

should be a URL of a page explaining the purpose and behavior of this

crawler.

</description>

</property>

<name>http.agent.email</name>

<value>nutch-agent@lucene.apache.org</value>

<description>An email address to advertise in the HTTP 'From' request

header and User-Agent header. A good practice is to mangle this

address (e.g. 'info at example dot com') to avoid spamming.

</description>

</property>

</configuration>

注意：在nutch-0.9.jar里面已包含nutch-site.xml， conf目錄下的文件都復(fù)制過到classpath根下，如果是在WEB環(huán)境下運(yùn)行classpath下的nutch-site.xml會優(yōu)先加載，如果在在Application環(huán)境運(yùn)行，應(yīng)把如上nutch-site.xml打入到nutch-0.9.jar包里，否則，上面的一些屬性為空不能運(yùn)行。

4、在Windows下運(yùn)行Nutch，很簡單，只要你能執(zhí)行Crawl這個類就行，寫一個Ant腳本放在Nuthc的根目錄下執(zhí)行它就OK，內(nèi)容如下：

</path>

<echo>crwaling starting

</echo>

</java>

<echo>crwaling finished

</echo>

</target>

</project>

至此，如無意外，Nutch已經(jīng)歡快地運(yùn)行起來，最后在crawl目錄下你會發(fā)現(xiàn)你想要的東西，Enjoy it!

posted @ 2007-04-27 11:09 小魚閱讀(2101) | 評論 (0) | 編輯收藏

僅列出標(biāo)題

小魚的空氣

Nutch 0.9筆記

導(dǎo)航

統(tǒng)計(jì)

常用鏈接

留言簿(3)

我參與的團(tuán)隊(duì)

隨筆檔案

文章檔案

搜索

最新評論