一直留意Lucene,Nutch的進展,最近這兩個項目都發展得非??欤琇ucne已發展到 2.1,Nutch已發展到 0.9,改進了很多,令人欣喜。
今天小試了一下Nutch-0.9,筆記如下:
1、解壓Nutch包,在Nutch根目錄下建目錄urls,里面建一些包含URL的文本如urlt.txt,一行一個URL,內容如:http://m.tkk7.com
http://www.javaeye.com/2、修改conf目錄下的
crawl-urlfilter.txt,片斷如下:
# accept hosts in MY.DOMAIN.NAME
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^http://m.tkk7.com/
+^http://www.javaeye.com/
+^http://lucene.apache.org/
3、修改conf目錄下的
nutch-site.xml,內容如下:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
<name>http.agent.name</name>
<value>Nutch</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version

and set their values appropriately.

</description>
</property>

<property>
<name>http.robots.agents</name>
<value>Nutch,*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>

<property>
<name>http.agent.description</name>
<value>Nutch Search Engineer</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>

<property>
<name>http.agent.url</name>
<value>http://lucene.apache.org/nutch/bot.html</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>

<property>
<name>http.agent.email</name>
<value>nutch-agent@lucene.apache.org</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>

</configuration>

注意:在nutch-0.9.jar里面已包含nutch-site.xml, conf目錄下的文件都復制過到classpath根下,如果是在WEB環境下運行classpath下的nutch-site.xml會優先加載,如果在在Application環境運行,應把如上nutch-site.xml打入到nutch-0.9.jar包里,否則,上面的一些屬性為空不能運行。
4、在Windows下運行Nutch,很簡單,只要你能執行Crawl這個類就行,寫一個Ant腳本放在Nuthc的根目錄下執行它就OK,內容如下:
<project name="nutch-crawl" default="crawl" basedir=".">
<property name="lib.dir" location="lib"/>
<property name="conf.dir" location="conf"/>

<path id="project.classpath">
<fileset dir="." includes="nutch-*.jar"/>
<fileset dir="lib" />
<pathelement path="."/>
<pathelement path="${conf.dir}"/>
</path>
<target name="crawl" >
<echo>crwaling starting
</echo>
<property name="JVM.extra.args" value="-Xmx512m" />
<java classname="org.apache.nutch.crawl.Crawl" classpathref="project.classpath" fork="true">
<jvmarg line="${JVM.extra.args}"/>
<arg value="C:/dev-tools/nutch-0.9/urls"/>
<arg value="-dir"/>
<arg value="C:/dev-tools/nutch-0.9/crawl"/>
<arg value="-depth"/>
<arg value="3"/>
<arg value="-threads"/>
<arg value="15"/>
</java>
<echo>crwaling finished
</echo>
</target>
</project>
至此,如無意外,Nutch已經歡快地運行起來,最后在crawl目錄下你會發現你想要的東西,Enjoy it!