2銆佷慨鏀筩onf鐩綍涓嬬殑crawl-urlfilter.txt,鐗囨柇濡備笅錛?br># accept hosts in MY.DOMAIN.NAME # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ +^http://m.tkk7.com/ +^http://www.javaeye.com/ +^http://lucene.apache.org/
3銆佷慨鏀筩onf鐩綍涓嬬殑nutch-site.xml錛屽唴瀹瑰涓嬶細
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>Nutch</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property> <property> <name>http.robots.agents</name> <value>Nutch,*</value> <description>The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* </description> </property> <property> <name>http.agent.description</name> <value>Nutch Search Engineer</value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value>http://lucene.apache.org/nutch/bot.html</value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </description> </property> <property> <name>http.agent.email</name> <value>nutch-agent@lucene.apache.org</value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </description> </property> </configuration>