红杏亚洲影院一区二区三区,国产成人精品久久亚洲高清不卡,亚洲日本国产精华液

Nutch中的html頁面的解析問題

Posted on 2010-04-23 17:38 泰仔在線閱讀(3076) 評論(1) 編輯收藏所屬分類: 云計算相關(guān)

今天主要研究了Nutch中的html頁面的解析問題，因為我的任務(wù)是從頁面中提取特定的文本，因此首先要找到Nutch如何將html中的文本提取出來。Nutch提供了兩種html解析器，nekohtml和tagsoup，我采用了neko的解析器，在看了代碼后，發(fā)現(xiàn)其提取文本的方法在org.apache.nutch.parse.html中的DOMContentUtils文件中，主要的函數(shù)是getTextHelper。下面做一下解釋。

private boolean getTextHelper(StringBuffer sb, Node node,
                                             boolean abortOnNestedAnchors,
                                             int anchorDepth) {
    boolean abort = false;
    NodeWalker walker = new NodeWalker(node);// NodeWalk類用來非遞歸遍歷DOM樹節(jié)點
    int myint=1;

    while (walker.hasNext()){ //如果存在節(jié)點

      Node currentNode = walker.nextNode();//獲取下一個節(jié)點
      String nodeName = currentNode.getNodeName();//獲取節(jié)點名
      short nodeType = currentNode.getNodeType();//節(jié)點類型

      if ("script".equalsIgnoreCase(nodeName)) {//不處理腳本
        walker.skipChildren();
      }
      if ("style".equalsIgnoreCase(nodeName)) {//不處理style
        walker.skipChildren();
      }
      if (abortOnNestedAnchors && "a".equalsIgnoreCase(nodeName)) {//檢測是否嵌套
        anchorDepth++;
        if (anchorDepth > 1) {
          abort = true;
          break;
        }
      }
      if (nodeType == Node.COMMENT_NODE) {//不處理注釋
        walker.skipChildren();
      }
      if (nodeType == Node.TEXT_NODE) {
        // cleanup and trim the value
        String text = currentNode.getNodeValue();//獲取文本內(nèi)容
        text = text.replaceAll("\\s+", " ");//消除所有空格和轉(zhuǎn)行等字符
   text = text.trim();
        if (text.length() > 0) {
          if (sb.length() > 0) sb.append(' ');
         sb.append(text);
        }
      }
    }
}

調(diào)用這個函數(shù)的類是htmlParser類，如果想自己寫一個提取文本的函數(shù)，可以做相應(yīng)修改。

轉(zhuǎn)自:實習(xí)日記(六)

Feedback

# re: Nutch中的html頁面的解析問題 回復(fù) 更多評論

2013-03-19 16:53 by gongshijun

怎樣改啊，nutch1.6都沒有你說的那些東西，找不到啊

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發(fā)表評論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關(guān)文章: Nutch URL過濾配置規(guī)則 nutch抓取動態(tài)網(wǎng)頁 Nutch中的html頁面的解析問題 Nutch中的一些小的問題解決 Nutch插件加載分析 nutch源代碼閱讀心得 MapReduce算法模式 MapReduce 簡介

泰仔在線

導(dǎo)航

留言簿(3)

隨筆分類

收藏夾

Database相關(guān)

Enet 沖浪

Java 技術(shù)

Linux相關(guān)

搜索

最新評論

閱讀排行榜

Nutch中的html頁面的解析問題

Feedback

# re: Nutch中的html頁面的解析問題 回復(fù) 更多評論

泰仔在線

導(dǎo)航

留言簿(3)

隨筆分類

收藏夾

Database相關(guān)

Enet 沖浪

Java 技術(shù)

Linux相關(guān)

搜索

最新評論

閱讀排行榜

Nutch中的html頁面的解析問題

Feedback

# re: Nutch中的html頁面的解析問題 回復(fù) 更多評論

# re: Nutch中的html頁面的解析問題回復(fù) 更多評論