亚洲高清最新av网站,亚洲首页在线观看,国产AV无码专区亚洲A∨毛片

htmlparser解析一些網(wǎng)頁(yè)時(shí),繁體中文會(huì)變成亂碼

最近發(fā)現(xiàn)用htmlparser解析一些網(wǎng)頁(yè)時(shí),繁體中文會(huì)變成亂碼.分析了下原因,發(fā)現(xiàn)在用stringbean的時(shí)候htmlparser會(huì)自己根據(jù)meta來(lái)決定用哪種內(nèi)碼來(lái)解碼,而有的網(wǎng)站在meta中是用gb2312來(lái)做charset,實(shí)際應(yīng)用的時(shí)候又用到了gbk.gb2312是不能表示繁體的,所以就出現(xiàn)了亂碼.解決的辦法很簡(jiǎn)單,gbk是兼容gb2312的,所以在htmlparser的page.java的getcharser()那里加一句判斷,如果ret是gb2312就設(shè)置為gbk,這樣問(wèn)題就解決了.?

修改的page.java的代碼如下(/lexer/page.java)

??? public String getCharset (String content)
??? {
??????? final String CHARSET_STRING = "charset";
??????? int index;
??????? String ret;

??????? if (null == mSource)
??????????? ret = DEFAULT_CHARSET;
??????? else
??????????? // use existing (possibly supplied) character set:
??????????? // bug #1322686 when illegal charset specified
??????????? ret = mSource.getEncoding ();
??????? if (null != content)
??????? {
??????????? index = content.indexOf (CHARSET_STRING);

??????????? if (index != -1)
??????????? {
??????????????? content = content.substring (index +
??????????????????? CHARSET_STRING.length ()).trim ();
??????????????? if (content.startsWith ("="))
??????????????? {
??????????????????? content = content.substring (1).trim ();
??????????????????? index = content.indexOf (";");
??????????????????? if (index != -1)
??????????????????????? content = content.substring (0, index);

??????????????????? //remove any double quotes from around charset string
??????????????????? if (content.startsWith ("\"") && content.endsWith ("\"")
??????????????????????? && (1 < content.length ()))
??????????????????????? content = content.substring (1, content.length () - 1);

??????????????????? //remove any single quote from around charset string
??????????????????? if (content.startsWith ("'") && content.endsWith ("'")
??????????????????????? && (1 < content.length ()))
??????????????????????? content = content.substring (1, content.length () - 1);

??????????????????? ret = findCharset (content, ret);

??????????????????? // Charset names are not case-sensitive;
??????????????????? // that is, case is always ignored when comparing
??????????????????? // charset names.
//??????????????????? if (!ret.equalsIgnoreCase (content))
//??????????????????? {
//??????????????????????? System.out.println (
//??????????????????????????? "detected charset \""
//??????????????????????????? + content
//??????????????????????????? + "\", using \""
//??????????????????????????? + ret
//??????????????????????????? + "\"");
//??????????????????? }
??????????????? }
??????????? }
??????? }
??????? if(ret.equalsIgnoreCase("gb2312"))ret="GBK"; //to avoid decode problem
??????????????????????????????????????????????????????????????????????????????????????? ? ?//edited by linyunfan
??????? return (ret);
??? }

在最后加入了這句

??????? if(ret.equalsIgnoreCase("gb2312"))ret="GBK";

大盤(pán)預(yù)測(cè) 國(guó)富論

posted on 2008-10-09 13:33 華夢(mèng)行閱讀(1778) 評(píng)論(3) 編輯收藏

常用鏈接

留言簿(2)

隨筆分類(lèi)(91)

隨筆檔案(293)

友情鏈接

最新隨筆

搜索

積分與排名

最新評(píng)論

閱讀排行榜

評(píng)論排行榜


只有注冊(cè)用戶(hù)登錄后才能發(fā)表評(píng)論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問(wèn) 管理