我在使用ICTCLAS.dll做網(wǎng)頁內(nèi)容分詞的時候,出現(xiàn)一下異常
An unexpected exception has been detected in native code outside the VM.
Unexpected Signal : EXCEPTION_ACCESS_VIOLATION (0xc0000005) occurred at PC=0x18473252
Function=Ordinal5+0x3252
Library=D:\workspace3\Lucene_191\ICTCLAS.dll
Current Java thread:
at com.xjt.nlp.word.ICTCLAS.paragraphProcess(Native Method)
- locked <0x1003db78> (a com.xjt.nlp.word.ICTCLAS)
at org.apache.lucene.analysis.cn.TjuChineseAnalyzer.tokenStream(TjuChineseAnalyzer.java:59)
at org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java:162)
at org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:93)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:450)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:436)
at ch14.performance.index.IndexPerformanceTest.addDocument(IndexPerformanceTest.java:192)
at ch14.performance.index.IndexPerformanceTest.indexFiles(IndexPerformanceTest.java:214)
at ch14.performance.index.IndexPerformanceTest.indexFiles(IndexPerformanceTest.java:207)
at ch14.performance.index.IndexPerformanceTest.indexFiles(IndexPerformanceTest.java:207)
at ch14.performance.index.IndexPerformanceTest.toIndex(IndexPerformanceTest.java:141)
at ch14.performance.index.IndexPerformanceTest.toIndex(IndexPerformanceTest.java:127)
at ch14.performance.index.IndexPerformanceTest.main(IndexPerformanceTest.java:228)
Dynamic libraries:
0x00400000 - 0x00407000 C:\j2sdk1.4.2_03\bin\javaw.exe
0x77F80000 - 0x77FFC000 C:\WINNT\system32\ntdll.dll
0x796D0000 - 0x79735000 C:\WINNT\system32\ADVAPI32.dll
0x77E60000 - 0x77F32000 C:\WINNT\system32\KERNEL32.dll
0x786F0000 - 0x78768000 C:\WINNT\system32\RPCRT4.dll
0x77DF0000 - 0x77E59000 C:\WINNT\system32\USER32.dll
0x77F40000 - 0x77F7C000 C:\WINNT\system32\GDI32.dll
0x78000000 - 0x78045000 C:\WINNT\system32\MSVCRT.dll
0x75E00000 - 0x75E1A000 C:\WINNT\system32\IMM32.DLL
0x6C330000 - 0x6C338000 C:\WINNT\system32\LPK.DLL
0x65D20000 - 0x65D74000 C:\WINNT\system32\USP10.dll
0x08000000 - 0x08138000 C:\j2sdk1.4.2_03\jre\bin\client\jvm.dll
0x77530000 - 0x77560000 C:\WINNT\system32\WINMM.dll
0x6BD00000 - 0x6BD0D000 C:\WINNT\system32\SYNCOR11.DLL
0x10000000 - 0x10007000 C:\j2sdk1.4.2_03\jre\bin\hpi.dll
0x007F0000 - 0x007FE000 C:\j2sdk1.4.2_03\jre\bin\verify.dll
0x00800000 - 0x00819000 C:\j2sdk1.4.2_03\jre\bin\java.dll
0x00820000 - 0x0082D000 C:\j2sdk1.4.2_03\jre\bin\zip.dll
0x18470000 - 0x1853F000 D:\workspace3\Lucene_191\ICTCLAS.dll
0x777C0000 - 0x777DE000 C:\WINNT\system32\WINSPOOL.DRV
0x79B20000 - 0x79B31000 C:\WINNT\system32\MPR.DLL
0x71710000 - 0x71794000 C:\WINNT\system32\COMCTL32.dll
0x77900000 - 0x77923000 C:\WINNT\system32\imagehlp.dll
0x72960000 - 0x7298D000 C:\WINNT\system32\DBGHELP.dll
0x687E0000 - 0x687EB000 C:\WINNT\system32\PSAPI.DLL
Heap at VM Abort:
Heap
def new generation total 576K, used 429K [0x10010000, 0x100b0000, 0x104f0000)
eden space 512K, 74% used [0x10010000, 0x1006f0f8, 0x10090000)
from space 64K, 77% used [0x100a0000, 0x100ac5e0, 0x100b0000)
to space 64K, 0% used [0x10090000, 0x10090000, 0x100a0000)
tenured generation total 1408K, used 128K [0x104f0000, 0x10650000, 0x14010000)
the space 1408K, 9% used [0x104f0000, 0x105103e8, 0x10510400, 0x10650000)
compacting perm gen total 4096K, used 1782K [0x14010000, 0x14410000, 0x18010000)
the space 4096K, 43% used [0x14010000, 0x141cd860, 0x141cda00, 0x14410000)
Local Time = Wed May 10 19:19:16 2006
Elapsed Time = 8
#
# The exception above was detected in native code outside the VM
#
# Java VM: Java HotSpot(TM) Client VM (1.4.2_03-b02 mixed mode)
#
# An error report file has been saved as hs_err_pid2120.log.
# Please refer to the file for further information.
#
不知作者或者其他朋友有沒有遇到,網(wǎng)上也有人遇到這種情況。
這個問題很好。。。
這個問題是中科院分詞工具的BUG
舉個例子,如果你用這個分詞工具分下面的詞,就一定會報錯
“5/”
“6/”
類似于這樣一個數(shù)字加一個“/”就會有問題。由于DLL拋出的異常是無法被JVM捕獲的,因此JVM就被強行停了下來。。
建議是在進行分詞前要對句子進行預處理,如
1)全角到半角的替換
2)去除多余的空格
3)一次分詞的句子不要太長
4)特殊符號的轉(zhuǎn)換
不過在新版本的分詞工具中,好像這樣的BUG已經(jīng)被改進了。
我手上有一個師弟做的JAVA版的仿中科院分詞工具,而且把詞庫也進行了翻譯,可以使用TXT文件做詞庫,并且能添加新詞。希望有機會也能拿出來和朋友分享:)這位師弟和我說他打算把這個放到SOURCEFORGE上,到時候大家可以下載了:)
如果可能,我也想在我的下一本新的Lucene的書中放上,呵呵。當然得先問問他的意見,哈哈