#
Oracle10g的DBA_TAB_STATISTICS視圖的STATTYPE_LOCKED列沒有正確的顯示結果。
看一個簡單的例子:
SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bi
PL/SQL Release 10.2.0.5.0 - Production
CORE 10.2.0.5.0 Production
TNS for Linux: Version 10.2.0.5.0 - Production
NLSRTL Version 10.2.0.5.0 - Production
SQL> create table t_part (id number, name varchar2(30))
2 partition by range (id)
3 (partition p1 values less than (10),
4 partition p2 values less than (20),
5 partition pmax values less than (maxvalue));
Table created.
SQL> select table_name, partition_name, stattype_locked from tb all_tab_statistics where wner = user and table_name = 'T_PART';
TABLE_NAME PARTITION_NAME STATT
------------------------------ ------------------------------ -----
T_PART
T_PART P1
T_PART P2
T_PART PMAX
SQL> exec dbms_stats.lock_partition_stats(user, 'T_PART', 'P1')
PL/SQL procedure successfully completed.
SQL> select table_name, partition_name, stattype_locked from all_tab_statistics where wner = user and table_name = 'T_PART';
TABLE_NAME PARTITION_NAME STATT
------------------------------ ------------------------------ -----
T_PART
T_PART P1
T_PART P2
T_PART PMAX
SQL> exec dbms_stats.gather_table_stats(user, 'T_PART')
PL/SQL procedure successfully completed.
SQL> select table_name, partition_name, last_analyzed, stattype_locked from all_tab_statistics where wner = user and table_name = 'T_PART';
TABLE_NAME PARTITION_NAME LAST_ANAL STATT
------------------------------ ------------------------------ --------- -----
T_PART 16-JUL-12
T_PART P1
T_PART P2 16-JUL-12
T_PART PMAX 16-JUL-12
可以看到在10.2環境中,LOCK_PARTITION_STATS過程是正常工作的,但是DBA_TAB_STATISTICS視圖的STATTYPE_LOCKED列并沒有正確的顯示分區被鎖定的結果。
而對于表來說,LOCK_TABLE_STATS過程執行后,STATTYPE_LOCKED的結果顯示是正常的:
SQL> exec dbms_stats.lock_table_stats(user, 'T_PART')
PL/SQL procedure successfully completed.
SQL> select table_name, partition_name, last_analyzed, stattype_locked from all_tab_statistics where wner = user and table_name = 'T_PART';
TABLE_NAME PARTITION_NAME LAST_ANAL STATT
------------------------------ ------------------------------ --------- -----
T_PART 16-JUL-12 ALL
T_PART P1 ALL
T_PART P2 16-JUL-12 ALL
T_PART PMAX 16-JUL-12 ALL
這說明在10.2中,Oracle對于分區列的鎖定的支持是存在問題的。查詢了一下MOS,Oracle將這個問題確認為內部BUG:7240460,這個問題在11.1.0.7中被FIXED。
而在11.2中,這個問題以及不存在了:
SQL> select * from v$version;
BANNER
----------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE 11.2.0.3.0 Production
TNS for Solaris: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production
SQL> select owner, table_name, partition_name, stattype_locked
2 from dba_tab_statistics
3 where wner = 'TEST'
4 and table_name = 'T_PART';
OWNER TABLE_NAME PARTITION_NAME STATT
---------- ------------ --------------- -----
TEST T_PART
TEST T_PART P2
TEST T_PART P3
TEST T_PART P4
TEST T_PART P5
TEST T_PART PMAX
6 rows selected.
SQL> exec dbms_stats.lock_partition_stats('TEST', 'T_PART', 'P2')
PL/SQL procedure successfully completed.
SQL> select owner, table_name, partition_name, stattype_locked
2 from dba_tab_statistics
3 where wner = 'TEST'
4 and table_name = 'T_PART';
OWNER TABLE_NAME PARTITION_NAME STATT
---------- ------------ --------------- -----
TEST T_PART
TEST T_PART P2 ALL
TEST T_PART P3
TEST T_PART P4
TEST T_PART P5
TEST T_PART PMAX
6 rows selected
客戶數據庫出現ORA-4031錯誤,隨后出現了大量的ORA-1003和ORA-604錯誤。
數據庫版本為10.2.0.3 RAC for HP-UX,詳細的報錯信息為:
Mon Jul 16 15:30:30 2012
Errors in file /u01/app/oracle/admin/ORCL/udump/orcl2_ora_2389.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","select name,online$,contents...","sql area","tmp")
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","select name,online$,contents...","sql area","tmp")
Mon Jul 16 15:30:32 2012
Errors in file /u01/app/oracle/admin/ORCL/udump/orcl2_ora_2878.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 144 bytes of shared memory ("shared pool","select name,online$,contents...","Typecheck","coldef: qcopCreateCol")
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 144 bytes of shared memory ("shared pool","select name,online$,contents...","Typecheck","coldef: qcopCreateCol")
Mon Jul 16 15:30:32 2012
Errors in file /u01/app/oracle/admin/ORCL/udump/orcl2_ora_10030.trc:
ORA-00603: ORACLE server session terminated by fatal error tb
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 144 bytes of shared memory ("shared pool","select name,online$,contents...","Typecheck","coldef: qcopCreateCol")
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 144 bytes of shared memory ("shared pool","select name,online$,contents...","Typecheck","coldef: qcopCreateCol")
Mon Jul 16 15:30:39 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level 2
ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","select o.owner#,o.name,o.nam...","sql area","kobjn : kkdcchs")
Mon Jul 16 15:30:40 2012
Trace dumping is performing id=[cdmp_20120716153040]
Mon Jul 16 15:32:19 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
Mon Jul 16 15:33:59 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
Mon Jul 16 15:35:39 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
Mon Jul 16 15:37:19 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
Mon Jul 16 15:39:00 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
Mon Jul 16 15:40:40 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQ
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: err
Mon Jul 16 15:41:59 2012
Thread 2 advanced to log sequence 61522
Current log# 7 seq# 61522 mem# 0: +ORCL_CTL/orcl/onlinelog/group_7.263.611598065
Mon Jul 16 15:42:20 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
Mon Jul 16 15:44:00 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
Mon Jul 16 15:45:40 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQ
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: err
可以看得,開始還是比較“正常”的ORA-4031錯誤,但是隨著CJQ進程出現了ORA-4031的錯誤,數據庫開始每1分40秒報一次ORA-1003和ORA-604的錯誤。這里CJQ進程似乎陷入了一個死循環中,持續不停的報錯。查詢MOS發現,在9.2中Oracle存在類似的BUG:Bug 3289063 ORA-1003 every 5 seconds after CJQ hits ORA-4031。這個bug已經在9.2.0.5和10.1.0.2中被FIXED,但是這個bug的現象和當前的錯誤完全一致,從這點上看,這個問題顯然在10.2.0.3中仍然存在。
Oracle給出的解決方案也很簡單,重啟CJQ進程既可。除了在操作系統去kill -9殺掉cjq0對應的進程之外,直接通過alter system set job_queue_processes=0,然后在設置回原值,也有可能解決該問題。此外,盡量避免ORA-4031錯誤的產生是最根本的問題解決之道。
可以利用tar命令向磁盤備份文件,同時可以利用tar從磁帶中恢復指定文件。
備份的命令為:
tar cvf /dev/rmt/0m filename…
恢復的命令為:
tar xvf /dev/rmt/0m filename…
從磁帶上恢復文件的例子:
> tar xvf /dev/rmt/0m /archive/orcl_exp_report_1.dmp /archive/orcl_exp_report_2.dmp
x /archive/temp_exp/acdb_exp_report_1.dmp, 3198353408 bytes, 6246784 tape blocks
x /archive/temp_exp/acdb_exp_report_2.dmp, 3078950912 bytes, 6013576 tape blocks
tb恢復時指定的文件名必須是備份到磁帶的文件名。
當用戶密碼即將過期時,在登錄時Oracle會提示ORA-28002錯誤,但是并不會影響正常的登錄。
本來認為這個信息并沒有太大的影響,但是沒想到這個錯誤會導致SET AUTOTRACE功能失效:
solaris*orcl-/home/oracle$ sqlplus test/test
SQL*Plus: Release 11.2.0.3.0 Production on Fri Jul 13 11:27:28 2012
Copyright (c) 1982, 2011, Oracle. tb All rights reserved.
ERROR:
ORA-28002: the password will expire within 1 days
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Oracle Label Security and Real Application Testing options
SQL> set autot on
ERROR:
ORA-28002: the password will expire within 1 days
SP2-0619: Error while connecting
SP2-0611: Error enabling STATISTICS report
SQL> alter user test identified by test;
User altered.
SQL> set autot on
ERROR:
ORA-24315: illegal attribute type
SP2-0619: Error while connecting
SP2-0611: Error enabling STATISTICS report
SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Oracle Label Security and Real Application Testing options
solaris*orcl-/home/oracle$ sqlplus test/test
SQL*Plus: Release 11.2.0.3.0 Production on Fri Jul 13 11:27:52 2012
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Oracle Label Security and Real Application Testing options
SQL> set autot on
SQL> insert into t values (1, 'a');
1 row created.
Execution Plan
----------------------------------------------------------
---------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 1 | 12 | tb 1 (0)| 00:00:01 |
| 1 | LOAD TABLE CONVENTIONAL | T | | | | |
---------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
71 recursive calls
9 db block gets
75 consistent gets
10 physical reads
1284 redo size
829 bytes sent via SQL*Net to client
785 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
9 sorts (memory)
0 sorts (disk)
1 rows processed
很顯然,由于ORA-28002錯誤導致了SET AUTOTRACE ON功能啟用時碰到了錯誤。當修改了當前的用戶密碼,則ORA-28002不再出現后,SET AUTOTRACE ON的功能恢復正常。
根據上面的信息其實可以判斷,在啟用SET AUTOTRACE ON功能時,sqlplus會自動創建一個新的會話來記錄當前會話的統計信息。
而啟用的新的會話會使用當前會話登錄時保存的密碼來進行登錄,因此可以通過下面的例子來驗證這個推論:
solaris*orcl-/home/oracle$ sqlplus test/test
SQL*Plus: Release 11.2.0.3.0 Production on Sun Jul 15 01:28:38 2012
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Oracle Label Security and Real Application Testing options
SQL> set autot on
SQL> set autot off
SQL> alter user test identified by test1;
User altered.
SQL> set autot on
ERROR:
ORA-01017: invalid username/password; logon denied
SP2-0619: Error while connecting
SP2-0611: Error enabling STATISTICS report
果然,在用戶登錄后,如果當前的密碼被修改,是會導致SET AUTOTRACE ON啟用時報錯不正確的用戶名密碼錯誤的。
一個客戶碰到的具體需求,分區表中有些分區所在的表空間被OFFLINE,導致在刪除統計信息時報錯。
下面通過例子來說明這個問題:
SQL> create table t_part_read (id number)
2 partition by range (id)
3 (partition p1 values less than (10) tablespace ts1,
4 partition p2 values less than (20) tablespace ts2,
5 partition pmax values less than (maxvalue) tablespace users);
Table created.
SQL> insert into t_part_read select rownum from tab;
54 rows created.
SQL> commit;
Commit complete.
SQL> exec dbms_stats.gather_table_stats(user, 'T_PART_READ')
PL/SQL procedure successfully completed.
SQL> alter tablespace ts1 read only;
Tablespace altered.
SQL> exec dbms_stats.gather_table_stats(user, 'T_PART_READ')
PL/SQL procedure successfully completed.
SQL> alter tablespace ts1 offline;
Tablespace altered.
SQL> exec dbms_stats.gather_table_stats(user, 'T_PART_READ')
BEGIN dbms_stats.gather_table_stats(user, 'T_PART_READ'); END;
*
ERROR at line 1:
ORA-00376: file 6 cannot be read at this time
ORA-01110: data file 6: '/u01/app/oracle/oradata/ORCL/datafile/tbo1_mf_ts1_7w8l5fz1_.dbf'
ORA-06512: at "SYS.DBMS_STATS", line 23829
ORA-06512: at "SYS.DBMS_STATS", line 23880
ORA-06512: at line 1
如果將表空間只讀,并不會影響到表空間上的表或分區的統計信息的收集,因為收集過程只是讀取,而收集的結果信息是寫到SYSTEM表空間的。
但是如果分區所在的表空間處于OFFLINE狀態,那么在統計信息收集的過程中就會報錯。
有一個很簡單的方法可以解決這個問題,就是將被OFFLINE影響的分區的統計信息鎖定,這樣Oracle在收集統計信息時就會跳過鎖定的分區,通過這個辦法就可以避免統計信息收集過程中的報錯:
SQL> exec dbms_stats.lock_partition_stats(user, 'T_PART_READ', 'P1')
PL/SQL procedure successfully completed.
SQL> exec dbms_stats.gather_table_stats(user, 'T_PART_READ')
BEGIN dbms_stats.gather_table_stats(user, 'T_PART_READ'); END;
*
ERROR at line 1:
ORA-00376: file 6 cannot be read at this time
ORA-01110: data file 6: '/u01/app/oracle/oradata/ORCL/datafile/o1_mf_ts1_7w8l5fz1_.dbf'
ORA-06512: at "SYS.DBMS_STATS", line 23829
ORA-06512: at "SYS.DBMS_STATS", line 23880
ORA-06512: at line 1
SQL> exec dbms_stats.gather_table_stats(user, 'T_PART_READ', granularity => 'PARTITION')
PL/SQL procedure successfully completed.
即使鎖定分區后,嘗試收集統計信息仍然報錯,這是因為Oracle默認除了要收集分區上的統計信息以外,還要收集表級的統計信息,而這就會造成被OFFLINE影響的分區也要被讀取。
解決方法就是在收集統計信息的時候指定收集的粒度是分區,不收集表上的GLOBAL信息。
第一次碰到Exadata上的bug。
數據庫環境Exadata V2-2,數據庫版本為11.2.0.2,錯誤信息為:
Wed Apr 25 11:32:35 2012
Errors in file /u01/app/oracle/diag/rdbms/ods/orcl2/trace/orcl2_ora_9495.trc (incident=304808):
ORA-00600: internal error code, arguments: [kffmXpGet], [145], [69784], [], [], [], [], [], [], [], [], []
ORA-03135: connection lost contact
Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_304808/orcl2_ora_9495_i304808.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
opidcl aborting process tb unknown ospid (9495) as a result of ORA-600
Wed Apr 25 11:32:36 2012
Dumping diagnostic data in directory=[cdmp_20120425113236], requested by (instance=2, sid=9495), summary=[incident=304808].
Wed Apr 25 11:32:36 2012
Sweep [inc][304808]: completed
Sweep [inc2][304808]: completed
Wed Apr 25 11:32:39 2012
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_pmon_10797.trc (incident=302224):
ORA-00600: internal error code, arguments: [kffmXpGet], [181], [87276], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_302224/orcl2_pmon_10797_i302224.trc
Dumping diagnostic data in directory=[cdmp_20120425113240], requested by (instance=2, sid=10797 (PMON)), summary=[incident=302224].
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_pmon_10797.trc:
ORA-00600: internal error code, arguments: [kffmXpGet], [181], [87276], [], [], [], [], [], [], [], [], []
PMON (ospid: 10797): terminating the instance due to error 472
Wed Apr 25 11:32:41 2012
opiodr aborting process unknown ospid (11248) as a result of ORA-1092
Wed Apr 25 11:32:41 2012
opiodr aborting process unknown ospid (1967) as a result of ORA-1092
Wed Apr 25 11:32:42 2012
ORA-1092 : opitsk aborting process
Wed Apr 25 11:32:43 2012
License high water mark = 573
Instance terminated by PMON, pid = 10797
USER (ospid: 8755): terminating the instance
Instance terminated by USER, pid = 8755
由于ORA-600[kffmXpGet]錯誤的出現,最終出現了ORA-1092的錯誤,并致使opitsk進程退出,導致數據庫實例崩潰。
Oracle在MOS文檔Bug 12387467 instance crash by ORA-600 [kffmxpget]中描述了這個問題,確認影響的版本為11.2.0.1、11.2.0.2、11.2.0.3,Oracle在11.2.0.2的Bundle Patch 16 for Exadata以及11.2.0.3 Bundle Patch 5 for Exadata中fixed了這個問題,Oracle計劃在11.2.0.4和12.1中徹底Fixed該問題。
這個錯誤只會發生在Exadata中,因為導致錯誤的原因和Smart Scan有關。當會話執行Smart Scan操作時使用了Ctrl + C中止該操作,會導致PMON進程出現這個600錯誤。Oracle建議在執行數據文件的SHRINK操作前,中止并退出所有執行Exadata Smart Scan的會話。
Eygle的新書經歷了4個多月的等待,終于面世了。
我應該是這本書的第一個讀者,4個月前Eygle剛剛完成初稿的時候,我就完整的看過一遍了。
我平??磿容^慢,不過那次卻看得很快。一方面是由于大部分案例都比較熟悉;另一方面得益于Eygle的文筆,把故障原因、分析過程、解決思路和處理過程描述得非常清晰,給人一種一氣呵成的感覺。導致我這個幫忙審稿的,多次都陷入到具體的內容中了,雖然對于我來說看得tb很爽,但是對于審稿而言并不是一件好事。審稿應該始終站在一個中立的角度,而如果在審稿的過程中過于關注內容,就會忽略掉一些細節的問題。好在Eygle對于自己文章的嚴謹程度很高,因此通篇看完也沒有發現多少不妥之處,估計也不會遺漏太多的問題。
書中的所有內容都來自真實的案例,而且其中有三個個重要的案例都是來自2011年12月30日到2011年12月31日這兩天。在2012年元旦馬上要來臨之前,Eygle接連幫助三個客戶進行了數據庫的恢復,這件事刺激了Eygle,于是元旦回來,Eygle就開始構思并執筆他的新作。一個多月的時間,這本《數據安全警示錄》就基本上完成了。
上次看得是電子版,拿到實體書后感覺這次的印刷質量還是很不錯的,等有空的話還要再把這本書再看一遍。
客戶的數據庫出現了ORA-600(kfnsBackground03)錯誤。
數據庫版本為10.2.0.3 RAC for HP-UX 11.23。這個錯誤在ASM實例和數據庫實例都可能出現,如果發生在ASM實例,并不會導致ASM實例的崩潰,但是如果發生在數據庫實例,則會導致數據庫實例被強制關閉:
Tue May 15 10:28:05 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:32:50 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:33:05 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:34:44 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:43:05 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:46:13 2012
Errors in file /u01/app/oracle/admin/+ASM/udump/+asm1_ora_18846.trc:
ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], []
Tue May tb 15 10:46:14 2012
Trace dumping is performing id=[cdmp_20120515104614]
上面是ASM實例的報錯,下面是對應時刻數據庫實例的報錯:
Tue May 15 10:38:12 2012
kkjcre1p: unable to spawn jobq slave process
Tue May 15 10:38:12 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_cjq0_17957.trc:
Tue May 15 10:42:19 2012
PMON failed to acquire latch, see PMON dump
Tue May 15 10:43:04 2012
found dead shared server 'S006', pid = (90, 4)
Tue May 15 10:43:10 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_j000_19938.trc:
ORA-12012: error on auto execute of job 42579
ORA-27468: "EXFSYS.RLM$EVTCLEANUP" is locked by another process
Tue May 15 10:45:06 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_j002_23628.trc:
ORA-12012: error on auto execute of job 8888975
ORA-27468: "ORCL.P_DATA_C" is locked by another process
Tue May 15 10:45:10 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_j003_23959.trc:
ORA-12012: error on auto execute of job 8855572
ORA-27468: "ORCL.P_DATA" is locked by another process
Tue May 15 10:46:14 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_asmb_18844.trc:
ORA-15064: communication failure with ASM instance
ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], []
Tue May 15 10:46:14 2012
ASMB: terminating instance due to error 15064
Tue May 15 10:46:15 2012
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/admin/ORCL/bdump/orcl1_diag_17903.trc
Tue May 15 10:46:16 2012
Shutting down instance (abort)
License high water mark = 52
如果從這次數據庫的實例崩潰看,問題似乎和主機上的資源耗盡有關。在問題發生之前,數據庫實例已經出現了kkjcre1p: unable to spawn jobq slave process和PMON failed to acquire latch的問題。
當時其他時刻出現這個錯誤時,似乎并沒有確定的資源不足的信息:
Sat May 26 09:47:49 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Sat May 26 09:49:44 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Sat May 26 09:52:23 2012
Errors in file /u01/app/oracle/admin/+ASM/udump/+asm1_ora_21722.trc:
ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], []
Sat May 26 09:52:25 2012
Trace dumping is performing id=[cdmp_20120526095225]
對應這個時刻的數據庫告警信息為:
Sat May 26 09:52:24 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_asmb_21720.trc:
ORA-15064: communication failure with ASM instance
ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], []
Sat May 26 09:52:24 2012
ASMB: terminating instance due to error 15064
Sat May 26 09:52:25 2012
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/admin/ORCL/bdump/orcl1_diag_20837.trc
Sat May 26 09:52:26 2012
Shutting down instance (abort)
License high water mark = 46
Sat May 26 09:52:30 2012
Instance terminated by ASMB, pid = 21720
Sat May 26 09:52:31 2012
Instance terminated by USER, pid = 536
這次錯誤的出現并沒有任何其他的信息,數據庫實例就直接DOWN掉了。不過每次在出現這個錯誤時,ASM實例上都會存在告警信息:NOTE: database ORCL1:ORCL failed during msg 19, reply 2。這說明ASM實例和數據庫的通信存在了問題。kfnsBackground是Kernel Files Network Service Background的縮寫。其中MSG 19是指IOSTAT,而reply 2指的是TIMEOUT,這說明ASM在進行io操作是出現了timeout導致了ASM的異常并導致實例的崩潰。
這個錯誤相對比較罕見,整個METALINK中,只有3篇文章和這個錯誤相關,其中兩篇是和歸檔路徑空間不足導致系統HANG住,最終導致IO的TIMEOUT,并產生了錯誤;而另外一篇則沒有進一步的信息。其中這三次錯誤對應的版本分別是10.2.0.4 FOR AIX、10.2.0.4 FOR SOLARIS和10.2.0.3 FOR HPUX,這說明這個錯誤和平臺沒有關系,但是問題集中在10.2.0.3和10.2.0.4版本上。
根據上面的分析,應該部署操作系統信息監控工具,以便于隨時觀察系統資源的使用情況,在出現類似的錯誤可以進行輔助分析。由于這個問題沒有出現在10.2.0.5中的記錄,因此把數據庫升級到10.2.0.5有可能避開這個問題。
前幾天在給公司的員工講一個案例的提到這個問題。
其實當時提到了這個特點,DROP TABLE會進入回收站,但是DROP PARTITION并不會,因此DROP PARTITION之后,數據無法簡單的回復,只能通過邏輯或物理備份的方式來進行數據的回復。
SQL> create table t_drop (id number);
Table created.
SQL> drop table t_drop;
Table dropped.
SQL> select object_name, original_name from recyclebin;
OBJECT_NAME ORIGINAL_NAME
------------------------------ --------------------------------
BIN$xJhZqpmfWZXgRDzZK0pZWw==$0 T_DROP
SQL> create table t_part_drop (id number) partition by range (id)
2 (partition p1 values less than (10),
3 partition p2 values less than (20),
4 partition p3 values less than (30),
5 partition pmax values less than (maxvalue));
Table created.
SQL> insert into t_part_drop tb select rownum from user_objects;
176 rows created.
SQL> commit;
Commit complete.
SQL> alter table t_part_drop drop partition p1;
Table altered.
SQL> select object_name, original_name from recyclebin;
OBJECT_NAME ORIGINAL_NAME
------------------------------ --------------------------------
BIN$xJhZqpmfWZXgRDzZK0pZWw==$0 T_DROP
本來只是普及一下這個常識,不過有人問我Oracle為什么沒有實現將刪除分區放在回收站中。這個問題問的很好,因為如果這個功能很容易實現,那么Oracle肯定早就實現了,而到了11.2中Oracle仍然沒有實現這個功能,那么一定說明這個功能不是無法實現,就是實現的困難太大。
回收站的實現并不復雜,當一張表被刪除的時候,Oracle沒有直接釋放表在表空間上的空間占用,而是將表簡單的打了個標識,這樣在正常查詢數據字典時就不會看到這張被刪除的表,而如果需要恢復這張表時,只需要將標識位改回來既可。
那么同樣是修改數據字典,為什么不能將被刪除的分區通過標識的方法放到回收站中呢,這是因為,對于表而言,刪除操作是將一個整理完全刪除。而對于分區的刪除,是刪除整體中的一個部分。對于刪除這個動作其實并沒有太大的影響,但是回收站的功能不是為了刪除,而是為了可以快速的恢復。對表而言,直接恢復整體不存在任何的問題,即使同名對象存在,也只需改個名字既可。而對于刪除分區的恢復而言, tb 問題就不那么簡單了。由于分區表并沒有刪除,因此這個表仍然可以繼續進行操作,雖然某個分區被刪除了,但是除非是范圍分區中的MAXVALUE分區和列表分區中的DEFAULT分區,否則再插入原分區對應的數據時,并不會報錯,而是會插入到其他分區中:
SQL> select * from t_part_drop partition (p2);
ID
----------
10
11
12
13
14
15
16
17
18
19
10 rows selected.
SQL> insert into t_part_drop values (5);
1 row created.
SQL> select * from t_part_drop partition (p2);
ID
----------
10
11
12
13
14
15
16
17
18
19
5
11 rows selected.
原表應該插入分區P1的數據,由于分區P1被刪除,因此現在滿足分區P2的條件,被插入到分區P2中,考慮這種情況下,如果直接恢復P1分區會怎樣。
顯然這不是一個簡單的數據字典的修改就能解決的問題,不但涉及到分區數據改變的問題,還必然會帶來全局和本地索引失效的問題,更重要的是,可能帶來主鍵沖突的情況。
這還只是分區表進行了DML的情況,如果刪除分區后,分區表又進行了DDL,比如新SPLIT了P1分區,那么刪除分區的恢復操作就更無法進行了。
如果一個功能覺得很簡單就可以實現,但是Oracle卻一直沒有實現,那么很可能實現這個功能并不像想象的那么簡單。
11.2對于CREATE OR REPLACE TYPE語句進行了增加,增加了FORCE選項。
在11.2之前,只要有其他的表或TYPE依賴了當前對象,這個對象就無法進行REPLACE了:
SQL> create type t_num_tab is table of number;
2 /
Type created.
SQL> create type t_record is object (id number, n_tab t_num_tab);
2 /
Type created.
SQL> create or replace type t_num_tab is table of number(5);
2 /
create or replace type t_num_tab is table of number(5);
*
ERROR at line 1:
ORA-02303: cannot drop or replace a type with type or table dependents
這是11.2之前的情況,嘗試執行CREATE OR REPLACE將會導致ORA-2303錯誤,在11.2中Oracle增加了FORCE功能,使得一個僅被TYPE所依賴的對象仍然可以執行REPLACE的操作:
SQL> create or replace type t_num_tab force is tbable of number(5);
2 /
Type created.
SQL> select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE 11.2.0.3.0 Production
TNS for Solaris: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production
有意思的是,FORCE語句的位置不對則不會生效,這時同樣會報ORA-2303的錯誤,而并不會導致語句錯誤:
SQL> create or replace force type t_num_tab is table of number(5);
create or replace force type t_num_tab is table of number(5)
*
ERROR at line 1:
ORA-02303: cannot drop or replace a type with type or table dependents
SQL> create or replace type t_num_tab is table of number(5) force;
2 /
create or replace type t_num_tab is table of number(5) force;
*
ERROR at line 1:
ORA-02303: cannot drop or replace a type with type or table dependents
最后這個功能只對被TYPE所依賴的對象有效,一旦對象被表所依賴,則FORCE功能也不起作用:
SQL> create table t_type_tab (id number, c_tab t_num_tab)
2 nested table c_tab store tb as c_tab_tab;
Table created.
SQL> create or replace type t_num_tab force is table of number(6);
2 /
create or replace type t_num_tab force is table of number(6);
*
ERROR at line 1:
ORA-22866: cannot replace a type with table dependents
Oracle的錯誤信息也變成了ORA-22866。其實這時可以預料的,因為一旦創建了表,就相當于進行了實體化的工作,如果依賴的類型發生了變化,將會影響表中已有的數據的讀寫。不過其實Oracle可以做到更進一步,就是如果表段沒有創建或者表中沒有插入數據的情況下,允許對依賴的對象進行修改。