标签归档：故障诊断

How to Collect Diagnostics for Database Hanging Issues

在前一篇中提到一个客户的数据库hung死，但是由于没有数据库hung死时的相关详细信息（除alert.log中），因此在事后定位数据库hung死原因时遇到重重困难，因此该篇贴出MOS上一篇《如何收集数据库hung死时的诊断信息的文章》[How to Collect Diagnostics for Database Hanging Issues (文档 ID 452358.1)]，以供参考！
Applies to
Oracle Database – Enterprise Edition – Version 9.0.1.0 and later
Oracle Database – Standard Edition – Version 9.0.1.0 and later
Oracle Database – Personal Edition – Version 9.0.1.0 and later
Information in this document applies to any platform.
Goal
When a database appears to be hung, it is useful to collect information from the database in order to determine the root cause of the hang. The root cause of the hang can often be isolated and solved using the diagnostic information gathered. Alternatively, if this is not possible, we can use the information obtained in order to help eliminate future occurences
Solution
What is needed to diagnose “Database Hang” issues? Database hangs are characterised by a number of processes waiting for some other activities to complete. Typically there is one or more blockers that are stuck or perhaps working hard and not freeing resources quickly enough. In order to diagnose this the following diagnostics are needed:
A. Hanganalyze and Systemstate Dumps
B. AWR/Statspack snapshots of General database performance
C. Up to date RDA
Please refer to the relevant sections below for more details on how to collect these.

A. Hanganalyze and Systemstate Dumps
Hanganalyze and Systemstate dumps provide information on the processes in the database at a specific point in time. Hanganalyze provides information on all processes involved in the hang chain, whereas systemstate provides information on all processes in the database. When looking at a potential hang situation, you need to determine whether a process is stuck or moving slowly. By collecting these dumps at 2 consecutive intervals this can be established. If a process is stuck, these traces also provide the information to start further diagnosis and possibly help to provide the solution.
a) Hanganalyze is a summary and will confirm if the db is really hung or just slow and provides a consistent snapshot.
b) Systemstate dump shows what each process on the database is doing
Collecting Hanganalyze and Systemstate Dumps steps:
1) Logging in to the system
Using SQL*Plus connect as SYSDBA using the following command:

sqlplus '/ as sysdba'

If there are problems making this connection then in 10gR2 and above, the sqlplus “preliminary connection” can be used :

sqlplus -prelim '/ as sysdba'

Note: From 11.2.0.2 onwards, hanganalyze will not produce output under a sqlplus “preliminary connection” since it requires a process state object and a session state object. If a hanganalyze is attempted, the trace will output:
HANG ANALYSIS
ERROR: Can not perform hang analysis dump without a process state object and a session state object.
( process=(nil), sess=(nil) )
2) Collection commands for Hanganalyze and Systemstate: Non-RAC
Sometimes, database may actually just be very slow and not actually hanging. It is therefore recommended, where possible to get 2 hanganalyze and 2 systemstate dumps in order to determine whether processes are moving at all or whether they are “frozen”.
2-1) Hanganalyze

sqlplus '/ as sysdba'
oradebug setmypid
oradebug unlimit
oradebug hanganalyze 3
-- Wait one minute before getting the second hanganalyze
oradebug hanganalyze 3
oradebug tracefile_name
exit

2-2) Systemstate

sqlplus '/ as sysdba'
oradebug setmypid
oradebug unlimit
oradebug dump systemstate 266
oradebug dump systemstate 266
oradebug tracefile_name
exit

3) Collection commands for Hanganalyze and Systemstate: RAC
There are 2 bugs affecting RAC that without the relevant patches being applied on your system, make using level 266 or 267 very costly. Therefore without these fixes in place it highly unadvisable to use these level.
For information on these patches see:

Document 11800959.8 Bug 11800959 - A SYSTEMSTATE dump with level >= 10 in RAC dumps huge BUSY GLOBAL CACHE ELEMENTS - can hang/crash instances 
Document 11827088.8 Bug 11827088 - Latch 'gc element' contention, LMHB terminates the instance

3-1) Collection commands for Hanganalyze and Systemstate: RAC with fixes for bug 11800959 and bug 11827088

sqlplus '/ as sysdba'
oradebug setorapname reco
oradebug unlimit
oradebug -g all hanganalyze 3
oradebug -g all hanganalyze 3
oradebug -g all dump systemstate 266
oradebug -g all dump systemstate 266
exit

3-2) Collection commands for Hanganalyze and Systemstate: RAC without fixes for Bug 11800959 and Bug 11827088

sqlplus '/ as sysdba'
oradebug setorapname reco
oradebug unlimit
oradebug -g all hanganalyze 3
oradebug -g all hanganalyze 3
oradebug -g all dump systemstate 258
oradebug -g all dump systemstate 258
exit
Note: In RAC environment, a dump will be created for all RAC instances in the DIAG trace file for each instance.

Explanation of Hanganalyze and Systemstate Levels
Hanganalyze levels:
1)Level 3: In 11g onwards, level 3 also collects a short stack for relevant processes in hang chain
Systemstate levels:
2)Level 258 is a fast alternative but we’d lose some lock element data
3)Level 267 can be used if additional buffer cache / lock element data is needed with an understanding of the cost
Other Methods
If connection to the system is not possible in any form, then please refer to the following article which describes how to collect systemstates in that situation:

Document 121779.1 Taking a SYSTEMSTATE dump when you cannot CONNECT to Oracle.

On RAC Systems, hanganalyze, systemstates and some other RAC information can be collected using the ‘racdiag.sql’ script, see:

Document 135714.1 Script to Collect RAC Diagnostic Information (racdiag.sql)

V$wait_chains
Starting from 11g release 1, the dia0 background processes starts collecting hanganalyze information and stores this in memory in the “hang analysis cache”. It does this every 3 seconds for local hanganalyze information and every 10 seconds for global (RAC) hanganalyze information. This information can provide a quick view of hang chains occurring at the time of a hang being experienced.
For more information see:

Document 1428210.1 Troubleshooting Database Contention With V$Wait_Chains

B. Provide AWR/Statspack snapshots of General database performance
Hangs are a visible effect of a number of potential causes, this can range from a single process issue to something brought on by a global problem.
Collecting information about the general performance of the database in the build up to, during and after the problem is of primary importance since these snapshots can help to determine the nature of the load on the database at these times and can provide vital diagnostic information. This may prove invaluable in identifying the area of the problem and ultimately resolving the issue.
To do this, please take and upload snapshot reports of database performance (AWR (or statspack) reports) immediately before, during and after the hang..
Please refer to the following article for details of what to collect:

Document 781198.1 Diagnostics for Database Performance Issues

C. Gather an up to date RDA
An up to date current RDA provides a lot of additional information about the configuration of the database and performance metrics and can be examined to spot background issues that may impact performance.
See the following note on My Oracle Support:

Document 314422.1 Remote Diagnostic Agent (RDA) 4 - Getting Started

Proactive Methods to gather information on a Hanging System
On some systems a hang can occur when the DBA is not available to run diagnostics or at times it may be too late to collect the relevant diagnostics. In these cases, the following methods may be used to gather diagnostics:
1) As an alternative to the manual collection method notes above, it is also possible to use the HANGFG script as described in the following note to collect the information:

Document 362094.1 HANGFG User Guide

Additionally, this script can collect information with lower impact on the target database.
2) LTOM
The Lite Onboard Monitor (LTOM) is a java program designed as a real-time diagnostic platform for deployment to a customer site.LTOM proactively provides real-time automatic problem detection and data collection.
For more information see:

Document 352363.1 LTOM - The On-Board Monitor User Guide

3) Procwatcher
Procwatcher is a tool that examines and monitors Oracle database and/or clusterware processes at a specific interval. The following notes explain how to use Procwatcher:

Document 459694.1 Procwatcher: Script to Monitor and Examine Oracle DB and Clusterware Processes
Document 1352623.1 How To Troubleshoot Database Contention With Procwatcher

4) OS Watcher Black Box OSWatcher Black Box contains a built in analyzer that allows the data that has been collected to be automatically analyzed, pro-actively looking for cpu, memory, io and network issues. It is recommended that all users install and run OSWbb since it is invaluable for looking at issues on the OS and has very little overhead. It can also be extremely useful for looking at OS performance degradation that may be seen when a hang situation occurs.
Refer to the following for download, user guide and usage videos on OSWatcher Black Box:

Document 301137.1 OSWatcher Black Box User Guide (Includes: [Video])

Oracle Enterprise Manager 12c Real-Time ADDM
Real-Time ADDM is a feature of Oracle Enterprise Manager Cloud Control 12c that allows you to analyze database performance automatically when you cannot logon to the database because it is hung or performing very slowly due to a performance issue. It analyzes current performance when database is hanging or running slow and reports sources of severe contention.
For more information see the following video: Oracle Enterprise Manager 12c Real-Time ADDM

Retroactive Information Collection
Sometimes we may only notice a hang after it has occurred. In this case the following information may help with Root Cause Analysis:
1) A series of AWR/Statspack reports leading up to and during the hang
2) ASH reports – one can obtain more granular reports during the time of the hang – even up to one minute in time.
3) Raw ASH information. This can be obtained by issuing an ashdump trac. See:

Document 243132.1 10g and above Active Session History (Ash) And Analysis Of Ash Online And Offline
Document 555303.1 ashdump* scripts and post-load processing of MMNL traces

4) Alert log and any traces created at time of hang
On a RAC specifically check the following traces files as well: dia0, lmhb, diag and lmd0 traces
5) RDA as above

PS: This article comes from MOS…

PMON failed to acquire latch分析

发表回复

近日接到一客户的电话，业务人员反馈无法建立新的连接请求，而已有的会话却无法对数据进行操作，初步判断很可能是数据库存在资源等待或是hung住，后前往客户现场，查看alert.log及相关信息，确定为数据库hung死，其alert.log中频繁出现PMON获取latch失败：

Wed Oct 16 22:00:12 2013
Thread 1 advanced to log sequence 25413 (LGWR switch)
  Current log# 5 seq# 25413 mem# 0: /ora_data/orcl/redo05.log
Wed Oct 16 22:07:15 2013
WARNING: inbound connection timed out (ORA-3136)
Wed Oct 16 22:08:18 2013
WARNING: inbound connection timed out (ORA-3136)
Wed Oct 16 22:08:33 2013
kkjcre1p: unable to spawn jobq slave process 
Wed Oct 16 22:08:33 2013
Errors in file /oracle/admin/orcl/bdump/orcl_cjq0_512234.trc:
Wed Oct 16 22:09:33 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:10:33 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:10:54 2013
kkjcre1p: unable to spawn jobq slave process 
Wed Oct 16 22:10:54 2013
Errors in file /oracle/admin/orcl/bdump/orcl_cjq0_512234.trc:
Wed Oct 16 22:11:33 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:12:33 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:13:33 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:14:33 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:15:33 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:16:34 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:17:34 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:18:34 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:19:34 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:20:34 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:21:34 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:22:34 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:23:34 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:24:34 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:25:34 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:26:34 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:27:34 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:28:34 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:29:34 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:30:35 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:31:35 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:32:35 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:33:35 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:34:35 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:35:35 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:36:35 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:37:35 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:38:35 2013
PMON failed to acquire latch, see PMON dump
Wed Oct 16 22:39:35 2013
PMON failed to acquire latch, see PMON dump
...........................................

为了确定具体原因，使用sqlplus以sysdba登录数据库hung住，则无法查数据库资源使用及等待情况，使用ps -ef|grep local=no存在大量的应用进程，kill掉之后数据库依然hung住，为了在最短时间内恢复应用，最后客户选择了重启主机来完成数据库重启（其实在重启主机之前应该做一个Hanganalyze and Systemstate Dumps或是RDA，以便事后确定数据库hung死的根本原因）。

事后尝试进行了一些详细分析，但是最终还是不能非常确定此CASE的根本原因，以下为分析过程：
1)alert.log中出现PMON failed to acquire latch前，曾出现了一个kkjcre1p: unable to spawn jobq slave process错误，对应的trace文件orcl_cjq0_512234.trc内容如下：

*** 2013-10-16 22:07:19.030
Waited for process J002 to initialize for 60 seconds
*** 2013-10-16 22:07:19.199
Dumping diagnostic information for J002:
OS pid = 5587050
loadavg : 5.31 5.76 5.78
swap info: free_mem = 77.07M rsv = 72.00M 
           alloc = 6257.89M avail = 18432.00M swap_free = 12174.11M
       F S      UID     PID    PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240001 A   oracle 5587050       1   0  60 20 e858f590 92492          22:06:18      -  0:00 ora_j002_orcl 
open: The file access permissions do not allow the specified action.
Warning: executed in non-root mode
procstack cannot verify that /unix matches the running kernel.
Kernel symbols might not be validated.
5587050: ora_j002_orcl
0x00000001000fc898  sskgpwwait(??, ??, ??, ??, ??) + 0x38
0x00000001000f9e7c  skgpwwait(??, ??, ??, ??, ??) + 0xbc
0x000000010011dbac  kslges(??, ??, ??, ??, ??) + 0x54c
0x00000001001219dc  kslgetl(??, ??, ??, ??) + 0x33c
0x00000001049f0978  ksfglt(??, ??, ??, ??, ??) + 0x198
0x00000001000847b4  kghfrunp(??, ??, ??, ??, ??, ??, ??) + 0x794
0x000000010007a488  kghfnd(??, ??, ??, ??, ??, ??) + 0x7e8
0x0000000100098444  kghalo(??, ??, ??, ??, ??, ??, ??, ??) + 0xa24
0x0000000100005908  ksp_param_handle_alloc(??) + 0x168
0x000000010001d21c  kspcrec(??) + 0x1bc
0x00000001001407e8  ksucre(??) + 0x408
0x0000000103081ab0  kkjrdp() + 0x350
0x000000010430c190  opirip(??, ??, ??) + 0x4f0
0x0000000102d9a598  opidrv(??, ??, ??) + 0x458
0x000000010370b0b0  sou2o(??, ??, ??, ??) + 0x90
0x0000000100000870  opimai_real(??, ??) + 0x150
0x00000001000006d8  main(??, ??) + 0x98
0x0000000100000368  __start() + 0x98
*** 2013-10-16 22:07:25.067
*** 2013-10-16 22:07:35.068
Waited for process J002 to initialize for 70 seconds
*** 2013-10-16 22:07:35.068
Dumping diagnostic information for J002:
OS pid = 5587050
loadavg : 5.75 5.83 5.81
swap info: free_mem = 72.61M rsv = 72.00M 
           alloc = 6257.04M avail = 18432.00M swap_free = 12174.96M
       F S      UID     PID    PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240001 A   oracle 5587050       1   0  60 20 e858f590 92492          22:06:18      -  0:00 ora_j002_orcl 
open: The file access permissions do not allow the specified action.
Warning: executed in non-root mode
procstack cannot verify that /unix matches the running kernel.
Kernel symbols might not be validated.
5587050: ora_j002_orcl
0x00000001000fc898  sskgpwwait(??, ??, ??, ??, ??) + 0x38
0x00000001000f9e7c  skgpwwait(??, ??, ??, ??, ??) + 0xbc
0x000000010011dbac  kslges(??, ??, ??, ??, ??) + 0x54c
0x00000001001219dc  kslgetl(??, ??, ??, ??) + 0x33c
0x00000001049f0978  ksfglt(??, ??, ??, ??, ??) + 0x198
0x00000001000847b4  kghfrunp(??, ??, ??, ??, ??, ??, ??) + 0x794
0x000000010007a488  kghfnd(??, ??, ??, ??, ??, ??) + 0x7e8
0x0000000100098444  kghalo(??, ??, ??, ??, ??, ??, ??, ??) + 0xa24
0x0000000100005908  ksp_param_handle_alloc(??) + 0x168
0x000000010001d21c  kspcrec(??) + 0x1bc
0x00000001001407e8  ksucre(??) + 0x408
0x0000000103081ab0  kkjrdp() + 0x350
0x000000010430c190  opirip(??, ??, ??) + 0x4f0
0x0000000102d9a598  opidrv(??, ??, ??) + 0x458
0x000000010370b0b0  sou2o(??, ??, ??, ??) + 0x90
0x0000000100000870  opimai_real(??, ??) + 0x150
0x00000001000006d8  main(??, ??) + 0x98
0x0000000100000368  __start() + 0x98

从orcl_pmon_508278.trc文件中看出，job进程启动失败，发现该进程已经dead:

*** 2013-10-17 13:50:01.856
found process 7000001938ff898 pid=185 serial=235 ospid = 5587050 dead

2)alert.log中PMON failed to acquire latch产生的PMON trace文件orcl_pmon_508278.trc内容如下：

*** 2013-10-16 22:09:33.682
PMON unable to acquire latch  7000000100e9bb0 Child shared pool level=7 child#=2 
       Location from where latch is held: kghfrunp: alloc: session dur: 
       Context saved from call: 0
       state=busy, wlstate=free
   waiters [orapid (seconds since: put on list, posted, alive check)]:
    225 (111, 1381932633, 3)
    231 (111, 1381932633, 0)
    232 (105, 1381932633, 3)
    190 (105, 1381932633, 3)
    258 (3, 1381932633, 3)
    259 (3, 1381932633, 3)
    260 (3, 1381932633, 0)
    261 (3, 1381932633, 0)
    262 (3, 1381932633, 0)
    waiter count=44
   gotten 471786337 times wait, failed first 1679732 sleeps 3225437
   gotten 0 times nowait, failed: 0
 possible holder pid = 227 ospid=5193976
----------------------------------------
SO: 700000193909df8, type: 2, owner: 0, flag: INIT/-/-/0x00
 (process) Oracle pid=227, calls cur/top: 0/700000181230280, flag: (0) -
           int error: 0, call error: 0, sess error: 0, txn error 0
 (post info) last post received: 0 0 0
             last post received-location: No post
             last process to post me: none
             last post sent: 0 0 0
             last post sent-location: No post
             last process posted by me: none
 (latch info) wait_event=0 bits=a0
   holding    (efd=6) 7000000100e9bb0 Child shared pool level=7 child#=2 
       Location from where latch is held: kghfrunp: alloc: session dur: 
       Context saved from call: 0
       state=busy, wlstate=free
       waiters [orapid (seconds since: put on list, posted, alive check)]:
        225 (652, 1381933174, 3)
        232 (646, 1381933174, 3)
        190 (646, 1381933174, 3)
        236 (634, 1381933174, 3)
        363 (240, 1381933174, 3)
        364 (234, 1381933174, 3)
        394 (153, 1381933174, 3)
        waiter count=146
   holding    (efd=6) 700000189fe4098 Child library cache level=5 child#=2 
       Location from where latch is held: kghfrunp: clatch: wait: 
       Context saved from call: 0
       state=busy, wlstate=free
       waiters [orapid (seconds since: put on list, posted, alive check)]:
        199 (652, 1381933174, 3)
        38 (652, 1381933174, 3)
        220 (652, 1381933174, 3)
        211 (652, 1381933174, 3)
        207 (652, 1381933174, 3)
        226 (652, 1381933174, 3)
        196 (652, 1381933174, 3)
        223 (652, 1381933174, 3)
        215 (652, 1381933174, 3)
        144 (652, 1381933174, 3)
        221 (652, 1381933174, 3)
        76 (634, 1381933174, 3)
        35 (529, 1381933174, 3)
        106 (514, 1381933174, 3)
        135 (463, 1381933174, 3)
        104 (408, 1381933174, 3)
        105 (393, 1381933174, 3)
        133 (288, 1381933174, 3)
        162 (168, 1381933174, 3)
        waiter count=19
   Process Group: DEFAULT, pseudo proc: 700000193a46758
   O/S info: user: oracle, term: UNKNOWN, ospid: 5193976 
   OSD pid info: Unix process pid: 5193976, image: oracle@gxjh_db
   Short stack dump: 
ksdxfstk+002c&lt;-ksdxcb+04e4&lt;-sspuser+0074&lt;-000044C0&lt;-skgpwwait+00bc&lt;-ksliwat+06c0&lt;-kslwaitns_timed+0024&lt;-kskthbwt+022c&lt;-kslwait+00f4&lt;-kksLockWait+01fc&lt;-kgxWait+0168&lt;-kgxExclusive+00bc&lt;-kksFreeHeapGetMutex+0158&lt;-kksCursorFreeCallBack+0088&lt;-kglobf0+0264&lt;-kglhpd_internal+0228&lt;-kglhpd+0010&lt;-kghfrx+0028&lt;-kghfrunp+0b44&lt;-kghfnd+07e8&lt;-kghalo+0a24&lt;-ksp_param_handle_alloc+0100&lt;-kspcrec+01bc&lt;-ksucre+0408&lt;-ksucresg+017c&lt;-kpolna+02c4&lt;-kpolon+02e0&lt;-opiodr+0ae0&lt;-ttcpip+1020&lt;-opitsk+1124&lt;-opiino+0990&lt;-opiodr+0ae0&lt;-opidrv+0484&lt;-sou2o+0090&lt;-opimai_real+01bc&lt;-main+0098&lt;-__start+0098
Dump of memory from 0x0700000193881630 to 0x0700000193881838
700000193881630 00000004 00000000 07000001 81169F18  [................]
700000193881640 00000010 000313A7 07000001 81230280  [.............#..]
700000193881650 00000003 000313A7 07000001 8E11F350  [...............P]
700000193881660 0000000B 000313A7 07000001 90E23DF0  [..............=.]
700000193881670 00000004 0003129B 00000000 00000000  [................]
700000193881680 00000000 00000000 00000000 00000000  [................]
       Repeat 26 times
700000193881830 00000000 00000000                    [........]  
*** 2013-10-16 22:28:34.878
PMON unable to acquire latch  700000010007608 session allocation level=5 
       Location from where latch is held: ksucrp: 
       Context saved from call: 0
       state=busy, wlstate=free
   waiters [orapid (seconds since: put on list, posted, alive check)]:
    21 (834, 1381933774, 0)
    26 (834, 1381933774, 0)
    175 (258, 1381933774, 3)
    151 (258, 1381933774, 3)
    9 (183, 1381933774, 0)
    205 (18, 1381933774, 3)
    waiter count=151
   gotten 2384768601 times wait, failed first 1244465 sleeps 32489
   gotten 0 times nowait, failed: 0
   recovery area:
Dump of memory from 0x0700000010007978 to 0x0700000010007978
 possible holder pid = 364 ospid=3707158
----------------------------------------
SO: 700000190920928, type: 2, owner: 0, flag: INIT/-/-/0x00
 (process) Oracle pid=364, calls cur/top: 0/0, flag: (0) -
           int error: 0, call error: 0, sess error: 0, txn error 0
 (post info) last post received: 0 0 0
             last post received-location: No post
             last process to post me: none
             last post sent: 0 0 0
             last post sent-location: No post
             last process posted by me: none
 (latch info) wait_event=0 bits=20
       Location from where call was made: kghalo: 
   waiting for 7000000100e9bb0 Child shared pool level=7 child#=2 
       Location from where latch is held: kghfrunp: alloc: session dur: 
       Context saved from call: 0
       state=busy, wlstate=free
       waiters [orapid (seconds since: put on list, posted, alive check)]:
        225 (1252, 1381933774, 0)
        232 (1246, 1381933774, 0)
        363 (840, 1381933774, 0)
        364 (834, 1381933774, 0)
        394 (753, 1381933774, 0)
        waiter count=146
       gotten 471786337 times wait, failed first 1679732 sleeps 3225437
       gotten 0 times nowait, failed: 0
       possible holder pid = 227 ospid=5193976
   on wait list for 7000000100e9bb0
   holding    (efd=3) 700000010007608 session allocation level=5 
       Location from where latch is held: ksucrp: 
       Context saved from call: 0
       state=busy, wlstate=free
       waiters [orapid (seconds since: put on list, posted, alive check)]:
        21 (834, 1381933774, 0)
        26 (834, 1381933774, 0)
        43 (834, 1381933774, 0)
        178 (834, 1381933774, 0)
        41 (834, 1381933774, 0)
        181 (834, 1381933774, 0)
        151 (258, 1381933774, 3)
        9 (183, 1381933774, 0)
        205 (18, 1381933774, 3)
        waiter count=151
   O/S info: user: oracle, term: UNKNOWN, ospid: 3707158 
   OSD pid info: Unix process pid: 3707158, image: oracle@gxjh_db
   Short stack dump: 
ksdxfstk+002c&lt;-ksdxcb+04e4&lt;-sspuser+0074&lt;-000044C0&lt;-skgpwwait+00bc&lt;-kslges+054c&lt;-kslgetl+033c&lt;-ksfglt+0198&lt;-kghalo+06c8&lt;-ksmdacnk+0154&lt;-ksmdget+0248&lt;-kssadpm+02e4&lt;-ksucrp+04e8&lt;-opiino+03d4&lt;-opiodr+0ae0&lt;-opidrv+0484&lt;-sou2o+0090&lt;-opimai_real+01bc&lt;-main+0098&lt;-__start+0098
Dump of memory from 0x070000019087EE28 to 0x070000019087F030
70000019087EE20                   00000001 00000000          [........]
70000019087EE30 07000000 A9BB4038 00000010 000313A7  [......@8........]
70000019087EE40 00000000 00000000 00000000 00000000  [................]
 Repeat 30 times

从该PMON trace文件中可以发现，在2013-10-16 22:09:33，PMON获取latch 7000000100e9bb0失败，这个latch的持有者为pid=227，2013-10-16 22:28:34，PMON获取latch 700000010007608失败，这个latch的持有者为 pid=364，而这个进程在等待latch 7000000100e9bb0
3)收集数据库PMON failed to acquire latch前最近的AWR及OS日志无异常。

到此，该分析工作基本也只能进行到此，从PMON的trace分析，这个问题是由于share pool争用引起的，这一点从trace文件中下面的部分可以看出：

*** 2013-10-16 22:09:33.682
PMON unable to acquire latch 7000000100e9bb0 Child shared pool level=7 child#=2
Location from where latch is held: kghfrunp: alloc: session dur:

但是，因为没有system state dump，我们缺少足够的证据分析根本原因，基于经验，最可能的原因有如下几种：
1.share pool不足；
2.oracle bug，比如bug 7039896与该CASE非常相似；
3.其它偶然的事件造成进程之间阻塞；
Recommended solution：
1.可以适当增加share pool的大小，还可以按bug 7039896中给出的workaround尝试一下；
2.数据库版本为10.2.0.4，这个版本上有很多bug都可以造成类似的问题，建议升致10.2.0.5以上的版本；
3.如果问题重现，需先做system state dump和Hanganalyze，以便确定其根本原因。

Lynn の世界

远方，除了遥远，还有梦想！

标签归档：故障诊断

How to Collect Diagnostics for Database Hanging Issues

PMON failed to acquire latch分析