This problem was experienced in a 18.104.22.168 production database, and started when we saw in increased amount of sessions waiting for enq – TX row lock contention.
Looking at the blocking sessions in grid control we had a blocking session running
a simple delete from a table. We have no foreign keys for this table nor do we have any triggers on this table. The delete is a single row delete and plan was using a unique scan of the PK index.
DELETE FROM proddta.f3201 WHERE ( cmcfgid = :KEY1 AND cmkcoo = :KEY2 AND cmdoco = :KEY3 AND cmdcto = :KEY4 AND cmsfxo = :KEY5 AND cmlnid = :KEY6 AND cmcfgcid = :KEY7 AND cmsy = :KEY8 )
We can see that we spend all time on cpu. Looking at the server when a similar scenario occurred we saw:
Tasks: 1075 total, 2 running, 1073 sleeping, 0 stopped, 0 zombie Cpu(s): 16.4%us, 0.9%sy, 0.0%ni, 82.4%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 65798828k total, 62052676k used, 3746152k free, 1801920k buffers Swap: 2097148k total, 87636k used, 2009512k free, 3253732k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30885 oracle 20 0 30.3g 27m 23m R 100.0 0.0 5:21.29 oraclecdb4166 (LOCAL=NO) < === 21190 oracle 20 0 30.3g 37m 31m S 15.6 0.1 0:42.23 oraclecdb4166 (LOCAL=NO) 71102 oracle 20 0 30.4g 49m 24m S 6.0 0.1 0:04.20 oraclecdb4166 (LOCAL=NO)
PID 30885 is always on top using 100% cpu, the pid matches our session that performs the delete. This spinning on the cpu prevents the transaction to finish (commit) contributing to the row lock contention. Looking at the history for this sql i looked at awr sql reports for the various timestamps. the script to extract awr sql reports are awrsqrpt.sql
The first one ..
Is from ~16 days back, April 27. For the period we had 14 executions with total elapsed time of 8 ms. A very fast execution. on average each execution took 0.6 ms and did ~8 buffer gets.
Looking one day back from the day when we experienced the problem, we started to see a change in the execution time.
The same sql_id and plan hash value, 228 execution using 1.794 total elapsed time.
On average each execution now took 7.87 ms. I huge increase, perhaps not in time but in percent increase. My assumption is not that all 228 executions took 7.84 ms each.
I think we had a few of the executions taking much longer time and many executions to be as fast as in the first example. But this is an average value for all 228 executions. We noticed that some executions was very fast and some was stuck.
The last timesatmp are from the day when we noticed this to be a problem.
Again we can see same sql_id,plan hash value and almost the same amount of buffer gets per read. This mean we executed the sql in the same way. This period we have 19 executions using 756 seconds! A huge increase of the execution time. Average execution is 39 seconds. And you can see the cpu time is equal to elapsed time so no wait time involved. (cpu time is actually higher than elapsed time which I think is a rounding error)
I have also traced the sql by using alter system set events ‘sql_trace [sql:sql_id|sql_id] level 12′; This command can enable tracing for one or more sqls. You disable the tracing by issuing alter session set events ‘sql_trace [sql:sql_id|sql_id] off’;
This is how that trace looks like
As you see we have 5 executions using 138 seconds of elapsed time. So the different data gathered supports our conclusion that this sql is now taking much longer to run.
With the long cpu time and the small amount of work we do and also knowing that this was earlier executed in less than 1 ms I started to think that we are spinning on cpu.
I took an errorstack from the spinning session, I found the process id in grid control.
connect / as sysdba ALTER SESSION SET tracefile_identifier = 'ERRORSTACK'; oradebug setospid PID oradebug unlimit oradebug dump errorstack 3 oradebug dump errorstack 3 oradebug dump errorstack 3 oradebug tracefile_name
Looking in the tracefile generated I looked at the stack trace, not the full error stack is listed here. You read a stack trace from the bottom up. I learned that the calls having ksd*/kse* can be ignored since they comes from the dump process.
So the top call that I’m interested in is kkscsFindMatchingRange. To better understand what parts of Oracle are involved you can use the first 3 characters to understand which part of the code we reference. kks means Kernel Kompile Shared (cursors).
If I searched for that call I immediately come across an interesting bug.
Query Intermittently Spins on CPU in Function kkscsFindMatchingRange() (Doc ID 1949350.1)
The bug has this information:
Looking at the symptoms we have
- Query intermittently spins on CPU, but most of the time runs OK without spinning.
- The execution plan is good and does not change when the problem is hit.
- Errorstacks of the spinning process show that it is stuck in the function kkscsFindMatchingRange().
Call stack in the trace file :
kkscsFindMatchingRange() kkscsVerifyBindEquivRP() kkscsVerifyBindEquivalence() kkscsMatchEquivLiterals() kkscsCompareBinds() kkscscid_bnd_eval() kkscsCheckCriteria()
All of the above symptoms matches our case. What is also interesting is the reference to the shared cursor function (KKS). I have a case with support for the same database regarding high numbers of shared cursors (version count). Looking in the bug referenced
Bug 19392364 we can also see a symptom matching our case. High number of child cursors and we had to kill our session manually.
Hang/Spin in kkscsFindMatchingRange requiring process to be killed manually when the number of child cursors is very high
We tested the workaround:
alter system set “_optimizer_extended_cursor_sharing”=’none’
alter system set “_optimizer_extended_cursor_sharing_rel”=’none’
After that we haven’t experienced the problem. We have a case with Oracle Support and waiting for a one-off patch for the bug.
If you don’t want to read the stack trace manually you can use a Oracle Support tool
ORA-600/ORA-7445/ORA-700 Error Look-up Tool (Doc ID 153788.1). If I use that and paste the stack trace.
Then I click search call stack trace button.
You see we get the same note I referenced above…
How to Collect Errorstacks for use in Diagnosing Performance Issues. (Doc ID 1364257.1)
SQL_TRACE (10046), TKProf and Explain Plan – Overview Reference (Doc ID 199081.1)
ORA-600/ORA-7445/ORA-700 Error Look-up Tool (Doc ID 153788.1)