Session with spinning CPU

This problem was experienced in a 12.1.0.2 production database, and started when we saw in increased amount of sessions waiting for enq – TX row lock contention.

jde_rlc

 Looking at the blocking sessions in grid control we had a blocking session running
a simple delete from a table. We have no foreign keys for this table nor do we have any triggers on this table. The delete is a single row delete and plan was using a unique scan of the PK index.

DELETE FROM proddta.f3201 
WHERE ( cmcfgid = :KEY1 
 AND cmkcoo = :KEY2 
 AND cmdoco = :KEY3 
 AND cmdcto = :KEY4 
 AND cmsfxo = :KEY5 
 AND cmlnid = :KEY6 
 AND cmcfgcid = :KEY7 
 AND cmsy = :KEY8 )

jde_del

We can see that we spend all time on cpu. Looking at the server when a similar scenario occurred we saw:

Tasks: 1075 total, 2 running, 1073 sleeping, 0 stopped, 0 zombie 
Cpu(s): 16.4%us, 0.9%sy, 0.0%ni, 82.4%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st 
Mem: 65798828k total, 62052676k used, 3746152k free, 1801920k buffers 
Swap: 2097148k total, 87636k used, 2009512k free, 3253732k cached 

PID   USER   PR NI VIRT  RES SHR S %CPU  %MEM TIME+ COMMAND 
30885 oracle 20 0  30.3g 27m 23m R 100.0 0.0  5:21.29 oraclecdb4166 (LOCAL=NO) < === 
21190 oracle 20 0  30.3g 37m 31m S 15.6  0.1  0:42.23 oraclecdb4166 (LOCAL=NO) 
71102 oracle 20 0  30.4g 49m 24m S 6.0   0.1  0:04.20 oraclecdb4166 (LOCAL=NO)

PID 30885 is always on top using 100% cpu, the pid matches our session that performs the delete. This spinning on the cpu prevents the transaction to finish (commit) contributing to the row lock contention. Looking at the history for this sql i looked at awr sql reports for the various timestamps. the script to extract awr sql reports are awrsqrpt.sql

The first one  ..

awrsq1

Is from ~16 days back, April 27. For the period we had 14 executions with total elapsed time of 8 ms. A very fast execution. on average each execution took 0.6 ms and did ~8 buffer gets.

Looking one day back from the day when we experienced the problem, we started to see a change in the execution time.

awrsq2

The same sql_id and plan hash value, 228 execution using 1.794 total elapsed time.
On average each execution now took 7.87 ms. I huge increase, perhaps not in time but in percent increase. My assumption is not that all 228 executions took 7.84 ms each.
I think we had a few of the executions taking much longer time and many executions to be as fast as in the first example. But this is an average value for all 228 executions. We noticed that some executions was very fast and some was stuck.

The last timesatmp are from the day when we noticed this to be a problem.

awrsq3

Again we can see same sql_id,plan hash value and almost the same amount of buffer gets per read. This mean we executed the sql in the same way. This period we have 19 executions using 756 seconds! A huge increase of the execution time. Average execution is 39 seconds. And you can see the cpu time is equal to elapsed time so no wait time involved. (cpu time is actually higher than elapsed time which I think is a rounding error)

I have also traced the sql by using alter system set events ‘sql_trace [sql:sql_id|sql_id] level 12′; This command can enable tracing for one or more sqls. You disable the tracing by issuing alter session set events ‘sql_trace [sql:sql_id|sql_id] off’;

This is how that trace looks like

jde46

As you see we have 5 executions using 138 seconds of elapsed time. So the different data gathered supports our conclusion that this sql is now taking much longer to run.

With the long cpu time and the small amount of work we do and also knowing that this was earlier executed in less than 1 ms I started to think that we are spinning on cpu.

I took an errorstack from the spinning session, I found the process id in grid control.

connect / as sysdba
ALTER SESSION SET tracefile_identifier = 'ERRORSTACK';
oradebug setospid PID
oradebug unlimit

oradebug dump errorstack 3
oradebug dump errorstack 3
oradebug dump errorstack 3
oradebug tracefile_name

Looking in the tracefile generated I looked at the stack trace, not the full error stack is listed here. You read a stack trace from the bottom up. I learned that the calls having ksd*/kse* can be ignored since they comes from the dump process.
So the top call that I’m interested in is kkscsFindMatchingRange. To better understand what parts of Oracle are involved you can use the first 3 characters to understand which part of the code we reference. kks means Kernel Kompile Shared (cursors).

errorstack If I searched for that call I immediately come across an interesting bug.
Query Intermittently Spins on CPU in Function kkscsFindMatchingRange() (Doc ID 1949350.1)

The bug has this information:

bug

Looking at the symptoms we have

  • Query intermittently spins on CPU, but most of the time runs OK without spinning.
  • The execution plan is good and does not change when the problem is hit.
  • Errorstacks of the spinning process show that it is stuck in the function kkscsFindMatchingRange().
    Call stack in the trace file :
    kkscsFindMatchingRange() kkscsVerifyBindEquivRP() kkscsVerifyBindEquivalence() kkscsMatchEquivLiterals() kkscsCompareBinds() kkscscid_bnd_eval() kkscsCheckCriteria()

All of the above symptoms matches our case. What is also interesting is the reference to the shared cursor function (KKS). I have a case with support for the same database regarding high numbers of shared cursors (version count). Looking in the bug referenced
Bug 19392364 we can also see a symptom matching our case. High number of child cursors and we had to kill our session manually.

Hang/Spin in kkscsFindMatchingRange requiring process to be killed manually
     when the number of child cursors is very high

We tested the workaround:

Workaround:
Set both
alter system set “_optimizer_extended_cursor_sharing”=’none’
alter system set “_optimizer_extended_cursor_sharing_rel”=’none’

After that we haven’t experienced the problem. We have a case with Oracle Support and waiting for a one-off patch for the bug.

If you don’t want to read the stack trace manually you can use a Oracle Support tool
ORA-600/ORA-7445/ORA-700 Error Look-up Tool (Doc ID 153788.1). If I use that and paste the stack trace.

satcktool Then I click search call stack trace button.

bug1

You see we get the same note I referenced above…

References:

How to Collect Errorstacks for use in Diagnosing Performance Issues. (Doc ID 1364257.1)
SQL_TRACE (10046), TKProf and Explain Plan – Overview Reference (Doc ID 199081.1)
ORA-600/ORA-7445/ORA-700 Error Look-up Tool (Doc ID 153788.1)

Advertisements

5 thoughts on “Session with spinning CPU

  1. Pingback: Poor Performance related to ADS | Oracle Tuning

  2. Hi

    We are facing same issue with the update statement in our environment, which is taking 22 hours to update some 8000 rows.Please suggest it can affect all DML statements. We have 12.1.0.2 on linux x86 64 machine.

    • I do believe it can hit Any statement, If you experience the issue now take an errorstack dump and compare the stack trace With the one in the bug. Or Test the workaround mentioned in the bug. IT worked for us.

      Br
      Magnus

  3. How many rows you have when you execute the following query

    select count(1)
    from gv$sql_cs_selectivity
    where sql_id = ‘xxxxxxxx ‘;

    it seems that your delete is suffering from a side effect of Extended Cursor Sharing similar to the case I have blogged about it in the below article:

    https://hourim.wordpress.com/2015/04/06/bind_equiv_failure-or-when-you-will-regret-using-adaptive-cursor-sharing/

    This behaviour appears very often with cursors having several bind variables (say more than 4 or 5) for which Oracle Extended Cursor sharing is unable to find an existing child cursor optimised for a set of bind variable values with a similar selectivity. Consequently Oracle is hard parsing a new plan for each execution which obviously burns a lot of CPU and traumatise the Library cache.

    Instead of disabling the ACS-ECS feature at system level you could use the hint /*+ no_bind_aware */ in your delete statement so that Oracle will not mark the underlying cursor bind aware and will not optimise a new execution plan at each execution.

    Best regards
    Mohamed Houri

    • I read your blog and IT is interesting. you are probably right, we never had that huge number of Child cursors that you saw. I have to check my other case what the reason was for not sharing the cursorscursors. This is JD Edvard system so we have no control of the code. We have the Patch and plan to Apply it in June.

      Thanks
      Magnus

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s