In an EDI application we use, we had times of very high buffer busy waits
and enq: TX – contention. This was seen while processing edi messages in the system. Database is 18.104.22.168.4
The top 5 wait events shows this when the problem is seen.
Top 5 Timed Foreground Events
|Event||Waits||Time(s)||Avg wait (ms)||% DB time||Wait Class|
|buffer busy waits||654,952||49,655||76||65.87||Concurrency|
|enq: TX – contention||1,273,449||23,157||18||30.72||Other|
|db file sequential read||367,266||604||2||0.80||User I/O|
|latch: enqueue hash chains||109,790||256||2||0.34||Other|
At this time the application didn’t process many messages at this time and it quickly became a sever problem for the comapny.
Our further analysis showed that buffer busy waits was related to one single block.
Looking in v$session for all sessions having buffer busy waits as event the p1,p2 and p3 values shows file,block and reasoncode.
Reasoncode for us was 8 which means 1st level bmp
This is the ASSM bitmap block, see the concept guide for details about ASSM.
This was also confirmed looking into the awr report
Buffer Wait Statistics
|Class||Waits||Total Wait Time (s)||Avg Time (ms)|
|1st level bmb||720,881||49,593||69|
|3rd level bmb||38||0||9|
|2nd level bmb||2||0||0|
Looking at the sql that was running we can see the following
SQL ordered by Elapsed Time
|Elapsed Time (s)||Executions||Elapsed Time per Exec (s)||%Total||%CPU||%IO||SQL Id||SQL Module||SQL Text|
|27,091.58||254||106.66||35.94||1.05||0.78||0u0vxfm0k4yba||SeeBIS6 AS 19686||INSERT INTO TBPINSTANCES (cID,…|
|8,593.48||48||179.03||11.40||1.07||0.71||4xfhztsqxh1nq||SeeBIS6 AS 19686||UPDATE TBPINSTANCES SET cProce…|
|7,013.41||29||241.84||9.30||1.01||0.84||20a291w9y9ruy||SeeBIS6 AS 19686||UPDATE TBPINSTANCES SET cProce…|
|6,238.21||13||479.86||8.28||1.01||0.83||ght7b34v6f7g4||SeeBIS6 AS 19686||UPDATE TBPINSTANCES SET cProce…|
|5,731.39||38||150.83||7.60||1.04||0.73||3bfqvwrqqpgmr||SeeBIS6 AS 19686||UPDATE TBPINSTANCES SET cProce…|
|5,187.19||50||103.74||6.88||1.12||0.68||7u1b7rtp4stn3||SeeBIS6 AS 19686||UPDATE TBPINSTANCES SET cState…|
|5,059.49||14||361.39||6.71||1.03||0.80||8a1urkdd3y048||SeeBIS6 AS 19686||UPDATE TBPINSTANCES SET cState…|
|4,855.83||28||173.42||6.44||1.08||0.65||899f7bqcg1rg7||SeeBIS6 AS 19686||UPDATE TBPINSTANCES SET cProce…|
|1,342.83||31||43.32||1.78||0.99||0.79||88w0d433ftf1v||SeeBIS6 AS 27951||UPDATE TBPINSTANCES SET cState…|
|1,088.53||9||120.95||1.44||1.07||0.82||4554dhmrq0kht||SeeBIS6 AS 19686||UPDATE TBPINSTANCES SET cProce…|
Looking at the sql we can see that all top sql performs insert or updates into one specific table, tbpinstances. Analyzing further shows that top segment for
buffer busy waits is a lobsegment.
Segments by Buffer Busy Waits
|Owner||Tablespace Name||Object Name||Subobject Name||Obj. Type||Buffer Busy Waits||% of Capture|
We learned that the application is using blobs for edi messages data, when processing it inserts the row and then updates it during the various stages of the process.
This could accours between 5-25 times depending what type of mesage we are processing.
We logged a P1 servcie request with Oracle Support and got help analyzing the problem. We first thought this was a bug but development
didn’t agree on that.
We was recommended to change the parameter “_securefiles_concurrency_estimate”=50 (default set to 12)
IMPORTANT: You must rebuild the table for this change to be effective.
In our case the table has grown to 260Gb. Our solution was to create an empty table (after changing the parameter)
and inserting the non processed edi messages into the table, when that was done and the application started and
confirmed to working as expected we staretd to add odler data that was mainly used by customers to verify that there messages
has been processed.
The parameter is hidden and undocumented so I can not say anything what changes when we set it. But I assume it has to do with
2013-09-18. Problem was experienced again, rebuilding with the parameter changed did not help.
We are currently working with Oracle to try to find the root cause. We did rebuild the table again
and used retention = none (Oracles recommendation) for the securefiles and the problem was gone.
This disables read consistency for securefiles.
I doubt that this is the solution, I will update when we have more information from Oracle.
We have Bug 17479510 : SECUREFILES DMLS CAUSE HIGH ‘BUFFER BUSY WAITS’ & ‘ENQ: TX – CONTENTION’ registered with Oracle Development.
I believe they have difficulties to analyse the issue, I don’t expect this to be solved for some time.
meanwhile we schedule frequent rebuilds of the table.
The rebuild fixes the problem, but it is introduced again. We did our last rebuild 2 month ago and we start to see the problem again, so we scheduled another rebuild this weekend.
Latest update in this case:
As it seems now this is not a new bug, when we took over the db we noticed that we had several Wnnn processes on top in the machine using 100% cpu.
This was confirmed by Oracle Support to be Wnnn processes consuming high CPU (Doc ID 1492880.1)
The recommended solution in the note is to SQL> ALTER SYSTEM SET “_ENABLE_SPACE_PREALLOCATION”=0;
If you want more info about space pre allocation see:
SMCO (Space Management Coordinator) For Autoextend On Datafiles And How To Disable/Enable (Doc ID 743773.1)
So the recommendation was to patch to 22.214.171.124 and set _ENABLE_SPACE_PREALLOCATION”=3 (default value) and
rebuild the table, we tried without rebuildingt he table and the table was still there. Rebuilding the table fixed the issue
but it is also our current workaround. After we have rebuilt the table it takes anything from 2 to 6 months until we see the
issue again. So this is means that it can take several months for us to verify the fix.
Development say that we can confirm this by:
“You can verify that the suggested solution is working in the following way:
In AWR report, you can search for “dispenser allocations” and “CFS allocations”.
That should give the number of chunks allocated from dispenser vs CFS.
CFS allocations should be very minimal. CFS allocations will happen only when dispenser is not able to meet requests.”
This is our values:
AWR from a test system where we reproduced the issue:
Statistic Total per Second per Trans
——————————– —————— ————– ————-
segment cfs allocations 153 0.1 0.0
segment dispenser allocations 3,662 2.0 0.1
After the patch and rebuild:
Statistic Total per Second per Trans
——————————– —————— ————– ————-
segment cfs allocations 1 0.0 0.0
segment dispenser allocations 12,626 3.5 0.1
cfs allocation is lower but I expected bigger differences between good and bad.
But we will patch and set the parameter back to its default value in production
and hopefully this issue is solved.