I have a JD Edwards system where we have some severe waits for logfile sync, I have spent some time analyzing the issue. This is how I did it.
My AWR info
I use one hour for my analyze. This is the load profile
Log File Sync
Let’s start by checking what Log File Sync are and why we see it. The Database Administration manual has a good description
When a user session commits, the session’s redo information needs to be flushed to the redo logfile. The user session will post the LGWR to write the log buffer to the redo log file. When the LGWR has finished writing, it will post the user session.
Wait Time: The wait time includes the writing of the log buffer and the post.
The wait time includes writing, this is interesting since it involves another wait event, Log File Parallel Write, this event can only be waited for by the LGWR.
The Manual says this about the Log File Parallel Write
Writing redo records to the redo log files from the log buffer.
Wait Time: Time it takes for the I/Os to complete. Even though redo records are written in parallel, the parallel write is not complete until the last I/O is on disk.
Analyzing following Oracle support Note: Troubleshooting: “log file sync” Waits
The note has four sections that are contributing to the analyse of Log File Sync.
– LGWR I/O performance
– LGWR IO Peak
– Application commits
– CPU capacity
LGWR I/O performance
The first things we want to analyze is if writing to the log files are slow, we can compare avg wait time for log file sync with log file parallel write. Tanel Pöder has a good picture showing the relation between the events.
If Log File Parallel Write is the main contributor to the overall Log File Sync, most of tha time is spent on writing redo and the IO system should be examined.
Since LGWR needs cpu a cpu shortage can also cause waits on log file sync.
From our data we can see that log file sync uses ~59% of db time with an average wait time of 13ms.
We have the foreground wait event log file sync with avg wait for 13ms.
The wait event registered when LGWR writes to the log files are the background wait event log file parallel write, the LGWR is the only process waiting on log file parallel writes. The relation is 13ms to 6ms which means writing redo is not standing for the majority of the time. The 6 ms wait for Log File Parallel Write isn’t great, but as it seems now it has no major contribution to the Log File Sync.
LGWR peak times
LGWR writes data in smaller bursts, most disks are not configured to handle these small bursts. Disks are configured to handle larger chunks of IO. These small bursts that LGWR writes can cause IO waits, but still show normal values for average IO (log file parallel write) waits. In situations like these the storage team can report that IO is not an issue. The peak time of the write is concealed in the average time.
One way to discover if you suffer from these bursts is to check the LGWR trace file, we have entries there reporting when a write to the redo logs takes more than 500ms. This is a high value so even if you don’t see them, we cannot guarantee that IO isn’t a problem. You should look if several warnings comes at the same time stamp, if it does IO peak can be an issue. In my case I see them but have long times apart.
The messages looks like this:
*** 2015-04-21 22:26:13.490 Warning: log write elapsed time 608ms, size 23848KB
Another message that you can see are
kcrfw_update_adaptive_sync_mode: post->poll long#=22 sync#=169 sync=61333 poll=5598 rw=2799 rw+=2799 ack=0 min_sleep=1974
This is related to Adaptive Log File Sync. The foreground (FG) process has two ways to check if the writing is done. POST/WAIT the FG process post the LGWr and waits for the LGWR to tell th FG process that the writing is done, post/wait works bet with small to medium load and is the faster of the two. Polling is better with high load, polling means that the FG process is polling the LGWR frequently to check if the writing is done. The switching between Post/Wait and Polling is done using an internal algorithm.
If the switching occurs to frequent it can be an issue, you can disable the feature by setting _use_adaptive_log_file_sync to FALSE. Note that 126.96.36.199-188.8.131.52 the default value was FALSE and from 184.108.40.206 it was changed to TRUE.
Are the redo logs large enough ?
On each log file switch a log file sync operation is performed. Oracle recommends that we do 3 to 4 switches per hour. Higher frequency can increase log file sync waits.
Looking at our stats we can see that we are most of the time within or close to the recommendations. At 8 PM we seems to have a peek in the log switches.
AWR has also information about log switches.
During my period we have 4 switches per hour, well within the recommendations.
If you perform log switches more often then recommended during a longer period, adjust the size and number of redo log groups. You can check the current size and groups by running:
select group#,thread#,sequence#,bytes/1024/1024 "MB" ,members,archived,status from v$log; GROUP# THREAD# SEQUENCE# MB MEMBERS ARC STATUS ------- ------- --------- ----- ------- --- -------- 1 1 79224 1024 1 YES INACTIVE 2 1 79228 1024 1 NO CURRENT 3 1 79227 1024 1 YES INACTIVE 4 1 79225 1100 1 YES INACTIVE 5 1 79226 1100 1 YES INACTIVE
We have different sizes, not by much but it should be adjusted.
Are we short of CPU ?
Do we suffer from CPU shortage ? The Operating System statistics tells us how much cpu we are using, in total not only the Instance. Craig Shalahammer has a nice blog how to calculate.
The formula for calculate utilization is U=R/C where R is Requirement (Busy Time ) and C is Capacity (Busy Time + Idle Time). (Note that the timing in Operating System statistics is centi seconds) Our formula will be 210.702/(210.702+4.115.592)=0,0487 ~ 4.9%. So we have plenty of CPU resources available.
Is the application committing to often ?
If the application commits to often it can cause high waits on log file sync since each commit flushes redo data from the redo buffer to the redo logs. Oracle has a recommendation that user calls per (commits+rollbacks) should not be lower than 30, if it is the application is committing to frequent. In our case we have:
The formula are: user calls/(user commits+user rollbacks) which gives us
3.192.460/(705.112+357)=3.192.460/705.469=4.52 avg user calls per commit.
This is way below Oracles recommendations, I have also confirmed this by looking at the DML’s in the system using grid control. We can see that we often perform single row DML’s.
executions is 17065 and rows are also 17065, row-by-row insert. Committing between each execution contributes to the high wait time on log file sync.
Writing to the disk is not the main contributor, 6 ms average wait time is not great but still we spend 7 seconds elsewhere. We do commit way to often if we follow Oracles recommendation, this can be an issue. Back to the text in the Note i mentioned…
“However , if 100s of sessions are waiting for the ‘log file parallel wait’ to complete, the total wait for ‘log file sync’ can be high as the wait time will be multiplied for the 100s of sessions.”
Our AWR report tells us we have around 350 sessions in this instance, and we have 195 commits per second. So if we commit to frequent can it be that we put each committing session on a queue waiting for the log writer and register that wait on log file sync ?
I’ll continue in my next post
Thanks to Craig Shalahammer for giving feedback about my case !
Adaptive Log File Sync Optimization
(Doc ID 1541136.1)
Adaptive Switching Between Log Write Methods can Cause ‘log file sync’ Waits
(Doc ID 1462942.1)
Troubleshooting: ‘Log file sync’ Waits
(Doc ID 1376916.1)
Craig Shalahammer, http://www.orapub.com