Log File Sync Issue (Part 2)

In part 1 of the log file sync case I’m writing about, I did an analyze following Troubleshooting: ‘Log file sync’ Waits (Doc ID 1376916.1)

But I was not able to fully point out the reason for the high average time for log file sync. Writing to the log files was not the main contributor to the wait nor was cpu shortage.

The application do commit very frequent with an average of 5 user calls per commit, which is way below Oracles recommendation of 30 calls per commit.

Craig Shalahammer (orapub) helped me analyzing the issue and he used some of my data to show how you can use R to visualize the wait event. Read it and you get more details about the first picture below. I added the LFPW just so you can see that it correlates with LFS.

ScreenHunter_61 Jul. 28 12.43 ScreenHunter_60 Jul. 28 11.16

In general we are doing good and the question is is this an issue ?
It depends on your users, are they complaining? then it is an issue.

In my case it is an issue since our users do complain about performance, I believe this is related to the combination of an “over-committing” application and fairly slow disks.
At peak times when we have many sessions committing we cannot cope with the load so our users has to wait for LFS(LFPW) since it is a queue building up of sessions that wants to commit. If we could reduce the commit rate from the application the picture would be different. But also if we could improve write performance to disk. Doing both would be very good. We are building a new test environment were we will use fusion IO cards to improve write performance, changing commit rate is not in our control and not an option right now. We haven’t been able to test it yet, but I will come back and write about it when we have done our tests.


2 thoughts on “Log File Sync Issue (Part 2)

  1. I had similar situation on production (in part 1 you’ve wrote that you’re on, same here but with app going up to 700 commit/s) – there are 1 or 2 bugs related to RAC there (you’re on RAC, aren’t you?) related/with symptoms:
    – kcrfw_update_adaptive_sync_mode() in LGWR trace, and
    – BoC (Broadcast on Commit) in RAC being bugged

    The matching criteria from bug descriptions did NOT match 100% situation, but after patching i had more “stable performance” (no commit stalls) (as far as i remember those bugfixes are online applicable , but are NOT part of PSU! no idea why they didn’t include them). What really helped nailing such issue was Tanel’s snapper used against LGWR and tracing redo I/O LUNs via iostat -x (excluding I/O as root cause).

    Keep us posted via part 3 🙂

    BTW: I was able push single 2s16c32t server (on – even without LGWR slaves – new “adaptative” thing in 12c!) up to 20000 ACID commit/s with several tricks on ramdisk/tmpfs (and up to 60000 commit/s semi-ACID COMMIT BATCH NOWAIT), till I’ve run out of CPU power so LGWR itself certainly shouldn’t be a bottleneck (you’ll hit several other walls before LGWR won’t be able to keep up especially with pooling).

    • Hi Jakub, thanks for your comment. No we do not use RAC.
      The switching between post/polling had a couple of bugs, thats why i disabled it. Storage is an issue, we do not have the full attention from our storage department, I (and some other persons) believe the configuration is not the optimal for databases .
      So our architect introduces the fusion I/O cards as a work around.
      The test system is built and hopefully available in a few weeks.
      I will test both the current version and 12c, perhaps also the latest I might also test COMMIT BATCH NOWAIT on specific batches.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s