Enq TX for a long time.

I’m helping a customer that experienced a scenario where some functions in the application stopped responding. The situation was seen for 48 hours. It has also been seen pretty frequent in the past months.
When approaching the vendor of the application the response was that the session was not receiving data from the database which implied it was a database issue. A reference to a blog by Oracle is used to confirm it to be a database issue. Looking at the details and reading the blog I agree that the conclusion that the jdbc session is stuck waiting for data from the database, but the reason for not receiving data are not database related. I think it is clearly a design issue related to the application. Here are my analyze…

lck1

This is how it looks at the beginning of the issue.

Looking at the sql being stuck, we have this picture.

lck2

We can see that we have five sessions that wants to run this update, but are blocked since the rows they want to update are all ready locked by another process.
The following info is found in Doc ID 62354.1

What is a TX lock?

A TX lock is acquired when a transaction initiates its first change and 
is held until the transaction does a COMMIT or ROLLBACK. 
It is used mainly as a queuing mechanism so that other sessions can wait 
for the transaction to complete. The lock name (ID1 and ID2) of the TX lock 
reflect the transaction ID of the active transaction.

How can we dig deeper ? An AWR report is not the right tool as I see it. It can help you confirm the issue but will not get you the details you need.
I think an ASH report is the best tool to understand what happens here. But since this happened some time ago we might not get what we want from ash. Looking at my ash report starting a few hours before we saw the locking issue and ending a few hours after.

lck3

The source are dba_hist_active_sess_history, if I get a more recent ash report.

lck4

here we get the data from v$active_session_history. What are the differences ?
v$active_session_history gets populated every second and keeps the data for a shorter time. As you see in the first report it lists which AWR snapshots we have. So if the data cannot be found in v$active_session_history it uses the dba_hist_active_sess_history. This table is populated every 10:th second with data from v$active_session_history.

v$active_session_history has a column named is_awr_sample and if I list a few sample_time we can see.

lck5.png

here you clearly see that v$active_session_history has a sample every second and samples going to the dba_hist_active_sess_history every 10:th second. This is crucial for our future investigation. There are 1 row for each active session at the time stamp.

Looking at our ash report..

lck6

We can see that we have the enq: TX lock as top event, this is expected.
What else can we get from the ASH report.

lck7

We have 3 sessions running  the update and they suffer from the enq: TX lock. A lock that we so far do not know where it comes from and how the sql looks like.

lck9.png

We can see our 3 sessions waiting, it is user SYNC and it is a JDBC connection.

lck10.png

Finally some useful information, who is the “bad guy” blocking the other sessions ?
Sid 22 is the reason the other sessions has to wait. It has 1 ongoing transaction and it is causing the event enq: TX – row lock contention. The information is not sufficient to get the root cause, but we know it is the user SYNC running whatever locks the data and it is a JDBC connection. As I see it, this is the first indication that the problem are application related.

How can this help us ? We can now try to get an ash report for that session and see if we can find which sql causing the locking scenario.
Why do we not get any details about session 22 in this ash report ? Remember that ASH only samples active sessions, our session 22 is idle, it still has a ongoing transaction but for the duration of my ash report it hasn’t done anything. As I see it, this is the second indication that we have an application issue.

I got a new ash report starting much earlier than ny first ash report, the reason for doing that is to see if I can get any details about session 22. But I didn’t find anything useful. One reason can be that we get our report from the AWR tables, which just kept every 10:th timestamp. We do not know anything about our session, the sql executed locking  the data could be a very fast execution. If that execution was executed faster than a second we might not even see it in v$active_session_history. I ran a select against dba_hist_active_sess_history to see what info I got for sid 22 and session_serial# 53359.

lck11

I have one sql executed before we see the locking issue and it is a select statement. (locking issue started 06:00 Am, November 18) This statement will not lock any rows.
Unfortunately I could not find the root sql causing this issue, we will see if it happens again. Then we need to be more alert collecting the proper data for our analysis.

 

 

I still think the evidences are clear that the issue is related to a design flaw. A session can be idle having ongoing transactions for a long time. Eventually the session was killed and the lock was released. We got information from the vendor that they have added a few commit statements in the code. I think this also indicates that the vendor realize that the problem are application related. If you want more indications that it is application related, you can have a look at an ADDM report.

lck8.png

It says check the application logic to find the contention and it says it is our session 22 that is causing the issue. The same information we found in our ash report.

Thanks for reading !

Referenses:

Waits for ‘Enq: TX – …’ Type Events – Transaction (TX) Lock Example Scenarios
(Doc ID 62354.1)

Advertisements

One thought on “Enq TX for a long time.

  1. Hi Magnus,

    A suggestion, if you haven’t already tried it. In case you have supplemental logging enabled (https://docs.oracle.com/database/122/SUTIL/oracle-logminer-utility.htm#SUTIL1583), and you still have the archived redo logs available for the problematic period, then one possibility to find what session 22 did in the specific problematic transaction might be to run a LogMiner session across those archived redo logs.

    Regards,
    Jure Bratina

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s