Bug 998945

Summary:

Too many statement timeout exceptions in server log

Product:

[Other] RHQ Project

Reporter:

Armine Hovsepyan <ahovsepy>

Component:

Core Server

Assignee:

Jay Shaughnessy <jshaughn>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Mike Foley <mfoley>

Severity:

urgent

Docs Contact:

Priority:

high

Version:

unspecified

CC:

genman, hrupp, mfoley

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-12-21 14:07:58 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

999165

Bug Blocks:

Attachments:

Description	Flags
statement-timeout-server.log	none

Description Armine Hovsepyan 2013-08-20 12:03:22 UTC

Created attachment 788460 [details]
statement-timeout-server.log

Description of problem:
Too many statement timeout exceptions in server log

Version-Release number of selected component (if applicable):
rhq :  3c8110a

How reproducible:


Steps to Reproduce:
1. install rhq server with storage and agent
2. install new storage node with agent and connect to server
3.

Actual results:
too many statement timeout exceptions in the log after both step 1 and step 2

Expected results:
no statement timeout exception in server.logs

Additional info:
log attached

Comment 1 Armine Hovsepyan 2013-08-21 09:09:41 UTC

rising severity according ti bug #999165

Comment 2 Heiko W. Rupp 2013-12-09 13:38:57 UTC

In the postgres log with statement logging enabled for statements taking more than 1.5seconds:

LOG:  duration: 110801.341 ms  execute <unnamed>: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2
DETAIL:  parameters: $1 = '1386595959475', $2 = '10001'
LOG:  duration: 110806.233 ms  execute <unnamed>: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2
DETAIL:  parameters: $1 = '1386595959479', $2 = '10001'
LOG:  duration: 110813.257 ms  execute <unnamed>: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2
DETAIL:  parameters: $1 = '1386595959483', $2 = '10001'
LOG:  duration: 110659.414 ms  execute <unnamed>: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2
DETAIL:  parameters: $1 = '1386595959645', $2 = '10001'
LOG:  duration: 79007.073 ms  execute <unnamed>: update RHQ_AGENT set LAST_AVAILABILITY_PING=$1, BACKFILLED=false where NAME=$2
DETAIL:  parameters: $1 = '1386595991303', $2 = 'snert'
LOG:  duration: 110645.288 ms  execute <unnamed>: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2
DETAIL:  parameters: $1 = '1386595959678', $2 = '10001'
LOG:  duration: 94849.078 ms  execute <unnamed>: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2
DETAIL:  parameters: $1 = '1386595975495', $2 = '10001'


I have 2 agents in that table

Comment 3 Heiko W. Rupp 2013-12-09 13:54:09 UTC

More database output:

2nd entry after the date is the transaction Id. Looks like we may lock the table for much too long and have thus an issue
with subsequent reports coming in

2013-12-09 14:49:23.775 CET 8995618:LOG:  process 33089 still waiting for ShareLock on transaction 8995615 after 1000.086 ms
2013-12-09 14:49:23.776 CET 8995618:STATEMENT:  update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2
2013-12-09 14:50:09.789 CET 8995627:LOG:  process 33302 still waiting for ExclusiveLock on tuple (1,15) of relation 464991 of database 437690 after 1000.322 ms
2013-12-09 14:50:09.789 CET 8995627:STATEMENT:  update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2
2013-12-09 14:50:46.681 CET 0:LOG:  checkpoints are occurring too frequently (4 seconds apart)
2013-12-09 14:50:46.681 CET 0:HINT:  Consider increasing the configuration parameter "checkpoint_segments".
2013-12-09 14:50:50.990 CET 0:LOG:  checkpoints are occurring too frequently (4 seconds apart)
2013-12-09 14:50:50.990 CET 0:HINT:  Consider increasing the configuration parameter "checkpoint_segments".
2013-12-09 14:50:55.317 CET 0:LOG:  checkpoints are occurring too frequently (5 seconds apart)
2013-12-09 14:50:55.317 CET 0:HINT:  Consider increasing the configuration parameter "checkpoint_segments".
2013-12-09 14:50:55.801 CET 8995656:LOG:  process 33345 still waiting for ExclusiveLock on tuple (1,15) of relation 464991 of database 437690 after 1000.176 ms
2013-12-09 14:50:55.802 CET 8995656:STATEMENT:  update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2
2013-12-09 14:50:58.587 CET 0:LOG:  checkpoints are occurring too frequently (3 seconds apart)
2013-12-09 14:50:58.587 CET 0:HINT:  Consider increasing the configuration parameter "checkpoint_segments".
2013-12-09 14:51:01.574 CET 8995618:LOG:  process 33089 acquired ShareLock on transaction 8995615 after 98799.009 ms
2013-12-09 14:51:01.574 CET 8995618:STATEMENT:  update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2
2013-12-09 14:51:01.575 CET 8995627:LOG:  process 33302 acquired ExclusiveLock on tuple (1,15) of relation 464991 of database 437690 after 52786.340 ms
2013-12-09 14:51:01.575 CET 8995627:STATEMENT:  update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2
2013-12-09 14:51:01.575 CET 8995656:LOG:  process 33345 acquired ExclusiveLock on tuple (1,15) of relation 464991 of database 437690 after 6773.574 ms
2013-12-09 14:51:01.575 CET 8995656:STATEMENT:  update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2
2013-12-09 14:51:01.575 CET 8995618:LOG:  duration: 98800.166 ms  execute S_74: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2
2013-12-09 14:51:01.575 CET 8995618:DETAIL:  parameters: $1 = '1386596962774', $2 = '10001'
2013-12-09 14:51:01.582 CET 8995627:LOG:  duration: 52793.416 ms  execute <unnamed>: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2
2013-12-09 14:51:01.582 CET 8995627:DETAIL:  parameters: $1 = '1386597008787', $2 = '10001'
2013-12-09 14:51:01.584 CET 8995656:LOG:  duration: 6782.346 ms  execute S_63: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2
2013-12-09 14:51:01.584 CET 8995656:DETAIL:  parameters: $1 = '1386597054800', $2 = '10001'

Comment 4 Jay Shaughnessy 2013-12-09 20:09:09 UTC

commit d01bb14a36a1121d377dc2445d1456c611c37007
Author: Jay Shaughnessy <jshaughn>
Date:   Mon Dec 9 14:13:52 2013 -0500

 I did a visual inspection of code looking for things that may
 contribute to the slow updates and/or timeouts reported in these BZs. In
 general the updates for availReports (infrequent but can spike in
 certain scenarios, 1 per avail report) and availPing
 (frequent, 1 per minute per agent) should be very fast and not in
 themselves be an issue. Slowness in these queries is likely caused by the
 Agent table being locked by some other longer running transaction.
 Everything looked pretty clean with one exception. When backfilling an agent
 due to an agent shutdown or suspect job detection, it looked like we
 may have locked th agent table for the duration of the backfilling, which
 can be a fairly large operation for a big agent inventory.  I'm not sure
 this is the problem, but this commit should allevaite the Agent table
 locking while the backfilling is performed.

 Also:
 - move some inline JPA updates on the Agent to be NamedQuery in the Agent
   entity.  No need to build those every time we update an agent ping time.

Comment 5 Elias Ross 2014-01-09 01:48:48 UTC

I've seen a similar issue with Oracle.

/* Formatted on 1/8/2014 5:39:59 PM (QP5 v5.185.11230.41888) */
UPDATE RHQ_AGENT
   SET ADDRESS = :1,
       AFFINITY_GROUP_ID = :2,
       AGENTTOKEN = :3,
       BACKFILLED = :4,
       CTIME = :5,
       LAST_AVAILABILITY_PING = :6,
       LAST_AVAILABILITY_REPORT = :7,
       MTIME = :8,
       NAME = :9,
       PORT = :10,
       REMOTE_ENDPOINT = :11,

This is the query I got back from the DBA that's causing trouble. I have close to 2000 agents.

Comment 7 Jay Shaughnessy 2015-12-21 14:07:58 UTC

Closing as fixed, inactive.