Bug 998945
Summary: | Too many statement timeout exceptions in server log | ||||||
---|---|---|---|---|---|---|---|
Product: | [Other] RHQ Project | Reporter: | Armine Hovsepyan <ahovsepy> | ||||
Component: | Core Server | Assignee: | Jay Shaughnessy <jshaughn> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Foley <mfoley> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | high | ||||||
Version: | unspecified | CC: | genman, hrupp, mfoley | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-12-21 14:07:58 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 999165 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
rising severity according ti bug #999165 In the postgres log with statement logging enabled for statements taking more than 1.5seconds: LOG: duration: 110801.341 ms execute <unnamed>: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2 DETAIL: parameters: $1 = '1386595959475', $2 = '10001' LOG: duration: 110806.233 ms execute <unnamed>: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2 DETAIL: parameters: $1 = '1386595959479', $2 = '10001' LOG: duration: 110813.257 ms execute <unnamed>: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2 DETAIL: parameters: $1 = '1386595959483', $2 = '10001' LOG: duration: 110659.414 ms execute <unnamed>: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2 DETAIL: parameters: $1 = '1386595959645', $2 = '10001' LOG: duration: 79007.073 ms execute <unnamed>: update RHQ_AGENT set LAST_AVAILABILITY_PING=$1, BACKFILLED=false where NAME=$2 DETAIL: parameters: $1 = '1386595991303', $2 = 'snert' LOG: duration: 110645.288 ms execute <unnamed>: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2 DETAIL: parameters: $1 = '1386595959678', $2 = '10001' LOG: duration: 94849.078 ms execute <unnamed>: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2 DETAIL: parameters: $1 = '1386595975495', $2 = '10001' I have 2 agents in that table More database output: 2nd entry after the date is the transaction Id. Looks like we may lock the table for much too long and have thus an issue with subsequent reports coming in 2013-12-09 14:49:23.775 CET 8995618:LOG: process 33089 still waiting for ShareLock on transaction 8995615 after 1000.086 ms 2013-12-09 14:49:23.776 CET 8995618:STATEMENT: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2 2013-12-09 14:50:09.789 CET 8995627:LOG: process 33302 still waiting for ExclusiveLock on tuple (1,15) of relation 464991 of database 437690 after 1000.322 ms 2013-12-09 14:50:09.789 CET 8995627:STATEMENT: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2 2013-12-09 14:50:46.681 CET 0:LOG: checkpoints are occurring too frequently (4 seconds apart) 2013-12-09 14:50:46.681 CET 0:HINT: Consider increasing the configuration parameter "checkpoint_segments". 2013-12-09 14:50:50.990 CET 0:LOG: checkpoints are occurring too frequently (4 seconds apart) 2013-12-09 14:50:50.990 CET 0:HINT: Consider increasing the configuration parameter "checkpoint_segments". 2013-12-09 14:50:55.317 CET 0:LOG: checkpoints are occurring too frequently (5 seconds apart) 2013-12-09 14:50:55.317 CET 0:HINT: Consider increasing the configuration parameter "checkpoint_segments". 2013-12-09 14:50:55.801 CET 8995656:LOG: process 33345 still waiting for ExclusiveLock on tuple (1,15) of relation 464991 of database 437690 after 1000.176 ms 2013-12-09 14:50:55.802 CET 8995656:STATEMENT: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2 2013-12-09 14:50:58.587 CET 0:LOG: checkpoints are occurring too frequently (3 seconds apart) 2013-12-09 14:50:58.587 CET 0:HINT: Consider increasing the configuration parameter "checkpoint_segments". 2013-12-09 14:51:01.574 CET 8995618:LOG: process 33089 acquired ShareLock on transaction 8995615 after 98799.009 ms 2013-12-09 14:51:01.574 CET 8995618:STATEMENT: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2 2013-12-09 14:51:01.575 CET 8995627:LOG: process 33302 acquired ExclusiveLock on tuple (1,15) of relation 464991 of database 437690 after 52786.340 ms 2013-12-09 14:51:01.575 CET 8995627:STATEMENT: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2 2013-12-09 14:51:01.575 CET 8995656:LOG: process 33345 acquired ExclusiveLock on tuple (1,15) of relation 464991 of database 437690 after 6773.574 ms 2013-12-09 14:51:01.575 CET 8995656:STATEMENT: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2 2013-12-09 14:51:01.575 CET 8995618:LOG: duration: 98800.166 ms execute S_74: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2 2013-12-09 14:51:01.575 CET 8995618:DETAIL: parameters: $1 = '1386596962774', $2 = '10001' 2013-12-09 14:51:01.582 CET 8995627:LOG: duration: 52793.416 ms execute <unnamed>: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2 2013-12-09 14:51:01.582 CET 8995627:DETAIL: parameters: $1 = '1386597008787', $2 = '10001' 2013-12-09 14:51:01.584 CET 8995656:LOG: duration: 6782.346 ms execute S_63: update RHQ_AGENT set LAST_AVAILABILITY_REPORT=$1, BACKFILLED=false where ID=$2 2013-12-09 14:51:01.584 CET 8995656:DETAIL: parameters: $1 = '1386597054800', $2 = '10001' commit d01bb14a36a1121d377dc2445d1456c611c37007 Author: Jay Shaughnessy <jshaughn> Date: Mon Dec 9 14:13:52 2013 -0500 I did a visual inspection of code looking for things that may contribute to the slow updates and/or timeouts reported in these BZs. In general the updates for availReports (infrequent but can spike in certain scenarios, 1 per avail report) and availPing (frequent, 1 per minute per agent) should be very fast and not in themselves be an issue. Slowness in these queries is likely caused by the Agent table being locked by some other longer running transaction. Everything looked pretty clean with one exception. When backfilling an agent due to an agent shutdown or suspect job detection, it looked like we may have locked th agent table for the duration of the backfilling, which can be a fairly large operation for a big agent inventory. I'm not sure this is the problem, but this commit should allevaite the Agent table locking while the backfilling is performed. Also: - move some inline JPA updates on the Agent to be NamedQuery in the Agent entity. No need to build those every time we update an agent ping time. I've seen a similar issue with Oracle. /* Formatted on 1/8/2014 5:39:59 PM (QP5 v5.185.11230.41888) */ UPDATE RHQ_AGENT SET ADDRESS = :1, AFFINITY_GROUP_ID = :2, AGENTTOKEN = :3, BACKFILLED = :4, CTIME = :5, LAST_AVAILABILITY_PING = :6, LAST_AVAILABILITY_REPORT = :7, MTIME = :8, NAME = :9, PORT = :10, REMOTE_ENDPOINT = :11, This is the query I got back from the DBA that's causing trouble. I have close to 2000 agents. Closing as fixed, inactive. |
Created attachment 788460 [details] statement-timeout-server.log Description of problem: Too many statement timeout exceptions in server log Version-Release number of selected component (if applicable): rhq : 3c8110a How reproducible: Steps to Reproduce: 1. install rhq server with storage and agent 2. install new storage node with agent and connect to server 3. Actual results: too many statement timeout exceptions in the log after both step 1 and step 2 Expected results: no statement timeout exception in server.logs Additional info: log attached