1566244 – CertificateRevocationListTask failing with lock wait timeout errors

Bug 1566244 - CertificateRevocationListTask failing with lock wait timeout errors

Summary: CertificateRevocationListTask failing with lock wait timeout errors

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Candlepin
Classification:	Community
Component:	candlepin
Sub Component:
Version:	2.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Chris "Ceiu" Rog
QA Contact:	Katello QA List
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-11 20:19 UTC by Tramaine Darby
Modified:	2018-04-13 17:56 UTC (History)
CC List:	5 users (show)
Fixed In Version:	candlepin-2.3.6-1
Clone Of:
Environment:
Last Closed:	2018-04-13 17:56:25 UTC
Embargoed:

Attachments	(Terms of Use)
Stack Trace - lock wait timeout (5.75 KB, text/plain) 2018-04-11 20:19 UTC, Tramaine Darby	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	candlepin candlepin pull 1968	0	None	closed	[M] 1566244: Speculative Fix For Failing CertificateRevocationListTask	2020-10-28 13:57:06 UTC

Description Tramaine Darby 2018-04-11 20:19:36 UTC

Created attachment 1420514 [details]
Stack Trace - lock wait timeout

Description of problem: The scheduled CertificateRevocationListTask is consistently failing.  


Version-Release number of selected component (if applicable): 2.2.3


How reproducible: Every time


Steps to Reproduce:
1. run the job or wait for the scheduled run
2.
3.

Actual results:
The CertificateRevocationListTask hasn't been
running since mid-November because QRTZ_TRIGGERS next_fire_time never
got fired in time, and it just stayed that way until recently. We had
Anil run a query on the table to get the next_fire_times corrected.
Now, CertificateRevocationListTask runs as scheduled (taking
approximately 12 each time), but it always ends with an SQLException
"Lock wait timeout exceeded" [1] (see attached logged stack trace).

Judging by the stack trace (attached), looks like the problem may be
related to CrlFileUtil line 316, where
certificateSerialCurator.getExpiredSerials() is called. I'm not sure
why that is, but that appears to be the hang up.


Expected results:
Successful completion of the job

Additional info:

Comment 1 Chris "Ceiu" Rog 2018-04-11 21:03:08 UTC

While I was unable to reproduce the lock-wait timeout, I was able to reproduce some severe performance loss associated with processing large amounts of revoked serials that could easily lead to an eventual lock-wait given how the data was being fetched and processed.

As a speculative fix, we've done some significant performance tuning of the CRL task which should help alleviate this situation; as well as returned a cleanup operation that was accidentally removed a couple years back.

Ideally the combination of those two things will outright fix the problem, or at least make it fast enough that the probability it gets hung up on itself (or another task) is low enough that this is no longer a major concern.

Should this issue persist after this fix is in place, we'll need to do more in-depth testing to identify what contentions we have in place.

Note You need to log in before you can comment on or make changes to this bug.