Bug 1566244 - CertificateRevocationListTask failing with lock wait timeout errors
Summary: CertificateRevocationListTask failing with lock wait timeout errors
Alias: None
Product: Candlepin
Classification: Community
Component: candlepin
Version: 2.2
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: ---
Assignee: Chris "Ceiu" Rog
QA Contact: Katello QA List
Depends On:
TreeView+ depends on / blocked
Reported: 2018-04-11 20:19 UTC by Tramaine Darby
Modified: 2018-04-13 17:56 UTC (History)
5 users (show)

Fixed In Version: candlepin-2.3.6-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2018-04-13 17:56:25 UTC

Attachments (Terms of Use)
Stack Trace - lock wait timeout (5.75 KB, text/plain)
2018-04-11 20:19 UTC, Tramaine Darby
no flags Details

System ID Private Priority Status Summary Last Updated
Github candlepin candlepin pull 1968 0 None closed [M] 1566244: Speculative Fix For Failing CertificateRevocationListTask 2020-10-28 13:57:06 UTC

Description Tramaine Darby 2018-04-11 20:19:36 UTC
Created attachment 1420514 [details]
Stack Trace - lock wait timeout

Description of problem: The scheduled CertificateRevocationListTask is consistently failing.  

Version-Release number of selected component (if applicable): 2.2.3

How reproducible: Every time

Steps to Reproduce:
1. run the job or wait for the scheduled run

Actual results:
The CertificateRevocationListTask hasn't been
running since mid-November because QRTZ_TRIGGERS next_fire_time never
got fired in time, and it just stayed that way until recently. We had
Anil run a query on the table to get the next_fire_times corrected.
Now, CertificateRevocationListTask runs as scheduled (taking
approximately 12 each time), but it always ends with an SQLException
"Lock wait timeout exceeded" [1] (see attached logged stack trace).

Judging by the stack trace (attached), looks like the problem may be
related to CrlFileUtil line 316, where
certificateSerialCurator.getExpiredSerials() is called. I'm not sure
why that is, but that appears to be the hang up.

Expected results:
Successful completion of the job

Additional info:

Comment 1 Chris "Ceiu" Rog 2018-04-11 21:03:08 UTC
While I was unable to reproduce the lock-wait timeout, I was able to reproduce some severe performance loss associated with processing large amounts of revoked serials that could easily lead to an eventual lock-wait given how the data was being fetched and processed.

As a speculative fix, we've done some significant performance tuning of the CRL task which should help alleviate this situation; as well as returned a cleanup operation that was accidentally removed a couple years back.

Ideally the combination of those two things will outright fix the problem, or at least make it fast enough that the probability it gets hung up on itself (or another task) is low enough that this is no longer a major concern.

Should this issue persist after this fix is in place, we'll need to do more in-depth testing to identify what contentions we have in place.

Note You need to log in before you can comment on or make changes to this bug.