Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1566244

Summary: CertificateRevocationListTask failing with lock wait timeout errors
Product: [Community] Candlepin (Migrated to Jira) Reporter: Tramaine Darby <tdarby>
Component: candlepinAssignee: Chris "Ceiu" Rog <crog>
Status: CLOSED CURRENTRELEASE QA Contact: Katello QA List <katello-qa-list>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 2.2CC: awood, crog, khowell, redakkan, skallesh
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: candlepin-2.3.6-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-13 17:56:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Stack Trace - lock wait timeout none

Description Tramaine Darby 2018-04-11 20:19:36 UTC
Created attachment 1420514 [details]
Stack Trace - lock wait timeout

Description of problem: The scheduled CertificateRevocationListTask is consistently failing.  


Version-Release number of selected component (if applicable): 2.2.3


How reproducible: Every time


Steps to Reproduce:
1. run the job or wait for the scheduled run
2.
3.

Actual results:
The CertificateRevocationListTask hasn't been
running since mid-November because QRTZ_TRIGGERS next_fire_time never
got fired in time, and it just stayed that way until recently. We had
Anil run a query on the table to get the next_fire_times corrected.
Now, CertificateRevocationListTask runs as scheduled (taking
approximately 12 each time), but it always ends with an SQLException
"Lock wait timeout exceeded" [1] (see attached logged stack trace).

Judging by the stack trace (attached), looks like the problem may be
related to CrlFileUtil line 316, where
certificateSerialCurator.getExpiredSerials() is called. I'm not sure
why that is, but that appears to be the hang up.


Expected results:
Successful completion of the job

Additional info:

Comment 1 Chris "Ceiu" Rog 2018-04-11 21:03:08 UTC
While I was unable to reproduce the lock-wait timeout, I was able to reproduce some severe performance loss associated with processing large amounts of revoked serials that could easily lead to an eventual lock-wait given how the data was being fetched and processed.

As a speculative fix, we've done some significant performance tuning of the CRL task which should help alleviate this situation; as well as returned a cleanup operation that was accidentally removed a couple years back.

Ideally the combination of those two things will outright fix the problem, or at least make it fast enough that the probability it gets hung up on itself (or another task) is low enough that this is no longer a major concern.

Should this issue persist after this fix is in place, we'll need to do more in-depth testing to identify what contentions we have in place.