Created attachment 1420514 [details]
Stack Trace - lock wait timeout
Description of problem: The scheduled CertificateRevocationListTask is consistently failing.
Version-Release number of selected component (if applicable): 2.2.3
How reproducible: Every time
Steps to Reproduce:
1. run the job or wait for the scheduled run
The CertificateRevocationListTask hasn't been
running since mid-November because QRTZ_TRIGGERS next_fire_time never
got fired in time, and it just stayed that way until recently. We had
Anil run a query on the table to get the next_fire_times corrected.
Now, CertificateRevocationListTask runs as scheduled (taking
approximately 12 each time), but it always ends with an SQLException
"Lock wait timeout exceeded"  (see attached logged stack trace).
Judging by the stack trace (attached), looks like the problem may be
related to CrlFileUtil line 316, where
certificateSerialCurator.getExpiredSerials() is called. I'm not sure
why that is, but that appears to be the hang up.
Successful completion of the job
While I was unable to reproduce the lock-wait timeout, I was able to reproduce some severe performance loss associated with processing large amounts of revoked serials that could easily lead to an eventual lock-wait given how the data was being fetched and processed.
As a speculative fix, we've done some significant performance tuning of the CRL task which should help alleviate this situation; as well as returned a cleanup operation that was accidentally removed a couple years back.
Ideally the combination of those two things will outright fix the problem, or at least make it fast enough that the probability it gets hung up on itself (or another task) is low enough that this is no longer a major concern.
Should this issue persist after this fix is in place, we'll need to do more in-depth testing to identify what contentions we have in place.