Created attachment 1420514 [details] Stack Trace - lock wait timeout Description of problem: The scheduled CertificateRevocationListTask is consistently failing. Version-Release number of selected component (if applicable): 2.2.3 How reproducible: Every time Steps to Reproduce: 1. run the job or wait for the scheduled run 2. 3. Actual results: The CertificateRevocationListTask hasn't been running since mid-November because QRTZ_TRIGGERS next_fire_time never got fired in time, and it just stayed that way until recently. We had Anil run a query on the table to get the next_fire_times corrected. Now, CertificateRevocationListTask runs as scheduled (taking approximately 12 each time), but it always ends with an SQLException "Lock wait timeout exceeded" [1] (see attached logged stack trace). Judging by the stack trace (attached), looks like the problem may be related to CrlFileUtil line 316, where certificateSerialCurator.getExpiredSerials() is called. I'm not sure why that is, but that appears to be the hang up. Expected results: Successful completion of the job Additional info:
While I was unable to reproduce the lock-wait timeout, I was able to reproduce some severe performance loss associated with processing large amounts of revoked serials that could easily lead to an eventual lock-wait given how the data was being fetched and processed. As a speculative fix, we've done some significant performance tuning of the CRL task which should help alleviate this situation; as well as returned a cleanup operation that was accidentally removed a couple years back. Ideally the combination of those two things will outright fix the problem, or at least make it fast enough that the probability it gets hung up on itself (or another task) is low enough that this is no longer a major concern. Should this issue persist after this fix is in place, we'll need to do more in-depth testing to identify what contentions we have in place.