Bug 1620226

Summary: CertificateRevocationListTask consumes 100% CPU and Memory for large lists
Product: [Community] Candlepin Reporter: Shayne Riley <sriley>
Component: candlepinAssignee: Michael Stead <mstead>
Status: CLOSED CURRENTRELEASE QA Contact: Katello QA List <katello-qa-list>
Severity: medium Docs Contact:
Priority: high    
Version: 2.3CC: andrew.schofield, bcourt, khowell, mstead, redakkan, skallesh
Target Milestone: ---Keywords: Triaged
Target Release: 2.6   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: candlepin-2.6.1-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-23 16:56:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1672706    

Description Shayne Riley 2018-08-22 18:24:54 UTC
Description of problem:
When running a CertificateRevocationListTask, the task will eventually hit 100% CPU use, and maxes out the available heap size, 14GB in this case. Additionally, the node becomes unresponsive to any HTTP requests, as the majority of the CPU time is spent in full GC.

Additionally, other nodes that may create/run any async jobs, such as hypervisor checkins or refreshPoolsJobs, will become unresponsive to HTTP requests. They're unresponsive, even though they're barely using any CPU and memory is fine.

Version-Release number of selected component (if applicable):
2.3.9

How reproducible:
Always, but only in prod.


Steps to Reproduce:
1. Schedule a CertificateRevocationListTask
2. Wait 10-15 minutes
3. Try to make a benign HTTP call, like GET status

Actual results:
Task takes over two hours, still doesn't complete, node becomes unresponsive due to high GC, other "worker" nodes become unresponsivce (no GC).


Expected results:
CertificateRevocationListTask can complete without consuming all 14GB of RAM to do so, and does it without locking out the other worker nodes.


Additional info:
This can be considered the sequel to BZ1566244.