Bug 1620226 - CertificateRevocationListTask consumes 100% CPU and Memory for large lists
Summary: CertificateRevocationListTask consumes 100% CPU and Memory for large lists
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Candlepin
Classification: Community
Component: candlepin
Version: 2.3
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 2.6
Assignee: Michael Stead
QA Contact: Katello QA List
URL:
Whiteboard:
Depends On:
Blocks: 1672706
TreeView+ depends on / blocked
 
Reported: 2018-08-22 18:24 UTC by Shayne Riley
Modified: 2019-04-22 15:14 UTC (History)
6 users (show)

Fixed In Version: candlepin-2.6.1-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-23 16:56:08 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github candlepin candlepin pull 2204 0 'None' closed 1620226: Perform CRL generation without exceeding memory and maxing CPU 2020-06-01 10:21:24 UTC

Description Shayne Riley 2018-08-22 18:24:54 UTC
Description of problem:
When running a CertificateRevocationListTask, the task will eventually hit 100% CPU use, and maxes out the available heap size, 14GB in this case. Additionally, the node becomes unresponsive to any HTTP requests, as the majority of the CPU time is spent in full GC.

Additionally, other nodes that may create/run any async jobs, such as hypervisor checkins or refreshPoolsJobs, will become unresponsive to HTTP requests. They're unresponsive, even though they're barely using any CPU and memory is fine.

Version-Release number of selected component (if applicable):
2.3.9

How reproducible:
Always, but only in prod.


Steps to Reproduce:
1. Schedule a CertificateRevocationListTask
2. Wait 10-15 minutes
3. Try to make a benign HTTP call, like GET status

Actual results:
Task takes over two hours, still doesn't complete, node becomes unresponsive due to high GC, other "worker" nodes become unresponsivce (no GC).


Expected results:
CertificateRevocationListTask can complete without consuming all 14GB of RAM to do so, and does it without locking out the other worker nodes.


Additional info:
This can be considered the sequel to BZ1566244.


Note You need to log in before you can comment on or make changes to this bug.