Bug 1927532

Summary: Large CRL file operation causes OOM error in Candlepin
Product: Red Hat Satellite Reporter: Mike McCune <mmccune>
Component: CandlepinAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED ERRATA QA Contact: Danny Synk <dsynk>
Severity: high Docs Contact:
Priority: high    
Version: 6.8.0CC: dsynk, ehelms, ltran, msunil, nmoumoul, pmendezh, pmoravec, redakkan, saydas
Target Milestone: 6.11.0Keywords: Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: candlepin-3.1.31-1,candlepin-4.0.8-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1928161 1958127 1958128 2027358 (view as bug list) Environment:
Last Closed: 2022-07-05 14:28:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1958127, 1958128    
Bug Blocks:    

Description Mike McCune 2021-02-10 22:21:41 UTC
Large customer environments occasionally generate a large CRL file after a big subscription update or refresh. Error:

2021-02-08 12:01:46,929 [thread=QuartzScheduler_Worker-6] [job=CertificateRevocationListTask-911d9c76-5768-4b10-b827-16dfac3c8b78, org=, csid=] ERROR org.quartz.core.JobRunShell - Job cron group.CertificateRevocationListTask-911d9c76-5768-4b10-b827-16dfac3c8b78 threw an unhandled Exception:
java.lang.OutOfMemoryError: input is too large to fit in a byte array
        at com.google.common.io.ByteStreams.toByteArrayInternal(ByteStreams.java:195)

Customer in question had a 2.8G CRL in:

/var/lib/candlepin/candlepin-crl.crl

this blew past the 1.99 GB limit for processing this file and the customer will be forced to take manual steps to get past the collection process.

WORKAROUND:

1) stop services:

# satellite-maintain service stop

2) start Postgres:

# systemctl start postgresql

3) move CRL out of the way:

# mv /var/lib/candlepin/candlepin-crl.crl /var/lib/candlepin/candlepin-crl.BAK

4) Update database:

# echo "UPDATE cp_cert_serial SET collected=true WHERE revoked=true;" | sudo -u postgres psql -d candlepin
UPDATE 134330

5) start services and resume operations

# satellite-maintain service start

Comment 2 Nikos Moumoulidis 2021-09-02 10:32:07 UTC
An update on how this issue will be resolved: We have a solution currently under review that will remove the CertificateRevocationListTask job entirely,
and replace it with a new job called CertificateCleanupJob which will be running periodically and:
- Will no longer be generating a CRL file.
- Will be revoking all expired (but not yet revoked) Identity and SCA certificates (these might pile up when hosts register and then never unregister themselves, and time passes so those certs are never revoked, but are expired and therefore invalid).
- Will be deleting all the certificate serials that are both expired and revoked (this includes serials of all 3 types of certs: identity, SCA and entitlement).

Comment 4 Danny Synk 2022-01-12 15:15:14 UTC
Verified on Satellite 7.0.0, snap 4 running on RHEL 7 and RHEL 8.

Steps to Test:

1. Add the following line to /etc/candlepin/candlepin.conf to set the CRLUpdateJob to run every 3 minutes:

candlepin.async.jobs.CRLUpdateJob.schedule=0 0/3 * * * ?

2. Restart the tomcat service:

# systemctl restart tomcat

3. Register a host to Satellite.

4. Verify that /var/lib/candlepin/candlepin-crl.crl is not present.

5. Follow the Candlepin log and wait for the CRLUpdateJob to attempt to run.

6. Between two runs of the job, attach a subscription to the host registered in step 3, then immediately remove that subscription.

7. After the job runs again, verify that the CRL file is still not present.

Expected Results:

The CRL file is not present when Satellite is installed, and the file is not created when the CRLUpdateJob is triggered.

Actual Results:

The CRL file is not present when Satellite is installed, and the file is not created when the CRLUpdateJob is triggered.

The attempted job run results in an error in /var/log/candlepin/candlepin.log:

```
2022-01-11 14:57:00,006 [thread=QuartzScheduler_Worker-14] [=, org=, csid=] INFO  org.candlepin.async.JobManager - Job queued: AsyncJobStatus [id: 8a81828b7e4a0179017e4ab716e4096f, name: CRLUpdateJob, key: CRLUpdateJob, state: QUEUED]
2022-01-11 14:57:00,048 [thread=Thread-151 (ActiveMQ-client-global-threads)] [job=8a81828b7e4a0179017e4ab716e4096f, job_key=CRLUpdateJob, org=, csid=] ERROR org.candlepin.async.JobManager - No registered job class for job: CRLUpdateJob
2022-01-11 14:57:00,048 [thread=Thread-151 (ActiveMQ-client-global-threads)] [=, org=, csid=] ERROR org.candlepin.async.JobMessageReceiver - Job processing failed terminally; committing job message as acknowledged: Message [id: 1292, address: job, body: {"jobId":"8a81828b7e4a0179017e4ab716e4096f","jobKey":"CRLUpdateJob"}]
org.candlepin.async.JobInitializationException: No registered job class for job: CRLUpdateJob
```

This error reflects the fact that the CRLUpdateJob was removed from Candlepin and replaced with the CertificateCleanupJob in https://github.com/candlepin/candlepin/pull/3078.

Comment 6 Nikos Moumoulidis 2022-03-25 14:53:10 UTC
*** Bug 1999089 has been marked as a duplicate of this bug. ***

Comment 9 errata-xmlrpc 2022-07-05 14:28:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.11 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5498