Bug 2229095

Summary: CertificateCleanupJob fails with foreign key constraint violation on table cp_certificate
Product: Red Hat Satellite Reporter: Alexey Masolov <amasolov>
Component: CandlepinAssignee: Barnaby Court <bcourt>
Status: CLOSED ERRATA QA Contact: Shweta Singh <shwsingh>
Severity: urgent Docs Contact:
Priority: high    
Version: 6.11.5CC: ahumbe, alsouza, hyu, jpasqual, momran, nmoumoul, pmoravec, rlavi, saydas, shwsingh, snemeth, zhunting
Target Milestone: 6.15.0Keywords: PrioBumpGSS, Triaged, WorkAround
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: candlepin-4.3.12-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2231503 2256950 2256951 2266145 (view as bug list) Environment:
Last Closed: 2024-04-23 17:12:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2231503, 2234486, 2256950, 2256951    
Bug Blocks:    

Description Alexey Masolov 2023-08-04 06:16:55 UTC
Description of problem:

On a very loaded system "CertificateCleanupJob" job shows following error:

2023-08-03 17:45:22,829 [thread=Thread-3 (ActiveMQ-client-global-threads)] [job=8a35b3cf89b9b7aa0189bb44de043b1b, job_key=CertificateCleanupJob, org=, csid=] WARN  org.hibernate.engine.jdbc.spi.SqlExceptionHelper - SQL Error: 0, SQLState: 23503
2023-08-03 17:45:22,829 [thread=Thread-3 (ActiveMQ-client-global-threads)] [job=8a35b3cf89b9b7aa0189bb44de043b1b, job_key=CertificateCleanupJob, org=, csid=] ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper - ERROR: update or delete on table "cp_cert_serial" violates foreign key constraint "fk16370c54b9410fc" on table "cp_certificate"
  Detail: Key (id)=(1298353109733755480) is still referenced from table "cp_certificate".
2023-08-03 17:45:22,840 [thread=Thread-3 (ActiveMQ-client-global-threads)] [job=8a35b3cf89b9b7aa0189bb44de043b1b, job_key=CertificateCleanupJob, org=, csid=] ERROR org.candlepin.async.JobManager - Job "CertificateCleanupJob" failed in 20722815ms

Version-Release number of selected component (if applicable):
Satellite 6.11

Comment 4 William Poteat 2023-08-04 14:17:41 UTC
Can I get a database dump for this?

Comment 15 William Poteat 2023-08-15 13:53:01 UTC
Then certificates that are not yet expired but have been revoked can be used to retrieve content.

What specific performance is currently affected? Deleting rows with no concrete justification is not advisable.

If there are rows that the job is not removing, they are there for a reason.

Comment 22 Pavel Moravec 2023-12-07 09:07:56 UTC
My understanding is there are two issues behind this BZ, foreign key error and performance:


1) CertificateCleanupJob fails with foreign key constraint error:

ERROR: update or delete on table "cp_cert_serial" violates foreign key constraint "fk16370c54b9410fc" on table "cp_certificate"
  Detail: Key (id)=(1298353109733755480) is still referenced from table "cp_certificate".

Do we need a reproducer for this? Is this error acknowledged to be fixed, or is it recommended to just clean some redundant certs?

For the workaround / cleanup: what is the right command? Is it the

su - postgres -c "psql candlepin -c \"delete from cp_certificate where id in (select c.id as certificate_id from cp_cert_serial cs inner join cp_certificate c on c.serial_id = cs.id where cs.revoked = 't');\""

one?


2) Independent problem is the performance: deleting hundreds of thousands (or millions) of such certs takes hours to days what makes candlepin unresponsive during that time.

Is this planned to be fixed? What indices can be added as a workaround - and do we need to remove them before an upgrade (that would, as a part of fix, try to add them and re-adding an index would fail and halt the upgrade, potentially)? Is it the:

 ~~~
 CREATE INDEX cp_cont_access_cert_serial_id_idx on cp_cont_access_cert(serial_id); 
 CREATE INDEX cp_ueber_cert_serial_id_idx on cp_ueber_cert(serial_id); 
 ~~~

 Hao's created the index below too:
 ~~~
 CREATE INDEX cp_consumer_cont_acc_cert_id_idx on cp_consumer(cont_acc_cert_id);
 ~~~

?


(we should write KCS for it but I would like to double-check its content before writing)


(please consider the questions "will it be fixed" as informative, not pushing for specific fix/solution)

Comment 23 William Poteat 2023-12-07 13:31:21 UTC
Cleaning the data on an ad hoc basis is the plan. There is no need to write code to clean these rows. The current code does not leave them and it would not be an effective use of anyone's time.


The indices are not a workaround, they are the fix. They have been added to the current Candlepin for use in Satellite. No patch will be produced to create these indices. They should be applied to a database as needed.

Comment 24 Pavel Moravec 2023-12-07 15:20:45 UTC
Thanks for confirmation, KCS created: https://access.redhat.com/solutions/7048429

Comment 25 Pavel Moravec 2023-12-08 07:53:48 UTC
I am linking support cases where I found (by some limited search) sosreport shows the same error affects these Satellites. Despite the cases were opened for different issues, these customers are affected by this problem as well.

Comment 26 Nikos Moumoulidis 2023-12-12 12:16:51 UTC
To add even more context to Pavel's comment above (#c22), after further investigation:

This issue is really multiple issues:
1. A foreign key violation against fk54b0f288a0b39916 (referencing table cp_upstream_consumer) that was already fixed in https://bugzilla.redhat.com/show_bug.cgi?id=2084130
2. The job being very slow due to lack of appropriate indices.
3. A foreign key violation against fk16370c54b9410fc (referencing table cp_certificate) that has not yet to be addressed. This violation should have never happened under
   normal circumstances because certificates in that table should never be revoked. However, an old db migration script (https://github.com/candlepin/candlepin/blob/6f1a83ced858abc560666af90ecc659f138bb338/src/main/resources/db/changelog/20170301083925-update-revoked-cert-serial-data.xml#L24-L32) accidentally set those to revoked, and once they also were expired, this fk violation started happening.

We will use this BZ id to fix both issues 2 and 3, and also prevent any FK violations from happening in the future due to rogue bad data.

Comment 33 errata-xmlrpc 2024-04-23 17:12:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.15.0 release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:2010