Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1793092 - candlepin hangs on ExpiredPoolsJob forever while Updating stacked entitlements
Summary: candlepin hangs on ExpiredPoolsJob forever while Updating stacked entitlements
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Candlepin
Version: 6.5.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: 6.7.0
Assignee: satellite6-bugs
QA Contact: Cole Higgins
URL:
Whiteboard:
Depends On:
Blocks: 1794041 1794042 1794043
TreeView+ depends on / blocked
 
Reported: 2020-01-20 16:18 UTC by Pavel Moravec
Modified: 2023-09-07 21:32 UTC (History)
4 users (show)

Fixed In Version: candlepin-2.9.23-1,foreman-installer-1.24.1.10-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1794041 1794042 1794043 1794583 (view as bug list)
Environment:
Last Closed: 2020-04-14 13:28:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github theforeman puppet-candlepin pull 141 0 None closed Fixes #28840 - move to every 24 hours instead of every hour 2020-08-26 01:43:28 UTC
Red Hat Product Errata RHSA-2020:1454 0 None None None 2020-04-14 13:28:48 UTC

Internal Links: 1794041 1794042 1794043

Description Pavel Moravec 2020-01-20 16:18:28 UTC
Description of problem:
On a customer data (internally reproduced), candlepin stops responding on requests. The cause is hung ExpiredPoolsJob after logging:

Updating stacked entitlements for 1661 consumers...

usually followed by:
- a postgres connection leak ("remaining connection slots are reserved for non-replication superuser connections")
- a postgres deadlock

followed 5-20 minutes after the Job kick off.


Version-Release number of selected component (if applicable):
Sat6.5 / candlepin-2.5.17-1.el7sat.noarch


How reproducible:
100% on the customer data


Steps to Reproduce:
1. Have the customer candlepin's DB (reproducer to be provided)
2. Add to candlepin.conf line like:

pinsetter.org.candlepin.pinsetter.tasks.ExpiredPoolsJob.schedule=0 40 * * * ? *

where the "40" is the minute when ExpiredPoolsJob should be kicked off (so one does not need to wait for full hour to default kick off)

3. Follow https://access.redhat.com/solutions/2955931 just to enable curl requests to check candlepin responsiveness (maybe "hammer ping" is sufficient)

4. Monitor /var/log/candlepin/candlepin.log for "ExpiredPoolsJob"

5. Sometimes check candlepin responds to requests like:

curl -u admin:admin -ks https://127.0.0.1:8443/candlepin/consumers/e26929e7-5249-47ce-b5c2-035b05c2456f

6. (havent seen on reproducer yet) monitor postgres logs for a deadlock and/or no free connections log entry

7. monitor CPU usage of tomcat process


Actual results:
4. ExpiredPoolsJob is started but never terminates, like:
2020-01-20 16:40:00,065 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.pinsetter.tasks.KingpinJob - Starting job: org.candlepin.pinsetter.tasks.ExpiredPoolsJob
2020-01-20 16:40:00,203 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Attempting to delete 10 pools...
2020-01-20 16:40:00,203 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Fetching related pools and entitlements...
2020-01-20 16:40:08,960 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Locked 11 pools for deletion...
2020-01-20 16:42:13,429 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Revoking 11271 entitlements...
2020-01-20 16:42:54,000 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Entitlements successfully revoked
2020-01-20 16:42:54,000 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Deleting 11 pools...
2020-01-20 16:42:56,258 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Pools successfully deleted
2020-01-20 16:42:56,258 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Updating entitlement counts on remaining, affected pools...
2020-01-20 16:43:04,344 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Entitlement counts successfully updated for 0 pools and 11242 consumers
2020-01-20 16:43:04,601 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Updating stacked entitlements for 1661 consumers...

(and no "Job completed:")

Sometimes (like on the reproducer just recently), the job can fail with:

2020-01-20 17:16:17,304 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.pinsetter.tasks.KingpinJob - Job completed: time=2177239
2020-01-20 17:16:17,305 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] ERROR org.quartz.core.JobRunShell - Job cron group.ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8 threw an unhandled Exception: 
java.lang.NullPointerException: null
	at org.candlepin.policy.SystemPurposeComplianceRules.getStatus(SystemPurposeComplianceRules.java:143)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2389)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:62)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2143)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPoolsImpl(CandlepinPoolManager.java:466)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:70)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPools(CandlepinPoolManager.java:435)
	at org.candlepin.pinsetter.tasks.ExpiredPoolsJob.toExecute(ExpiredPoolsJob.java:49)
	at org.candlepin.pinsetter.tasks.KingpinJob.execute(KingpinJob.java:115)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
2020-01-20 17:16:17,306 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] ERROR org.quartz.core.ErrorLogger - Job (cron group.ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8 threw an exception.
org.quartz.SchedulerException: Job threw an unhandled exception.
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: java.lang.NullPointerException: null
	at org.candlepin.policy.SystemPurposeComplianceRules.getStatus(SystemPurposeComplianceRules.java:143)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2389)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:62)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2143)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPoolsImpl(CandlepinPoolManager.java:466)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:70)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPools(CandlepinPoolManager.java:435)
	at org.candlepin.pinsetter.tasks.ExpiredPoolsJob.toExecute(ExpiredPoolsJob.java:49)
	at org.candlepin.pinsetter.tasks.KingpinJob.execute(KingpinJob.java:115)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
	... 1 common frames omitted
2020-01-20 17:16:17,308 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] ERROR org.candlepin.pinsetter.core.PinsetterJobListener - Job [ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8] failed.
org.quartz.JobExecutionException: org.quartz.SchedulerException: Job threw an unhandled exception. [See nested exception: java.lang.NullPointerException]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:218)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: org.quartz.SchedulerException: Job threw an unhandled exception.
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
	... 1 common frames omitted
Caused by: java.lang.NullPointerException: null
	at org.candlepin.policy.SystemPurposeComplianceRules.getStatus(SystemPurposeComplianceRules.java:143)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2389)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:62)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2143)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPoolsImpl(CandlepinPoolManager.java:466)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:70)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPools(CandlepinPoolManager.java:435)
	at org.candlepin.pinsetter.tasks.ExpiredPoolsJob.toExecute(ExpiredPoolsJob.java:49)
	at org.candlepin.pinsetter.tasks.KingpinJob.execute(KingpinJob.java:115)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
	... 1 common frames omitted

5. the curl command hangs after a time (20-ish minutes after Job started, I expect - havent seen on my reproducer, just at customer)

6. either no below connection error or deadlock is logged (this probably requires some specific load / requests fired, havent seen on my reproducer, just sometimes in customer logs):

2020-01-16 23:06:00 CST FATAL:  remaining connection slots are reserved for non-replication superuser connections

2020-01-17 21:02:48 CST ERROR:  deadlock detected
2020-01-17 21:02:48 CST DETAIL:  Process 177429 waits for ShareLock on transaction 1842161269; blocked by process 217155.
        Process 217155 waits for ShareLock on transaction 1842136013; blocked by process 177429.
        Process 177429: update cp_consumer set created=$1, updated=$2, annotations=$3, autoheal=$4, complianceStatusHash=$5, cont_acc_cert_id=$6, content_access_mode=$7, entitlement_count=$8, entitlementStatus=$9, environment_id=$10, consumer_idcert_id=$11, keyPair_id=$12, lastCheckin=$13, name=$14, owner_id=$15, releaseVer=$16, sp_role=$17, serviceLevel=$18, sp_status=$19, sp_status_hash=$20, type_id=$21, sp_usage=$22, username=$23, uuid=$24 where id=$25
        Process 217155: delete from cp_entitlement where id=$1

7. CPU slowly grows over time


Expected results:
4. Job completes in a reasonable time

5. No requests hung

6. No postgres connection leak or deadlock

7. no CPU growth


Additional info:

Comment 4 Mike McCune 2020-01-22 18:07:47 UTC
After discussion within Engineering we are going to make a change to Satellite so the ExpiredPoolsJob only runs 1x every 24 hours. That will be included in this BZ.

Comment 7 Cole Higgins 2020-03-03 21:16:01 UTC
Verified in 6.7 snap 14

[root@dhcp-2-119 ~]# grep Expired /etc/candlepin/candlepin.conf
# Quartz schedule notation for how often to run the ExpiredPoolsJob
pinsetter.org.candlepin.pinsetter.tasks.ExpiredPoolsJob.schedule=0 0 0 * * ?

Comment 10 errata-xmlrpc 2020-04-14 13:28:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1454


Note You need to log in before you can comment on or make changes to this bug.