Bug 1793092 - candlepin hangs on ExpiredPoolsJob forever while Updating stacked entitlements
Summary: candlepin hangs on ExpiredPoolsJob forever while Updating stacked entitlements
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Candlepin
Version: 6.5.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: 6.7.0
Assignee: satellite6-bugs
QA Contact: Cole Higgins
URL:
Whiteboard:
Depends On:
Blocks: 1794041 1794042 1794043
TreeView+ depends on / blocked
 
Reported: 2020-01-20 16:18 UTC by Pavel Moravec
Modified: 2023-09-07 21:32 UTC (History)
4 users (show)

Fixed In Version: candlepin-2.9.23-1,foreman-installer-1.24.1.10-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1794041 1794042 1794043 1794583 (view as bug list)
Environment:
Last Closed: 2020-04-14 13:28:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github theforeman puppet-candlepin pull 141 0 None closed Fixes #28840 - move to every 24 hours instead of every hour 2020-08-26 01:43:28 UTC
Red Hat Product Errata RHSA-2020:1454 0 None None None 2020-04-14 13:28:48 UTC

Internal Links: 1794041 1794042 1794043

Description Pavel Moravec 2020-01-20 16:18:28 UTC
Description of problem:
On a customer data (internally reproduced), candlepin stops responding on requests. The cause is hung ExpiredPoolsJob after logging:

Updating stacked entitlements for 1661 consumers...

usually followed by:
- a postgres connection leak ("remaining connection slots are reserved for non-replication superuser connections")
- a postgres deadlock

followed 5-20 minutes after the Job kick off.


Version-Release number of selected component (if applicable):
Sat6.5 / candlepin-2.5.17-1.el7sat.noarch


How reproducible:
100% on the customer data


Steps to Reproduce:
1. Have the customer candlepin's DB (reproducer to be provided)
2. Add to candlepin.conf line like:

pinsetter.org.candlepin.pinsetter.tasks.ExpiredPoolsJob.schedule=0 40 * * * ? *

where the "40" is the minute when ExpiredPoolsJob should be kicked off (so one does not need to wait for full hour to default kick off)

3. Follow https://access.redhat.com/solutions/2955931 just to enable curl requests to check candlepin responsiveness (maybe "hammer ping" is sufficient)

4. Monitor /var/log/candlepin/candlepin.log for "ExpiredPoolsJob"

5. Sometimes check candlepin responds to requests like:

curl -u admin:admin -ks https://127.0.0.1:8443/candlepin/consumers/e26929e7-5249-47ce-b5c2-035b05c2456f

6. (havent seen on reproducer yet) monitor postgres logs for a deadlock and/or no free connections log entry

7. monitor CPU usage of tomcat process


Actual results:
4. ExpiredPoolsJob is started but never terminates, like:
2020-01-20 16:40:00,065 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.pinsetter.tasks.KingpinJob - Starting job: org.candlepin.pinsetter.tasks.ExpiredPoolsJob
2020-01-20 16:40:00,203 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Attempting to delete 10 pools...
2020-01-20 16:40:00,203 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Fetching related pools and entitlements...
2020-01-20 16:40:08,960 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Locked 11 pools for deletion...
2020-01-20 16:42:13,429 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Revoking 11271 entitlements...
2020-01-20 16:42:54,000 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Entitlements successfully revoked
2020-01-20 16:42:54,000 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Deleting 11 pools...
2020-01-20 16:42:56,258 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Pools successfully deleted
2020-01-20 16:42:56,258 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Updating entitlement counts on remaining, affected pools...
2020-01-20 16:43:04,344 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Entitlement counts successfully updated for 0 pools and 11242 consumers
2020-01-20 16:43:04,601 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Updating stacked entitlements for 1661 consumers...

(and no "Job completed:")

Sometimes (like on the reproducer just recently), the job can fail with:

2020-01-20 17:16:17,304 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.pinsetter.tasks.KingpinJob - Job completed: time=2177239
2020-01-20 17:16:17,305 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] ERROR org.quartz.core.JobRunShell - Job cron group.ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8 threw an unhandled Exception: 
java.lang.NullPointerException: null
	at org.candlepin.policy.SystemPurposeComplianceRules.getStatus(SystemPurposeComplianceRules.java:143)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2389)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:62)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2143)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPoolsImpl(CandlepinPoolManager.java:466)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:70)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPools(CandlepinPoolManager.java:435)
	at org.candlepin.pinsetter.tasks.ExpiredPoolsJob.toExecute(ExpiredPoolsJob.java:49)
	at org.candlepin.pinsetter.tasks.KingpinJob.execute(KingpinJob.java:115)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
2020-01-20 17:16:17,306 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] ERROR org.quartz.core.ErrorLogger - Job (cron group.ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8 threw an exception.
org.quartz.SchedulerException: Job threw an unhandled exception.
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: java.lang.NullPointerException: null
	at org.candlepin.policy.SystemPurposeComplianceRules.getStatus(SystemPurposeComplianceRules.java:143)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2389)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:62)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2143)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPoolsImpl(CandlepinPoolManager.java:466)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:70)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPools(CandlepinPoolManager.java:435)
	at org.candlepin.pinsetter.tasks.ExpiredPoolsJob.toExecute(ExpiredPoolsJob.java:49)
	at org.candlepin.pinsetter.tasks.KingpinJob.execute(KingpinJob.java:115)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
	... 1 common frames omitted
2020-01-20 17:16:17,308 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] ERROR org.candlepin.pinsetter.core.PinsetterJobListener - Job [ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8] failed.
org.quartz.JobExecutionException: org.quartz.SchedulerException: Job threw an unhandled exception. [See nested exception: java.lang.NullPointerException]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:218)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: org.quartz.SchedulerException: Job threw an unhandled exception.
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
	... 1 common frames omitted
Caused by: java.lang.NullPointerException: null
	at org.candlepin.policy.SystemPurposeComplianceRules.getStatus(SystemPurposeComplianceRules.java:143)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2389)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:62)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2143)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPoolsImpl(CandlepinPoolManager.java:466)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:70)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPools(CandlepinPoolManager.java:435)
	at org.candlepin.pinsetter.tasks.ExpiredPoolsJob.toExecute(ExpiredPoolsJob.java:49)
	at org.candlepin.pinsetter.tasks.KingpinJob.execute(KingpinJob.java:115)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
	... 1 common frames omitted

5. the curl command hangs after a time (20-ish minutes after Job started, I expect - havent seen on my reproducer, just at customer)

6. either no below connection error or deadlock is logged (this probably requires some specific load / requests fired, havent seen on my reproducer, just sometimes in customer logs):

2020-01-16 23:06:00 CST FATAL:  remaining connection slots are reserved for non-replication superuser connections

2020-01-17 21:02:48 CST ERROR:  deadlock detected
2020-01-17 21:02:48 CST DETAIL:  Process 177429 waits for ShareLock on transaction 1842161269; blocked by process 217155.
        Process 217155 waits for ShareLock on transaction 1842136013; blocked by process 177429.
        Process 177429: update cp_consumer set created=$1, updated=$2, annotations=$3, autoheal=$4, complianceStatusHash=$5, cont_acc_cert_id=$6, content_access_mode=$7, entitlement_count=$8, entitlementStatus=$9, environment_id=$10, consumer_idcert_id=$11, keyPair_id=$12, lastCheckin=$13, name=$14, owner_id=$15, releaseVer=$16, sp_role=$17, serviceLevel=$18, sp_status=$19, sp_status_hash=$20, type_id=$21, sp_usage=$22, username=$23, uuid=$24 where id=$25
        Process 217155: delete from cp_entitlement where id=$1

7. CPU slowly grows over time


Expected results:
4. Job completes in a reasonable time

5. No requests hung

6. No postgres connection leak or deadlock

7. no CPU growth


Additional info:

Comment 4 Mike McCune 2020-01-22 18:07:47 UTC
After discussion within Engineering we are going to make a change to Satellite so the ExpiredPoolsJob only runs 1x every 24 hours. That will be included in this BZ.

Comment 7 Cole Higgins 2020-03-03 21:16:01 UTC
Verified in 6.7 snap 14

[root@dhcp-2-119 ~]# grep Expired /etc/candlepin/candlepin.conf
# Quartz schedule notation for how often to run the ExpiredPoolsJob
pinsetter.org.candlepin.pinsetter.tasks.ExpiredPoolsJob.schedule=0 0 0 * * ?

Comment 10 errata-xmlrpc 2020-04-14 13:28:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1454


Note You need to log in before you can comment on or make changes to this bug.