Bug 1793092

Summary: candlepin hangs on ExpiredPoolsJob forever while Updating stacked entitlements
Product: Red Hat Satellite Reporter: Pavel Moravec <pmoravec>
Component: CandlepinAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED ERRATA QA Contact: Cole Higgins <chiggins>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.5.0CC: dsynk, egolov, mmccune, zhunting
Target Milestone: 6.7.0Keywords: Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: candlepin-2.9.23-1,foreman-installer-1.24.1.10-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1794041 1794042 1794043 1794583 (view as bug list) Environment:
Last Closed: 2020-04-14 13:28:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1794041, 1794042, 1794043    

Description Pavel Moravec 2020-01-20 16:18:28 UTC
Description of problem:
On a customer data (internally reproduced), candlepin stops responding on requests. The cause is hung ExpiredPoolsJob after logging:

Updating stacked entitlements for 1661 consumers...

usually followed by:
- a postgres connection leak ("remaining connection slots are reserved for non-replication superuser connections")
- a postgres deadlock

followed 5-20 minutes after the Job kick off.


Version-Release number of selected component (if applicable):
Sat6.5 / candlepin-2.5.17-1.el7sat.noarch


How reproducible:
100% on the customer data


Steps to Reproduce:
1. Have the customer candlepin's DB (reproducer to be provided)
2. Add to candlepin.conf line like:

pinsetter.org.candlepin.pinsetter.tasks.ExpiredPoolsJob.schedule=0 40 * * * ? *

where the "40" is the minute when ExpiredPoolsJob should be kicked off (so one does not need to wait for full hour to default kick off)

3. Follow https://access.redhat.com/solutions/2955931 just to enable curl requests to check candlepin responsiveness (maybe "hammer ping" is sufficient)

4. Monitor /var/log/candlepin/candlepin.log for "ExpiredPoolsJob"

5. Sometimes check candlepin responds to requests like:

curl -u admin:admin -ks https://127.0.0.1:8443/candlepin/consumers/e26929e7-5249-47ce-b5c2-035b05c2456f

6. (havent seen on reproducer yet) monitor postgres logs for a deadlock and/or no free connections log entry

7. monitor CPU usage of tomcat process


Actual results:
4. ExpiredPoolsJob is started but never terminates, like:
2020-01-20 16:40:00,065 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.pinsetter.tasks.KingpinJob - Starting job: org.candlepin.pinsetter.tasks.ExpiredPoolsJob
2020-01-20 16:40:00,203 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Attempting to delete 10 pools...
2020-01-20 16:40:00,203 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Fetching related pools and entitlements...
2020-01-20 16:40:08,960 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Locked 11 pools for deletion...
2020-01-20 16:42:13,429 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Revoking 11271 entitlements...
2020-01-20 16:42:54,000 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Entitlements successfully revoked
2020-01-20 16:42:54,000 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Deleting 11 pools...
2020-01-20 16:42:56,258 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Pools successfully deleted
2020-01-20 16:42:56,258 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Updating entitlement counts on remaining, affected pools...
2020-01-20 16:43:04,344 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Entitlement counts successfully updated for 0 pools and 11242 consumers
2020-01-20 16:43:04,601 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.controller.CandlepinPoolManager - Updating stacked entitlements for 1661 consumers...

(and no "Job completed:")

Sometimes (like on the reproducer just recently), the job can fail with:

2020-01-20 17:16:17,304 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO  org.candlepin.pinsetter.tasks.KingpinJob - Job completed: time=2177239
2020-01-20 17:16:17,305 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] ERROR org.quartz.core.JobRunShell - Job cron group.ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8 threw an unhandled Exception: 
java.lang.NullPointerException: null
	at org.candlepin.policy.SystemPurposeComplianceRules.getStatus(SystemPurposeComplianceRules.java:143)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2389)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:62)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2143)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPoolsImpl(CandlepinPoolManager.java:466)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:70)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPools(CandlepinPoolManager.java:435)
	at org.candlepin.pinsetter.tasks.ExpiredPoolsJob.toExecute(ExpiredPoolsJob.java:49)
	at org.candlepin.pinsetter.tasks.KingpinJob.execute(KingpinJob.java:115)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
2020-01-20 17:16:17,306 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] ERROR org.quartz.core.ErrorLogger - Job (cron group.ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8 threw an exception.
org.quartz.SchedulerException: Job threw an unhandled exception.
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: java.lang.NullPointerException: null
	at org.candlepin.policy.SystemPurposeComplianceRules.getStatus(SystemPurposeComplianceRules.java:143)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2389)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:62)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2143)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPoolsImpl(CandlepinPoolManager.java:466)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:70)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPools(CandlepinPoolManager.java:435)
	at org.candlepin.pinsetter.tasks.ExpiredPoolsJob.toExecute(ExpiredPoolsJob.java:49)
	at org.candlepin.pinsetter.tasks.KingpinJob.execute(KingpinJob.java:115)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
	... 1 common frames omitted
2020-01-20 17:16:17,308 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] ERROR org.candlepin.pinsetter.core.PinsetterJobListener - Job [ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8] failed.
org.quartz.JobExecutionException: org.quartz.SchedulerException: Job threw an unhandled exception. [See nested exception: java.lang.NullPointerException]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:218)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: org.quartz.SchedulerException: Job threw an unhandled exception.
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
	... 1 common frames omitted
Caused by: java.lang.NullPointerException: null
	at org.candlepin.policy.SystemPurposeComplianceRules.getStatus(SystemPurposeComplianceRules.java:143)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2389)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:62)
	at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2143)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPoolsImpl(CandlepinPoolManager.java:466)
	at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:70)
	at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPools(CandlepinPoolManager.java:435)
	at org.candlepin.pinsetter.tasks.ExpiredPoolsJob.toExecute(ExpiredPoolsJob.java:49)
	at org.candlepin.pinsetter.tasks.KingpinJob.execute(KingpinJob.java:115)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
	... 1 common frames omitted

5. the curl command hangs after a time (20-ish minutes after Job started, I expect - havent seen on my reproducer, just at customer)

6. either no below connection error or deadlock is logged (this probably requires some specific load / requests fired, havent seen on my reproducer, just sometimes in customer logs):

2020-01-16 23:06:00 CST FATAL:  remaining connection slots are reserved for non-replication superuser connections

2020-01-17 21:02:48 CST ERROR:  deadlock detected
2020-01-17 21:02:48 CST DETAIL:  Process 177429 waits for ShareLock on transaction 1842161269; blocked by process 217155.
        Process 217155 waits for ShareLock on transaction 1842136013; blocked by process 177429.
        Process 177429: update cp_consumer set created=$1, updated=$2, annotations=$3, autoheal=$4, complianceStatusHash=$5, cont_acc_cert_id=$6, content_access_mode=$7, entitlement_count=$8, entitlementStatus=$9, environment_id=$10, consumer_idcert_id=$11, keyPair_id=$12, lastCheckin=$13, name=$14, owner_id=$15, releaseVer=$16, sp_role=$17, serviceLevel=$18, sp_status=$19, sp_status_hash=$20, type_id=$21, sp_usage=$22, username=$23, uuid=$24 where id=$25
        Process 217155: delete from cp_entitlement where id=$1

7. CPU slowly grows over time


Expected results:
4. Job completes in a reasonable time

5. No requests hung

6. No postgres connection leak or deadlock

7. no CPU growth


Additional info:

Comment 4 Mike McCune 2020-01-22 18:07:47 UTC
After discussion within Engineering we are going to make a change to Satellite so the ExpiredPoolsJob only runs 1x every 24 hours. That will be included in this BZ.

Comment 7 Cole Higgins 2020-03-03 21:16:01 UTC
Verified in 6.7 snap 14

[root@dhcp-2-119 ~]# grep Expired /etc/candlepin/candlepin.conf
# Quartz schedule notation for how often to run the ExpiredPoolsJob
pinsetter.org.candlepin.pinsetter.tasks.ExpiredPoolsJob.schedule=0 0 0 * * ?

Comment 10 errata-xmlrpc 2020-04-14 13:28:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1454