Description of problem: On a customer data (internally reproduced), candlepin stops responding on requests. The cause is hung ExpiredPoolsJob after logging: Updating stacked entitlements for 1661 consumers... usually followed by: - a postgres connection leak ("remaining connection slots are reserved for non-replication superuser connections") - a postgres deadlock followed 5-20 minutes after the Job kick off. Version-Release number of selected component (if applicable): Sat6.5 / candlepin-2.5.17-1.el7sat.noarch How reproducible: 100% on the customer data Steps to Reproduce: 1. Have the customer candlepin's DB (reproducer to be provided) 2. Add to candlepin.conf line like: pinsetter.org.candlepin.pinsetter.tasks.ExpiredPoolsJob.schedule=0 40 * * * ? * where the "40" is the minute when ExpiredPoolsJob should be kicked off (so one does not need to wait for full hour to default kick off) 3. Follow https://access.redhat.com/solutions/2955931 just to enable curl requests to check candlepin responsiveness (maybe "hammer ping" is sufficient) 4. Monitor /var/log/candlepin/candlepin.log for "ExpiredPoolsJob" 5. Sometimes check candlepin responds to requests like: curl -u admin:admin -ks https://127.0.0.1:8443/candlepin/consumers/e26929e7-5249-47ce-b5c2-035b05c2456f 6. (havent seen on reproducer yet) monitor postgres logs for a deadlock and/or no free connections log entry 7. monitor CPU usage of tomcat process Actual results: 4. ExpiredPoolsJob is started but never terminates, like: 2020-01-20 16:40:00,065 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO org.candlepin.pinsetter.tasks.KingpinJob - Starting job: org.candlepin.pinsetter.tasks.ExpiredPoolsJob 2020-01-20 16:40:00,203 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO org.candlepin.controller.CandlepinPoolManager - Attempting to delete 10 pools... 2020-01-20 16:40:00,203 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO org.candlepin.controller.CandlepinPoolManager - Fetching related pools and entitlements... 2020-01-20 16:40:08,960 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO org.candlepin.controller.CandlepinPoolManager - Locked 11 pools for deletion... 2020-01-20 16:42:13,429 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO org.candlepin.controller.CandlepinPoolManager - Revoking 11271 entitlements... 2020-01-20 16:42:54,000 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO org.candlepin.controller.CandlepinPoolManager - Entitlements successfully revoked 2020-01-20 16:42:54,000 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO org.candlepin.controller.CandlepinPoolManager - Deleting 11 pools... 2020-01-20 16:42:56,258 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO org.candlepin.controller.CandlepinPoolManager - Pools successfully deleted 2020-01-20 16:42:56,258 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO org.candlepin.controller.CandlepinPoolManager - Updating entitlement counts on remaining, affected pools... 2020-01-20 16:43:04,344 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO org.candlepin.controller.CandlepinPoolManager - Entitlement counts successfully updated for 0 pools and 11242 consumers 2020-01-20 16:43:04,601 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO org.candlepin.controller.CandlepinPoolManager - Updating stacked entitlements for 1661 consumers... (and no "Job completed:") Sometimes (like on the reproducer just recently), the job can fail with: 2020-01-20 17:16:17,304 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] INFO org.candlepin.pinsetter.tasks.KingpinJob - Job completed: time=2177239 2020-01-20 17:16:17,305 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] ERROR org.quartz.core.JobRunShell - Job cron group.ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8 threw an unhandled Exception: java.lang.NullPointerException: null at org.candlepin.policy.SystemPurposeComplianceRules.getStatus(SystemPurposeComplianceRules.java:143) at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2389) at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:62) at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2143) at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPoolsImpl(CandlepinPoolManager.java:466) at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:70) at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPools(CandlepinPoolManager.java:435) at org.candlepin.pinsetter.tasks.ExpiredPoolsJob.toExecute(ExpiredPoolsJob.java:49) at org.candlepin.pinsetter.tasks.KingpinJob.execute(KingpinJob.java:115) at org.quartz.core.JobRunShell.run(JobRunShell.java:202) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) 2020-01-20 17:16:17,306 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] ERROR org.quartz.core.ErrorLogger - Job (cron group.ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8 threw an exception. org.quartz.SchedulerException: Job threw an unhandled exception. at org.quartz.core.JobRunShell.run(JobRunShell.java:213) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) Caused by: java.lang.NullPointerException: null at org.candlepin.policy.SystemPurposeComplianceRules.getStatus(SystemPurposeComplianceRules.java:143) at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2389) at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:62) at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2143) at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPoolsImpl(CandlepinPoolManager.java:466) at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:70) at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPools(CandlepinPoolManager.java:435) at org.candlepin.pinsetter.tasks.ExpiredPoolsJob.toExecute(ExpiredPoolsJob.java:49) at org.candlepin.pinsetter.tasks.KingpinJob.execute(KingpinJob.java:115) at org.quartz.core.JobRunShell.run(JobRunShell.java:202) ... 1 common frames omitted 2020-01-20 17:16:17,308 [thread=QuartzScheduler_Worker-7] [job=ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8, org=, csid=] ERROR org.candlepin.pinsetter.core.PinsetterJobListener - Job [ExpiredPoolsJob-0406c674-ac1c-4d09-b27b-4ccbc67da3e8] failed. org.quartz.JobExecutionException: org.quartz.SchedulerException: Job threw an unhandled exception. [See nested exception: java.lang.NullPointerException] at org.quartz.core.JobRunShell.run(JobRunShell.java:218) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) Caused by: org.quartz.SchedulerException: Job threw an unhandled exception. at org.quartz.core.JobRunShell.run(JobRunShell.java:213) ... 1 common frames omitted Caused by: java.lang.NullPointerException: null at org.candlepin.policy.SystemPurposeComplianceRules.getStatus(SystemPurposeComplianceRules.java:143) at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2389) at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:62) at org.candlepin.controller.CandlepinPoolManager.deletePools(CandlepinPoolManager.java:2143) at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPoolsImpl(CandlepinPoolManager.java:466) at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:70) at org.candlepin.controller.CandlepinPoolManager.cleanupExpiredPools(CandlepinPoolManager.java:435) at org.candlepin.pinsetter.tasks.ExpiredPoolsJob.toExecute(ExpiredPoolsJob.java:49) at org.candlepin.pinsetter.tasks.KingpinJob.execute(KingpinJob.java:115) at org.quartz.core.JobRunShell.run(JobRunShell.java:202) ... 1 common frames omitted 5. the curl command hangs after a time (20-ish minutes after Job started, I expect - havent seen on my reproducer, just at customer) 6. either no below connection error or deadlock is logged (this probably requires some specific load / requests fired, havent seen on my reproducer, just sometimes in customer logs): 2020-01-16 23:06:00 CST FATAL: remaining connection slots are reserved for non-replication superuser connections 2020-01-17 21:02:48 CST ERROR: deadlock detected 2020-01-17 21:02:48 CST DETAIL: Process 177429 waits for ShareLock on transaction 1842161269; blocked by process 217155. Process 217155 waits for ShareLock on transaction 1842136013; blocked by process 177429. Process 177429: update cp_consumer set created=$1, updated=$2, annotations=$3, autoheal=$4, complianceStatusHash=$5, cont_acc_cert_id=$6, content_access_mode=$7, entitlement_count=$8, entitlementStatus=$9, environment_id=$10, consumer_idcert_id=$11, keyPair_id=$12, lastCheckin=$13, name=$14, owner_id=$15, releaseVer=$16, sp_role=$17, serviceLevel=$18, sp_status=$19, sp_status_hash=$20, type_id=$21, sp_usage=$22, username=$23, uuid=$24 where id=$25 Process 217155: delete from cp_entitlement where id=$1 7. CPU slowly grows over time Expected results: 4. Job completes in a reasonable time 5. No requests hung 6. No postgres connection leak or deadlock 7. no CPU growth Additional info:
After discussion within Engineering we are going to make a change to Satellite so the ExpiredPoolsJob only runs 1x every 24 hours. That will be included in this BZ.
Verified in 6.7 snap 14 [root@dhcp-2-119 ~]# grep Expired /etc/candlepin/candlepin.conf # Quartz schedule notation for how often to run the ExpiredPoolsJob pinsetter.org.candlepin.pinsetter.tasks.ExpiredPoolsJob.schedule=0 0 0 * * ?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:1454