Bug 717650
Summary: | Candlepin jobs failing in the event of a container bounce | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Community] Candlepin | Reporter: | Jordan OMara <jomara> | ||||||
Component: | candlepin | Assignee: | Jesus M. Rodriguez <jesusr> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | John Sefler <jsefler> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 0.5 | CC: | athomas, bkearney, jesusr | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2012-07-17 13:03:34 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 682238 | ||||||||
Attachments: |
|
Description
Jordan OMara
2011-06-29 13:34:41 UTC
Ensure Quartz is configured for clustering mode (i.e. storing job data to database): https://fedorahosted.org/candlepin/wiki/QuartzSetup Testing candlepin version... [root@jsefler-onprem-62candlepin proxy]# git branch 0.3 * master [root@jsefler-onprem-62candlepin proxy]# git show-ref | grep master 7c8d283e8e8f02f6492db78db46ac637c48c3a5a refs/heads/master 7c8d283e8e8f02f6492db78db46ac637c48c3a5a refs/remotes/origin/master [root@jsefler-onprem-62candlepin proxy]# curl -k -u admin:admin --request GET https://jsefler-onprem-62candlepin.usersys.redhat.com:8443/candlepin/status | python -mjson.tool % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 48 100 48 0 0 55 0 --:--:-- --:--:-- --:--:-- 62 { "release": "1", "result": true, "version": "0.4.17" } [root@jsefler-onprem-62candlepin proxy]# Attempt to verify this bug.... APPROACH: Using my onpremise candlepin server with TESTDATA=1 deployed, create a ton of subscriptions so that when I run refresh pools, the job will be in a RUNNING state for awhile. Then stop and start the container (tomcat6) and re-check the status of the refresh_pools job. It should still be in a Running state after the container bounce and eventually I expect it to Finish. STEP 0: add all the required quartz paramters to candlepin.conf (reference: https://fedorahosted.org/candlepin/wiki/QuartzSetup) [root@jsefler-onprem-62candlepin ~]# cat /etc/candlepin/candlepin.conf | grep quartz org.quartz.scheduler.instanceName = MyClusteredScheduler org.quartz.scheduler.instanceId = AUTO org.quartz.jobStore.class = org.quartz.impl.jdbcjobstore.JobStoreTX org.quartz.jobStore.driverDelegateClass = org.quartz.impl.jdbcjobstore.PostgreSQLDelegate org.quartz.jobStore.useProperties = false org.quartz.jobStore.dataSource = cpqrtz org.quartz.jobStore.tablePrefix = QRTZ_ org.quartz.jobStore.isClustered = true org.quartz.jobStore.clusterCheckinInterval = 20000 # this must match the value of org.quartz.jobStore.dataSource org.quartz.dataSource.cpqrtz.driver = org.postgresql.Driver org.quartz.dataSource.cpqrtz.URL = jdbc:postgresql:candlepin org.quartz.dataSource.cpqrtz.user = candlepin org.quartz.dataSource.cpqrtz.password = candlepin org.quartz.dataSource.cpqrtz.maxConnections = 5 [root@jsefler-onprem-62candlepin ~]# [root@jsefler-onprem-62candlepin ~]# service tomcat6 restart Stopping tomcat6: [ OK ] Starting tomcat6: [ OK ] [root@jsefler-onprem-62candlepin ~]# STEP 1: Create a ton of future subscriptions [root@jsefler-onprem-62candlepin ~]# for i in `seq 1 1000`; do curl -k --user admin:admin --request POST --data '{"product":{"id":"awesomeos-server-basic"},"startDate":"Tue, 13 Sep 2016 01:00:00 -0400","accountNumber":123456,"quantity":20,"endDate":"Wed, 13 Sep 2017 01:00:00 -0400","contractNumber":123,"providedProducts":[{"id":"37060"}]}' --header 'accept: application/json' --header 'content-type: application/json' https://jsefler-onprem-62candlepin.usersys.redhat.com:8443/candlepin/owners/admin/subscriptions 1>/dev/null 2>/dev/null; done; STEP 2: Refresh pools [root@jsefler-onprem-62candlepin ~]# curl -k -u admin:admin --request PUT https://jsefler-onprem-62candlepin.usersys.redhat.com:8443/candlepin/owners/admin/subscriptions | python -mjson.tool % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 366 100 366 0 0 1728 0 --:--:-- --:--:-- --:--:-- 2000 { "created": "2011-09-28T20:13:50.577+0000", "finishTime": null, "group": "async group", "id": "refresh_pools_147f5c49-87ed-4106-abce-87561e00606e", "principalName": "admin", "result": null, "startTime": null, "state": "CREATED", "statusPath": "/jobs/refresh_pools_147f5c49-87ed-4106-abce-87561e00606e", "targetId": "admin", "targetType": "owner", "updated": "2011-09-28T20:13:50.577+0000" } STEP 3: VERY QUICKLY STOP THE SERVER, THEN START AND CHECK THE REFRESH STATUS [root@jsefler-onprem-62candlepin ~]# date; service tomcat6 stop; date Wed Sep 28 16:13:51 EDT 2011 Stopping tomcat6: [ OK ] Wed Sep 28 16:13:54 EDT 2011 [root@jsefler-onprem-62candlepin ~]# service tomcat6 start Starting tomcat6: [ OK ] [root@jsefler-onprem-62candlepin ~]# curl -k -u admin:admin --request GET https://jsefler-onprem-62candlepin.usersys.redhat.com:8443/candlepin/jobs/refresh_pools_147f5c49-87ed-4106-abce-87561e00606e | python -mjson.tool % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 392 100 392 0 0 13 0 0:00:30 0:00:28 0:00:02 13 { "created": "2011-09-28T20:13:50.577+0000", "finishTime": null, "group": "async group", "id": "refresh_pools_147f5c49-87ed-4106-abce-87561e00606e", "principalName": "admin", "result": null, "startTime": "2011-09-28T20:13:50.680+0000", "state": "RUNNING", "statusPath": "/jobs/refresh_pools_147f5c49-87ed-4106-abce-87561e00606e", "targetId": "admin", "targetType": "owner", "updated": "2011-09-28T20:13:50.701+0000" } ^^^ OUR REFRESH POOLS JOB IS IN A RUNNING STATE AFTER BOUNCING THE CONTAINER. THAT"S GOOD [root@jsefler-onprem-62candlepin ~]# date Wed Sep 28 16:51:35 EDT 2011 [root@jsefler-onprem-62candlepin ~]# curl -k -u admin:admin --request GET https://jsefler-onprem-62candlepin.usersys.redhat.com:8443/candlepin/jobs/refresh_pools_147f5c49-87ed-4106-abce-87561e00606e | python -mjson.tool % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 392 100 392 0 0 734 0 --:--:-- --:--:-- --:--:-- 933 { "created": "2011-09-28T20:13:50.577+0000", "finishTime": null, "group": "async group", "id": "refresh_pools_147f5c49-87ed-4106-abce-87561e00606e", "principalName": "admin", "result": null, "startTime": "2011-09-28T20:13:50.680+0000", "state": "RUNNING", "statusPath": "/jobs/refresh_pools_147f5c49-87ed-4106-abce-87561e00606e", "targetId": "admin", "targetType": "owner", "updated": "2011-09-28T20:13:50.701+0000" } ^^^^ AFTER ALMOST AN HOUR, THE JOB IS STILL RUNNING. THAT"S BAD [root@jsefler-onprem-62candlepin ~]# psql -U candlepin psql (9.0.4) Type "help" for help. candlepin=> select * from qrtz_simple_triggers; trigger_name | trigger_group | repeat_count | repeat_interval | times_triggered --------------+---------------+--------------+-----------------+---------------- - (0 rows) candlepin=> \q [root@jsefler-onprem-62candlepin ~]# ^^^ NO ENTRIES IN THE qrtz_simple_triggers TABLE. OUR RUNNING JOB APPEARS TO BE A RUNAWAY. THAT"S HOPELESS. Moving back to NEW/FailsQA Attaching the tail of catalina.out Created attachment 525406 [details] portion of /var/log/tomcat6/catalina.out during the test period in comment 2 Upon stopping tomcat, connect to the database to verify a few things: 1) select trigger_name, requests_recovery from qrtz_fired_triggers; verify requests_recovery is true (t). 2) select * from qrtz_simple_triggers; verify the refresh pools exists there. 3) select id, state from cp_job; verify the state is *NOT* 3 ------------- Restart tomcat, then wait a bit for the job to re-run. Then verify that the above 3 queries changed; 1) verify that refresh job was removed from the fired_triggers table 2) verify that refresh job was removed from the simple_triggers table 3) verify the state changed to 3 (FINISHED) Created attachment 528626 [details]
set some async jobs to recover
fixed in master by 4a94df0aff73231917c88fdbf96c9b8804174f5c requires candlepin >= 0.4.22-1 to get this functionality. Marking all community bugs modified or beyong as closed. |