Bug 717650

Summary: Candlepin jobs failing in the event of a container bounce
Product: [Community] Candlepin Reporter: Jordan OMara <jomara>
Component: candlepinAssignee: Jesus M. Rodriguez <jesusr>
Status: CLOSED CURRENTRELEASE QA Contact: John Sefler <jsefler>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 0.5CC: athomas, bkearney, jesusr
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-07-17 13:03:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 682238    
Attachments:
Description Flags
portion of /var/log/tomcat6/catalina.out during the test period in comment 2
none
set some async jobs to recover none

Description Jordan OMara 2011-06-29 13:34:41 UTC
Description of problem:
If a Candlepin job, like refresh pools, is running while the container gets forcibly bounced (including as far as we can tell a clean shutdown, like "service jbossas stop"), the job never finishes and stays in the DB.

Version-Release number of selected component (if applicable):
0.3.x (but I'm fairly certain it exists in 0.4.x also)

How reproducible:
Always

Steps to Reproduce:
1. Schedule long running job
2. Bounce container before job finishes
3. Restart container
  
Actual results:
Job does not get picked back up when container restarts


Expected results:
Job gets picked back up when container restarts

Additional info:

Comment 1 Jesus M. Rodriguez 2011-09-28 19:19:55 UTC
Ensure Quartz is configured for clustering mode (i.e. storing job data to database):

https://fedorahosted.org/candlepin/wiki/QuartzSetup

Comment 2 John Sefler 2011-09-28 21:06:42 UTC
Testing candlepin version...
[root@jsefler-onprem-62candlepin proxy]# git branch
  0.3
* master
[root@jsefler-onprem-62candlepin proxy]# git show-ref | grep master
7c8d283e8e8f02f6492db78db46ac637c48c3a5a refs/heads/master
7c8d283e8e8f02f6492db78db46ac637c48c3a5a refs/remotes/origin/master
[root@jsefler-onprem-62candlepin proxy]# curl -k -u admin:admin --request GET https://jsefler-onprem-62candlepin.usersys.redhat.com:8443/candlepin/status | python -mjson.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    48  100    48    0     0     55      0 --:--:-- --:--:-- --:--:--    62
{
    "release": "1", 
    "result": true, 
    "version": "0.4.17"
}
[root@jsefler-onprem-62candlepin proxy]# 



Attempt to verify this bug....
APPROACH:
Using my onpremise candlepin server with TESTDATA=1 deployed, create a ton of subscriptions so that when I run refresh pools, the job will be in a RUNNING state for awhile.  Then stop and start the container (tomcat6) and re-check the status of the refresh_pools job.  It should still be in a Running state after the container bounce and eventually I expect it to Finish.


STEP 0: add all the required quartz paramters to candlepin.conf (reference: https://fedorahosted.org/candlepin/wiki/QuartzSetup)
[root@jsefler-onprem-62candlepin ~]# cat /etc/candlepin/candlepin.conf  | grep quartz
org.quartz.scheduler.instanceName = MyClusteredScheduler
org.quartz.scheduler.instanceId = AUTO
org.quartz.jobStore.class = org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass = org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
org.quartz.jobStore.useProperties = false
org.quartz.jobStore.dataSource = cpqrtz
org.quartz.jobStore.tablePrefix = QRTZ_
org.quartz.jobStore.isClustered = true
org.quartz.jobStore.clusterCheckinInterval = 20000
#  this must match the value of org.quartz.jobStore.dataSource
org.quartz.dataSource.cpqrtz.driver = org.postgresql.Driver
org.quartz.dataSource.cpqrtz.URL = jdbc:postgresql:candlepin
org.quartz.dataSource.cpqrtz.user = candlepin
org.quartz.dataSource.cpqrtz.password = candlepin
org.quartz.dataSource.cpqrtz.maxConnections = 5
[root@jsefler-onprem-62candlepin ~]# 
[root@jsefler-onprem-62candlepin ~]# service tomcat6 restart
Stopping tomcat6:                                          [  OK  ]
Starting tomcat6:                                          [  OK  ]
[root@jsefler-onprem-62candlepin ~]# 


STEP 1: Create a ton of future subscriptions

[root@jsefler-onprem-62candlepin ~]# for i in `seq 1 1000`; do curl -k --user admin:admin --request POST --data '{"product":{"id":"awesomeos-server-basic"},"startDate":"Tue, 13 Sep 2016 01:00:00 -0400","accountNumber":123456,"quantity":20,"endDate":"Wed, 13 Sep 2017 01:00:00 -0400","contractNumber":123,"providedProducts":[{"id":"37060"}]}' --header 'accept: application/json' --header 'content-type: application/json' https://jsefler-onprem-62candlepin.usersys.redhat.com:8443/candlepin/owners/admin/subscriptions 1>/dev/null 2>/dev/null; done;


STEP 2: Refresh pools

[root@jsefler-onprem-62candlepin ~]# curl -k -u admin:admin --request PUT https://jsefler-onprem-62candlepin.usersys.redhat.com:8443/candlepin/owners/admin/subscriptions | python -mjson.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   366  100   366    0     0   1728      0 --:--:-- --:--:-- --:--:--  2000
{
    "created": "2011-09-28T20:13:50.577+0000", 
    "finishTime": null, 
    "group": "async group", 
    "id": "refresh_pools_147f5c49-87ed-4106-abce-87561e00606e", 
    "principalName": "admin", 
    "result": null, 
    "startTime": null, 
    "state": "CREATED", 
    "statusPath": "/jobs/refresh_pools_147f5c49-87ed-4106-abce-87561e00606e", 
    "targetId": "admin", 
    "targetType": "owner", 
    "updated": "2011-09-28T20:13:50.577+0000"
}


STEP 3: VERY QUICKLY STOP THE SERVER, THEN START AND CHECK THE REFRESH STATUS

[root@jsefler-onprem-62candlepin ~]# date; service tomcat6 stop; date
Wed Sep 28 16:13:51 EDT 2011
Stopping tomcat6:                                          [  OK  ]
Wed Sep 28 16:13:54 EDT 2011
[root@jsefler-onprem-62candlepin ~]# service tomcat6 start
Starting tomcat6:                                          [  OK  ]
[root@jsefler-onprem-62candlepin ~]# curl -k -u admin:admin --request GET https://jsefler-onprem-62candlepin.usersys.redhat.com:8443/candlepin/jobs/refresh_pools_147f5c49-87ed-4106-abce-87561e00606e | python -mjson.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   392  100   392    0     0     13      0  0:00:30  0:00:28  0:00:02    13
{
    "created": "2011-09-28T20:13:50.577+0000", 
    "finishTime": null, 
    "group": "async group", 
    "id": "refresh_pools_147f5c49-87ed-4106-abce-87561e00606e", 
    "principalName": "admin", 
    "result": null, 
    "startTime": "2011-09-28T20:13:50.680+0000", 
    "state": "RUNNING", 
    "statusPath": "/jobs/refresh_pools_147f5c49-87ed-4106-abce-87561e00606e", 
    "targetId": "admin", 
    "targetType": "owner", 
    "updated": "2011-09-28T20:13:50.701+0000"
}

^^^ OUR REFRESH POOLS JOB IS IN A RUNNING STATE AFTER BOUNCING THE CONTAINER.  THAT"S GOOD

[root@jsefler-onprem-62candlepin ~]# date
Wed Sep 28 16:51:35 EDT 2011
[root@jsefler-onprem-62candlepin ~]# curl -k -u admin:admin --request GET https://jsefler-onprem-62candlepin.usersys.redhat.com:8443/candlepin/jobs/refresh_pools_147f5c49-87ed-4106-abce-87561e00606e | python -mjson.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   392  100   392    0     0    734      0 --:--:-- --:--:-- --:--:--   933
{
    "created": "2011-09-28T20:13:50.577+0000", 
    "finishTime": null, 
    "group": "async group", 
    "id": "refresh_pools_147f5c49-87ed-4106-abce-87561e00606e", 
    "principalName": "admin", 
    "result": null, 
    "startTime": "2011-09-28T20:13:50.680+0000", 
    "state": "RUNNING", 
    "statusPath": "/jobs/refresh_pools_147f5c49-87ed-4106-abce-87561e00606e", 
    "targetId": "admin", 
    "targetType": "owner", 
    "updated": "2011-09-28T20:13:50.701+0000"
}

^^^^ AFTER ALMOST AN HOUR, THE JOB IS STILL RUNNING.  THAT"S BAD


[root@jsefler-onprem-62candlepin ~]# psql -U candlepin
psql (9.0.4)
Type "help" for help.

candlepin=> select * from qrtz_simple_triggers;
 trigger_name | trigger_group | repeat_count | repeat_interval | times_triggered
 
--------------+---------------+--------------+-----------------+----------------
-
(0 rows)

candlepin=> \q
[root@jsefler-onprem-62candlepin ~]# 


^^^ NO ENTRIES IN THE qrtz_simple_triggers TABLE.  OUR RUNNING JOB APPEARS TO BE A RUNAWAY.  THAT"S HOPELESS.


Moving back to NEW/FailsQA
Attaching the tail of catalina.out

Comment 3 John Sefler 2011-09-28 21:09:27 UTC
Created attachment 525406 [details]
portion of /var/log/tomcat6/catalina.out during the test period in comment 2

Comment 4 Jesus M. Rodriguez 2011-10-17 18:56:31 UTC
Upon stopping tomcat, connect to the database to verify a few things:

1) select trigger_name, requests_recovery from qrtz_fired_triggers;

verify requests_recovery is true (t).

2) select * from qrtz_simple_triggers;

verify the refresh pools exists there.

3) select id, state from cp_job;

verify the state is *NOT* 3

-------------

Restart tomcat, then wait a bit for the job to re-run.
Then verify that the above 3 queries changed;

1) verify that refresh job was removed from the fired_triggers table
2) verify that refresh job was removed from the simple_triggers table
3) verify the state changed to 3 (FINISHED)

Comment 6 Jesus M. Rodriguez 2011-10-17 18:57:49 UTC
Created attachment 528626 [details]
set some async jobs to recover

Comment 7 Jesus M. Rodriguez 2011-10-17 19:22:27 UTC
fixed in master by 4a94df0aff73231917c88fdbf96c9b8804174f5c
requires candlepin >= 0.4.22-1 to get this functionality.

Comment 8 Bryan Kearney 2012-07-17 13:03:34 UTC
Marking all community bugs modified or beyong as closed.