+++ This bug was initially created as a clone of Bug #1120270 +++ Description of problem: After killing worker running a task,the task halts immediately, and the worker dies, but the status of the task is successful. before kill -9 Operations: sync Resources: pup (repository) State: Running Start Time: 2014-07-16T14:06:44Z Finish Time: Incomplete Task Id: e5cc116b-9266-4afa-b144-b425bb7450cd Operations: sync Resources: pup1 (repository) State: Waiting Start Time: Unstarted Finish Time: Incomplete right after kill -9 Operations: sync Resources: pup (repository) State: Successful Start Time: 2014-07-16T14:07:36Z Finish Time: 2014-07-16T14:07:37Z Task Id: e5cc116b-9266-4afa-b144-b425bb7450cd Operations: sync Resources: pup1 (repository) State: Waiting Start Time: Unstarted Finish Time: Incomplete Task Id: 091a2bec-5c45-4494-b367-e1370bd23398 Operations: publish Resources: pup (repository) State: Waiting Start Time: Unstarted Finish Time: Incomplete Task Id: 0c83b413-d55d-4549-8c34-0b1f6dcb497a after 5mins Operations: sync Resources: pup (repository) State: Successful Start Time: 2014-07-16T14:07:36Z Finish Time: 2014-07-16T14:07:37Z Task Id: e5cc116b-9266-4afa-b144-b425bb7450cd Operations: sync Resources: pup1 (repository) State: Cancelled Start Time: Unstarted Finish Time: Incomplete Task Id: 091a2bec-5c45-4494-b367-e1370bd23398 Operations: publish Resources: pup (repository) State: Cancelled Start Time: Unstarted Finish Time: Incomplete Task Id: 0c83b413-d55d-4549-8c34-0b1f6dcb497a Version-Release number of selected component (if applicable): 2.4.0-0.24.beta How reproducible: always Steps to Reproduce: 1. have 1 worker 2. create and sync 2 repos 3. kill the worker Actual results: the state of the task is 'successful' Expected results: the state of the task is 'cancelled' Additional info: --- Additional comment from on 2014-08-13 16:16:25 EDT --- This BZ is very much related to [0], but it does claim that killed worker leaves task state incorrect in a different way (by marking it successful when it should be cancelled). Given the importance of Pulp 2.4.1 having all task states being accurate I'm adjusting the priority of this BZ to high so it can be fixed along with [0]. [0]: https://bugzilla.redhat.com/show_bug.cgi?id=1129858
Completing the triage of this bug and moving it to ON_QA (since it should be included as part of Snap7).
This looks to be a pulp bug, please provide what needs to be tested at the satellite6, to verify this.
I am assuming canceling a sync at the sync status page should show the state as "cancelled" instead of "sync complete". IS it so?
@Kedar - The verification steps are to reduce Pulp to one worker: 1. Comment out the existing DEFAULT_PULP_CONCURRENCY line in /etc/init.d/pulp_workers and add: DEFAULT_PULP_CONCURRENCY=1 2. katello-service restart 3. Sync a new large repo 4. Go to the Dynflow page, and look for the worker running the task on Actions::Pulp::Repository::Sync: queue: reserved_resource_worker-0.lab.bos.redhat.com.dq 5. ps -Af | grep reserved_resource_worker-0 6. kill -9 all the processes for that worker 7. Task should go to stopped/error (Although it doesn't, the Pulp::Sync succeeds, and the task itself goes to paused/error).
I am no longer able to access the task referenced in #5; however, I would generally expect it to be in a final state (e.g. stopped) with a result indicating warning or error. I agree that 'paused' state would be misleading as the user is not able to resume the task. From the behavior, it sounds like there may be a Sat6 change needed to ensure the proper state in dynflow.
The task isn't available anymore because testing this bug really messed up my pulp instance for other testing and I needed to do a katello-reset (and upgrade to compose 3 anyway). We have some work to do about cleaning up after bad things happen to some of our orchestration tasks, including this one when a worker is killed -- it takes a while for us to notice, and we don't really do anything about it, just pause the task. Resuming isn't sufficient to correct the failure, either.
VERIFIED: # rpm -qa | grep foreman foreman-1.7.2.25-1.el7sat.noarch ruby193-rubygem-foreman-tasks-0.6.12.5-1.el7sat.noarch ruby193-rubygem-foreman_gutterball-0.0.1.9-1.el7sat.noarch ibm-hs21-04.lab.bos.redhat.com-foreman-proxy-1.0-1.noarch ruby193-rubygem-foreman_docker-1.2.0.14-1.el7sat.noarch foreman-debug-1.7.2.25-1.el7sat.noarch foreman-ovirt-1.7.2.25-1.el7sat.noarch ruby193-rubygem-foreman-redhat_access-0.1.0-1.el7sat.noarch rubygem-hammer_cli_foreman_bootdisk-0.1.2.7-1.el7sat.noarch rubygem-hammer_cli_foreman_docker-0.0.3.6-1.el7sat.noarch foreman-selinux-1.7.2.13-1.el7sat.noarch ruby193-rubygem-foreman_bootdisk-4.0.2.13-1.el7sat.noarch foreman-vmware-1.7.2.25-1.el7sat.noarch ruby193-rubygem-foreman_hooks-0.3.7-2.el7sat.noarch rubygem-hammer_cli_foreman_discovery-0.0.1.10-1.el7sat.noarch foreman-proxy-1.7.2.4-1.el7sat.noarch ibm-hs21-04.lab.bos.redhat.com-foreman-client-1.0-1.noarch ibm-hs21-04.lab.bos.redhat.com-foreman-proxy-client-1.0-1.noarch foreman-gce-1.7.2.25-1.el7sat.noarch rubygem-hammer_cli_foreman-0.1.4.12-1.el7sat.noarch foreman-compute-1.7.2.25-1.el7sat.noarch ruby193-rubygem-foreman_discovery-2.0.0.14-1.el7sat.noarch rubygem-hammer_cli_foreman_tasks-0.0.3.4-1.el7sat.noarch foreman-libvirt-1.7.2.25-1.el7sat.noarch foreman-postgresql-1.7.2.25-1.el7sat.noarch steps: 1. Comment out the existing DEFAULT_PULP_CONCURRENCY line in /etc/init.d/pulp_workers and add: DEFAULT_PULP_CONCURRENCY=1 2. katello-service restart 3. Sync a new large repo 4. Go to the Dynflow page, and look for the worker running the task on Actions::Pulp::Repository::Sync: queue: reserved_resource_worker-0.lab.bos.redhat.com.dq 5. ps -Af | grep reserved_resource_worker-0 6. kill -9 all the processes for that worker 7. Task should go to stopped/error
This bug is slated to be released with Satellite 6.1.
This bug was fixed in version 6.1.1 of Satellite which was released on 12 August, 2015.