Bug 1129877 - After killing the worker running a task the task status is successful
Summary: After killing the worker running a task the task status is successful
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Other
Version: Unspecified
Hardware: Unspecified
OS: Unspecified
high
high vote
Target Milestone: Unspecified
Assignee: Stephen Benjamin
QA Contact: Tazim Kolhar
URL:
Whiteboard:
Depends On: 1120270
Blocks: sat6-pulp-blocker
TreeView+ depends on / blocked
 
Reported: 2014-08-13 20:19 UTC by Brian Bouterse
Modified: 2017-07-26 19:41 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 1120270
Environment:
Last Closed: 2015-08-12 13:58:54 UTC


Attachments (Terms of Use)

Description Brian Bouterse 2014-08-13 20:19:43 UTC
+++ This bug was initially created as a clone of Bug #1120270 +++

Description of problem:
After killing worker running a task,the task halts immediately, and the worker dies, but the status of the task is successful.

before kill -9

Operations:  sync
Resources:   pup (repository)
State:       Running
Start Time:  2014-07-16T14:06:44Z
Finish Time: Incomplete
Task Id:     e5cc116b-9266-4afa-b144-b425bb7450cd

Operations:  sync
Resources:   pup1 (repository)
State:       Waiting
Start Time:  Unstarted
Finish Time: Incomplete



right after kill -9 

Operations:  sync
Resources:   pup (repository)
State:       Successful
Start Time:  2014-07-16T14:07:36Z
Finish Time: 2014-07-16T14:07:37Z
Task Id:     e5cc116b-9266-4afa-b144-b425bb7450cd

Operations:  sync
Resources:   pup1 (repository)
State:       Waiting
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     091a2bec-5c45-4494-b367-e1370bd23398

Operations:  publish
Resources:   pup (repository)
State:       Waiting
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     0c83b413-d55d-4549-8c34-0b1f6dcb497a



after 5mins

Operations:  sync
Resources:   pup (repository)
State:       Successful
Start Time:  2014-07-16T14:07:36Z
Finish Time: 2014-07-16T14:07:37Z
Task Id:     e5cc116b-9266-4afa-b144-b425bb7450cd

Operations:  sync
Resources:   pup1 (repository)
State:       Cancelled
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     091a2bec-5c45-4494-b367-e1370bd23398

Operations:  publish
Resources:   pup (repository)
State:       Cancelled
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     0c83b413-d55d-4549-8c34-0b1f6dcb497a

Version-Release number of selected component (if applicable):
2.4.0-0.24.beta


How reproducible:
always

Steps to Reproduce:
1. have 1 worker
2. create and sync 2 repos
3. kill the worker

Actual results:
the state of the task is 'successful'

Expected results:

the state of the task is 'cancelled'
Additional info:

--- Additional comment from  on 2014-08-13 16:16:25 EDT ---

This BZ is very much related to [0], but it does claim that killed worker leaves task state incorrect in a different way (by marking it successful when it should be cancelled). Given the importance of Pulp 2.4.1 having all task states being accurate I'm adjusting the priority of this BZ to high so it can be fixed along with [0].

[0]:  https://bugzilla.redhat.com/show_bug.cgi?id=1129858

Comment 2 Brad Buckingham 2014-08-28 14:54:33 UTC
Completing the triage of this bug and moving it to ON_QA (since it should be included as part of Snap7).

Comment 3 Kedar Bidarkar 2014-09-01 09:26:51 UTC
This looks to be a pulp bug, please provide what needs to be tested at the satellite6, to verify this.

Comment 4 Kedar Bidarkar 2014-09-01 09:29:13 UTC
I am assuming canceling a sync at the sync status page should show the state as "cancelled" instead of "sync complete". IS it so?

Comment 6 Stephen Benjamin 2014-09-01 09:49:20 UTC
@Kedar - The verification steps are to reduce Pulp to one worker:

1. Comment out the existing DEFAULT_PULP_CONCURRENCY line in /etc/init.d/pulp_workers and add:
  DEFAULT_PULP_CONCURRENCY=1

2. katello-service restart

3. Sync a new large repo

4. Go to the Dynflow page, and look for the worker running the task on Actions::Pulp::Repository::Sync:

  queue: reserved_resource_worker-0@sat-perf-01.idm.lab.bos.redhat.com.dq


5. ps -Af | grep reserved_resource_worker-0

6. kill -9 all the processes for that worker

7. Task should go to stopped/error (Although it doesn't, the Pulp::Sync succeeds, and the task itself goes to paused/error).

Comment 8 Brad Buckingham 2014-09-02 13:17:40 UTC
I am no longer able to access the task referenced in #5; however, I would generally expect it to be in a final state (e.g. stopped) with a result indicating warning or error.  I agree that 'paused' state would be misleading as the user is not able to resume the task.

From the behavior, it sounds like there may be a Sat6 change needed to ensure the proper state in dynflow.

Comment 9 Stephen Benjamin 2014-09-02 13:32:06 UTC
The task isn't available anymore because testing this bug really messed up my pulp instance for other testing and I needed to do a katello-reset (and upgrade to compose 3 anyway).

We have some work to do about cleaning up after bad things happen to some of our orchestration tasks, including this one when a worker is killed -- it takes a while for us to notice, and we don't really do anything about it, just pause the task.  Resuming isn't sufficient to correct the failure, either.

Comment 16 Tazim Kolhar 2015-06-02 09:07:37 UTC
VERIFIED:
# rpm -qa | grep foreman
foreman-1.7.2.25-1.el7sat.noarch
ruby193-rubygem-foreman-tasks-0.6.12.5-1.el7sat.noarch
ruby193-rubygem-foreman_gutterball-0.0.1.9-1.el7sat.noarch
ibm-hs21-04.lab.bos.redhat.com-foreman-proxy-1.0-1.noarch
ruby193-rubygem-foreman_docker-1.2.0.14-1.el7sat.noarch
foreman-debug-1.7.2.25-1.el7sat.noarch
foreman-ovirt-1.7.2.25-1.el7sat.noarch
ruby193-rubygem-foreman-redhat_access-0.1.0-1.el7sat.noarch
rubygem-hammer_cli_foreman_bootdisk-0.1.2.7-1.el7sat.noarch
rubygem-hammer_cli_foreman_docker-0.0.3.6-1.el7sat.noarch
foreman-selinux-1.7.2.13-1.el7sat.noarch
ruby193-rubygem-foreman_bootdisk-4.0.2.13-1.el7sat.noarch
foreman-vmware-1.7.2.25-1.el7sat.noarch
ruby193-rubygem-foreman_hooks-0.3.7-2.el7sat.noarch
rubygem-hammer_cli_foreman_discovery-0.0.1.10-1.el7sat.noarch
foreman-proxy-1.7.2.4-1.el7sat.noarch
ibm-hs21-04.lab.bos.redhat.com-foreman-client-1.0-1.noarch
ibm-hs21-04.lab.bos.redhat.com-foreman-proxy-client-1.0-1.noarch
foreman-gce-1.7.2.25-1.el7sat.noarch
rubygem-hammer_cli_foreman-0.1.4.12-1.el7sat.noarch
foreman-compute-1.7.2.25-1.el7sat.noarch
ruby193-rubygem-foreman_discovery-2.0.0.14-1.el7sat.noarch
rubygem-hammer_cli_foreman_tasks-0.0.3.4-1.el7sat.noarch
foreman-libvirt-1.7.2.25-1.el7sat.noarch
foreman-postgresql-1.7.2.25-1.el7sat.noarch

steps:
1. Comment out the existing DEFAULT_PULP_CONCURRENCY line in /etc/init.d/pulp_workers and add:
  DEFAULT_PULP_CONCURRENCY=1
2. katello-service restart
3. Sync a new large repo
4. Go to the Dynflow page, and look for the worker running the task on Actions::Pulp::Repository::Sync:

  queue: reserved_resource_worker-0@sat-perf-01.idm.lab.bos.redhat.com.dq
5. ps -Af | grep reserved_resource_worker-0
6. kill -9 all the processes for that worker
7. Task should go to stopped/error

Comment 17 Bryan Kearney 2015-08-11 13:23:53 UTC
This bug is slated to be released with Satellite 6.1.

Comment 18 Bryan Kearney 2015-08-12 13:58:54 UTC
This bug was fixed in version 6.1.1 of Satellite which was released on 12 August, 2015.


Note You need to log in before you can comment on or make changes to this bug.