1129877 – After killing the worker running a task the task status is successful

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1129877 - After killing the worker running a task the task status is successful

Summary: After killing the worker running a task the task status is successful

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Other
Sub Component:
Version:	Unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	Unspecified
Assignee:	Stephen Benjamin
QA Contact:	Tazim Kolhar
Docs Contact:
URL:
Whiteboard:
Depends On:	1120270
Blocks:	sat6-pulp-blocker
TreeView+	depends on / blocked

Reported:	2014-08-13 20:19 UTC by Brian Bouterse
Modified:	2017-07-26 19:41 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	1120270
Environment:
Last Closed:	2015-08-12 13:58:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Brian Bouterse 2014-08-13 20:19:43 UTC

+++ This bug was initially created as a clone of Bug #1120270 +++

Description of problem:
After killing worker running a task,the task halts immediately, and the worker dies, but the status of the task is successful.

before kill -9

Operations:  sync
Resources:   pup (repository)
State:       Running
Start Time:  2014-07-16T14:06:44Z
Finish Time: Incomplete
Task Id:     e5cc116b-9266-4afa-b144-b425bb7450cd

Operations:  sync
Resources:   pup1 (repository)
State:       Waiting
Start Time:  Unstarted
Finish Time: Incomplete



right after kill -9 

Operations:  sync
Resources:   pup (repository)
State:       Successful
Start Time:  2014-07-16T14:07:36Z
Finish Time: 2014-07-16T14:07:37Z
Task Id:     e5cc116b-9266-4afa-b144-b425bb7450cd

Operations:  sync
Resources:   pup1 (repository)
State:       Waiting
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     091a2bec-5c45-4494-b367-e1370bd23398

Operations:  publish
Resources:   pup (repository)
State:       Waiting
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     0c83b413-d55d-4549-8c34-0b1f6dcb497a



after 5mins

Operations:  sync
Resources:   pup (repository)
State:       Successful
Start Time:  2014-07-16T14:07:36Z
Finish Time: 2014-07-16T14:07:37Z
Task Id:     e5cc116b-9266-4afa-b144-b425bb7450cd

Operations:  sync
Resources:   pup1 (repository)
State:       Cancelled
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     091a2bec-5c45-4494-b367-e1370bd23398

Operations:  publish
Resources:   pup (repository)
State:       Cancelled
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     0c83b413-d55d-4549-8c34-0b1f6dcb497a

Version-Release number of selected component (if applicable):
2.4.0-0.24.beta


How reproducible:
always

Steps to Reproduce:
1. have 1 worker
2. create and sync 2 repos
3. kill the worker

Actual results:
the state of the task is 'successful'

Expected results:

the state of the task is 'cancelled'
Additional info:

--- Additional comment from  on 2014-08-13 16:16:25 EDT ---

This BZ is very much related to [0], but it does claim that killed worker leaves task state incorrect in a different way (by marking it successful when it should be cancelled). Given the importance of Pulp 2.4.1 having all task states being accurate I'm adjusting the priority of this BZ to high so it can be fixed along with [0].

[0]:  https://bugzilla.redhat.com/show_bug.cgi?id=1129858

Comment 2 Brad Buckingham 2014-08-28 14:54:33 UTC

Completing the triage of this bug and moving it to ON_QA (since it should be included as part of Snap7).

Comment 3 Kedar Bidarkar 2014-09-01 09:26:51 UTC

This looks to be a pulp bug, please provide what needs to be tested at the satellite6, to verify this.

Comment 4 Kedar Bidarkar 2014-09-01 09:29:13 UTC

I am assuming canceling a sync at the sync status page should show the state as "cancelled" instead of "sync complete". IS it so?

Comment 6 Stephen Benjamin 2014-09-01 09:49:20 UTC

@Kedar - The verification steps are to reduce Pulp to one worker:

1. Comment out the existing DEFAULT_PULP_CONCURRENCY line in /etc/init.d/pulp_workers and add:
  DEFAULT_PULP_CONCURRENCY=1

2. katello-service restart

3. Sync a new large repo

4. Go to the Dynflow page, and look for the worker running the task on Actions::Pulp::Repository::Sync:

  queue: reserved_resource_worker-0.lab.bos.redhat.com.dq


5. ps -Af | grep reserved_resource_worker-0

6. kill -9 all the processes for that worker

7. Task should go to stopped/error (Although it doesn't, the Pulp::Sync succeeds, and the task itself goes to paused/error).

Comment 8 Brad Buckingham 2014-09-02 13:17:40 UTC

I am no longer able to access the task referenced in #5; however, I would generally expect it to be in a final state (e.g. stopped) with a result indicating warning or error.  I agree that 'paused' state would be misleading as the user is not able to resume the task.

From the behavior, it sounds like there may be a Sat6 change needed to ensure the proper state in dynflow.

Comment 9 Stephen Benjamin 2014-09-02 13:32:06 UTC

The task isn't available anymore because testing this bug really messed up my pulp instance for other testing and I needed to do a katello-reset (and upgrade to compose 3 anyway).

We have some work to do about cleaning up after bad things happen to some of our orchestration tasks, including this one when a worker is killed -- it takes a while for us to notice, and we don't really do anything about it, just pause the task.  Resuming isn't sufficient to correct the failure, either.

Comment 16 Tazim Kolhar 2015-06-02 09:07:37 UTC

VERIFIED:
# rpm -qa | grep foreman
foreman-1.7.2.25-1.el7sat.noarch
ruby193-rubygem-foreman-tasks-0.6.12.5-1.el7sat.noarch
ruby193-rubygem-foreman_gutterball-0.0.1.9-1.el7sat.noarch
ibm-hs21-04.lab.bos.redhat.com-foreman-proxy-1.0-1.noarch
ruby193-rubygem-foreman_docker-1.2.0.14-1.el7sat.noarch
foreman-debug-1.7.2.25-1.el7sat.noarch
foreman-ovirt-1.7.2.25-1.el7sat.noarch
ruby193-rubygem-foreman-redhat_access-0.1.0-1.el7sat.noarch
rubygem-hammer_cli_foreman_bootdisk-0.1.2.7-1.el7sat.noarch
rubygem-hammer_cli_foreman_docker-0.0.3.6-1.el7sat.noarch
foreman-selinux-1.7.2.13-1.el7sat.noarch
ruby193-rubygem-foreman_bootdisk-4.0.2.13-1.el7sat.noarch
foreman-vmware-1.7.2.25-1.el7sat.noarch
ruby193-rubygem-foreman_hooks-0.3.7-2.el7sat.noarch
rubygem-hammer_cli_foreman_discovery-0.0.1.10-1.el7sat.noarch
foreman-proxy-1.7.2.4-1.el7sat.noarch
ibm-hs21-04.lab.bos.redhat.com-foreman-client-1.0-1.noarch
ibm-hs21-04.lab.bos.redhat.com-foreman-proxy-client-1.0-1.noarch
foreman-gce-1.7.2.25-1.el7sat.noarch
rubygem-hammer_cli_foreman-0.1.4.12-1.el7sat.noarch
foreman-compute-1.7.2.25-1.el7sat.noarch
ruby193-rubygem-foreman_discovery-2.0.0.14-1.el7sat.noarch
rubygem-hammer_cli_foreman_tasks-0.0.3.4-1.el7sat.noarch
foreman-libvirt-1.7.2.25-1.el7sat.noarch
foreman-postgresql-1.7.2.25-1.el7sat.noarch

steps:
1. Comment out the existing DEFAULT_PULP_CONCURRENCY line in /etc/init.d/pulp_workers and add:
  DEFAULT_PULP_CONCURRENCY=1
2. katello-service restart
3. Sync a new large repo
4. Go to the Dynflow page, and look for the worker running the task on Actions::Pulp::Repository::Sync:

  queue: reserved_resource_worker-0.lab.bos.redhat.com.dq
5. ps -Af | grep reserved_resource_worker-0
6. kill -9 all the processes for that worker
7. Task should go to stopped/error

Comment 17 Bryan Kearney 2015-08-11 13:23:53 UTC

This bug is slated to be released with Satellite 6.1.

Comment 18 Bryan Kearney 2015-08-12 13:58:54 UTC

This bug was fixed in version 6.1.1 of Satellite which was released on 12 August, 2015.

Note You need to log in before you can comment on or make changes to this bug.