Bug 1120270

Summary:	After killing the worker running a task the task status is successful
Product:	[Retired] Pulp	Reporter:	Ina Panova <ipanova>
Component:	async/tasks	Assignee:	Brian Bouterse <bmbouter>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Ina Panova <ipanova>
Severity:	high	Docs Contact:
Priority:	high
Version:	2.4 Beta	CC:	bcourt, bmbouter, ipanova, mhrivnak
Target Milestone:	---	Keywords:	Triaged
Target Release:	2.4.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1129877 (view as bug list)		Environment:
Last Closed:	2014-09-23 17:54:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1129877, 1131719

Description Ina Panova 2014-07-16 15:16:06 UTC

Description of problem:
After killing worker running a task,the task halts immediately, and the worker dies, but the status of the task is successful.

before kill -9

Operations:  sync
Resources:   pup (repository)
State:       Running
Start Time:  2014-07-16T14:06:44Z
Finish Time: Incomplete
Task Id:     e5cc116b-9266-4afa-b144-b425bb7450cd

Operations:  sync
Resources:   pup1 (repository)
State:       Waiting
Start Time:  Unstarted
Finish Time: Incomplete



right after kill -9 

Operations:  sync
Resources:   pup (repository)
State:       Successful
Start Time:  2014-07-16T14:07:36Z
Finish Time: 2014-07-16T14:07:37Z
Task Id:     e5cc116b-9266-4afa-b144-b425bb7450cd

Operations:  sync
Resources:   pup1 (repository)
State:       Waiting
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     091a2bec-5c45-4494-b367-e1370bd23398

Operations:  publish
Resources:   pup (repository)
State:       Waiting
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     0c83b413-d55d-4549-8c34-0b1f6dcb497a



after 5mins

Operations:  sync
Resources:   pup (repository)
State:       Successful
Start Time:  2014-07-16T14:07:36Z
Finish Time: 2014-07-16T14:07:37Z
Task Id:     e5cc116b-9266-4afa-b144-b425bb7450cd

Operations:  sync
Resources:   pup1 (repository)
State:       Cancelled
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     091a2bec-5c45-4494-b367-e1370bd23398

Operations:  publish
Resources:   pup (repository)
State:       Cancelled
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     0c83b413-d55d-4549-8c34-0b1f6dcb497a

Version-Release number of selected component (if applicable):
2.4.0-0.24.beta


How reproducible:
always

Steps to Reproduce:
1. have 1 worker
2. create and sync 2 repos
3. kill the worker

Actual results:
the state of the task is 'successful'

Expected results:

the state of the task is 'cancelled'
Additional info:

Comment 1 Brian Bouterse 2014-08-13 20:16:25 UTC

This BZ is very much related to [0], but it does claim that killed worker leaves task state incorrect in a different way (by marking it successful when it should be cancelled). Given the importance of Pulp 2.4.1 having all task states being accurate I'm adjusting the priority of this BZ to high so it can be fixed along with [0].

[0]:  https://bugzilla.redhat.com/show_bug.cgi?id=1129858

Comment 2 Barnaby Court 2014-08-19 22:00:12 UTC

Ina, can you list the how you selected & killed the worker process?

Each worker is made up of two processes.  Which one did you kill?

Comment 3 Ina Panova 2014-08-20 08:30:26 UTC

I killed the parent process.
For example:

# ps auxf | grep celery
root      6936  0.0  0.0 112668   964 pts/0    S+   08:15   0:00          \_ grep --color=auto celery
apache   20089  0.2  0.6 774280 26132 ?        Ssl  Aug18   6:44 /usr/bin/python /usr/bin/celery beat --scheduler=pulp.server.async.scheduler.Scheduler
apache   20107  0.2  1.2 638876 49288 ?        Ssl  Aug18   5:16 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18
apache   20141  0.0  1.0 564628 42228 ?        Sl   Aug18   0:00  \_ /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18
apache    6791  0.7  1.2 638692 49020 ?        Ssl  08:12   0:01 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18
apache    6801  0.3  1.7 946744 65608 ?        Sl   08:12   0:00  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18

# kill -9 6791

Comment 4 Brian Bouterse 2014-08-22 19:23:38 UTC

For simplicity, assume there is exactly 1 worker in the system. I've configured my system that way for testing. That 1 worker has 2 processes a parent and a child. I then do this:

1) I start a long-running sync which gets assigned to that worker
2) I kill both the parent and child at the same time `pkill -9 -f reserved_resource_worker`
3) I observe the task state just afterwards `pulp-admin tasks list` I see the task is still in the running state (as expected).


If I wait 5-6 minutes the worker will timeout assuming the pulp_celerybeat component is running. Once the timeout occurs the task state will go to Cancelled. If I restart the worker because of the fix from [0], the work will also show up as cancelled just after starting the worker.

Next I need to investigate just killing the parent...

[0]:  https://bugzilla.redhat.com/show_bug.cgi?id=1129858

Comment 5 Brian Bouterse 2014-08-22 20:13:12 UTC

In the same situation as above, if I kill just the parent process the task currently in progress updates to 'Failed'.

Comment 6 Brian Bouterse 2014-08-22 20:19:51 UTC

I'm moving this to ON_QA because I believe the task state in all situations is correct. This BZ was originally filed against a 2.4.0-beta, and the task status updating has been reworked some from then.

Comment 7 Ina Panova 2014-08-25 11:06:20 UTC

(In reply to bbouters from comment #4)
> For simplicity, assume there is exactly 1 worker in the system. I've
> configured my system that way for testing. That 1 worker has 2 processes a
> parent and a child. I then do this:
> 
> 1) I start a long-running sync which gets assigned to that worker
> 2) I kill both the parent and child at the same time `pkill -9 -f
> reserved_resource_worker`
> 3) I observe the task state just afterwards `pulp-admin tasks list` I see
> the task is still in the running state (as expected).
> 
> 
> If I wait 5-6 minutes the worker will timeout assuming the pulp_celerybeat
> component is running. Once the timeout occurs the task state will go to
> Cancelled. If I restart the worker because of the fix from [0], the work
> will also show up as cancelled just after starting the worker.
> 
> Next I need to investigate just killing the parent...
> 
> [0]:  https://bugzilla.redhat.com/show_bug.cgi?id=1129858

I think there is a difference how kill and pkill kills the processes, i.e. the order of process killing.

When I use:

kill -9 'parent pid'

*Both* parent and child are killed, but first parent is killed, then child.

If I use:
pkill -9 -f reserved_resource_worker

First child then parent are killed.

So I tried to first kill child then parent(to confirm my theory):

kill -9 'child pid'
kill -9 'parent pid'

And in this order everything works as you described in step 3.

As far as I understand to have everything work properly first *child* then *parent* should be killed, and command 'kill' does not follow this order.

Comment 8 Ina Panova 2014-08-25 11:10:39 UTC

(In reply to bbouters from comment #5)
> In the same situation as above, if I kill just the parent process the task
> currently in progress updates to 'Failed'.

I can suppose that you received status 'Failed' due to this bz[0] if you were syncing puppet repo.

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1131062

Comment 9 Ina Panova 2014-08-25 11:21:47 UTC

Based on comment 7 and 8, I am moving back this bz to Assigned.

I still have same behaviour as stated in description if I use command 'kill':

ps ax -o  "%p %r %n %c "|grep celery
PID  PGID  NI COMMAND
16032 16032   0 celery          
16050 16050   0 celery          
16067 16067   0 celery          
16086 16050   0 celery          
16091 16067   0 celery 


# systemctl status pulp_worker-0
pulp_worker-0.service - Pulp Worker #0
   Loaded: loaded (/run/systemd/system/pulp_worker-0.service; static)
   Active: active (running) since Mon 2014-08-25 10:42:06 UTC; 33min ago
 Main PID: 16067 (celery)
   CGroup: /system.slice/pulp_worker-0.service
           ├─16067 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18
           └─16091 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18

Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: - ** ---------- [config]
Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: - ** ---------- .> app:         tasks:0x2beaf90
Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: - ** ---------- .> transport:   qpid://guest.compute.amazonaws.com:5672//
Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: - ** ---------- .> results:     mongodb
Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: - *** --- * --- .> concurrency: 1 (prefork)
Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: -- ******* ----
Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: --- ***** ----- [queues]
Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: -------------- .> celery           exchange=celery(direct) key=celery
Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: .> reserved_resource_worker-0.compute.amazonaws.com.dq exchange=C.dq(dir...naws.com
Aug 25 10:42:09 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16067]: celery.worker.consumer:INFO: Connected to qpid://guest.compute.amazonaws.com:5672//
Hint: Some lines were ellipsized, use -l to show in full.



# kill -9 16067

NOTE: both processes(parent + child)  were killed after 'kill -9 "parent pid"'

[root@ec2-54-74-16-107 ~]# ps ax -o  "%p %r %n %c "|grep celery
PID  PGID  NI COMMAND
16032 16032   0 celery          
16050 16050   0 celery          
16086 16050   0 celery         



# systemctl status pulp_worker-0
pulp_worker-0.service - Pulp Worker #0
   Loaded: loaded (/run/systemd/system/pulp_worker-0.service; static)
   Active: failed (Result: signal) since Mon 2014-08-25 11:16:38 UTC; 18s ago
  Process: 16067 ExecStart=/usr/bin/celery worker -n reserved_resource_worker-0@%%h -A pulp.server.async.app -c 1 --events --umask 18 (code=killed, signal=KILL)
 Main PID: 16067 (code=killed, signal=KILL)

Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com systemd[1]: pulp_worker-0.service: main process exited, code=killed, status=9/KILL
Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16091]: py.warnings:WARNING: /usr/lib/python2.7/site-packages/celery/result.py:45: RuntimeWarning: Never cal...n a task!
Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16091]: py.warnings:WARNING: See http://docs.celeryq.org/en/latest/userguide/tasks.html#task-synchronous-subtasks
Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16091]: py.warnings:WARNING:
Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16091]: py.warnings:WARNING: In Celery 3.2 this will result in an exception being
Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16091]: py.warnings:WARNING: raised instead of just being a warning.
Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16091]: py.warnings:WARNING:
Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16091]: py.warnings:WARNING:   warnings.warn(RuntimeWarning(E_WOULDBLOCK))
Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16091]: py.warnings:WARNING:
Aug 25 11:16:38 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com systemd[1]: Unit pulp_worker-0.service entered failed state.
Hint: Some lines were ellipsized, use -l to show in full.

State of the sync task changed 'running'---> 'successful'.

Comment 10 Ina Panova 2014-08-26 08:06:29 UTC

Moving it to Verified.

Brian found out that when killing just the parent process "kill -9 'parent pid'" child is not killed, but a TERM is sent to it. So it can exit immediately or continue to run,finish the task and then exit.

So the right way of killing the worker is :

# systemctl status pulp_worker-0
pulp_worker-0.service - Pulp Worker #0
   Loaded: loaded (/run/systemd/system/pulp_worker-0.service; static)
   Active: active (running) since Mon 2014-08-25 21:38:50 UTC; 10h ago
 Main PID: 21039 (celery)
   CGroup: /system.slice/pulp_worker-0.service
           ├─21039 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18
           └─21102 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18

[root@ec2-54-75-164-84 ~]# kill -9 21039 21102
[root@ec2-54-75-164-84 ~]# pulp-admin -u admin -p admin tasks list
+----------------------------------------------------------------------+
                                 Tasks
+----------------------------------------------------------------------+

Operations:  sync
Resources:   pup (repository)
State:       Running
Start Time:  2014-08-26T07:46:55Z
Finish Time: Incomplete
Task Id:     285680d1-22b0-4829-bce1-4e0bbf514799

After 5mins:
# pulp-admin -u admin -p admin tasks list
+----------------------------------------------------------------------+
                                 Tasks
+----------------------------------------------------------------------+

Operations:  sync
Resources:   pup (repository)
State:       Cancelled
Start Time:  2014-08-26T07:46:55Z
Finish Time: Incomplete
Task Id:     285680d1-22b0-4829-bce1-4e0bbf514799


'pkill -9 -f reserved_resource_worker' also works as supposed.

Tested in 2.4.1-0.3.beta.

Comment 11 Randy Barlow 2014-09-23 17:54:13 UTC

This is fixed in Pulp-2.4.1-1.