Bug 1120270
Summary: | After killing the worker running a task the task status is successful | |||
---|---|---|---|---|
Product: | [Retired] Pulp | Reporter: | Ina Panova <ipanova> | |
Component: | async/tasks | Assignee: | Brian Bouterse <bmbouter> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Ina Panova <ipanova> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 2.4 Beta | CC: | bcourt, bmbouter, ipanova, mhrivnak | |
Target Milestone: | --- | Keywords: | Triaged | |
Target Release: | 2.4.1 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1129877 (view as bug list) | Environment: | ||
Last Closed: | 2014-09-23 17:54:13 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1129877, 1131719 |
Description
Ina Panova
2014-07-16 15:16:06 UTC
This BZ is very much related to [0], but it does claim that killed worker leaves task state incorrect in a different way (by marking it successful when it should be cancelled). Given the importance of Pulp 2.4.1 having all task states being accurate I'm adjusting the priority of this BZ to high so it can be fixed along with [0]. [0]: https://bugzilla.redhat.com/show_bug.cgi?id=1129858 Ina, can you list the how you selected & killed the worker process? Each worker is made up of two processes. Which one did you kill? I killed the parent process. For example: # ps auxf | grep celery root 6936 0.0 0.0 112668 964 pts/0 S+ 08:15 0:00 \_ grep --color=auto celery apache 20089 0.2 0.6 774280 26132 ? Ssl Aug18 6:44 /usr/bin/python /usr/bin/celery beat --scheduler=pulp.server.async.scheduler.Scheduler apache 20107 0.2 1.2 638876 49288 ? Ssl Aug18 5:16 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 apache 20141 0.0 1.0 564628 42228 ? Sl Aug18 0:00 \_ /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 apache 6791 0.7 1.2 638692 49020 ? Ssl 08:12 0:01 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 apache 6801 0.3 1.7 946744 65608 ? Sl 08:12 0:00 \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 # kill -9 6791 For simplicity, assume there is exactly 1 worker in the system. I've configured my system that way for testing. That 1 worker has 2 processes a parent and a child. I then do this: 1) I start a long-running sync which gets assigned to that worker 2) I kill both the parent and child at the same time `pkill -9 -f reserved_resource_worker` 3) I observe the task state just afterwards `pulp-admin tasks list` I see the task is still in the running state (as expected). If I wait 5-6 minutes the worker will timeout assuming the pulp_celerybeat component is running. Once the timeout occurs the task state will go to Cancelled. If I restart the worker because of the fix from [0], the work will also show up as cancelled just after starting the worker. Next I need to investigate just killing the parent... [0]: https://bugzilla.redhat.com/show_bug.cgi?id=1129858 In the same situation as above, if I kill just the parent process the task currently in progress updates to 'Failed'. I'm moving this to ON_QA because I believe the task state in all situations is correct. This BZ was originally filed against a 2.4.0-beta, and the task status updating has been reworked some from then. (In reply to bbouters from comment #4) > For simplicity, assume there is exactly 1 worker in the system. I've > configured my system that way for testing. That 1 worker has 2 processes a > parent and a child. I then do this: > > 1) I start a long-running sync which gets assigned to that worker > 2) I kill both the parent and child at the same time `pkill -9 -f > reserved_resource_worker` > 3) I observe the task state just afterwards `pulp-admin tasks list` I see > the task is still in the running state (as expected). > > > If I wait 5-6 minutes the worker will timeout assuming the pulp_celerybeat > component is running. Once the timeout occurs the task state will go to > Cancelled. If I restart the worker because of the fix from [0], the work > will also show up as cancelled just after starting the worker. > > Next I need to investigate just killing the parent... > > [0]: https://bugzilla.redhat.com/show_bug.cgi?id=1129858 I think there is a difference how kill and pkill kills the processes, i.e. the order of process killing. When I use: kill -9 'parent pid' *Both* parent and child are killed, but first parent is killed, then child. If I use: pkill -9 -f reserved_resource_worker First child then parent are killed. So I tried to first kill child then parent(to confirm my theory): kill -9 'child pid' kill -9 'parent pid' And in this order everything works as you described in step 3. As far as I understand to have everything work properly first *child* then *parent* should be killed, and command 'kill' does not follow this order. (In reply to bbouters from comment #5) > In the same situation as above, if I kill just the parent process the task > currently in progress updates to 'Failed'. I can suppose that you received status 'Failed' due to this bz[0] if you were syncing puppet repo. [0] https://bugzilla.redhat.com/show_bug.cgi?id=1131062 Based on comment 7 and 8, I am moving back this bz to Assigned. I still have same behaviour as stated in description if I use command 'kill': ps ax -o "%p %r %n %c "|grep celery PID PGID NI COMMAND 16032 16032 0 celery 16050 16050 0 celery 16067 16067 0 celery 16086 16050 0 celery 16091 16067 0 celery # systemctl status pulp_worker-0 pulp_worker-0.service - Pulp Worker #0 Loaded: loaded (/run/systemd/system/pulp_worker-0.service; static) Active: active (running) since Mon 2014-08-25 10:42:06 UTC; 33min ago Main PID: 16067 (celery) CGroup: /system.slice/pulp_worker-0.service ├─16067 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 └─16091 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: - ** ---------- [config] Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: - ** ---------- .> app: tasks:0x2beaf90 Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: - ** ---------- .> transport: qpid://guest.compute.amazonaws.com:5672// Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: - ** ---------- .> results: mongodb Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: - *** --- * --- .> concurrency: 1 (prefork) Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: -- ******* ---- Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: --- ***** ----- [queues] Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: -------------- .> celery exchange=celery(direct) key=celery Aug 25 10:42:08 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com celery[16067]: .> reserved_resource_worker-0.compute.amazonaws.com.dq exchange=C.dq(dir...naws.com Aug 25 10:42:09 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16067]: celery.worker.consumer:INFO: Connected to qpid://guest.compute.amazonaws.com:5672// Hint: Some lines were ellipsized, use -l to show in full. # kill -9 16067 NOTE: both processes(parent + child) were killed after 'kill -9 "parent pid"' [root@ec2-54-74-16-107 ~]# ps ax -o "%p %r %n %c "|grep celery PID PGID NI COMMAND 16032 16032 0 celery 16050 16050 0 celery 16086 16050 0 celery # systemctl status pulp_worker-0 pulp_worker-0.service - Pulp Worker #0 Loaded: loaded (/run/systemd/system/pulp_worker-0.service; static) Active: failed (Result: signal) since Mon 2014-08-25 11:16:38 UTC; 18s ago Process: 16067 ExecStart=/usr/bin/celery worker -n reserved_resource_worker-0@%%h -A pulp.server.async.app -c 1 --events --umask 18 (code=killed, signal=KILL) Main PID: 16067 (code=killed, signal=KILL) Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com systemd[1]: pulp_worker-0.service: main process exited, code=killed, status=9/KILL Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16091]: py.warnings:WARNING: /usr/lib/python2.7/site-packages/celery/result.py:45: RuntimeWarning: Never cal...n a task! Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16091]: py.warnings:WARNING: See http://docs.celeryq.org/en/latest/userguide/tasks.html#task-synchronous-subtasks Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16091]: py.warnings:WARNING: Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16091]: py.warnings:WARNING: In Celery 3.2 this will result in an exception being Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16091]: py.warnings:WARNING: raised instead of just being a warning. Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16091]: py.warnings:WARNING: Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16091]: py.warnings:WARNING: warnings.warn(RuntimeWarning(E_WOULDBLOCK)) Aug 25 11:16:37 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com pulp[16091]: py.warnings:WARNING: Aug 25 11:16:38 ec2-54-74-16-107.eu-west-1.compute.amazonaws.com systemd[1]: Unit pulp_worker-0.service entered failed state. Hint: Some lines were ellipsized, use -l to show in full. State of the sync task changed 'running'---> 'successful'. Moving it to Verified. Brian found out that when killing just the parent process "kill -9 'parent pid'" child is not killed, but a TERM is sent to it. So it can exit immediately or continue to run,finish the task and then exit. So the right way of killing the worker is : # systemctl status pulp_worker-0 pulp_worker-0.service - Pulp Worker #0 Loaded: loaded (/run/systemd/system/pulp_worker-0.service; static) Active: active (running) since Mon 2014-08-25 21:38:50 UTC; 10h ago Main PID: 21039 (celery) CGroup: /system.slice/pulp_worker-0.service ├─21039 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 └─21102 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 [root@ec2-54-75-164-84 ~]# kill -9 21039 21102 [root@ec2-54-75-164-84 ~]# pulp-admin -u admin -p admin tasks list +----------------------------------------------------------------------+ Tasks +----------------------------------------------------------------------+ Operations: sync Resources: pup (repository) State: Running Start Time: 2014-08-26T07:46:55Z Finish Time: Incomplete Task Id: 285680d1-22b0-4829-bce1-4e0bbf514799 After 5mins: # pulp-admin -u admin -p admin tasks list +----------------------------------------------------------------------+ Tasks +----------------------------------------------------------------------+ Operations: sync Resources: pup (repository) State: Cancelled Start Time: 2014-08-26T07:46:55Z Finish Time: Incomplete Task Id: 285680d1-22b0-4829-bce1-4e0bbf514799 'pkill -9 -f reserved_resource_worker' also works as supposed. Tested in 2.4.1-0.3.beta. This is fixed in Pulp-2.4.1-1. |