Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1105195

Summary:

pulp might leak async tasks in case worker exited abnormally

Product:

[Retired] Pulp

Reporter:

mkovacik

Component:

async/tasks

Assignee:

Brian Bouterse <bmbouter>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Ina Panova <ipanova>

Severity:

unspecified

Docs Contact:

Priority:

medium

Version:

2.4 Beta

CC:

bmbouter, ipanova, skarmark

Target Milestone:

---

Keywords:

Triaged

Target Release:

2.4.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-08-09 06:56:36 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
reproducer bash script	none
successful reproducer screen log	none
Qpid Queues Output Example	none
output from qpid-config queues	none

Description mkovacik 2014-06-05 14:53:05 UTC

Created attachment 902566 [details]
reproducer bash script

Description of problem:
In case pulp_worker-*.service exited abnormally --- such as killed by signal 9
--- an asynchronous task might still be assigned to the pulp_worker-* queue.

There seems to be a race between qpid removing worker queues
(the 3s worker queue inactivity timeout) and assigning tasks to queues
and pulp_database collections

As a consequence, stale/leaked tasks may occur.

Version-Release number of selected component (if applicable):
pulp-server-2.4.0-0.19.beta.fc20.noarch


How reproducible:
~30% on a system with 2 pulp_workers

Steps to Reproduce:
1. kill the worker processes with signal sigkill
2. check mongo pulp_database.available_queues collection --- should still show
   worker queues
3. check qpid-stat -q --- shouldn't show pulp_worker- queues anymore
4. induce an async task (such as sync a repo)
5. check content of  mongo collection pupl_database.reserved_resources and
   pulp_database.task_status --- should show a pending repo sync
   reservation/task
6. wait as long as the  WorkerTimeoutMonitor cleaning thread needs to remove 
   unavailable worker queues
6. restart pulp_worker-*.service
7. check status of the pending sync task --- should be still pending

Expected results:
The task is either always finished or not accepted in case pulp_worker exited 
abnormally

Additional info:
In the reproducer, un-commenting the delay on line #60 decreases the frequency 
of stale task occurrence several times.
It seems there is some cleaning mechanism involved that reacts to the event of
qpidd removing inactive worker queues, too.

Comment 1 mkovacik 2014-06-05 14:55:19 UTC

Created attachment 902567 [details]
successful reproducer screen log

Comment 2 Brian Bouterse 2014-06-10 14:25:23 UTC

Originally the delay was put in place because Pulp's use of Celery workers were massively using the Celery's management and control commands[0]. When I was load testing babysit(), the workers were continually subscribing and unsubscribing from the same queue which had auto-delete functionality. By introducing the delay, it mostly avoided the problem unless the same worker unsubscribed, and then re-subscribed to the same queue exactly "delay" amount later.

We no longer need this functionality because babysit() has been removed, and with the exception of revocation of tasks, there are no Celery management and control commands removed anymore. Revoke() is a broadcast function and does not have replies, so lots of revocations should not cause an issue with removing the timeout.

I'm going to remove the use of qpid.auto_delete_timeout [1] from the code. It doesn't solve the whole problem, but it is a behavioral change that needs to be made. This should cause the queues to delete immediately, which will cause the dispatcher to experience an exception when it tries to dispatch work to a worker that died unexpectedly. This will be resolved through a broader effort to ensure that the Resource Manager recovers from these types of situations.

[0]:  http://celery.readthedocs.org/en/latest/userguide/monitoring.html#management-command-line-utilities-inspect-control
[1]:  https://github.com/pulp/kombu/blob/fb96b61a4e58d0bfc3d6d5539cdf9228e12fc115/kombu/transport/qpid.py#L771

Comment 3 Brian Bouterse 2014-06-10 14:32:53 UTC

Regarding task cancellation, when a worker goes missing, tasks should be canceled automatically soon thereafter. The worker_watcher module[0] that is part of the Scheduler will notice heartbeats stop coming in from the worker between 5 and 6 minutes after the worker dies unexpectedly and "cleans up". Cleanup involves canceling and revoking all tasks that were dispatched to that worker[1].

[0]:  https://github.com/pulp/pulp/blob/19ad153eb69a22f4b3ddc9d01a390f26d83fb1ad/server/pulp/server/async/worker_watcher.py
[1]:  https://github.com/pulp/pulp/blob/19ad153eb69a22f4b3ddc9d01a390f26d83fb1ad/server/pulp/server/async/tasks.py#L59

Comment 4 Brian Bouterse 2014-06-10 14:57:51 UTC

There are two PRs that are part of this. One in pulp/kombu, and the other in pulp/pulp for a docs change.

pulp/kombu PR:   https://github.com/pulp/kombu/pull/11

pulp/pulp PR:    https://github.com/pulp/pulp/pull/1002

Comment 5 Brian Bouterse 2014-06-11 13:26:18 UTC

Merged in both pulp/kombu and pulp/pulp.

This is not yet been built into a python-kombu release yet. It should stay at POST until a release has been built, at which point it will be moved to MODIFIED.

This will be done when it is time to release another beta. bmbouter will move them to MODIFIED as part of the creation of the next beta.

Comment 6 Brian Bouterse 2014-07-10 16:16:09 UTC

The patch containing this fix has been merged to pulp/pulp branch 2.4, and has been tagged into python-kombu-3.0.15-11. Moving to MODIFIED.

Comment 7 Brian Bouterse 2014-07-11 20:35:35 UTC

Fix included in python-kombu-3.0.15-11, which is included with pulp beta 2.4.0-0.24.beta.

Comment 8 Brian Bouterse 2014-07-11 21:28:08 UTC

Created attachment 917436 [details]
Qpid Queues Output Example

Comment 9 Brian Bouterse 2014-07-11 21:29:08 UTC

To verify this BZ do the following:

1. Install qpid-tools using `sudo yum install qpid-tools -y`
2. Deploy a normal installation of Pulp using Qpid on localhost with authentication off (auth=no in qpidd.conf)
3. Start all services and sanity check Pulp
4. Get the output of `qpid-config queues`

Each queue the broker knows about will be listed. The output will be similar to the attachment 917436 [details] attached to this ticket.

To verify this bug, you should confirm that none of the queues have the setting qpid.auto_delete_timeout set. There should not be any mention of auto_delete_timeout in the output.

Comment 10 Ina Panova 2014-07-16 13:45:51 UTC

Created attachment 918412 [details]
output from qpid-config queues

Comment 11 Ina Panova 2014-07-16 15:28:00 UTC

Based on c#10 moving to Verified.

Comment 12 Randy Barlow 2014-08-09 06:56:36 UTC

This has been fixed in Pulp 2.4.0-1.