Bug 1105195
| Summary: | pulp might leak async tasks in case worker exited abnormally | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Retired] Pulp | Reporter: | mkovacik | ||||||||||
| Component: | async/tasks | Assignee: | Brian Bouterse <bmbouter> | ||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Ina Panova <ipanova> | ||||||||||
| Severity: | unspecified | Docs Contact: | |||||||||||
| Priority: | medium | ||||||||||||
| Version: | 2.4 Beta | CC: | bmbouter, ipanova, skarmark | ||||||||||
| Target Milestone: | --- | Keywords: | Triaged | ||||||||||
| Target Release: | 2.4.0 | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2014-08-09 06:56:36 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
Created attachment 902567 [details]
successful reproducer screen log
Originally the delay was put in place because Pulp's use of Celery workers were massively using the Celery's management and control commands[0]. When I was load testing babysit(), the workers were continually subscribing and unsubscribing from the same queue which had auto-delete functionality. By introducing the delay, it mostly avoided the problem unless the same worker unsubscribed, and then re-subscribed to the same queue exactly "delay" amount later. We no longer need this functionality because babysit() has been removed, and with the exception of revocation of tasks, there are no Celery management and control commands removed anymore. Revoke() is a broadcast function and does not have replies, so lots of revocations should not cause an issue with removing the timeout. I'm going to remove the use of qpid.auto_delete_timeout [1] from the code. It doesn't solve the whole problem, but it is a behavioral change that needs to be made. This should cause the queues to delete immediately, which will cause the dispatcher to experience an exception when it tries to dispatch work to a worker that died unexpectedly. This will be resolved through a broader effort to ensure that the Resource Manager recovers from these types of situations. [0]: http://celery.readthedocs.org/en/latest/userguide/monitoring.html#management-command-line-utilities-inspect-control [1]: https://github.com/pulp/kombu/blob/fb96b61a4e58d0bfc3d6d5539cdf9228e12fc115/kombu/transport/qpid.py#L771 Regarding task cancellation, when a worker goes missing, tasks should be canceled automatically soon thereafter. The worker_watcher module[0] that is part of the Scheduler will notice heartbeats stop coming in from the worker between 5 and 6 minutes after the worker dies unexpectedly and "cleans up". Cleanup involves canceling and revoking all tasks that were dispatched to that worker[1]. [0]: https://github.com/pulp/pulp/blob/19ad153eb69a22f4b3ddc9d01a390f26d83fb1ad/server/pulp/server/async/worker_watcher.py [1]: https://github.com/pulp/pulp/blob/19ad153eb69a22f4b3ddc9d01a390f26d83fb1ad/server/pulp/server/async/tasks.py#L59 There are two PRs that are part of this. One in pulp/kombu, and the other in pulp/pulp for a docs change. pulp/kombu PR: https://github.com/pulp/kombu/pull/11 pulp/pulp PR: https://github.com/pulp/pulp/pull/1002 Merged in both pulp/kombu and pulp/pulp. This is not yet been built into a python-kombu release yet. It should stay at POST until a release has been built, at which point it will be moved to MODIFIED. This will be done when it is time to release another beta. bmbouter will move them to MODIFIED as part of the creation of the next beta. The patch containing this fix has been merged to pulp/pulp branch 2.4, and has been tagged into python-kombu-3.0.15-11. Moving to MODIFIED. Fix included in python-kombu-3.0.15-11, which is included with pulp beta 2.4.0-0.24.beta. Created attachment 917436 [details]
Qpid Queues Output Example
To verify this BZ do the following:
1. Install qpid-tools using `sudo yum install qpid-tools -y`
2. Deploy a normal installation of Pulp using Qpid on localhost with authentication off (auth=no in qpidd.conf)
3. Start all services and sanity check Pulp
4. Get the output of `qpid-config queues`
Each queue the broker knows about will be listed. The output will be similar to the attachment 917436 [details] attached to this ticket.
To verify this bug, you should confirm that none of the queues have the setting qpid.auto_delete_timeout set. There should not be any mention of auto_delete_timeout in the output.
Created attachment 918412 [details]
output from qpid-config queues
Based on c#10 moving to Verified. This has been fixed in Pulp 2.4.0-1. |
Created attachment 902566 [details] reproducer bash script Description of problem: In case pulp_worker-*.service exited abnormally --- such as killed by signal 9 --- an asynchronous task might still be assigned to the pulp_worker-* queue. There seems to be a race between qpid removing worker queues (the 3s worker queue inactivity timeout) and assigning tasks to queues and pulp_database collections As a consequence, stale/leaked tasks may occur. Version-Release number of selected component (if applicable): pulp-server-2.4.0-0.19.beta.fc20.noarch How reproducible: ~30% on a system with 2 pulp_workers Steps to Reproduce: 1. kill the worker processes with signal sigkill 2. check mongo pulp_database.available_queues collection --- should still show worker queues 3. check qpid-stat -q --- shouldn't show pulp_worker- queues anymore 4. induce an async task (such as sync a repo) 5. check content of mongo collection pupl_database.reserved_resources and pulp_database.task_status --- should show a pending repo sync reservation/task 6. wait as long as the WorkerTimeoutMonitor cleaning thread needs to remove unavailable worker queues 6. restart pulp_worker-*.service 7. check status of the pending sync task --- should be still pending Expected results: The task is either always finished or not accepted in case pulp_worker exited abnormally Additional info: In the reproducer, un-commenting the delay on line #60 decreases the frequency of stale task occurrence several times. It seems there is some cleaning mechanism involved that reacts to the event of qpidd removing inactive worker queues, too.