Bug 1129758 - When a worker dies or restarts the tasks assigned to it are not processed
Summary: When a worker dies or restarts the tasks assigned to it are not processed
Alias: None
Product: Pulp
Classification: Retired
Component: async/tasks
Version: 2.4.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: ---
Assignee: Brian Bouterse
QA Contact: pulp-qe-list
Depends On:
TreeView+ depends on / blocked
Reported: 2014-08-13 15:26 UTC by Barnaby Court
Modified: 2015-02-28 22:15 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2015-02-28 22:15:22 UTC

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Pulp Redmine 489 None None None Never

Description Barnaby Court 2014-08-13 15:26:22 UTC
Description of problem:

When a worker restarts (or dies) the tasks assigned to that worker are not processed.  Previously, in pulp 2.3.x, tasks would be restarted when Pulp restarted.  In 2.4 the task queue is being cleared when the worker stops.  

If the worker dies unexpectedly or the power goes this can end up with tasks being listed as "NOT_STARTED" or "IN_PROGRESS" but with a reservation that will prevent other tasks that require that resource from occurring until the tasks are manually canceled by the user.  

How reproducible: 
Every time (this behavior is by design in 2.4.0)

Steps to Reproduce:
1. Queue a bunch of tasks
2. Restart Pulp

Actual results:
Tasks are either canceled or hang in the waiting/running state

Expected results:
Tasks are processed or restarted after the restart.

Comment 1 Brian Bouterse 2014-08-13 20:09:59 UTC
I believe the reason other tasks are prevented from running has a lot more to do with the task status showing either in-progress or waiting than the fact that a reserved_resource reservation exists for the worker who restarted or dies.

Given that, this bug implicitly described 2 behaviors: one that work that is submitted into Pulp gets "lost" when workers restart or die and two, that the status of those tasks is incorrect. The second behavior has been moved to a different BZ [0] altogether to be fixed in the short-term (ie: pulp 2.4.1).

This BZ should focus on the first behavior only: that pulp looses work which it already knew about. Once that defect is fixed and pulp no longer looses work, it's important to undo the short-term fix put in place by [0]. Basically, once work does restart properly there should be no reason to proactively update those tasks as being cancelled.

[0]:  https://bugzilla.redhat.com/show_bug.cgi?id=1129858

Comment 2 Brian Bouterse 2015-01-12 18:08:07 UTC
I was able to do some prototyping and made some improvements in the behaviors towards resolving this issue, but it is not finished. We'll need to move away from the dedicated queues feature of Celery, and use the CELERY_QUEUES feature instead to create similar auto-deleting, dedicated queues with the additional options to support the alternate-exchange.

In addition to the implementation, it still needs a lot of testing on both Qpid and RabbitMQ before it's ready to be put into a PR or have tests written for it. I'm going to focus on some task/story work for a bit, but since I'm so far into this fix I'm leaving it in the assigned state.

Comment 3 Brian Bouterse 2015-02-06 22:25:53 UTC
I filed an upstream bug [1] with Celery that the CELERY_WORKER_DIRECT can loose work. I think solving this in upstream celery is better than reworking Pulp to have a similar feature only with durability.


Comment 4 Brian Bouterse 2015-02-06 22:26:11 UTC
[1] in Comment 3 is https://github.com/celery/celery/issues/2492

Comment 5 Brian Bouterse 2015-02-28 22:15:22 UTC
Moved to https://pulp.plan.io/issues/489

Note You need to log in before you can comment on or make changes to this bug.