Description of problem: When a worker restarts (or dies) the tasks assigned to that worker are not processed. Previously, in pulp 2.3.x, tasks would be restarted when Pulp restarted. In 2.4 the task queue is being cleared when the worker stops. If the worker dies unexpectedly or the power goes this can end up with tasks being listed as "NOT_STARTED" or "IN_PROGRESS" but with a reservation that will prevent other tasks that require that resource from occurring until the tasks are manually canceled by the user. How reproducible: Every time (this behavior is by design in 2.4.0) Steps to Reproduce: 1. Queue a bunch of tasks 2. Restart Pulp Actual results: Tasks are either canceled or hang in the waiting/running state Expected results: Tasks are processed or restarted after the restart.
I believe the reason other tasks are prevented from running has a lot more to do with the task status showing either in-progress or waiting than the fact that a reserved_resource reservation exists for the worker who restarted or dies. Given that, this bug implicitly described 2 behaviors: one that work that is submitted into Pulp gets "lost" when workers restart or die and two, that the status of those tasks is incorrect. The second behavior has been moved to a different BZ [0] altogether to be fixed in the short-term (ie: pulp 2.4.1). This BZ should focus on the first behavior only: that pulp looses work which it already knew about. Once that defect is fixed and pulp no longer looses work, it's important to undo the short-term fix put in place by [0]. Basically, once work does restart properly there should be no reason to proactively update those tasks as being cancelled. [0]: https://bugzilla.redhat.com/show_bug.cgi?id=1129858
I was able to do some prototyping and made some improvements in the behaviors towards resolving this issue, but it is not finished. We'll need to move away from the dedicated queues feature of Celery, and use the CELERY_QUEUES feature instead to create similar auto-deleting, dedicated queues with the additional options to support the alternate-exchange. In addition to the implementation, it still needs a lot of testing on both Qpid and RabbitMQ before it's ready to be put into a PR or have tests written for it. I'm going to focus on some task/story work for a bit, but since I'm so far into this fix I'm leaving it in the assigned state.
I filed an upstream bug [1] with Celery that the CELERY_WORKER_DIRECT can loose work. I think solving this in upstream celery is better than reworking Pulp to have a similar feature only with durability. https://github.com/celery/celery/issues/2492
[1] in Comment 3 is https://github.com/celery/celery/issues/2492
Moved to https://pulp.plan.io/issues/489