1129758 – When a worker dies or restarts the tasks assigned to it are not processed

Bug 1129758 - When a worker dies or restarts the tasks assigned to it are not processed

Summary: When a worker dies or restarts the tasks assigned to it are not processed

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Pulp
Classification:	Retired
Component:	async/tasks
Sub Component:
Version:	2.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Brian Bouterse
QA Contact:	pulp-qe-list
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-08-13 15:26 UTC by Barnaby Court
Modified:	2015-02-28 22:15 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-02-28 22:15:22 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Pulp Redmine	489	0	None	None	None	Never

Description Barnaby Court 2014-08-13 15:26:22 UTC

Description of problem:

When a worker restarts (or dies) the tasks assigned to that worker are not processed.  Previously, in pulp 2.3.x, tasks would be restarted when Pulp restarted.  In 2.4 the task queue is being cleared when the worker stops.  

If the worker dies unexpectedly or the power goes this can end up with tasks being listed as "NOT_STARTED" or "IN_PROGRESS" but with a reservation that will prevent other tasks that require that resource from occurring until the tasks are manually canceled by the user.  

How reproducible: 
Every time (this behavior is by design in 2.4.0)

Steps to Reproduce:
1. Queue a bunch of tasks
2. Restart Pulp


Actual results:
Tasks are either canceled or hang in the waiting/running state

Expected results:
Tasks are processed or restarted after the restart.

Comment 1 Brian Bouterse 2014-08-13 20:09:59 UTC

I believe the reason other tasks are prevented from running has a lot more to do with the task status showing either in-progress or waiting than the fact that a reserved_resource reservation exists for the worker who restarted or dies.

Given that, this bug implicitly described 2 behaviors: one that work that is submitted into Pulp gets "lost" when workers restart or die and two, that the status of those tasks is incorrect. The second behavior has been moved to a different BZ [0] altogether to be fixed in the short-term (ie: pulp 2.4.1).

This BZ should focus on the first behavior only: that pulp looses work which it already knew about. Once that defect is fixed and pulp no longer looses work, it's important to undo the short-term fix put in place by [0]. Basically, once work does restart properly there should be no reason to proactively update those tasks as being cancelled.

[0]:  https://bugzilla.redhat.com/show_bug.cgi?id=1129858

Comment 2 Brian Bouterse 2015-01-12 18:08:07 UTC

I was able to do some prototyping and made some improvements in the behaviors towards resolving this issue, but it is not finished. We'll need to move away from the dedicated queues feature of Celery, and use the CELERY_QUEUES feature instead to create similar auto-deleting, dedicated queues with the additional options to support the alternate-exchange.

In addition to the implementation, it still needs a lot of testing on both Qpid and RabbitMQ before it's ready to be put into a PR or have tests written for it. I'm going to focus on some task/story work for a bit, but since I'm so far into this fix I'm leaving it in the assigned state.

Comment 3 Brian Bouterse 2015-02-06 22:25:53 UTC

I filed an upstream bug [1] with Celery that the CELERY_WORKER_DIRECT can loose work. I think solving this in upstream celery is better than reworking Pulp to have a similar feature only with durability.

https://github.com/celery/celery/issues/2492

Comment 4 Brian Bouterse 2015-02-06 22:26:11 UTC

[1] in Comment 3 is https://github.com/celery/celery/issues/2492

Comment 5 Brian Bouterse 2015-02-28 22:15:22 UTC

Moved to https://pulp.plan.io/issues/489

Note You need to log in before you can comment on or make changes to this bug.