Bug 1088060 - Sometimes the babysit() task thinks workers are missing that are not missing
Summary: Sometimes the babysit() task thinks workers are missing that are not missing
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Pulp
Classification: Retired
Component: async/tasks
Version: Master
Hardware: All
OS: Linux
urgent
high
Target Milestone: ---
: 2.4.0
Assignee: Brian Bouterse
QA Contact: Ina Panova
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-04-16 00:31 UTC by Randy Barlow
Modified: 2014-08-09 06:57 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-08-09 06:57:07 UTC
Embargoed:


Attachments (Terms of Use)

Description Randy Barlow 2014-04-16 00:31:21 UTC
I added a trace statement to an installation that only had one worker, and watched while several babysit() tasks went by. The babysit() task has an internal variable called active_queues that is the response from Celery's inspection tool. It is supposed to list all the workers and which queues they are subscribed to. Sometimes it seems that it returns None:

Apr 15 21:01:17 ip-10-227-65-61.eu-west-1.compute.internal pulp[1341]: celery.worker.job:DEBUG: Task accepted: pulp.server.async.tasks.babysit[903f9c6b-01ef-4675-bc0d-d964da87f609] pid:1356
Apr 15 21:01:17 ip-10-227-65-61.eu-west-1.compute.internal pulp[1341]: kombu.pidbox:DEBUG: pidbox received method active_queues() [reply_to:{u'routing_key': u'5a5cfec7-0b73-3e2a-b38a-976440ae7de6', u'exchange': u'reply.celery.pidbox'} ticket:fa1ab8f2-62ee-4f42-8ea5-7c89a394072e]
Apr 15 21:01:18 ip-10-227-65-61.eu-west-1.compute.internal pulp[1356]: pulp.server.async.tasks:DEBUG: babysit(): active_queues: None
Apr 15 21:01:18 ip-10-227-65-61.eu-west-1.compute.internal pulp[1341]: celery.worker.job:INFO: Task pulp.server.async.tasks.babysit[903f9c6b-01ef-4675-bc0d-d964da87f609] succeeded in 1.356255633s: None

I made this observation on ipanova's QE instance, and it seems to be a recurring, though difficult to reproduce problem.

Comment 1 Randy Barlow 2014-04-16 14:54:42 UTC
We have talked about this issue, and we believe that we have a reasonable solution:

1) Raise the default time that babysit() waits for workers to reply from 1s to 10s. This will mean that babysit() always takes 10s to complete.

2) Make the babysit() timer configurable.

3) Make the babysit() frequency configurable (still defaulting to 1m).

4) Make the timer for missing workers configurable (still defaulting to 5m).

The current plan is to do another beta build on Friday, and I believe this change can be done in time. Ina, if this is affecting you badly let me know and I can plan a sooner beta build with this fix.

Comment 2 Randy Barlow 2014-04-16 15:09:41 UTC
This issue is making life difficult for our QE friends, so I am marking it as urgent priority.

Comment 4 Randy Barlow 2014-04-24 20:28:39 UTC
The fix for this bug is included in the 2.4.0-0.10.beta build that was just published to the Pulp fedorapeople.org repository.

Comment 5 Ina Panova 2014-05-16 12:33:48 UTC
Moving to Verified, as after the fix have never faced the issue again

Comment 6 Randy Barlow 2014-08-09 06:57:07 UTC
This has been fixed in Pulp 2.4.0-1.


Note You need to log in before you can comment on or make changes to this bug.