Bug 1088060

Summary: Sometimes the babysit() task thinks workers are missing that are not missing
Product: [Retired] Pulp Reporter: Randy Barlow <rbarlow>
Component: async/tasksAssignee: Brian Bouterse <bmbouter>
Status: CLOSED CURRENTRELEASE QA Contact: Ina Panova <ipanova>
Severity: high Docs Contact:
Priority: urgent    
Version: MasterCC: ipanova, mhrivnak
Target Milestone: ---Keywords: Triaged
Target Release: 2.4.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-08-09 06:57:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Randy Barlow 2014-04-16 00:31:21 UTC
I added a trace statement to an installation that only had one worker, and watched while several babysit() tasks went by. The babysit() task has an internal variable called active_queues that is the response from Celery's inspection tool. It is supposed to list all the workers and which queues they are subscribed to. Sometimes it seems that it returns None:

Apr 15 21:01:17 ip-10-227-65-61.eu-west-1.compute.internal pulp[1341]: celery.worker.job:DEBUG: Task accepted: pulp.server.async.tasks.babysit[903f9c6b-01ef-4675-bc0d-d964da87f609] pid:1356
Apr 15 21:01:17 ip-10-227-65-61.eu-west-1.compute.internal pulp[1341]: kombu.pidbox:DEBUG: pidbox received method active_queues() [reply_to:{u'routing_key': u'5a5cfec7-0b73-3e2a-b38a-976440ae7de6', u'exchange': u'reply.celery.pidbox'} ticket:fa1ab8f2-62ee-4f42-8ea5-7c89a394072e]
Apr 15 21:01:18 ip-10-227-65-61.eu-west-1.compute.internal pulp[1356]: pulp.server.async.tasks:DEBUG: babysit(): active_queues: None
Apr 15 21:01:18 ip-10-227-65-61.eu-west-1.compute.internal pulp[1341]: celery.worker.job:INFO: Task pulp.server.async.tasks.babysit[903f9c6b-01ef-4675-bc0d-d964da87f609] succeeded in 1.356255633s: None

I made this observation on ipanova's QE instance, and it seems to be a recurring, though difficult to reproduce problem.

Comment 1 Randy Barlow 2014-04-16 14:54:42 UTC
We have talked about this issue, and we believe that we have a reasonable solution:

1) Raise the default time that babysit() waits for workers to reply from 1s to 10s. This will mean that babysit() always takes 10s to complete.

2) Make the babysit() timer configurable.

3) Make the babysit() frequency configurable (still defaulting to 1m).

4) Make the timer for missing workers configurable (still defaulting to 5m).

The current plan is to do another beta build on Friday, and I believe this change can be done in time. Ina, if this is affecting you badly let me know and I can plan a sooner beta build with this fix.

Comment 2 Randy Barlow 2014-04-16 15:09:41 UTC
This issue is making life difficult for our QE friends, so I am marking it as urgent priority.

Comment 4 Randy Barlow 2014-04-24 20:28:39 UTC
The fix for this bug is included in the 2.4.0-0.10.beta build that was just published to the Pulp fedorapeople.org repository.

Comment 5 Ina Panova 2014-05-16 12:33:48 UTC
Moving to Verified, as after the fix have never faced the issue again

Comment 6 Randy Barlow 2014-08-09 06:57:07 UTC
This has been fixed in Pulp 2.4.0-1.