I added a trace statement to an installation that only had one worker, and watched while several babysit() tasks went by. The babysit() task has an internal variable called active_queues that is the response from Celery's inspection tool. It is supposed to list all the workers and which queues they are subscribed to. Sometimes it seems that it returns None: Apr 15 21:01:17 ip-10-227-65-61.eu-west-1.compute.internal pulp[1341]: celery.worker.job:DEBUG: Task accepted: pulp.server.async.tasks.babysit[903f9c6b-01ef-4675-bc0d-d964da87f609] pid:1356 Apr 15 21:01:17 ip-10-227-65-61.eu-west-1.compute.internal pulp[1341]: kombu.pidbox:DEBUG: pidbox received method active_queues() [reply_to:{u'routing_key': u'5a5cfec7-0b73-3e2a-b38a-976440ae7de6', u'exchange': u'reply.celery.pidbox'} ticket:fa1ab8f2-62ee-4f42-8ea5-7c89a394072e] Apr 15 21:01:18 ip-10-227-65-61.eu-west-1.compute.internal pulp[1356]: pulp.server.async.tasks:DEBUG: babysit(): active_queues: None Apr 15 21:01:18 ip-10-227-65-61.eu-west-1.compute.internal pulp[1341]: celery.worker.job:INFO: Task pulp.server.async.tasks.babysit[903f9c6b-01ef-4675-bc0d-d964da87f609] succeeded in 1.356255633s: None I made this observation on ipanova's QE instance, and it seems to be a recurring, though difficult to reproduce problem.
We have talked about this issue, and we believe that we have a reasonable solution: 1) Raise the default time that babysit() waits for workers to reply from 1s to 10s. This will mean that babysit() always takes 10s to complete. 2) Make the babysit() timer configurable. 3) Make the babysit() frequency configurable (still defaulting to 1m). 4) Make the timer for missing workers configurable (still defaulting to 5m). The current plan is to do another beta build on Friday, and I believe this change can be done in time. Ina, if this is affecting you badly let me know and I can plan a sooner beta build with this fix.
This issue is making life difficult for our QE friends, so I am marking it as urgent priority.
https://github.com/pulp/kombu/pull/3 https://github.com/pulp/pulp/pull/911
The fix for this bug is included in the 2.4.0-0.10.beta build that was just published to the Pulp fedorapeople.org repository.
Moving to Verified, as after the fix have never faced the issue again
This has been fixed in Pulp 2.4.0-1.