Bug 1088060
| Summary: | Sometimes the babysit() task thinks workers are missing that are not missing | ||
|---|---|---|---|
| Product: | [Retired] Pulp | Reporter: | Randy Barlow <rbarlow> |
| Component: | async/tasks | Assignee: | Brian Bouterse <bmbouter> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Ina Panova <ipanova> |
| Severity: | high | Docs Contact: | |
| Priority: | urgent | ||
| Version: | Master | CC: | ipanova, mhrivnak |
| Target Milestone: | --- | Keywords: | Triaged |
| Target Release: | 2.4.0 | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2014-08-09 06:57:07 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
We have talked about this issue, and we believe that we have a reasonable solution: 1) Raise the default time that babysit() waits for workers to reply from 1s to 10s. This will mean that babysit() always takes 10s to complete. 2) Make the babysit() timer configurable. 3) Make the babysit() frequency configurable (still defaulting to 1m). 4) Make the timer for missing workers configurable (still defaulting to 5m). The current plan is to do another beta build on Friday, and I believe this change can be done in time. Ina, if this is affecting you badly let me know and I can plan a sooner beta build with this fix. This issue is making life difficult for our QE friends, so I am marking it as urgent priority. The fix for this bug is included in the 2.4.0-0.10.beta build that was just published to the Pulp fedorapeople.org repository. Moving to Verified, as after the fix have never faced the issue again This has been fixed in Pulp 2.4.0-1. |
I added a trace statement to an installation that only had one worker, and watched while several babysit() tasks went by. The babysit() task has an internal variable called active_queues that is the response from Celery's inspection tool. It is supposed to list all the workers and which queues they are subscribed to. Sometimes it seems that it returns None: Apr 15 21:01:17 ip-10-227-65-61.eu-west-1.compute.internal pulp[1341]: celery.worker.job:DEBUG: Task accepted: pulp.server.async.tasks.babysit[903f9c6b-01ef-4675-bc0d-d964da87f609] pid:1356 Apr 15 21:01:17 ip-10-227-65-61.eu-west-1.compute.internal pulp[1341]: kombu.pidbox:DEBUG: pidbox received method active_queues() [reply_to:{u'routing_key': u'5a5cfec7-0b73-3e2a-b38a-976440ae7de6', u'exchange': u'reply.celery.pidbox'} ticket:fa1ab8f2-62ee-4f42-8ea5-7c89a394072e] Apr 15 21:01:18 ip-10-227-65-61.eu-west-1.compute.internal pulp[1356]: pulp.server.async.tasks:DEBUG: babysit(): active_queues: None Apr 15 21:01:18 ip-10-227-65-61.eu-west-1.compute.internal pulp[1341]: celery.worker.job:INFO: Task pulp.server.async.tasks.babysit[903f9c6b-01ef-4675-bc0d-d964da87f609] succeeded in 1.356255633s: None I made this observation on ipanova's QE instance, and it seems to be a recurring, though difficult to reproduce problem.