Bug 1088481
| Summary: | qpidd restart needed to unblock celery processing | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Retired] Pulp | Reporter: | Jeff Ortel <jortel> | ||||||||
| Component: | async/tasks | Assignee: | Brian Bouterse <bmbouter> | ||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | mkovacik | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | high | ||||||||||
| Version: | 2.4.0 | CC: | bmbouter, mkovacik, rbarlow, tross | ||||||||
| Target Milestone: | --- | Keywords: | Triaged | ||||||||
| Target Release: | 2.4.0 | ||||||||||
| Hardware: | All | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2014-08-09 06:56:40 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
Jeff Ortel
2014-04-16 17:23:18 UTC
I have observed this problem on several other people's machines, though never on my own. It could be due to versions of various involved software, though we really don't know much about it at this point. If anyone is able to provide more information, please do so. Created attachment 886965 [details]
jortel's /var/log/messages for path few days.
The fix for this bug is included in the 2.4.0-0.10.beta build that was just published to the Pulp fedorapeople.org repository. I am reopening this bug because it has been experienced again in 2.4.0-0.10.beta I've looked at data supplied by Brian Bouterse. It appears that Celery creates and binds queues (called celeryev.<uuid>). It then proceeds to read messages from the queues but never acknowledges these messages. As a result, the messages remain in the queues and the queues reach their configured depth limit. The default behavior for queues-at-the-limit is to apply back-pressure to producers of messages to the queue. This will cause the producing process to block in the "send" call. This lock-up is inevitable after a time inversely proportional to the message transfer rate. I suggest creating the queues with the "qpid.policy_type" argument set to "ring". This will prevent the back pressure and will cause the oldest message in the queue to be discarded when a new message arrives on a full queue. tl;dr the 'qpid.policy_type' being set to 'ring' works! First, I verified that the number of unacked messages was the problem by reducing the allowed queue_depth by setting the option 'qpid.max_count' to 10. This stopped the system heartbeats just after startup, with 10 messages in each celeryev. This is the same behavior observed in the affected systems of this issue. To resolve the issue that celery does not ACK event messages, I set the 'qpid.policy_type' to 'ring' for any queue that starts with 'celeryev'. I believe the pidbox queues also will have this behavior, but the fill more slowly that the 'celeryev' queues. I'm going to guard the pidbox queues also against this outstanding messages issue. The other areas of celery seem to ACK messages when they are handled correctly, so I believe this is a resolved issue. I'll prepare a PR for this. Thanks so much to tross and gsim for their guidance on this. PR is available here: https://github.com/pulp/kombu/pull/4 Merged into pulp with this PR: https://github.com/pulp/pulp/pull/949 The fix for this issue is included in pulp-2.4.0-0.12.beta. Created attachment 895511 [details]
verifying screen log
Verified in version python-kombu-3.0.15-5.pulp.fc20.noarch .
Reproduced by inducing the issue hacking the patch which led to sudden halt of pulp where test cases would just get stuck.
Un-hacking the patch brought the test cases execution back to normal status.
Created attachment 895814 [details]
Screen log from another node
Unfortunately, I've encountered the same issue with different queues on another node. Running our CI suit for 16-times fills up other pulp--qpid queues rendering the system stuck. Restarting the qpidd unblocks the system, too. See the attachment for screen log
missed the version info: [root@ec2-54-73-155-135 pulp_auto]# rpm -q pulp-server python-kombu pulp-server-2.4.0-0.13.beta.fc20.noarch python-kombu-3.0.15-5.pulp.fc20.noarch I'm not sure what you mean by "another node". Based on your attachment, I'm also not sure how you are hitting "the same issue" or why you believe that you are. Here's what I see in your attachments. The only queues in your attachment which have large outstanding message counts (the msg column) are the following: celeryev.* *.celery.pidbox Those queues also have their ring policy set correctly, which means that we've resolved the root cause of this bug. Having large message counts in those queues is part of normal operation, and restarting the broker will erase those messages, so I don't see anything unusual in the attachments. Running a CI suite over and over again to leave pulp in a bad state could indicate there is an issue with the CI suite. I don't run the CI suite (nor do customers) so I would need a normal use method to reproduce this issue without the CI suite. If max message count is related to the problem, then you should be able to reproduce without the CI suite by lowering the max count on the queues listed above using qpid.max_count and having the system stop without the use of the CI suite. I'll wait to hear more info after this reliability issue is investigated/reproduced independent of the CI suite. I agree there could be a problem, but it needs more info before it can be properly investigated. Reviewing the screen log and based on the #Comment14 I have to agree the root cause for the system block I'm seeing is somewhere else. Moving this particular issue back to Verified. Sorry for the confusion. This has been fixed in Pulp 2.4.0-1. |