Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1088481

Summary: qpidd restart needed to unblock celery processing
Product: [Retired] Pulp Reporter: Jeff Ortel <jortel>
Component: async/tasksAssignee: Brian Bouterse <bmbouter>
Status: CLOSED CURRENTRELEASE QA Contact: mkovacik
Severity: high Docs Contact:
Priority: high    
Version: 2.4.0CC: bmbouter, mkovacik, rbarlow, tross
Target Milestone: ---Keywords: Triaged
Target Release: 2.4.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-08-09 06:56:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
jortel's /var/log/messages for path few days.
none
verifying screen log
none
Screen log from another node none

Description Jeff Ortel 2014-04-16 17:23:18 UTC
Description of problem:

Task processing seems to get blocked occasionally and the only way to fix it seems to be to restart qpidd.  When tracing with the debugger, it seemed to get caught in backend/base.py wait_for() but I suspect that is a symptom.

Seen this for sure on f18 with qpid 0.24.  Not sure about other platforms and versions of qpid.

Version-Release number of selected component (if applicable):

2.4.0-0.9.beta

How reproducible:

Unknown

Comment 1 Randy Barlow 2014-04-16 17:27:50 UTC
I have observed this problem on several other people's machines, though never on my own. It could be due to versions of various involved software, though we really don't know much about it at this point. If anyone is able to provide more information, please do so.

Comment 2 Jeff Ortel 2014-04-16 17:53:38 UTC
Created attachment 886965 [details]
jortel's /var/log/messages for path few days.

Comment 4 Randy Barlow 2014-04-24 20:28:42 UTC
The fix for this bug is included in the 2.4.0-0.10.beta build that was just published to the Pulp fedorapeople.org repository.

Comment 5 Brian Bouterse 2014-04-29 15:30:12 UTC
I am reopening this bug because it has been experienced again in 2.4.0-0.10.beta

Comment 6 Ted Ross 2014-05-06 21:43:48 UTC
I've looked at data supplied by Brian Bouterse.

It appears that Celery creates and binds queues (called celeryev.<uuid>).  It then proceeds to read messages from the queues but never acknowledges these messages.  As a result, the messages remain in the queues and the queues reach their configured depth limit.  The default behavior for queues-at-the-limit is to apply back-pressure to producers of messages to the queue.  This will cause the producing process to block in the "send" call.

This lock-up is inevitable after a time inversely proportional to the message transfer rate.

I suggest creating the queues with the "qpid.policy_type" argument set to "ring".  This will prevent the back pressure and will cause the oldest message in the queue to be discarded when a new message arrives on a full queue.

Comment 7 Brian Bouterse 2014-05-07 13:09:17 UTC
tl;dr    the 'qpid.policy_type' being set to 'ring' works!

First, I verified that the number of unacked messages was the problem by reducing the allowed queue_depth by setting the option 'qpid.max_count' to 10.  This stopped the system heartbeats just after startup, with 10 messages in each celeryev.  This is the same behavior observed in the affected systems of this issue.

To resolve the issue that celery does not ACK event messages, I set the 'qpid.policy_type' to 'ring' for any queue that starts with 'celeryev'.

I believe the pidbox queues also will have this behavior, but the fill more slowly that the 'celeryev' queues.  I'm going to guard the pidbox queues also against this outstanding messages issue.  The other areas of celery seem to ACK messages when they are handled correctly, so I believe this is a resolved issue.

I'll prepare a PR for this.  Thanks so much to tross and gsim for their guidance on this.

Comment 8 Brian Bouterse 2014-05-07 15:22:44 UTC
PR is available here:

https://github.com/pulp/kombu/pull/4

Comment 9 Brian Bouterse 2014-05-07 20:51:03 UTC
Merged into pulp with this PR:

https://github.com/pulp/pulp/pull/949

Comment 10 Randy Barlow 2014-05-07 22:16:27 UTC
The fix for this issue is included in pulp-2.4.0-0.12.beta.

Comment 11 mkovacik 2014-05-14 14:39:16 UTC
Created attachment 895511 [details]
verifying screen log

Verified in version python-kombu-3.0.15-5.pulp.fc20.noarch .
Reproduced by inducing the issue hacking the patch which led to sudden halt of pulp where test cases would just get stuck.
Un-hacking the patch brought the test cases execution back to normal status.

Comment 12 mkovacik 2014-05-15 09:08:18 UTC
Created attachment 895814 [details]
Screen log from another node

Unfortunately, I've encountered the same issue with different queues on another node. Running our CI suit for 16-times fills up other pulp--qpid queues rendering the system stuck. Restarting the qpidd unblocks the system, too. See the attachment for screen log

Comment 13 mkovacik 2014-05-15 09:11:05 UTC
missed the version info:
[root@ec2-54-73-155-135 pulp_auto]# rpm -q pulp-server python-kombu
pulp-server-2.4.0-0.13.beta.fc20.noarch
python-kombu-3.0.15-5.pulp.fc20.noarch

Comment 14 Brian Bouterse 2014-05-15 10:16:12 UTC
I'm not sure what you mean by "another node".  Based on your attachment, I'm also not sure how you are hitting "the same issue" or why you believe that you are.

Here's what I see in your attachments.  The only queues in your attachment which have large outstanding message counts (the msg column) are the following:

celeryev.*
*.celery.pidbox

Those queues also have their ring policy set correctly, which means that we've resolved the root cause of this bug.  Having large message counts in those queues is part of normal operation, and restarting the broker will erase those messages, so I don't see anything unusual in the attachments.

Running a CI suite over and over again to leave pulp in a bad state could indicate there is an issue with the CI suite.

I don't run the CI suite (nor do customers) so I would need a normal use method to reproduce this issue without the CI suite.  If max message count is related to the problem, then you should be able to reproduce without the CI suite by lowering the max count on the queues listed above using qpid.max_count and having the system stop without the use of the CI suite.

I'll wait to hear more info after this reliability issue is investigated/reproduced independent of the CI suite.  I agree there could be a problem, but it needs more info before it can be properly investigated.

Comment 15 mkovacik 2014-05-15 16:42:56 UTC
Reviewing the screen log and based on the #Comment14 I have to agree the root cause for the system block I'm seeing is somewhere else. Moving this particular issue back to Verified. Sorry for the confusion.

Comment 16 Randy Barlow 2014-08-09 06:56:40 UTC
This has been fixed in Pulp 2.4.0-1.