Bug 1268916 - [RabbitMQ] rabbitmq-server becomes wedged
[RabbitMQ] rabbitmq-server becomes wedged
Status: CLOSED NOTABUG
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server (Show other bugs)
7.0 (Kilo)
All Linux
unspecified Severity urgent
: ---
: 8.0 (Liberty)
Assigned To: Peter Lemenkov
yeylon@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-10-05 11:35 EDT by Joe Talerico
Modified: 2016-04-18 02:55 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-10-09 12:06:04 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Joe Talerico 2015-10-05 11:35:12 EDT
Description of problem:
nova-compute service kept having timeout issues with the message-queue when I would launch a workload. Once the workload started, `nova hypervisor-list` would show all my compute nodes "down"

Version-Release number of selected component (if applicable):
rabbitmq-server-3.3.5-5.el7ost.noarch

How reproducible:
N/A

Steps to Reproduce:
1. Launch 128 guests concurrently with Rally (Rally boot-list scenario).

Also, we saw problems running `rabbitmqctl list_queues`, it would become hung up at "Listing queues ..."

Actual results:
http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/Random/rabbit-maybe_stuck.out
Comment 3 John Eckersberg 2015-10-08 15:50:48 EDT
We looked at this a few days ago, and the giant red flag that we found was that the link on one of the NICs was flapping up and down.  Five seconds after the link flapped (which just happens to be the tcp timeout setting we've configured), the connection between nodes times out and then the cluster becomes partitioned.

So I see two possible outcomes given that:

(1) Bonding is not working correctly for some reason.  NOTABUG, at least not as far as rabbitmq is concerned.

(2) Bonding is working correctly - it's expected behavior to incur a five second delay in TCP traffic.  (I have no idea if that is the case or not, just thinking out loud).  In this case, we need to identify what a reasonable and expected delay might be, and tweak the default timeout from 5 seconds to something higher.  I picked that number on a whim without any real reasoning behind it other than "well, 5 seconds should be enough to get an ACK back, right?".

Joe, you probably know more about expected bonding behavior than I do.  Thoughts?
Comment 4 Joe Talerico 2015-10-09 11:59:00 EDT
John - yes bonding was causing some odd failures. 

We have most recently switched to Active/Backup and are seeing better results. 

For more on the Bonding issue with OSPd Deployments : https://bugzilla.redhat.com/show_bug.cgi?id=1267291
Comment 5 John Eckersberg 2015-10-09 12:06:04 EDT
Awesome, I will assume that the bonding change fixes this then.  If you see it again, feel free to re-open.

Note You need to log in before you can comment on or make changes to this bug.