Bug 1268916 - [RabbitMQ] rabbitmq-server becomes wedged
Summary: [RabbitMQ] rabbitmq-server becomes wedged
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 7.0 (Kilo)
Hardware: All
OS: Linux
unspecified
urgent
Target Milestone: ---
: 8.0 (Liberty)
Assignee: Peter Lemenkov
QA Contact: yeylon@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-10-05 15:35 UTC by Joe Talerico
Modified: 2016-04-18 06:55 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-10-09 16:06:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Joe Talerico 2015-10-05 15:35:12 UTC
Description of problem:
nova-compute service kept having timeout issues with the message-queue when I would launch a workload. Once the workload started, `nova hypervisor-list` would show all my compute nodes "down"

Version-Release number of selected component (if applicable):
rabbitmq-server-3.3.5-5.el7ost.noarch

How reproducible:
N/A

Steps to Reproduce:
1. Launch 128 guests concurrently with Rally (Rally boot-list scenario).

Also, we saw problems running `rabbitmqctl list_queues`, it would become hung up at "Listing queues ..."

Actual results:
http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/Random/rabbit-maybe_stuck.out

Comment 3 John Eckersberg 2015-10-08 19:50:48 UTC
We looked at this a few days ago, and the giant red flag that we found was that the link on one of the NICs was flapping up and down.  Five seconds after the link flapped (which just happens to be the tcp timeout setting we've configured), the connection between nodes times out and then the cluster becomes partitioned.

So I see two possible outcomes given that:

(1) Bonding is not working correctly for some reason.  NOTABUG, at least not as far as rabbitmq is concerned.

(2) Bonding is working correctly - it's expected behavior to incur a five second delay in TCP traffic.  (I have no idea if that is the case or not, just thinking out loud).  In this case, we need to identify what a reasonable and expected delay might be, and tweak the default timeout from 5 seconds to something higher.  I picked that number on a whim without any real reasoning behind it other than "well, 5 seconds should be enough to get an ACK back, right?".

Joe, you probably know more about expected bonding behavior than I do.  Thoughts?

Comment 4 Joe Talerico 2015-10-09 15:59:00 UTC
John - yes bonding was causing some odd failures. 

We have most recently switched to Active/Backup and are seeing better results. 

For more on the Bonding issue with OSPd Deployments : https://bugzilla.redhat.com/show_bug.cgi?id=1267291

Comment 5 John Eckersberg 2015-10-09 16:06:04 UTC
Awesome, I will assume that the bonding change fixes this then.  If you see it again, feel free to re-open.


Note You need to log in before you can comment on or make changes to this bug.