1268916 – [RabbitMQ] rabbitmq-server becomes wedged

Bug 1268916 - [RabbitMQ] rabbitmq-server becomes wedged

Summary: [RabbitMQ] rabbitmq-server becomes wedged

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rabbitmq-server
Sub Component:
Version:	7.0 (Kilo)
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	8.0 (Liberty)
Assignee:	Peter Lemenkov
QA Contact:	yeylon@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-10-05 15:35 UTC by Joe Talerico
Modified:	2016-04-18 06:55 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-10-09 16:06:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Joe Talerico 2015-10-05 15:35:12 UTC

Description of problem:
nova-compute service kept having timeout issues with the message-queue when I would launch a workload. Once the workload started, `nova hypervisor-list` would show all my compute nodes "down"

Version-Release number of selected component (if applicable):
rabbitmq-server-3.3.5-5.el7ost.noarch

How reproducible:
N/A

Steps to Reproduce:
1. Launch 128 guests concurrently with Rally (Rally boot-list scenario).

Also, we saw problems running `rabbitmqctl list_queues`, it would become hung up at "Listing queues ..."

Actual results:
http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/Random/rabbit-maybe_stuck.out

Comment 3 John Eckersberg 2015-10-08 19:50:48 UTC

We looked at this a few days ago, and the giant red flag that we found was that the link on one of the NICs was flapping up and down.  Five seconds after the link flapped (which just happens to be the tcp timeout setting we've configured), the connection between nodes times out and then the cluster becomes partitioned.

So I see two possible outcomes given that:

(1) Bonding is not working correctly for some reason.  NOTABUG, at least not as far as rabbitmq is concerned.

(2) Bonding is working correctly - it's expected behavior to incur a five second delay in TCP traffic.  (I have no idea if that is the case or not, just thinking out loud).  In this case, we need to identify what a reasonable and expected delay might be, and tweak the default timeout from 5 seconds to something higher.  I picked that number on a whim without any real reasoning behind it other than "well, 5 seconds should be enough to get an ACK back, right?".

Joe, you probably know more about expected bonding behavior than I do.  Thoughts?

Comment 4 Joe Talerico 2015-10-09 15:59:00 UTC

John - yes bonding was causing some odd failures. 

We have most recently switched to Active/Backup and are seeing better results. 

For more on the Bonding issue with OSPd Deployments : https://bugzilla.redhat.com/show_bug.cgi?id=1267291

Comment 5 John Eckersberg 2015-10-09 16:06:04 UTC

Awesome, I will assume that the bonding change fixes this then.  If you see it again, feel free to re-open.

Note You need to log in before you can comment on or make changes to this bug.