Red Hat Bugzilla – Bug 1268916
[RabbitMQ] rabbitmq-server becomes wedged
Last modified: 2016-04-18 02:55:33 EDT
Description of problem:
nova-compute service kept having timeout issues with the message-queue when I would launch a workload. Once the workload started, `nova hypervisor-list` would show all my compute nodes "down"
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Launch 128 guests concurrently with Rally (Rally boot-list scenario).
Also, we saw problems running `rabbitmqctl list_queues`, it would become hung up at "Listing queues ..."
We looked at this a few days ago, and the giant red flag that we found was that the link on one of the NICs was flapping up and down. Five seconds after the link flapped (which just happens to be the tcp timeout setting we've configured), the connection between nodes times out and then the cluster becomes partitioned.
So I see two possible outcomes given that:
(1) Bonding is not working correctly for some reason. NOTABUG, at least not as far as rabbitmq is concerned.
(2) Bonding is working correctly - it's expected behavior to incur a five second delay in TCP traffic. (I have no idea if that is the case or not, just thinking out loud). In this case, we need to identify what a reasonable and expected delay might be, and tweak the default timeout from 5 seconds to something higher. I picked that number on a whim without any real reasoning behind it other than "well, 5 seconds should be enough to get an ACK back, right?".
Joe, you probably know more about expected bonding behavior than I do. Thoughts?
John - yes bonding was causing some odd failures.
We have most recently switched to Active/Backup and are seeing better results.
For more on the Bonding issue with OSPd Deployments : https://bugzilla.redhat.com/show_bug.cgi?id=1267291
Awesome, I will assume that the bonding change fixes this then. If you see it again, feel free to re-open.