Description of problem:
In HA setup, when nova-compute and nova-conductor services connect to rabbitmq through VIP/haproxy and if haproxy service or whole node is shut down, then this nova-conductor doesn't notice that connection was closed and doesn't reconnect to rabbitmq again. Result is that nova-compute is marked as "down" in "nova service-list" listing.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Deploy HA setup with staypuft with multiple ctrl nodes
2. Detect ctrl node where rabbitmq's VIP is currently assigned
3. Terminate this node, or just move rabbitmq's VIP to another node with:
crm_resource -M -r ip-192.168.0.99 -H another_host
After ~1 minute nova-compute service is marked as "down", there is no log in nova-conductor about reconnecting to AMQP server
nova-conductor reconnects to AMQP server (there should be a message in log)m nova-compute service stays "up"
This issue is caused by lack of heartbeat in oslo.messaging tracked here:
Workaround for missing heartbeat which worked for me in TripleO project is decreasing sysctl keepalive timeouts as described here (though I used higher values which closes connection after ~50 secs):
In TripleO I also had to apply patches linked from https://bugs.launchpad.net/oslo.messaging/+bug/1338732 otherwise nova-compute was still switching between up/down status (this probably happens in OFI too, but I did not test this with OFI setup):
The above patches helped mitigate the issue in most cases in TripleO (though not 100%).
Ok, reading the above it looks like we need to track three upstream issues
1) https://bugzilla.redhat.com/1129242 - implement RabbitMQ heartbeating so we notice more quickly when the other side has gone away
Candidate fixes in gerrit right now are https://review.openstack.org/94656 and https://review.openstack.org/36606
2) https://bugs.launchpad.net/oslo.messaging/+bug/1338732 - an ordering issue with how queues are declared which causes issues on reconnect
3) https://bugs.launchpad.net/oslo.messaging/+bug/1349301 - in flight replies get lost during failover because the reply queue is auto-delete?
It may be worth having separate bugzillas for these, maybe have this one depend on each of them - we need to be able to clearly talk about each sub-issue individually and, when each of them are fixed, figure out whether the symptoms described in this bug have been fully eliminated
*** Bug 1141958 has been marked as a duplicate of this bug. ***
I've had some initial success with the heartbeat by slightly modifying the patch from https://review.openstack.org/#/c/94656/. I've left comments in the review there, just waiting for the patch author to get back to me.
Neutron services suffer from similar issue too - "nova agent-list" shows all agents down until neutron server is restarted. This is not such a big surprise since neutron uses oslo.messaging lib too.
I tried the heartbeat patch (https://review.openstack.org/#/c/94656/) in TripleO but had not much luck with it. John did more testing on this and discovered that the patch may not work if lower heartbeat interval is used (which is my case - I used 10 or 15 secs). I did quick test of 30s interval but nova-compute was still down.
I have observed the same behaviour with neutron-server and other AMQP dependant services.
Agents will reconnect, because they try to send a heartbeat periodically, that triggers a TCP reset, and a reconnection.
But neutron-server stays waiting for data on the old TCP connection, which takes several hours to reset.
I'm going to test the TCP keepalive connection solution, I believe that's not intrusive to other system apps, as it looks like apps (rabbitmq/haproxy) need to request the TCP keepalive mode per "LISTEN" (verifying this).
"Remember that keepalive support, even if configured in the kernel, is not the default behavior in Linux. Programs must request keepalive control for their sockets using the setsockopt interface. There are relatively few programs implementing keepalive, but you can easily add keepalive support for most of them following the instructions explained later in this document." from 
(In reply to Miguel Angel Ajo from comment #16)
> I have observed the same behaviour with neutron-server and other AMQP
> dependant services.
> Agents will reconnect, because they try to send a heartbeat periodically,
> that triggers a TCP reset, and a reconnection.
> But neutron-server stays waiting for data on the old TCP connection, which
> takes several hours to reset.
> I'm going to test the TCP keepalive connection solution, I believe that's
> not intrusive to other system apps, as it looks like apps (rabbitmq/haproxy)
> need to request the TCP keepalive mode per "LISTEN" (verifying this).
Well it is intrusive in the sense that the TCP keepalive settings are system-wide, so any application that requests TCP keepalive (SO_KEEPALIVE) will use the system-wide settings.
And yes, haproxy can turn on TCP keepalives per proxy by using either 'option tcpka' in a listen block or clitcpka/srvtcpka in the client/server blocks.
> "Remember that keepalive support, even if configured in the kernel, is not
> the default behavior in Linux. Programs must request keepalive control for
> their sockets using the setsockopt interface. There are relatively few
> programs implementing keepalive, but you can easily add keepalive support
> for most of them following the instructions explained later in this
> document." from 
>  http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#examples
Using rabbitmq-3.3.5 didn't solve the problem with message timeout issue when testing this on TripleO locally.
When using sysctl keepalive workaround and moving around only VIP, and when hitting MessageTimeout issue, this may be caused by haproxy on old VIP node - see comment 22 of this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1175685#c22
You also need to restart haproxy on the node that was holding the VIP if you want to be safe.
There is a problem with some connections staying in probing state, and keeping the backend connections open. The VIP is removed in the middle of tcp keepalive probing state, and the kernel doesn't hold the IP anymore, thus cannot send probes from that IP, until it timeouts it keeps backend connections open.