Bug 1129242

Summary: Nova-compute service and neutron-agents go down when rabbitmq connection is terminated or vip-rabbitmq is moved from host to host
Product: Red Hat OpenStack Reporter: Jan Provaznik <jprovazn>
Component: openstack-foreman-installerAssignee: John Eckersberg <jeckersb>
Status: CLOSED CURRENTRELEASE QA Contact: Leonid Natapov <lnatapov>
Severity: high Docs Contact:
Priority: high    
Version: 5.0 (RHEL 7)CC: aberezin, benglish, dmaley, ebarrera, fdinitto, gfidente, hbrock, jeckersb, jguiditt, lnatapov, majopela, markmc, mburns, morazi, oblaut, rhos-maint, rohara, sstar, tshefi, vnarayan, yeylon
Target Milestone: ---   
Target Release: Installer   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-04-16 12:06:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1167414, 1171744    
Bug Blocks: 1163783    

Description Jan Provaznik 2014-08-12 11:20:18 UTC
Description of problem:
In HA setup, when nova-compute and nova-conductor services connect to rabbitmq through VIP/haproxy and if haproxy service or whole node is shut down, then this nova-conductor doesn't notice that connection was closed and doesn't reconnect to rabbitmq again. Result is that nova-compute is marked as "down" in "nova service-list" listing.

Version-Release number of selected component (if applicable):
openstack-nova-conductor-2014.1.1-4.el7ost.noarch
openstack-nova-scheduler-2014.1.1-4.el7ost.noarch
openstack-dashboard-theme-2014.1.1-2.el7ost.noarch
openstack-heat-api-cloudwatch-2014.1.1-2.2.el7ost.noarch
openstack-cinder-2014.1.1-1.el7ost.noarch
openstack-nova-console-2014.1.1-4.el7ost.noarch
python-django-openstack-auth-1.1.5-2.el7ost.noarch
openstack-heat-api-2014.1.1-2.2.el7ost.noarch
openstack-heat-api-cfn-2014.1.1-2.2.el7ost.noarch
openstack-nova-novncproxy-2014.1.1-4.el7ost.noarch
openstack-nova-api-2014.1.1-4.el7ost.noarch
redhat-access-plugin-openstack-5.0.0-3.el7ost.noarch
openstack-utils-2014.1-3.el7ost.noarch
openstack-glance-2014.1.1-1.el7ost.noarch
openstack-selinux-0.5.14-3.el7ost.noarch
openstack-heat-common-2014.1.1-2.2.el7ost.noarch
openstack-nova-common-2014.1.1-4.el7ost.noarch
openstack-keystone-2014.1.1-1.el7ost.noarch
openstack-nova-cert-2014.1.1-4.el7ost.noarch
openstack-dashboard-2014.1.1-2.el7ost.noarch
openstack-heat-engine-2014.1.1-2.2.el7ost.noarch


Steps to Reproduce:
1. Deploy HA setup with staypuft with multiple ctrl nodes
2. Detect ctrl node where rabbitmq's VIP is currently assigned
3. Terminate this node, or just move rabbitmq's VIP to another node with:
crm_resource -M -r ip-192.168.0.99 -H another_host

Actual results:
After ~1 minute nova-compute service is marked as "down", there is no log in nova-conductor about reconnecting to AMQP server

Expected results:
nova-conductor reconnects to AMQP server (there should be a message in log)m nova-compute service stays "up"

Additional info:
This issue is caused by lack of heartbeat in oslo.messaging tracked here:
https://bugs.launchpad.net/oslo.messaging/+bug/856764/

Workaround for missing heartbeat which worked for me in TripleO project is decreasing sysctl keepalive timeouts as described here (though I used higher values which closes connection after ~50 secs):
https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/19


In TripleO I also had to apply patches linked from https://bugs.launchpad.net/oslo.messaging/+bug/1338732 otherwise nova-compute was still switching between up/down status (this probably happens in OFI too, but I did not test this with OFI setup):
https://review.openstack.org/#/c/110058/
https://review.openstack.org/#/c/109373/ 
The above patches helped mitigate the issue in most cases in TripleO (though not 100%).

Comment 5 Mark McLoughlin 2014-08-13 05:50:40 UTC
Ok, reading the above it looks like we need to track three upstream issues

1) https://bugzilla.redhat.com/1129242 - implement RabbitMQ heartbeating so we notice more quickly when the other side has gone away

Candidate fixes in gerrit right now are https://review.openstack.org/94656 and https://review.openstack.org/36606

2) https://bugs.launchpad.net/oslo.messaging/+bug/1338732 - an ordering issue with how queues are declared which causes issues on reconnect

See https://review.openstack.org/110058

3) https://bugs.launchpad.net/oslo.messaging/+bug/1349301 - in flight replies get lost during failover because the reply queue is auto-delete?

See https://review.openstack.org/109373

It may be worth having separate bugzillas for these, maybe have this one depend on each of them - we need to be able to clearly talk about each sub-issue individually and, when each of them are fixed, figure out whether the symptoms described in this bug have been fully eliminated

Comment 6 John Eckersberg 2014-09-16 17:29:46 UTC
*** Bug 1141958 has been marked as a duplicate of this bug. ***

Comment 7 John Eckersberg 2014-10-01 16:57:10 UTC
I've had some initial success with the heartbeat by slightly modifying the patch from https://review.openstack.org/#/c/94656/.  I've left comments in the review there, just waiting for the patch author to get back to me.

Comment 8 Jan Provaznik 2014-10-21 11:23:44 UTC
Neutron services suffer from similar issue too - "nova agent-list" shows all agents down until neutron server is restarted. This is not such a big surprise since neutron uses oslo.messaging lib too.

Comment 15 Jan Provaznik 2014-10-31 14:45:31 UTC
I tried the heartbeat patch (https://review.openstack.org/#/c/94656/) in TripleO but had not much luck with it. John did more testing on this and discovered that the patch may not work if lower heartbeat interval is used (which is my case - I used 10 or 15 secs). I did quick test of 30s interval but nova-compute was still down.

Comment 16 Miguel Angel Ajo 2014-11-19 09:15:41 UTC
I have observed the same behaviour with neutron-server and other AMQP dependant services. 

Agents will reconnect, because they try to send a heartbeat periodically, that triggers a TCP reset, and a reconnection.

But neutron-server stays waiting for data on the old TCP connection, which takes several hours to reset.

I'm going to test the TCP keepalive connection solution, I believe that's not intrusive to other system apps, as it looks like apps (rabbitmq/haproxy) need to request the TCP keepalive mode per "LISTEN" (verifying this).


"Remember that keepalive support, even if configured in the kernel, is not the default behavior in Linux. Programs must request keepalive control for their sockets using the setsockopt interface. There are relatively few programs implementing keepalive, but you can easily add keepalive support for most of them following the instructions explained later in this document." from [1]

[1] http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#usingkeepalive 
[2] http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#examples

Comment 18 Ryan O'Hara 2014-11-19 14:13:26 UTC
(In reply to Miguel Angel Ajo from comment #16)
> I have observed the same behaviour with neutron-server and other AMQP
> dependant services. 
> 
> Agents will reconnect, because they try to send a heartbeat periodically,
> that triggers a TCP reset, and a reconnection.
> 
> But neutron-server stays waiting for data on the old TCP connection, which
> takes several hours to reset.
> 
> I'm going to test the TCP keepalive connection solution, I believe that's
> not intrusive to other system apps, as it looks like apps (rabbitmq/haproxy)
> need to request the TCP keepalive mode per "LISTEN" (verifying this).

Well it is intrusive in the sense that the TCP keepalive settings are system-wide, so any application that requests TCP keepalive (SO_KEEPALIVE) will use the system-wide settings.

And yes, haproxy can turn on TCP keepalives per proxy by using either 'option tcpka' in a listen block or clitcpka/srvtcpka in the client/server blocks.

> "Remember that keepalive support, even if configured in the kernel, is not
> the default behavior in Linux. Programs must request keepalive control for
> their sockets using the setsockopt interface. There are relatively few
> programs implementing keepalive, but you can easily add keepalive support
> for most of them following the instructions explained later in this
> document." from [1]
> 
> [1]
> http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#usingkeepalive 
> [2] http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#examples

Comment 29 Jan Provaznik 2014-12-12 15:03:18 UTC
Using rabbitmq-3.3.5 didn't solve the problem with message timeout issue when testing this on TripleO locally.

Comment 35 Jan Provaznik 2015-01-20 14:02:50 UTC
When using sysctl keepalive workaround and moving around only VIP, and when hitting MessageTimeout issue, this may be caused by haproxy on old VIP node - see comment 22 of this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1175685#c22

Comment 36 Miguel Angel Ajo 2015-01-21 11:31:29 UTC
You also need to restart haproxy on the node that was holding the VIP if you want to be safe.

There is a problem with some connections staying in probing state, and keeping the backend connections open. The VIP is removed in the middle of tcp keepalive probing state, and the kernel doesn't hold the IP anymore, thus cannot send probes from that IP, until it timeouts it keeps backend connections open.