Bug 1129242
Summary: | Nova-compute service and neutron-agents go down when rabbitmq connection is terminated or vip-rabbitmq is moved from host to host | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Jan Provaznik <jprovazn> |
Component: | openstack-foreman-installer | Assignee: | John Eckersberg <jeckersb> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Leonid Natapov <lnatapov> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 5.0 (RHEL 7) | CC: | aberezin, benglish, dmaley, ebarrera, fdinitto, gfidente, hbrock, jeckersb, jguiditt, lnatapov, majopela, markmc, mburns, morazi, oblaut, rhos-maint, rohara, sstar, tshefi, vnarayan, yeylon |
Target Milestone: | --- | ||
Target Release: | Installer | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-04-16 12:06:50 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1167414, 1171744 | ||
Bug Blocks: | 1163783 |
Description
Jan Provaznik
2014-08-12 11:20:18 UTC
Ok, reading the above it looks like we need to track three upstream issues 1) https://bugzilla.redhat.com/1129242 - implement RabbitMQ heartbeating so we notice more quickly when the other side has gone away Candidate fixes in gerrit right now are https://review.openstack.org/94656 and https://review.openstack.org/36606 2) https://bugs.launchpad.net/oslo.messaging/+bug/1338732 - an ordering issue with how queues are declared which causes issues on reconnect See https://review.openstack.org/110058 3) https://bugs.launchpad.net/oslo.messaging/+bug/1349301 - in flight replies get lost during failover because the reply queue is auto-delete? See https://review.openstack.org/109373 It may be worth having separate bugzillas for these, maybe have this one depend on each of them - we need to be able to clearly talk about each sub-issue individually and, when each of them are fixed, figure out whether the symptoms described in this bug have been fully eliminated *** Bug 1141958 has been marked as a duplicate of this bug. *** I've had some initial success with the heartbeat by slightly modifying the patch from https://review.openstack.org/#/c/94656/. I've left comments in the review there, just waiting for the patch author to get back to me. Neutron services suffer from similar issue too - "nova agent-list" shows all agents down until neutron server is restarted. This is not such a big surprise since neutron uses oslo.messaging lib too. I tried the heartbeat patch (https://review.openstack.org/#/c/94656/) in TripleO but had not much luck with it. John did more testing on this and discovered that the patch may not work if lower heartbeat interval is used (which is my case - I used 10 or 15 secs). I did quick test of 30s interval but nova-compute was still down. I have observed the same behaviour with neutron-server and other AMQP dependant services. Agents will reconnect, because they try to send a heartbeat periodically, that triggers a TCP reset, and a reconnection. But neutron-server stays waiting for data on the old TCP connection, which takes several hours to reset. I'm going to test the TCP keepalive connection solution, I believe that's not intrusive to other system apps, as it looks like apps (rabbitmq/haproxy) need to request the TCP keepalive mode per "LISTEN" (verifying this). "Remember that keepalive support, even if configured in the kernel, is not the default behavior in Linux. Programs must request keepalive control for their sockets using the setsockopt interface. There are relatively few programs implementing keepalive, but you can easily add keepalive support for most of them following the instructions explained later in this document." from [1] [1] http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#usingkeepalive [2] http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#examples (In reply to Miguel Angel Ajo from comment #16) > I have observed the same behaviour with neutron-server and other AMQP > dependant services. > > Agents will reconnect, because they try to send a heartbeat periodically, > that triggers a TCP reset, and a reconnection. > > But neutron-server stays waiting for data on the old TCP connection, which > takes several hours to reset. > > I'm going to test the TCP keepalive connection solution, I believe that's > not intrusive to other system apps, as it looks like apps (rabbitmq/haproxy) > need to request the TCP keepalive mode per "LISTEN" (verifying this). Well it is intrusive in the sense that the TCP keepalive settings are system-wide, so any application that requests TCP keepalive (SO_KEEPALIVE) will use the system-wide settings. And yes, haproxy can turn on TCP keepalives per proxy by using either 'option tcpka' in a listen block or clitcpka/srvtcpka in the client/server blocks. > "Remember that keepalive support, even if configured in the kernel, is not > the default behavior in Linux. Programs must request keepalive control for > their sockets using the setsockopt interface. There are relatively few > programs implementing keepalive, but you can easily add keepalive support > for most of them following the instructions explained later in this > document." from [1] > > [1] > http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#usingkeepalive > [2] http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#examples Using rabbitmq-3.3.5 didn't solve the problem with message timeout issue when testing this on TripleO locally. When using sysctl keepalive workaround and moving around only VIP, and when hitting MessageTimeout issue, this may be caused by haproxy on old VIP node - see comment 22 of this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1175685#c22 You also need to restart haproxy on the node that was holding the VIP if you want to be safe. There is a problem with some connections staying in probing state, and keeping the backend connections open. The VIP is removed in the middle of tcp keepalive probing state, and the kernel doesn't hold the IP anymore, thus cannot send probes from that IP, until it timeouts it keeps backend connections open. |