1175685 – in some circumstances, nova-compute appears to be unable to send updates after reconnecting to rabbit

Bug 1175685 - in some circumstances, nova-compute appears to be unable to send updates after reconnecting to rabbit

Summary: in some circumstances, nova-compute appears to be unable to send updates afte...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-oslo-messaging
Sub Component:
Version:	6.0 (Juno)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	6.0 (Juno)
Assignee:	John Eckersberg
QA Contact:	Gabriel Szasz
Docs Contact:
URL:
Whiteboard:
Depends On:	1151756 1186459 1186463 1186966 1189241 1190776 1190783 1194407
Blocks:	1186749
TreeView+	depends on / blocked

Reported:	2014-12-18 11:43 UTC by Giulio Fidente
Modified:	2019-12-16 04:36 UTC (History)
CC List:	24 users (show)
Fixed In Version:	python-oslo-messaging-1.4.1-3.el7ost
Doc Type:	Bug Fix
Doc Text:	-
Clone Of:
Environment:
Last Closed:	2016-08-26 19:10:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
nova-compute.log (25.42 KB, text/plain) 2014-12-18 11:43 UTC, Giulio Fidente	no flags	Details
nova-compute.log.recovered (21.94 KB, text/plain) 2014-12-18 12:18 UTC, Giulio Fidente	no flags	Details
nova-compute.log.trunk.recovered_15mins (44.51 KB, text/plain) 2015-01-07 17:42 UTC, Giulio Fidente	no flags	Details
console log - sequence of commands I did (27.50 KB, text/plain) 2015-01-14 13:46 UTC, Jan Provaznik	no flags	Details
nova-compute log from compute node (74.29 KB, text/plain) 2015-01-14 13:46 UTC, Jan Provaznik	no flags	Details
nova-conductor log from one of controller nodes (8.27 KB, text/plain) 2015-01-14 13:47 UTC, Jan Provaznik	no flags	Details
nova-compute.log.never_recover (345.10 KB, text/plain) 2015-01-14 15:36 UTC, Giulio Fidente	no flags	Details
View All

Description Giulio Fidente 2014-12-18 11:43:51 UTC

Created attachment 970504 [details]
nova-compute.log

Description of problem:
by using aggressive keepalive settings and an updated version of rabbitmq, as documented in bz #1129242 , the majority of services seems to recover correctly after a VIP failover except nova-compute which instead seems to remain stuck in a loop and unable to send updates

attached is a piece of the nova-compute.log going from a stable working condition down to a loss of connection, reconnection and finally entering the loop


Version-Release number of selected component (if applicable):
openstack-nova-compute-2014.2.1-7.el7ost.noarch


Steps to Reproduce:
1. deploy with 3 controllers, ensure keepalive timings and rabbitmq are configured as reported in https://bugzilla.redhat.com/show_bug.cgi?id=1167414
2. shutdown one of the controllers

Comment 2 Giulio Fidente 2014-12-18 12:18:43 UTC

Created attachment 970518 [details]
nova-compute.log.recovered

*NOTE*: this is not reproducing consistently, some compute nodes will recover, which makes it pretty hard to debug

I have attached logs of a system which recovered, by contrast

Comment 4 Giulio Fidente 2015-01-07 17:42:22 UTC

Created attachment 977471 [details]
nova-compute.log.trunk.recovered_15mins

upstream nova code (deployed with upstream TripleO devtest) seems to be able to recover from the timeouts, even though it took in a test 15minutes, so it does not seem to be related to either the sysctl keepalive settings nor the messages updates occurring every minute

Comment 5 Dan Smith 2015-01-08 16:11:22 UTC

Looks to me like there is a typo in oslo.messaging that is causing the retry loop to bail:

--- impl_rabbit.py.orig	2015-01-08 11:01:52.455492467 -0500
+++ impl_rabbit.py	2015-01-08 11:02:00.234621475 -0500
@@ -615,7 +615,7 @@
                 # a protocol response.  (See paste link in LP888621)
                 # So, we check all exceptions for 'timeout' in them
                 # and try to reconnect in this case.
-                if 'timeout' not in six.text_type(e):
+                if 'timeout' not in six.text_type(ex):
                     raise
                 e = ex

Comment 7 Jan Provaznik 2015-01-13 14:35:37 UTC

I tried it with this patch applied (from comment 5) and I can still see the same error as in attachment in comment 2. Nova-compute service is down. I'll attach more details (my logs) later today.

Comment 9 Ami Jeain 2015-01-14 13:20:51 UTC

plz provide the extra logs per comment #7

Comment 10 Jan Provaznik 2015-01-14 13:46:00 UTC

Created attachment 980020 [details]
console log - sequence of commands I did

Comment 11 Jan Provaznik 2015-01-14 13:46:47 UTC

Created attachment 980021 [details]
nova-compute log from compute node

Comment 12 Jan Provaznik 2015-01-14 13:47:20 UTC

Created attachment 980022 [details]
nova-conductor log from one of controller nodes

Comment 13 Giulio Fidente 2015-01-14 15:36:07 UTC

Created attachment 980067 [details]
nova-compute.log.never_recover

hi, I am seeing the same issue after patch as well

specifically after VIP (hosting nova-conductor and rabbit) relocated, I let nova-compute running for 2h to see if recovery would ever happen like I've seen before on occasion ... but this time it didn't so it really seems there are different patterns causing it to recover only in some occasions

attached is the log of the failure (message timeout) persisting for 2h

Comment 14 Dan Smith 2015-01-14 15:53:51 UTC

Updating some state from an IRC discussion:

After applying the patch, the log files show that nova is no longer arresting the retry loop due to the bug in oslo.messaging, and continues to try to retry the connection, getting a variety of network-related errors in the process.

Sounds like haproxy is sitting between these nodes and the rabbit server, so I would tend to suspect that something between nova and rabbit is preventing things from working.

In any case, after the patch, nova continues to try to reconnect to the rabbit server.

Comment 15 Jan Provaznik 2015-01-14 16:04:20 UTC

From the nova-conductor log it seems that conductor service reconnected successfully to rabbitmq at 08:35:04.717 and stayd connected since then. IOW it doesn't seem to be issue with nova-conductor or nova-compute failing to connect to rabbitmq itself.

Note that we observed this issue on nova services only, e.g. for neutron the workaround with keepalive seems to work fine.

The related older bug which addresses missing heartbeat issue and which mentions this workaround and issue caused by it is here:
https://bugzilla.redhat.com/show_bug.cgi?id=1129242

Comment 16 Ryan O'Hara 2015-01-15 00:07:52 UTC

I looked at this setup and there is not evidence of a problem with haproxy. Keepalived appears to be working correctly with the possible exception that the ARP tables are goofy. Specifically, the node that owns the VIP seems to have an incorrect arp entry for the VIP. I am not sure why this happens, nor am I sure that it is causing the problem.

What would be useful here is to get debug logs from haproxy. At least then you could verify that 1) traffic is being directed to the VIP owner (ie. not an arp problem) and 2) if haproxy is passing the traffic to a backend rabbitmq server. Without logs there is no evidence this is a problem with haproxy. Having the logs may help you track down the real issue.

Comment 17 Mehdi ABAAKOUK 2015-01-15 15:47:29 UTC

Most of the time oslo.messaging don't set the timeout to drain events from rabbitmq, this is a not recommanded, Without this timeout oslo.messaging rely on the system to timeout, configuring the tcp keepalive mitigate the issue, but does not fix it.

oslo.messaging 1.5.X (kilo release) always set this timeout to drain rabbitmq events, a partial backport exists (but not yet merged) for juno 
https://review.openstack.org/#/c/143805/, it should you fix your use case by detecting broken connection faster.

Comment 22 Jan Provaznik 2015-01-20 13:59:27 UTC

I was testing now very similar situation - moving around only VIP instead of shutting whole node with VIP and here is a couple of findings (situation when moving VIP from node1 to node2):
1) only some messages from nova-compute to nova-conductor seem to get stuck in rabbitmq message-queue, after adding some debugging info I can see that some of the iterative messages (going to same queue/channel) are delivered and answered by nova-conductor, some of them are not. I can see increasing number of messages in "rabbitmqctl list_queues|grep conductor"

2) In this case the problem seems to be caused by left opened connection from haproxy on node1 (old VIP node) to rabbitmq server. When I stop haproxy on node1, and this connections is closed. Then all pending messages are picked up by nova-conductor and nova-compute gets response to all of them. IOW it seems that old queue is being held until last client/consomer disconnects?

I would expect that when shutting down whole node, rabbitmq detects closed connection from haproxy on node1 and that this doesn't happen, so there is still room for investigating this use-case.

Comment 23 Ryan O'Hara 2015-01-20 18:22:48 UTC

(In reply to Jan Provaznik from comment #22)
> I was testing now very similar situation - moving around only VIP instead of
> shutting whole node with VIP and here is a couple of findings (situation
> when moving VIP from node1 to node2):
> 1) only some messages from nova-compute to nova-conductor seem to get stuck
> in rabbitmq message-queue, after adding some debugging info I can see that
> some of the iterative messages (going to same queue/channel) are delivered
> and answered by nova-conductor, some of them are not. I can see increasing
> number of messages in "rabbitmqctl list_queues|grep conductor"
> 
> 2) In this case the problem seems to be caused by left opened connection
> from haproxy on node1 (old VIP node) to rabbitmq server. When I stop haproxy
> on node1, and this connections is closed. Then all pending messages are
> picked up by nova-conductor and nova-compute gets response to all of them.
> IOW it seems that old queue is being held until last client/consomer
> disconnects?

This seems to indicate to me that TCP keepaliives are not working. The consumer is not going to let go of the connection unless 1) haproxy timeout occurs or 2) TCP keepalive causes a reset. In your case, haproxy will never timeout rabbitmq connections because timeout is set to 0. If TCP keepalives are working on the consumer side and haproxy is setting the TCP keelalive flag, you should get a reset.

> I would expect that when shutting down whole node, rabbitmq detects closed
> connection from haproxy on node1 and that this doesn't happen, so there is
> still room for investigating this use-case.

By shutdown do you mean immediate power off? If you power off an haproxy node it is not going to send TCP resets to close connections. The consumer will need to detect this, which doesn't seem to be happening.

Just out of curiosity, can you tell me if TCP keepalives are set on both the controller and compute nodes? If so, what are the values?

Comment 24 Jan Provaznik 2015-01-21 08:15:35 UTC

(In reply to Ryan O'Hara from comment #23)
> (In reply to Jan Provaznik from comment #22)
> > I would expect that when shutting down whole node, rabbitmq detects closed
> > connection from haproxy on node1 and that this doesn't happen, so there is
> > still room for investigating this use-case.
> 
> By shutdown do you mean immediate power off? If you power off an haproxy
> node it is not going to send TCP resets to close connections. The consumer
> will need to detect this, which doesn't seem to be happening.
> 

Yes, I mean immediate power off.

> Just out of curiosity, can you tell me if TCP keepalives are set on both the
> controller and compute nodes? If so, what are the values?

for all nodes (both computes and controllers):
net.ipv4.tcp_keepalive_time: 5
net.ipv4.tcp_keepalive_probes: 5
net.ipv4.tcp_keepalive_intvl: 1

Comment 25 Jan Provaznik 2015-01-21 08:21:32 UTC

(In reply to Jan Provaznik from comment #22)
> I would expect that when shutting down whole node, rabbitmq detects closed
> connection from haproxy on node1 and that this doesn't happen, so there is
> still room for investigating this use-case.

Checked shutting down whole node yesterday and there were not any left connections to rabbitmq nodes - IOW this BZ is caused by a different issue than described in comment 22.

Comment 26 Miguel Angel Ajo 2015-01-21 15:37:06 UTC

try restarting haproxy in the node where the vip-rabbitmq was before moving the VIP.

Comment 27 Jan Provaznik 2015-01-28 19:20:25 UTC

I already mentioned this in a thread, but for the record: so far setting more aggressive net.ipv4.tcp_retries2=5 seems to solve the issue on my deployment.

Comment 28 John Eckersberg 2015-02-11 20:18:29 UTC

A summary is in order here!

First, we've added the ability to connect directly from openstack services to the rabbitmq server, thus bypassing haproxy and VIPs entirely. This can avoid a set of problems with load balancing and VIP failover. That's bug 1186966.

Jan identified that setting net.ipv4.tcp_retries2=5 mitigates the problem. However, RFC1122 explicitly states:

The value of R2 SHOULD correspond to at least 100 seconds.

And from the kernel documentation for the tcp_retries2 sysctl[1]:

RFC 1122 recommends at least 100 seconds for the timeout,
which corresponds to a value of at least 8.

So although setting it down to 5 may help, it's a rather "big hammer" approach to fixing the problem, as it puts the TCP stack for the entire host into a non-compliant configuration.

Instead, programs can set TCP_USER_TIMEOUT on a per-connection basis. This allows applications to modify the timeout and override the default imposed by the tcp_retries sysctls.

This has spawned a few different offshoots of work, in order to provide that functionality across components involved in AMQP communication.

(1) HAProxy upstream has added a configuration option to set TCP_USER_TIMEOUT on listening frontend sockets. This is being tracked for backport to RHEL7 in bug 1190776.

(2) RabbitMQ can set TCP_USER_TIMEOUT on all connecting and listening sockets. This helps cluster nodes to detect quickly when other cluster nodes go away. This also helps between server and clients to detect when a consumer has gone away. This is being implemented as part of bug 1189480.

(3) There's a bug in the RHEL7 kernel with TCP_USER_TIMEOUT that is a blocker for the above two fixes. That's bug 1189241.

When the above fixes come together, it should greatly reduce or eliminate weird network edge cases that crop up when control nodes disappear and/or fail over between each other.

There is one caveat to all of this. There very well may still be issues in oslo.messaging that will prevent the client from reconnecting successfully still. This is not something I have personally explored, yet.

[1] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/ip-sysctl.txt?id=bfa76d49576599a4b9f9b7a71f23d73d6dcff735#n413

Comment 30 John Eckersberg 2016-08-26 19:10:58 UTC

Closing tracking bug, all dependent bugs have been closed.

Note You need to log in before you can comment on or make changes to this bug.

apevec
astellwa
berrange
dasmith
dnavale
eglynn
jason.dobies
jeckersb
jprovazn
kchamart
lhh
mabaakou
majopela
morazi
nwu_zbh
ohochman
racedoro
sbauza
sclewis
sferdjao
sgordon
srevivo
vromanso
vstinner