1311597 – Nonoptimal failover strategy can lead to RPC timeout

Bug 1311597 - Nonoptimal failover strategy can lead to RPC timeout

Summary: Nonoptimal failover strategy can lead to RPC timeout

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-oslo-messaging
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	async
Target Release:	7.0 (Kilo)
Assignee:	Victor Stinner
QA Contact:	Udi Shkalim
Docs Contact:
URL:
Whiteboard:
Depends On:	1302391
Blocks:
TreeView+	depends on / blocked

Reported:	2016-02-24 14:51 UTC by Marian Krcmarik
Modified:	2021-02-01 02:40 UTC (History)
CC List:	13 users (show)
Fixed In Version:	python-oslo-messaging-1.8.3-6.el7ost
Doc Type:	Bug Fix
Doc Text:	Oslo Messaging used the 'shuffle' strategy to select a RabbitMQ host from the list of RabbitMQ servers. When a node of the cluster running RabbitMQ was restarted, each OpenStack service connected to this server reconnected to a new RabbitMQ server. Unfortunately, this strategy does not handle dead RabbitMQ servers correctly; it can try to connect to the same dead server multiple times in a row. The strategy also leads to increased reconnection time, and sometimes may lead to RPC operations timing out because no guarantee is provided on how long the reconnection process will take. With this update, Oslo Messaging uses the 'round-robin' strategy to select a RabbitMQ host. This strategy provides the least achievable reconnection time and avoids RPC timeout when a node is restarted. It also guarantees that if K of N RabbitMQ hosts are alive, it will take at most N - K + 1 attempts to successfully reconnect to the RabbitMQ cluster.
Clone Of:	1302391
Environment:
Last Closed:	2017-01-19 13:27:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1519851	None	None	None	2016-02-24 14:51:50 UTC
OpenStack gerrit	249849	None	MERGED	Use round robin failover strategy for Kombu driver	2021-02-01 02:36:11 UTC
OpenStack gerrit	278462	None	MERGED	Use round robin failover strategy for Kombu driver	2021-02-01 02:36:12 UTC
Red Hat Product Errata	RHBA-2017:0158	normal	SHIPPED_LIVE	Red Hat Enterprise Linux OpenStack Platform 7 Bug Fix and Enhancement Advisory	2017-01-19 18:19:16 UTC

Comment 1 Flavio Percoco 2016-02-24 20:35:53 UTC

This patch doesn't apply cleanly and it seems to conflict with a previous backport. How much of this is really needed for OSP7? And how far down in the releases are we expecting to go?

I'm also not super happy with this backport because it adds a new config option, which is not something we normally do on backports.

Comment 2 Marian Krcmarik 2016-02-24 22:39:58 UTC

(In reply to Flavio Percoco from comment #1)
> This patch doesn't apply cleanly and it seems to conflict with a previous
> backport. How much of this is really needed for OSP7? And how far down in
> the releases are we expecting to go?
> 
> I'm also not super happy with this backport because it adds a new config
> option, which is not something we normally do on backports.

RHOS7 and RHOS8 behave the same in this regard, The impact would be following: in some situations when one of the controllers goes down (especially controller-0 (first one in config)), rabbitmq clients *sometimes* (previously connected to this node) take time to reconnect since they keep trying to connect to dead rabbitmq server. It does not happen always and Usually It takes up to several minutes (personally experienced almost 5 minutes the most). 
I would leave the decision for the PM (not sure who the right person is) to decide.
Honestly I am not sure myself how far down we are supposed to go.

Comment 5 Marian Krcmarik 2016-12-12 09:19:07 UTC

Verified on python-oslo-messaging-1.8.3-6.el7ost

Comment 9 errata-xmlrpc 2017-01-19 13:27:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0158.html

Note You need to log in before you can comment on or make changes to this bug.