Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1302391 - Nonoptimal failover strategy can lead to RPC timeout
Nonoptimal failover strategy can lead to RPC timeout
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-oslo-messaging (Show other bugs)
8.0 (Liberty)
Unspecified Unspecified
urgent Severity urgent
: ga
: 8.0 (Liberty)
Assigned To: Victor Stinner
Marian Krcmarik
: AutomationBlocker, TestBlocker
Depends On:
Blocks: 1311597
  Show dependency treegraph
 
Reported: 2016-01-27 12:27 EST by Marian Krcmarik
Modified: 2016-04-07 17:26 EDT (History)
8 users (show)

See Also:
Fixed In Version: python-oslo-messaging-2.5.0-5.el7ost
Doc Type: Bug Fix
Doc Text:
Oslo Messaging used the "shuffle" strategy to select a RabbitMQ host from the list of RabbitMQ servers. When a node of the cluster running RabbitMQ was restarted, each OpenStack service connected to this server reconnected to a new RabbitMQ server. Unfortunately, this strategy does not handle dead RabbitMQ servers correctly; it can try to connect to the same dead server multiple times in a row. The strategy also leads to increased reconnection time, and sometimes it may lead to RPC operations timing out because no guarantee is provided on how long the reconnection process will take. With this update, Oslo Messaging uses the "round-robin" strategy to select a RabbitMQ host. This strategy provides the least achievable reconnection time and avoids RPC timeout when a node is restarted. It also guarantees that if K of N RabbitMQ hosts are alive, it will take at most N - K + 1 attempts to successfully reconnect to the RabbitMQ cluster.
Story Points: ---
Clone Of:
: 1311597 (view as bug list)
Environment:
Last Closed: 2016-04-07 17:26:22 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
nova-compute log (293.00 KB, text/plain)
2016-01-27 12:27 EST, Marian Krcmarik
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1519851 None None None 2016-01-27 12:27 EST
OpenStack gerrit 249849 None None None 2016-02-10 07:44 EST
OpenStack gerrit 278462 None None None 2016-02-10 10:47 EST
Red Hat Product Errata RHEA-2016:0603 normal SHIPPED_LIVE Red Hat OpenStack Platform 8 Enhancement Advisory 2016-04-07 20:53:53 EDT

  None (edit)
Description Marian Krcmarik 2016-01-27 12:27:46 EST
Created attachment 1118849 [details]
nova-compute log

Description of problem:
I can hit the problems described in https://bugs.launchpad.net/oslo.messaging/+bug/1519851 (sometimes I believe I can see multiples unsuccessful tries to reconnect to different AMQP servers (even 11 times) on my RHOS8 setup. It take sometimes even several minutes to get reconnected after failover of a controller in HA (3 controllers setup) - see attached debug log from nova-compute.
I tested the patch from upstream bug and It seemed to help speed up, I would suggest to backport the patch which seems to be simple.

Version-Release number of selected component (if applicable):
python-oslo-messaging-2.5.0-1.el7ost.noarch

How reproducible:
Often

Steps to Reproduce:
1. Restartd one of the controller in HA OS setup.
2. Look at the nova-compute log for example and look for successful reconnection to different AQMP server

Actual results:
It takes sometimes several minutes while the OS is not operational.

Expected results:
It should take singificantly lower time.


Additional info:
Comment 2 Flavio Percoco 2016-02-10 10:47:34 EST
I think it's safe to backport this patch. I've proposed it upstream and I'll do the backport downstream after some feedback is provided on the upstream patch.
Comment 9 errata-xmlrpc 2016-04-07 17:26:22 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0603.html

Note You need to log in before you can comment on or make changes to this bug.