Back to bug 1302391

Who When What Removed Added
Red Hat Bugzilla Rules Engine 2016-01-27 17:27:49 UTC Target Release --- 8.0
Michele Baldessari 2016-01-27 18:38:02 UTC CC michele
Ofer Blaut 2016-01-27 19:21:10 UTC Keywords AutomationBlocker
CC oblaut
Severity high urgent
Jon Schlueter 2016-02-10 12:44:44 UTC Priority unspecified high
CC jschluet
Link ID OpenStack gerrit 249849
Flavio Percoco 2016-02-10 15:47:34 UTC Status NEW ASSIGNED
Link ID OpenStack gerrit 278462
Jon Schlueter 2016-02-22 16:35:02 UTC Target Milestone --- ga
Flavio Percoco 2016-02-24 14:45:29 UTC Status ASSIGNED POST
Marian Krcmarik 2016-02-24 14:51:51 UTC Blocks 1311597
Flavio Percoco 2016-03-07 13:07:19 UTC CC hguemar
Flags needinfo?(hguemar)
Fabio Massimo Di Nitto 2016-03-15 17:12:59 UTC Keywords TestBlocker
Priority high urgent
CC fdinitto
Fabio Massimo Di Nitto 2016-03-15 17:14:00 UTC Assignee fpercoco vstinner
Flags needinfo?(hguemar)
Victor Stinner 2016-03-17 14:09:17 UTC Status POST MODIFIED
Fixed In Version python-oslo-messaging-2.5.0-5.el7ost
errata-xmlrpc 2016-03-17 19:56:56 UTC Status MODIFIED ON_QA
Ofer Blaut 2016-03-22 09:50:17 UTC QA Contact ushkalim mkrcmari
errata-xmlrpc 2016-03-28 11:19:52 UTC Status ON_QA VERIFIED
Victor Stinner 2016-04-01 16:05:44 UTC Doc Text Cause:
Oslo Messaging uses a "shuffle" strategy to select a RabbitMQ host from the list of RabbitMQ servers. When a node of the cluster running RabbitMQ is restarted, each OpenStack service connected to this server reconnects to a new RabbitMQ server.

Consequence:
The problem is that Oslo Messaging stategy doesn't handle well a dead RabbitMQ server, it can lead to RPC timeout. The strategy can retry to connect to the same dead server multiple times in a row.

Shuffle strategy used right now leads to increased reconnection time. Sometimes it might lead to RPC operations timeout because the strategy provides no guarantee on how long the reconnection process will take.

Fix:
Oslo Messaging now uses Round Robin stategy to select a RabbitMQ host.

Result:
Round-robin strategy provides least achievable reconnection time and avoid RPC timeout when a node is restarted. It also provides guarantee that if K of N RabbitMQ hosts are alive, it will take at most N - K + 1 attempts to successfully reconnect to RabbitMQ cluster.
Radek Bíba 2016-04-06 08:06:07 UTC Doc Text Cause:
Oslo Messaging uses a "shuffle" strategy to select a RabbitMQ host from the list of RabbitMQ servers. When a node of the cluster running RabbitMQ is restarted, each OpenStack service connected to this server reconnects to a new RabbitMQ server.

Consequence:
The problem is that Oslo Messaging stategy doesn't handle well a dead RabbitMQ server, it can lead to RPC timeout. The strategy can retry to connect to the same dead server multiple times in a row.

Shuffle strategy used right now leads to increased reconnection time. Sometimes it might lead to RPC operations timeout because the strategy provides no guarantee on how long the reconnection process will take.

Fix:
Oslo Messaging now uses Round Robin stategy to select a RabbitMQ host.

Result:
Round-robin strategy provides least achievable reconnection time and avoid RPC timeout when a node is restarted. It also provides guarantee that if K of N RabbitMQ hosts are alive, it will take at most N - K + 1 attempts to successfully reconnect to RabbitMQ cluster.
Oslo Messaging used the "shuffle" strategy to select a RabbitMQ host from the list of RabbitMQ servers. When a node of the cluster running RabbitMQ was restarted, each OpenStack service connected to this server reconnected to a new RabbitMQ server. Unfortunately, this strategy does not handle dead RabbitMQ servers correctly; it can try to connect to the same dead server multiple times in a row. The strategy also leads to increased reconnection time, and sometimes it may lead to RPC operations timing out because no guarantee is provided on how long the reconnection process will take.

With this update, Oslo Messaging uses the "round-robin" strategy to select a RabbitMQ host. This strategy provides the least achievable reconnection time and avoids RPC timeout when a node is restarted. It also guarantees that if K of N RabbitMQ hosts are alive, it will take at most N - K + 1 attempts to successfully reconnect to the RabbitMQ cluster.
errata-xmlrpc 2016-04-07 21:26:22 UTC Status VERIFIED CLOSED
Resolution --- ERRATA
Last Closed 2016-04-07 17:26:22 UTC

Back to bug 1302391