Bug 1323635

Summary: [RFE] Keystone oslo messaging trying to reconnect to the unreachable RabbitMQ node every second
Product: Red Hat OpenStack Reporter: Pablo Caruana <pcaruana>
Component: python-oslo-messagingAssignee: Victor Stinner <vstinner>
Status: CLOSED WONTFIX QA Contact: Udi Shkalim <ushkalim>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.0 (Kilo)CC: abeekhof, apevec, ealcaniz, fdinitto, felipe.alfaro, ipetrova, jbuchta, jdennis, jeckersb, lhh, nkinder, royoung, srevivo, ushkalim, vstinner
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-25 20:30:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1339488    

Description Pablo Caruana 2016-04-04 10:05:58 UTC
Description of problem:

When a controller machine is shutdown completely (e.g. no Layer 2 connectivity), It can be observed the following entries at keystone logs 

2016-03-30 10:20:29.743 14953 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds.
2016-03-30 10:20:35.765 14953 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 1 seconds.
2016-03-30 10:21:42.450 14930 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds.
2016-03-30 10:21:48.474 14930 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 1 seconds.
2016-03-30 10:21:53.777 14935 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds.
2016-03-30 10:21:59.802 14935 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 1 seconds.
2016-03-30 10:22:05.828 14935 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 1 seconds.
2016-03-30 10:22:50.446 14941 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds.
2016-03-30 10:22:56.472 14941 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 1 seconds.
2016-03-30 10:23:21.892 14929 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds.
2016-03-30 10:23:27.915 14929 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 1 seconds.
2016-03-30 10:31:39.049 14945 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds.
2016-03-30 10:31:45.071 14945 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 1 seconds.
2016-03-30 10:31:51.095 14945 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 1 seconds.
2016-03-30 10:31:57.122 14945 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 2 seconds.
2016-03-30 10:33:19.765 14948 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.17:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds.
2016-03-30 10:37:35.229 14927 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.17:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds.
2016-03-30 10:38:53.808 14940 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds.

It seems Keystone is trying to reconnect to the unreachable RabbitMQ node every second. First, I don't see any form of exponential backoff, which is critical for any scalable system. Second, it seems to me that Keystone never gives up in trying to connect. Or perhaps the Keystone configuration is not optimal. We need 1) configure exponential backoff in Keystone and 2) put a reasonable limit on how many times Keystone tries to connect to a RabbitMQ before it considers it completely dead and 3) have Keystone consider a RabbitMQ node dead for a reasonable time (e.g. 10 minutes).

Version-Release number of selected component (if applicable):


How reproducible:
 No Layer 2-4 connectivity a a rabbitmq node.

The vanilla Director deployed keystone conf contails similar to those options

oslo_messaging_rabbit]


rabbit_hosts = 172.16.2.7,172.16.2.8,172.16.2.9

rabbit_use_ssl = False

rabbit_userid = guest

rabbit_password = guest


rabbit_virtual_host = /


rabbit_ha_queues = True

heartbeat_timeout_threshold = 60  <---- Number of seconds after which the 
                                        Rabbit broker is considered down if
                                        heartbeat's keep-alive fails

heartbeat_rate = 2


where heartbeat_timeout_threshold=  Number of seconds after which the Rabbit broker is considered down if heartbeat's keep-alive fails (0 disable the heartbeat). EXPERIMENTAL (integer value)

Tcp keepalive are

sysctl -a | grep net.ipv4.tcp_keepalive
net.ipv4.tcp_keepalive_intvl = 1
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_time = 5

That means that the first keepalive probe will be sent 5 seconds after the connection goes (and remains) idle, and 5 probes will be sent at one second intervals before a connection is terminated.  This will detect failures at the other end of the connection in about 10 seconds (5+5*1).

Rabbitmq.config contains the eepalive, true as
cat /etc/rabbitmq/rabbitmq.config 
% This file managed by Puppet
% Template Path: rabbitmq/templates/rabbitmq.config
[
  {rabbit, [
    {cluster_partition_handling, pause_minority},
    {tcp_listen_options, [binary,{packet, raw},
                                {reuseaddr, true},
                                {backlog, 128},
                                {nodelay, true},
                                {exit_on_close, false},
                                {keepalive, true}]},

Comment 4 Adam Young 2016-04-15 18:11:31 UTC
This is not within Keystone's ability to affect, but there is an upstream bug for exactly this.


Reassigning to Oslo Messaging.  Please confirm that it belongs there.

Comment 5 John Eckersberg 2016-04-15 18:26:02 UTC
Yeah this looks like an oslo.messaging bug.  I'm pretty sure exponential backoff worked at some point in the past so maybe this is a regression.  I'll look into it.

Comment 6 Irina Petrova 2016-05-09 07:43:47 UTC
(In reply to John Eckersberg from comment #5)
> Yeah this looks like an oslo.messaging bug.  I'm pretty sure exponential
> backoff worked at some point in the past so maybe this is a regression. 
> I'll look into it.

Hey John,

Any findings on your recollection?

--Irina

Comment 8 Edu Alcaniz 2016-06-16 07:09:17 UTC
Morning, could you give us an update about the status of this BZ please. 

Thanks
Edu Alcaniz
Senior Technical Account Manager

Comment 9 Udi Shkalim 2016-06-16 07:31:20 UTC
Hi Edu,

The bug is still in "ASSIGNED".
@eck can you assist here?

thx

Comment 10 John Eckersberg 2016-06-16 13:54:44 UTC
Looking into it, sorry I was out on leave and lost track of this one.

Comment 11 John Eckersberg 2016-06-20 19:24:12 UTC
OK, I looked into this and I was wrong, oslo.messaging has never had exponential backoff.  I do see some code that appears to have attempted a linear backoff, but that's completely broken.  It just sleeps whatever kombu_reconnect_delay is configured as (default 1 as seen here).

Comment 12 Edu Alcaniz 2016-06-22 05:51:10 UTC
Thanks very much John for your update. Any track on the code? could we estimate a hotfix or errata, please?

Comment 13 John Eckersberg 2016-06-28 17:01:53 UTC
(In reply to Edu Alcaniz from comment #12)
> Thanks very much John for your update. Any track on the code? could we
> estimate a hotfix or errata, please?

It's not so straightforward.  This has been attempted upstream before and never went anywhere.  It should use the retrying library[1] but there are some shortcomings that need to be fixed first, so there are several dependent improvements that would need to be made.  Also potentially problematic is that the retrying library has not been updated in about 18 months, so it could be difficult to get the necessary fixes upstreamed there.

[1] https://pypi.python.org/pypi/retrying

Comment 14 Edu Alcaniz 2016-09-09 07:24:08 UTC
Hi, could we get an update of this BZ please

Comment 15 Rob Young 2016-10-11 19:40:26 UTC
@jeckersb will this make OSP 10 or should it be moved out to OSP 11?

Comment 16 John Eckersberg 2016-10-22 12:22:24 UTC
Just saw this oslo.messaging review that replaces retrying with tenacity:

https://review.openstack.org/#/c/389729

Of note from https://pypi.python.org/pypi/tenacity/3.2.1 :

- It's being actively updated
- It supports exponential backoff

So that's promising for this use case as well.

Comment 18 Red Hat Bugzilla Rules Engine 2017-06-04 02:13:52 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 19 Andrew Beekhof 2018-05-21 08:15:49 UTC
Hey John, whats the plan for this one?  Redundant?

Comment 20 John Eckersberg 2018-09-25 20:30:28 UTC
I'm going to close this as WONTFIX for OSP7, but for reference this has since been implemented in kombu and invoked from the oslo_messaging rabbit driver via the following config options:

rabbit_retry_interval
rabbit_retry_backoff
rabbit_interval_max
rabbit_max_retries

See https://docs.openstack.org/oslo.messaging/rocky/configuration/opts.html for details