| Summary: | [RFE] Keystone oslo messaging trying to reconnect to the unreachable RabbitMQ node every second | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Pablo Caruana <pcaruana> |
| Component: | python-oslo-messaging | Assignee: | Victor Stinner <vstinner> |
| Status: | CLOSED WONTFIX | QA Contact: | Udi Shkalim <ushkalim> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 7.0 (Kilo) | CC: | abeekhof, apevec, ealcaniz, fdinitto, felipe.alfaro, ipetrova, jbuchta, jdennis, jeckersb, lhh, nkinder, royoung, srevivo, ushkalim, vstinner |
| Target Milestone: | --- | Keywords: | FutureFeature |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-09-25 20:30:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | |||
| Bug Blocks: | 1339488 | ||
This is not within Keystone's ability to affect, but there is an upstream bug for exactly this. Reassigning to Oslo Messaging. Please confirm that it belongs there. Yeah this looks like an oslo.messaging bug. I'm pretty sure exponential backoff worked at some point in the past so maybe this is a regression. I'll look into it. (In reply to John Eckersberg from comment #5) > Yeah this looks like an oslo.messaging bug. I'm pretty sure exponential > backoff worked at some point in the past so maybe this is a regression. > I'll look into it. Hey John, Any findings on your recollection? --Irina Morning, could you give us an update about the status of this BZ please. Thanks Edu Alcaniz Senior Technical Account Manager Hi Edu, The bug is still in "ASSIGNED". @eck can you assist here? thx Looking into it, sorry I was out on leave and lost track of this one. OK, I looked into this and I was wrong, oslo.messaging has never had exponential backoff. I do see some code that appears to have attempted a linear backoff, but that's completely broken. It just sleeps whatever kombu_reconnect_delay is configured as (default 1 as seen here). Thanks very much John for your update. Any track on the code? could we estimate a hotfix or errata, please? (In reply to Edu Alcaniz from comment #12) > Thanks very much John for your update. Any track on the code? could we > estimate a hotfix or errata, please? It's not so straightforward. This has been attempted upstream before and never went anywhere. It should use the retrying library[1] but there are some shortcomings that need to be fixed first, so there are several dependent improvements that would need to be made. Also potentially problematic is that the retrying library has not been updated in about 18 months, so it could be difficult to get the necessary fixes upstreamed there. [1] https://pypi.python.org/pypi/retrying Hi, could we get an update of this BZ please @jeckersb will this make OSP 10 or should it be moved out to OSP 11? Just saw this oslo.messaging review that replaces retrying with tenacity: https://review.openstack.org/#/c/389729 Of note from https://pypi.python.org/pypi/tenacity/3.2.1 : - It's being actively updated - It supports exponential backoff So that's promising for this use case as well. This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release. Hey John, whats the plan for this one? Redundant? I'm going to close this as WONTFIX for OSP7, but for reference this has since been implemented in kombu and invoked from the oslo_messaging rabbit driver via the following config options: rabbit_retry_interval rabbit_retry_backoff rabbit_interval_max rabbit_max_retries See https://docs.openstack.org/oslo.messaging/rocky/configuration/opts.html for details |
Description of problem: When a controller machine is shutdown completely (e.g. no Layer 2 connectivity), It can be observed the following entries at keystone logs 2016-03-30 10:20:29.743 14953 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds. 2016-03-30 10:20:35.765 14953 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-03-30 10:21:42.450 14930 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds. 2016-03-30 10:21:48.474 14930 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-03-30 10:21:53.777 14935 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds. 2016-03-30 10:21:59.802 14935 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-03-30 10:22:05.828 14935 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-03-30 10:22:50.446 14941 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds. 2016-03-30 10:22:56.472 14941 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-03-30 10:23:21.892 14929 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds. 2016-03-30 10:23:27.915 14929 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-03-30 10:31:39.049 14945 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds. 2016-03-30 10:31:45.071 14945 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-03-30 10:31:51.095 14945 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-03-30 10:31:57.122 14945 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: timed out. Trying again in 2 seconds. 2016-03-30 10:33:19.765 14948 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.17:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds. 2016-03-30 10:37:35.229 14927 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.17:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds. 2016-03-30 10:38:53.808 14940 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.24.13.10:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds. It seems Keystone is trying to reconnect to the unreachable RabbitMQ node every second. First, I don't see any form of exponential backoff, which is critical for any scalable system. Second, it seems to me that Keystone never gives up in trying to connect. Or perhaps the Keystone configuration is not optimal. We need 1) configure exponential backoff in Keystone and 2) put a reasonable limit on how many times Keystone tries to connect to a RabbitMQ before it considers it completely dead and 3) have Keystone consider a RabbitMQ node dead for a reasonable time (e.g. 10 minutes). Version-Release number of selected component (if applicable): How reproducible: No Layer 2-4 connectivity a a rabbitmq node. The vanilla Director deployed keystone conf contails similar to those options oslo_messaging_rabbit] rabbit_hosts = 172.16.2.7,172.16.2.8,172.16.2.9 rabbit_use_ssl = False rabbit_userid = guest rabbit_password = guest rabbit_virtual_host = / rabbit_ha_queues = True heartbeat_timeout_threshold = 60 <---- Number of seconds after which the Rabbit broker is considered down if heartbeat's keep-alive fails heartbeat_rate = 2 where heartbeat_timeout_threshold= Number of seconds after which the Rabbit broker is considered down if heartbeat's keep-alive fails (0 disable the heartbeat). EXPERIMENTAL (integer value) Tcp keepalive are sysctl -a | grep net.ipv4.tcp_keepalive net.ipv4.tcp_keepalive_intvl = 1 net.ipv4.tcp_keepalive_probes = 5 net.ipv4.tcp_keepalive_time = 5 That means that the first keepalive probe will be sent 5 seconds after the connection goes (and remains) idle, and 5 probes will be sent at one second intervals before a connection is terminated. This will detect failures at the other end of the connection in about 10 seconds (5+5*1). Rabbitmq.config contains the eepalive, true as cat /etc/rabbitmq/rabbitmq.config % This file managed by Puppet % Template Path: rabbitmq/templates/rabbitmq.config [ {rabbit, [ {cluster_partition_handling, pause_minority}, {tcp_listen_options, [binary,{packet, raw}, {reuseaddr, true}, {backlog, 128}, {nodelay, true}, {exit_on_close, false}, {keepalive, true}]},