Description of problem: This bug is a clone of the upstream bugs >> https://bugs.launchpad.net/oslo.messaging/+bug/1657444 [Impact] When the heartbeat connection times out it is not treated as a recoverable error nor attempts to reconnect calling ensure_connection(). This leaves the heartbeat thread attempting to reconnect to the same host over and over again. [Test Case] * deploy openstack bzr branch lp:openstack-charm-testing cd openstack-charm-testing juju deployer -c default.yaml -d -v artful-pike juju add-unit rabbitmq-server * Force timeout using iptables in a rabbitmq-server node sudo iptables -I INPUT -p tcp --dport 5672 -j DROP Expected result: once the timeout happens, the heartbeat thread reconnects (picking a new rabbit host if needed). Actual result: the heartbeat thread is left in a loop (connect, socket closed, retry, connect...) [Regression Potential] Without this patch when the heartbeat connection times out, and it does not attempt to connect to the next configured rabbit host. So the risk is that situations where currently the daemons using this library made it to reconnect to the same host (e.g. the disconnection from the host is only for a few seconds) with this change they will reconnect to the next host, so users may see the connections flapping between two (or more) rabbit hosts. [Other Info] I have a rabbitmq cluster of 3 nodes root@47704165d2bb:/# rabbitmqctl cluster_status Cluster status of node rabbit@47704165d2bb ... [{nodes,[{disc,[rabbit@0482398a286e,rabbit@3709521b608a, rabbit@47704165d2bb]}]}, {running_nodes,[rabbit@0482398a286e,rabbit@3709521b608a,rabbit@47704165d2bb]}, {cluster_name,<<"rabbit@47704165d2bb">>}, {partitions,[]}, {alarms,[{rabbit@0482398a286e,[]}, {rabbit@3709521b608a,[]}, {rabbit@47704165d2bb,[]}]}] root@47704165d2bb:/# rabbitmqctl list_policies Listing policies ... / ha-all all ^ha\\. {"ha-mode":"all"} 0 My oslo_message client configuration [oslo_messaging_rabbit] rabbit_hosts=120.0.0.56:5671,120.0.0.57:5671,120.0.0.55:5671 rabbit_userid=cloud rabbit_password=cloud rabbit_ha_queues=True rabbit_retry_interval=1 rabbit_retry_backoff=2 rabbit_max_retries=0 rabbit_durable_queues=False When I run "service rabbitmq-server stop" on one node to simulating a failure, I got following error logs, and the consumer can't failover from the bad node. It will reconnect the failure node forever instead of other nodes. "kombu_failover_strategy" is default value of "round-robin". 2009-01-13 18:32:42.785 17 ERROR oslo.messaging._drivers.impl_rabbit [-] [4e976d46-ceee-4617-b9be-5e4821990738] AMQP server 120.0.0.56:5671 closed the connection. Check login credentials: Socket closed 2009-01-13 18:32:43.819 17 ERROR oslo.messaging._drivers.impl_rabbit [-] Unable to connect to AMQP server on 120.0.0.56:5671 after None tries: Socket closed 2009-01-13 18:32:43.819 17 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying... 2009-01-13 18:32:58.874 17 ERROR oslo.messaging._drivers.impl_rabbit [-] [4e976d46-ceee-4617-b9be-5e4821990738] AMQP server 120.0.0.56:5671 closed the connection. Check login credentials: Socket closed 2009-01-13 18:32:59.907 17 ERROR oslo.messaging._drivers.impl_rabbit [-] Unable to connect to AMQP server on 120.0.0.56:5671 after None tries: Socket closed 2009-01-13 18:32:59.907 17 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying... Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I don't think the upstream bug referenced here is applicable to OSP10. On OSP10 we have: python-oslo-messaging-5.10.2-1.el7ost python-kombu-3.0.32-2.el7ost python-amqp-1.4.6-1.el7ost The upstream patch[1] for the bug explicitly catches socket.timeout in addition to kombu.exceptions.OperationalError. OperationalError however is a feature of kombu >= 4.0.0. Because OSP10 has kombu 3.0.32, it works a bit differently. First, oslo.messaging 5.10.2 does not catch OperationalError at all. Instead, it catches recoverable_errors[2], which is defined[3] as self.connection.recoverable_channel_errors + self.connection.recoverable_connection_errors. Kombu connections delegate both recoverable_connection_errors[4] and recoverable_channel_errors[5] to the transport. In OSP, the kombu transport backend used is py-amqp. Finally, py-amqp defines recoverable_connection_errors[6] as: recoverable_connection_errors = ( RecoverableConnectionError, socket.error, IOError, OSError, ) Which notably includes both IOError and socket.error. socket.error is already a subclass of IOError, so including it explicitly is redundant: >>> issubclass(socket.error, IOError) True However more importantly, socket.timeout is a subclass of socket.error: >>> issubclass(socket.timeout, socket.error) True Which means that a socket.timeout error being raised will be caught as part of recoverable_connection_errors because it is also a socket.error and transitively also a IOError. Thus the proposed patch wouldn't have any effect on OSP10; it already handles socket.timeout. If the connection is not failing over to a different RabbitMQ server it is due to some other reason. [1] https://review.openstack.org/#/c/519701/1 [2] https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/_drivers/impl_rabbit.py?h=5.10.2#n1004 [3] https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/_drivers/impl_rabbit.py?h=5.10.2#n987 [4] https://github.com/celery/kombu/blob/v3.0.32/kombu/connection.py#L782 [5] https://github.com/celery/kombu/blob/v3.0.32/kombu/connection.py#L795 [6] https://github.com/celery/py-amqp/blob/v1.4.6/amqp/connection.py#L999
Looking at the sosreport, it looks like failover *is* working, as an example: $ grep a6cb9f8b-e498-4f0c-8299-c62777745308 nova-conductor.log 2018-01-31 05:18:14.549 6386 ERROR oslo.messaging._drivers.impl_rabbit [req-e916d955-37b6-43f8-bf57-81801e7331cd - - - - -] [a6cb9f8b-e498-4f0c-8299-c62777745308] AMQP server on 192.168.1.15:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds. Client port: 49972 2018-01-31 05:18:15.561 6386 ERROR oslo.messaging._drivers.impl_rabbit [req-e916d955-37b6-43f8-bf57-81801e7331cd - - - - -] [a6cb9f8b-e498-4f0c-8299-c62777745308] AMQP server on 192.168.1.15:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: 49972 2018-01-31 05:18:16.691 6386 INFO oslo.messaging._drivers.impl_rabbit [req-e916d955-37b6-43f8-bf57-81801e7331cd - - - - -] [a6cb9f8b-e498-4f0c-8299-c62777745308] Reconnected to AMQP server on 192.168.1.16:5672 via [amqp] clientwith port 44382. So it was originally connected to .15, it attempted to reconnect once to .15 which failed with ECONNREFUSED, and then successfully re-established the connection to rabbit on .16. Keep in mind the messaging traffic does not go through haproxy to be load balanced; instead the clients know each rabbit server and will round-robin between all cluster nodes.