1542100 – Can't failover when rabbit_hosts is configured as 3 hosts

Bug 1542100 - Can't failover when rabbit_hosts is configured as 3 hosts

Summary: Can't failover when rabbit_hosts is configured as 3 hosts

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-oslo-messaging
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	John Eckersberg
QA Contact:	Udi Shkalim
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-02-05 15:02 UTC by Md Nadeem
Modified:	2021-03-11 17:06 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-03-05 14:42:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Md Nadeem 2018-02-05 15:02:45 UTC

Description of problem:

This bug is a clone of the upstream bugs
>> https://bugs.launchpad.net/oslo.messaging/+bug/1657444

[Impact]

When the heartbeat connection times out it is not treated as a recoverable error nor attempts to reconnect calling ensure_connection(). This leaves the heartbeat thread attempting to reconnect to the same host over and over again.

[Test Case]

* deploy openstack
  bzr branch lp:openstack-charm-testing
  cd openstack-charm-testing
  juju deployer -c default.yaml -d -v artful-pike
  juju add-unit rabbitmq-server
* Force timeout using iptables in a rabbitmq-server node
  sudo iptables -I INPUT -p tcp --dport 5672 -j DROP

Expected result:
once the timeout happens, the heartbeat thread reconnects (picking a new rabbit host if needed).

Actual result:
the heartbeat thread is left in a loop (connect, socket closed, retry, connect...)

[Regression Potential]

Without this patch when the heartbeat connection times out, and it does not attempt to connect to the next configured rabbit host. So the risk is that situations where currently the daemons using this library made it to reconnect to the same host (e.g. the disconnection from the host is only for a few seconds) with this change they will reconnect to the next host, so users may see the connections flapping between two (or more) rabbit hosts.

[Other Info]
I have a rabbitmq cluster of 3 nodes

root@47704165d2bb:/# rabbitmqctl cluster_status
Cluster status of node rabbit@47704165d2bb ...
[{nodes,[{disc,[rabbit@0482398a286e,rabbit@3709521b608a,
                rabbit@47704165d2bb]}]},
 {running_nodes,[rabbit@0482398a286e,rabbit@3709521b608a,rabbit@47704165d2bb]},
 {cluster_name,<<"rabbit@47704165d2bb">>},
 {partitions,[]},
 {alarms,[{rabbit@0482398a286e,[]},
          {rabbit@3709521b608a,[]},
          {rabbit@47704165d2bb,[]}]}]
root@47704165d2bb:/# rabbitmqctl list_policies
Listing policies ...
/ ha-all all ^ha\\. {"ha-mode":"all"} 0

My oslo_message client configuration
[oslo_messaging_rabbit]
rabbit_hosts=120.0.0.56:5671,120.0.0.57:5671,120.0.0.55:5671
rabbit_userid=cloud
rabbit_password=cloud
rabbit_ha_queues=True
rabbit_retry_interval=1
rabbit_retry_backoff=2
rabbit_max_retries=0
rabbit_durable_queues=False

When I run "service rabbitmq-server stop" on one node to simulating a failure, I got following error logs, and the consumer can't failover from the bad node. It will reconnect the failure node forever instead of other nodes. "kombu_failover_strategy" is default value of "round-robin".

2009-01-13 18:32:42.785 17 ERROR oslo.messaging._drivers.impl_rabbit [-] [4e976d46-ceee-4617-b9be-5e4821990738] AMQP server 120.0.0.56:5671 closed the connection. Check login credentials: Socket closed
2009-01-13 18:32:43.819 17 ERROR oslo.messaging._drivers.impl_rabbit [-] Unable to connect to AMQP server on 120.0.0.56:5671 after None tries: Socket closed
2009-01-13 18:32:43.819 17 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...
2009-01-13 18:32:58.874 17 ERROR oslo.messaging._drivers.impl_rabbit [-] [4e976d46-ceee-4617-b9be-5e4821990738] AMQP server 120.0.0.56:5671 closed the connection. Check login credentials: Socket closed
2009-01-13 18:32:59.907 17 ERROR oslo.messaging._drivers.impl_rabbit [-] Unable to connect to AMQP server on 120.0.0.56:5671 after None tries: Socket closed
2009-01-13 18:32:59.907 17 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 John Eckersberg 2018-02-06 18:37:10 UTC

I don't think the upstream bug referenced here is applicable to OSP10.

On OSP10 we have:

python-oslo-messaging-5.10.2-1.el7ost
python-kombu-3.0.32-2.el7ost
python-amqp-1.4.6-1.el7ost

The upstream patch[1] for the bug explicitly catches socket.timeout in addition to kombu.exceptions.OperationalError.  OperationalError however is a feature of kombu >= 4.0.0.  Because OSP10 has kombu 3.0.32, it works a bit differently.

First, oslo.messaging 5.10.2 does not catch OperationalError at all.  Instead, it catches recoverable_errors[2], which is defined[3] as self.connection.recoverable_channel_errors + self.connection.recoverable_connection_errors.

Kombu connections delegate both recoverable_connection_errors[4] and recoverable_channel_errors[5] to the transport.  In OSP, the kombu transport backend used is py-amqp.

Finally, py-amqp defines recoverable_connection_errors[6] as:

recoverable_connection_errors = (
        RecoverableConnectionError,
        socket.error,
        IOError,
        OSError,
)

Which notably includes both IOError and socket.error.

socket.error is already a subclass of IOError, so including it explicitly is redundant:

>>> issubclass(socket.error, IOError)
True

However more importantly, socket.timeout is a subclass of socket.error:

>>> issubclass(socket.timeout, socket.error)
True

Which means that a socket.timeout error being raised will be caught as part of recoverable_connection_errors because it is also a socket.error and transitively also a IOError.

Thus the proposed patch wouldn't have any effect on OSP10; it already handles socket.timeout.

If the connection is not failing over to a different RabbitMQ server it is due to some other reason.

[1] https://review.openstack.org/#/c/519701/1
[2] https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/_drivers/impl_rabbit.py?h=5.10.2#n1004
[3] https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/_drivers/impl_rabbit.py?h=5.10.2#n987
[4] https://github.com/celery/kombu/blob/v3.0.32/kombu/connection.py#L782
[5] https://github.com/celery/kombu/blob/v3.0.32/kombu/connection.py#L795
[6] https://github.com/celery/py-amqp/blob/v1.4.6/amqp/connection.py#L999

Comment 2 John Eckersberg 2018-02-06 19:19:42 UTC

Looking at the sosreport, it looks like failover *is* working, as an example:

$ grep a6cb9f8b-e498-4f0c-8299-c62777745308 nova-conductor.log
2018-01-31 05:18:14.549 6386 ERROR oslo.messaging._drivers.impl_rabbit [req-e916d955-37b6-43f8-bf57-81801e7331cd - - - - -] [a6cb9f8b-e498-4f0c-8299-c62777745308] AMQP server on 192.168.1.15:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds. Client port: 49972
2018-01-31 05:18:15.561 6386 ERROR oslo.messaging._drivers.impl_rabbit [req-e916d955-37b6-43f8-bf57-81801e7331cd - - - - -] [a6cb9f8b-e498-4f0c-8299-c62777745308] AMQP server on 192.168.1.15:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: 49972
2018-01-31 05:18:16.691 6386 INFO oslo.messaging._drivers.impl_rabbit [req-e916d955-37b6-43f8-bf57-81801e7331cd - - - - -] [a6cb9f8b-e498-4f0c-8299-c62777745308] Reconnected to AMQP server on 192.168.1.16:5672 via [amqp] clientwith port 44382.

So it was originally connected to .15, it attempted to reconnect once to .15 which failed with ECONNREFUSED, and then successfully re-established the connection to rabbit on .16.

Keep in mind the messaging traffic does not go through haproxy to be load balanced; instead the clients know each rabbit server and will round-robin between all cluster nodes.

Note You need to log in before you can comment on or make changes to this bug.