Bug 1270353 - Failure to reconnect when connection lost during message acknowlegement
Failure to reconnect when connection lost during message acknowlegement
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-oslo-messaging (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
urgent Severity high
: async
: 7.0 (Kilo)
Assigned To: John Eckersberg
Udi Shkalim
: Triaged, ZStream
Depends On:
Blocks: 1278176 1278177
  Show dependency treegraph
 
Reported: 2015-10-09 14:16 EDT by John Eckersberg
Modified: 2017-07-19 20:15 EDT (History)
17 users (show)

See Also:
Fixed In Version: python-oslo-messaging-1.8.3-3.el7ost
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1278176 1278177 (view as bug list)
Environment:
Last Closed: 2016-01-20 08:12:29 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description John Eckersberg 2015-10-09 14:16:31 EDT
Working with Mark who is doing some scale testing, I think we're seeing roughly this chain of events:

(1) nova-compute is happily consuming messages from rabbitmq

(2) nova-compute receives some message M

(3) the connection to rabbitmq on which M arrived is reset

(4) nova-compute (and oslo.messaging and kombu...) try to acknowledge message M

(5) exception bubbles up for connection reset by peer

(6) connection is *not* re-established

(7) nova-compute no longer consumes messages from rabbitmq

(8) restarting nova-compute service causes messages to be consumed again


Here's the complete backtrace in the nova-compute.log:

2015-10-07 18:03:40.452 13727 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.
2015-10-07 18:03:40.452 13727 TRACE root Traceback (most recent call last):
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func
2015-10-07 18:03:40.452 13727 TRACE root     return infunc(*args, **kwargs)
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread
2015-10-07 18:03:40.452 13727 TRACE root     self._dispatch(incoming)
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch
2015-10-07 18:03:40.452 13727 TRACE root     spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool)
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with
2015-10-07 18:03:40.452 13727 TRACE root     callback = ctxt.__enter__()
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
2015-10-07 18:03:40.452 13727 TRACE root     return self.gen.next()
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__
2015-10-07 18:03:40.452 13727 TRACE root     incoming.acknowledge()
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge
2015-10-07 18:03:40.452 13727 TRACE root     self.acknowledge_callback()
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 168, in acknowledge
2015-10-07 18:03:40.452 13727 TRACE root     self._raw_message.ack()
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
2015-10-07 18:03:40.452 13727 TRACE root     self.channel.basic_ack(self.delivery_tag)
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack
2015-10-07 18:03:40.452 13727 TRACE root     self._send_method((60, 80), args)
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2015-10-07 18:03:40.452 13727 TRACE root     self.channel_id, method_sig, args, content,
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2015-10-07 18:03:40.452 13727 TRACE root     write_frame(1, channel, payload)
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
2015-10-07 18:03:40.452 13727 TRACE root     frame_type, channel, size, payload, 0xce,
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 376, in sendall
2015-10-07 18:03:40.452 13727 TRACE root     tail = self.send(data, flags)
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
2015-10-07 18:03:40.452 13727 TRACE root     total_sent += fd.send(data[total_sent:], flags)
2015-10-07 18:03:40.452 13727 TRACE root error: [Errno 104] Connection reset by peer
2015-10-07 18:03:40.452 13727 TRACE root


I would expect to see a log message about "Reconnecting to AMQP server" after, but no such message is present.
Comment 2 John Eckersberg 2015-10-15 15:55:37 EDT
This seems to be because this is missing in OSP7:

https://review.openstack.org/#/c/213780/

It was merged into stable/kilo after the 1.8.3 tag.
Comment 3 Flavio Percoco 2015-10-16 02:10:33 EDT
@eck Thanks for the analysis. I'll backport the patch and let you know.
Comment 4 John Eckersberg 2015-11-04 16:34:28 EST
https://code.engineering.redhat.com/gerrit/#/c/59535/2
Comment 8 Udi Shkalim 2016-01-11 03:49:50 EST
Hi John,

Wanted to try and reproduce the same scenario mention above.
What will be the best way to do so in order to verify solution?

Thanks,
Udi
Comment 9 John Eckersberg 2016-01-11 17:32:46 EST
(In reply to Udi Shkalim from comment #8)
> Hi John,
> 
> Wanted to try and reproduce the same scenario mention above.
> What will be the best way to do so in order to verify solution?
> 
> Thanks,
> Udi

Since the timing is important here, I've written a patch that will simulate the failure by resetting the underlying connection 10% of the time in order to demonstrate the failure and the recovery.

To install on the controller:

# pushd /usr/lib/python2.7/site-packages; curl http://git.engineering.redhat.com/git/users/jeckersb/oslo.messaging/patch/?id=ackbreaker | patch -p1; popd

Then restart a service (I usually just use openstack-nova-conductor).

Before the fix, you will see...

2016-01-11 17:27:21.599 21973 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.
2016-01-11 17:27:21.599 21973 TRACE root Traceback (most recent call last):
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func
2016-01-11 17:27:21.599 21973 TRACE root     return infunc(*args, **kwargs)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread
2016-01-11 17:27:21.599 21973 TRACE root     self._dispatch(incoming)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch
2016-01-11 17:27:21.599 21973 TRACE root     spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with
2016-01-11 17:27:21.599 21973 TRACE root     callback = ctxt.__enter__()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
2016-01-11 17:27:21.599 21973 TRACE root     return self.gen.next()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__
2016-01-11 17:27:21.599 21973 TRACE root     incoming.acknowledge()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge
2016-01-11 17:27:21.599 21973 TRACE root     self.acknowledge_callback()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 171, in acknowledge
2016-01-11 17:27:21.599 21973 TRACE root     self._raw_message.ack()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
2016-01-11 17:27:21.599 21973 TRACE root     self.channel.basic_ack(self.delivery_tag)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack
2016-01-11 17:27:21.599 21973 TRACE root     self._send_method((60, 80), args)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2016-01-11 17:27:21.599 21973 TRACE root     self.channel_id, method_sig, args, content,
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2016-01-11 17:27:21.599 21973 TRACE root     write_frame(1, channel, payload)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
2016-01-11 17:27:21.599 21973 TRACE root     frame_type, channel, size, payload, 0xce,
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 377, in sendall
2016-01-11 17:27:21.599 21973 TRACE root     tail = self.send(data, flags)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
2016-01-11 17:27:21.599 21973 TRACE root     total_sent += fd.send(data[total_sent:], flags)
2016-01-11 17:27:21.599 21973 TRACE root error: [Errno 32] Broken pipe
2016-01-11 17:27:21.599 21973 TRACE root 

And there will be no attempt to reconnect.  It will basically hang here forever.

Then:

- apply the updated oslo.messaging package
- *repatch the file*!
- restart nova-conductor

And there will be a reconnect after the error:

2016-01-11 17:32:39.015 1070 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.
2016-01-11 17:32:39.015 1070 TRACE root Traceback (most recent call last):
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func
2016-01-11 17:32:39.015 1070 TRACE root     return infunc(*args, **kwargs)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread
2016-01-11 17:32:39.015 1070 TRACE root     self._dispatch(incoming)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch
2016-01-11 17:32:39.015 1070 TRACE root     spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with
2016-01-11 17:32:39.015 1070 TRACE root     callback = ctxt.__enter__()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
2016-01-11 17:32:39.015 1070 TRACE root     return self.gen.next()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__
2016-01-11 17:32:39.015 1070 TRACE root     incoming.acknowledge()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge
2016-01-11 17:32:39.015 1070 TRACE root     self.acknowledge_callback()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 172, in acknowledge
2016-01-11 17:32:39.015 1070 TRACE root     self._raw_message.ack()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
2016-01-11 17:32:39.015 1070 TRACE root     self.channel.basic_ack(self.delivery_tag)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack
2016-01-11 17:32:39.015 1070 TRACE root     self._send_method((60, 80), args)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2016-01-11 17:32:39.015 1070 TRACE root     self.channel_id, method_sig, args, content,
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2016-01-11 17:32:39.015 1070 TRACE root     write_frame(1, channel, payload)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
2016-01-11 17:32:39.015 1070 TRACE root     frame_type, channel, size, payload, 0xce,
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 377, in sendall
2016-01-11 17:32:39.015 1070 TRACE root     tail = self.send(data, flags)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
2016-01-11 17:32:39.015 1070 TRACE root     total_sent += fd.send(data[total_sent:], flags)
2016-01-11 17:32:39.015 1070 TRACE root error: [Errno 32] Broken pipe
2016-01-11 17:32:39.015 1070 TRACE root 
2016-01-11 17:32:40.018 1070 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 192.0.2.8:5672 is unreachable: <AMQPError: unknown error>. Trying again in 1 seconds.
2016-01-11 17:32:41.038 1070 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 192.0.2.8:5672
Comment 10 Udi Shkalim 2016-01-14 10:00:43 EST
Thanks John - worked like a charm.

Verified on python-oslo-messaging-1.8.3-3.el7ost

From steps to reproduce above got the expected error and re-connection.
2016-01-14 09:59:02.248 19142 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.
2016-01-14 09:59:02.248 19142 TRACE root Traceback (most recent call last):
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func
2016-01-14 09:59:02.248 19142 TRACE root     return infunc(*args, **kwargs)
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread
2016-01-14 09:59:02.248 19142 TRACE root     self._dispatch(incoming)
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch
2016-01-14 09:59:02.248 19142 TRACE root     spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool)
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with
2016-01-14 09:59:02.248 19142 TRACE root     callback = ctxt.__enter__()
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
2016-01-14 09:59:02.248 19142 TRACE root     return self.gen.next()
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__
2016-01-14 09:59:02.248 19142 TRACE root     incoming.acknowledge()
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge
2016-01-14 09:59:02.248 19142 TRACE root     self.acknowledge_callback()
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 172, in acknowledge
2016-01-14 09:59:02.248 19142 TRACE root     self._raw_message.ack()
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
2016-01-14 09:59:02.248 19142 TRACE root     self.channel.basic_ack(self.delivery_tag)
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack
2016-01-14 09:59:02.248 19142 TRACE root     self._send_method((60, 80), args)
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2016-01-14 09:59:02.248 19142 TRACE root     self.channel_id, method_sig, args, content,
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2016-01-14 09:59:02.248 19142 TRACE root     write_frame(1, channel, payload)
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
2016-01-14 09:59:02.248 19142 TRACE root     frame_type, channel, size, payload, 0xce,
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 377, in sendall
2016-01-14 09:59:02.248 19142 TRACE root     tail = self.send(data, flags)
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
2016-01-14 09:59:02.248 19142 TRACE root     total_sent += fd.send(data[total_sent:], flags)
2016-01-14 09:59:02.248 19142 TRACE root error: [Errno 32] Broken pipe
2016-01-14 09:59:02.248 19142 TRACE root 
2016-01-14 09:59:03.251 19142 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 192.0.2.10:5672 is unreachable: <AMQPError: unknown error>. Trying again in 1 seconds.
2016-01-14 09:59:04.295 19142 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 192.0.2.10:5672
Comment 12 errata-xmlrpc 2016-01-20 08:12:29 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:0047

Note You need to log in before you can comment on or make changes to this bug.