Bug 1270353

Summary: Failure to reconnect when connection lost during message acknowlegement
Product: Red Hat OpenStack Reporter: John Eckersberg <jeckersb>
Component: python-oslo-messagingAssignee: John Eckersberg <jeckersb>
Status: CLOSED ERRATA QA Contact: Udi Shkalim <ushkalim>
Severity: high Docs Contact:
Priority: urgent    
Version: 7.0 (Kilo)CC: apevec, dsafford, fpercoco, hguemar, jeckersb, jtaleric, kgiusti, lhh, lnatapov, mfuruta, mwagner, oblaut, plemenko, sputhenp, vcojot, yeylon
Target Milestone: asyncKeywords: Triaged, ZStream
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-oslo-messaging-1.8.3-3.el7ost Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1278176 1278177 (view as bug list) Environment:
Last Closed: 2016-01-20 13:12:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1278176, 1278177    

Description John Eckersberg 2015-10-09 18:16:31 UTC
Working with Mark who is doing some scale testing, I think we're seeing roughly this chain of events:

(1) nova-compute is happily consuming messages from rabbitmq

(2) nova-compute receives some message M

(3) the connection to rabbitmq on which M arrived is reset

(4) nova-compute (and oslo.messaging and kombu...) try to acknowledge message M

(5) exception bubbles up for connection reset by peer

(6) connection is *not* re-established

(7) nova-compute no longer consumes messages from rabbitmq

(8) restarting nova-compute service causes messages to be consumed again


Here's the complete backtrace in the nova-compute.log:

2015-10-07 18:03:40.452 13727 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.
2015-10-07 18:03:40.452 13727 TRACE root Traceback (most recent call last):
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func
2015-10-07 18:03:40.452 13727 TRACE root     return infunc(*args, **kwargs)
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread
2015-10-07 18:03:40.452 13727 TRACE root     self._dispatch(incoming)
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch
2015-10-07 18:03:40.452 13727 TRACE root     spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool)
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with
2015-10-07 18:03:40.452 13727 TRACE root     callback = ctxt.__enter__()
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
2015-10-07 18:03:40.452 13727 TRACE root     return self.gen.next()
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__
2015-10-07 18:03:40.452 13727 TRACE root     incoming.acknowledge()
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge
2015-10-07 18:03:40.452 13727 TRACE root     self.acknowledge_callback()
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 168, in acknowledge
2015-10-07 18:03:40.452 13727 TRACE root     self._raw_message.ack()
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
2015-10-07 18:03:40.452 13727 TRACE root     self.channel.basic_ack(self.delivery_tag)
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack
2015-10-07 18:03:40.452 13727 TRACE root     self._send_method((60, 80), args)
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2015-10-07 18:03:40.452 13727 TRACE root     self.channel_id, method_sig, args, content,
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2015-10-07 18:03:40.452 13727 TRACE root     write_frame(1, channel, payload)
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
2015-10-07 18:03:40.452 13727 TRACE root     frame_type, channel, size, payload, 0xce,
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 376, in sendall
2015-10-07 18:03:40.452 13727 TRACE root     tail = self.send(data, flags)
2015-10-07 18:03:40.452 13727 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
2015-10-07 18:03:40.452 13727 TRACE root     total_sent += fd.send(data[total_sent:], flags)
2015-10-07 18:03:40.452 13727 TRACE root error: [Errno 104] Connection reset by peer
2015-10-07 18:03:40.452 13727 TRACE root


I would expect to see a log message about "Reconnecting to AMQP server" after, but no such message is present.

Comment 2 John Eckersberg 2015-10-15 19:55:37 UTC
This seems to be because this is missing in OSP7:

https://review.openstack.org/#/c/213780/

It was merged into stable/kilo after the 1.8.3 tag.

Comment 3 Flavio Percoco 2015-10-16 06:10:33 UTC
@eck Thanks for the analysis. I'll backport the patch and let you know.

Comment 4 John Eckersberg 2015-11-04 21:34:28 UTC
https://code.engineering.redhat.com/gerrit/#/c/59535/2

Comment 8 Udi Shkalim 2016-01-11 08:49:50 UTC
Hi John,

Wanted to try and reproduce the same scenario mention above.
What will be the best way to do so in order to verify solution?

Thanks,
Udi

Comment 9 John Eckersberg 2016-01-11 22:32:46 UTC
(In reply to Udi Shkalim from comment #8)
> Hi John,
> 
> Wanted to try and reproduce the same scenario mention above.
> What will be the best way to do so in order to verify solution?
> 
> Thanks,
> Udi

Since the timing is important here, I've written a patch that will simulate the failure by resetting the underlying connection 10% of the time in order to demonstrate the failure and the recovery.

To install on the controller:

# pushd /usr/lib/python2.7/site-packages; curl http://git.engineering.redhat.com/git/users/jeckersb/oslo.messaging/patch/?id=ackbreaker | patch -p1; popd

Then restart a service (I usually just use openstack-nova-conductor).

Before the fix, you will see...

2016-01-11 17:27:21.599 21973 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.
2016-01-11 17:27:21.599 21973 TRACE root Traceback (most recent call last):
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func
2016-01-11 17:27:21.599 21973 TRACE root     return infunc(*args, **kwargs)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread
2016-01-11 17:27:21.599 21973 TRACE root     self._dispatch(incoming)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch
2016-01-11 17:27:21.599 21973 TRACE root     spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with
2016-01-11 17:27:21.599 21973 TRACE root     callback = ctxt.__enter__()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
2016-01-11 17:27:21.599 21973 TRACE root     return self.gen.next()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__
2016-01-11 17:27:21.599 21973 TRACE root     incoming.acknowledge()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge
2016-01-11 17:27:21.599 21973 TRACE root     self.acknowledge_callback()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 171, in acknowledge
2016-01-11 17:27:21.599 21973 TRACE root     self._raw_message.ack()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
2016-01-11 17:27:21.599 21973 TRACE root     self.channel.basic_ack(self.delivery_tag)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack
2016-01-11 17:27:21.599 21973 TRACE root     self._send_method((60, 80), args)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2016-01-11 17:27:21.599 21973 TRACE root     self.channel_id, method_sig, args, content,
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2016-01-11 17:27:21.599 21973 TRACE root     write_frame(1, channel, payload)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
2016-01-11 17:27:21.599 21973 TRACE root     frame_type, channel, size, payload, 0xce,
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 377, in sendall
2016-01-11 17:27:21.599 21973 TRACE root     tail = self.send(data, flags)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
2016-01-11 17:27:21.599 21973 TRACE root     total_sent += fd.send(data[total_sent:], flags)
2016-01-11 17:27:21.599 21973 TRACE root error: [Errno 32] Broken pipe
2016-01-11 17:27:21.599 21973 TRACE root 

And there will be no attempt to reconnect.  It will basically hang here forever.

Then:

- apply the updated oslo.messaging package
- *repatch the file*!
- restart nova-conductor

And there will be a reconnect after the error:

2016-01-11 17:32:39.015 1070 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.
2016-01-11 17:32:39.015 1070 TRACE root Traceback (most recent call last):
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func
2016-01-11 17:32:39.015 1070 TRACE root     return infunc(*args, **kwargs)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread
2016-01-11 17:32:39.015 1070 TRACE root     self._dispatch(incoming)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch
2016-01-11 17:32:39.015 1070 TRACE root     spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with
2016-01-11 17:32:39.015 1070 TRACE root     callback = ctxt.__enter__()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
2016-01-11 17:32:39.015 1070 TRACE root     return self.gen.next()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__
2016-01-11 17:32:39.015 1070 TRACE root     incoming.acknowledge()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge
2016-01-11 17:32:39.015 1070 TRACE root     self.acknowledge_callback()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 172, in acknowledge
2016-01-11 17:32:39.015 1070 TRACE root     self._raw_message.ack()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
2016-01-11 17:32:39.015 1070 TRACE root     self.channel.basic_ack(self.delivery_tag)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack
2016-01-11 17:32:39.015 1070 TRACE root     self._send_method((60, 80), args)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2016-01-11 17:32:39.015 1070 TRACE root     self.channel_id, method_sig, args, content,
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2016-01-11 17:32:39.015 1070 TRACE root     write_frame(1, channel, payload)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
2016-01-11 17:32:39.015 1070 TRACE root     frame_type, channel, size, payload, 0xce,
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 377, in sendall
2016-01-11 17:32:39.015 1070 TRACE root     tail = self.send(data, flags)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
2016-01-11 17:32:39.015 1070 TRACE root     total_sent += fd.send(data[total_sent:], flags)
2016-01-11 17:32:39.015 1070 TRACE root error: [Errno 32] Broken pipe
2016-01-11 17:32:39.015 1070 TRACE root 
2016-01-11 17:32:40.018 1070 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 192.0.2.8:5672 is unreachable: <AMQPError: unknown error>. Trying again in 1 seconds.
2016-01-11 17:32:41.038 1070 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 192.0.2.8:5672

Comment 10 Udi Shkalim 2016-01-14 15:00:43 UTC
Thanks John - worked like a charm.

Verified on python-oslo-messaging-1.8.3-3.el7ost

From steps to reproduce above got the expected error and re-connection.
2016-01-14 09:59:02.248 19142 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.
2016-01-14 09:59:02.248 19142 TRACE root Traceback (most recent call last):
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func
2016-01-14 09:59:02.248 19142 TRACE root     return infunc(*args, **kwargs)
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread
2016-01-14 09:59:02.248 19142 TRACE root     self._dispatch(incoming)
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch
2016-01-14 09:59:02.248 19142 TRACE root     spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool)
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with
2016-01-14 09:59:02.248 19142 TRACE root     callback = ctxt.__enter__()
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
2016-01-14 09:59:02.248 19142 TRACE root     return self.gen.next()
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__
2016-01-14 09:59:02.248 19142 TRACE root     incoming.acknowledge()
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge
2016-01-14 09:59:02.248 19142 TRACE root     self.acknowledge_callback()
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 172, in acknowledge
2016-01-14 09:59:02.248 19142 TRACE root     self._raw_message.ack()
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
2016-01-14 09:59:02.248 19142 TRACE root     self.channel.basic_ack(self.delivery_tag)
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack
2016-01-14 09:59:02.248 19142 TRACE root     self._send_method((60, 80), args)
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2016-01-14 09:59:02.248 19142 TRACE root     self.channel_id, method_sig, args, content,
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2016-01-14 09:59:02.248 19142 TRACE root     write_frame(1, channel, payload)
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
2016-01-14 09:59:02.248 19142 TRACE root     frame_type, channel, size, payload, 0xce,
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 377, in sendall
2016-01-14 09:59:02.248 19142 TRACE root     tail = self.send(data, flags)
2016-01-14 09:59:02.248 19142 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
2016-01-14 09:59:02.248 19142 TRACE root     total_sent += fd.send(data[total_sent:], flags)
2016-01-14 09:59:02.248 19142 TRACE root error: [Errno 32] Broken pipe
2016-01-14 09:59:02.248 19142 TRACE root 
2016-01-14 09:59:03.251 19142 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 192.0.2.10:5672 is unreachable: <AMQPError: unknown error>. Trying again in 1 seconds.
2016-01-14 09:59:04.295 19142 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 192.0.2.10:5672

Comment 12 errata-xmlrpc 2016-01-20 13:12:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:0047