Bug 1278176 - Failure to reconnect when connection lost during message acknowlegement [NEEDINFO]
Failure to reconnect when connection lost during message acknowlegement
Status: CLOSED CURRENTRELEASE
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-oslo-messaging (Show other bugs)
6.0 (Juno)
Unspecified Unspecified
urgent Severity urgent
: async
: 6.0 (Juno)
Assigned To: John Eckersberg
Ofer Blaut
: Triaged, ZStream
Depends On: 1270353 1278177 1280016 1280019 1280020 1288634 1296680 1300044
Blocks:
  Show dependency treegraph
 
Reported: 2015-11-04 15:44 EST by John Eckersberg
Modified: 2016-04-26 12:08 EDT (History)
11 users (show)

See Also:
Fixed In Version: python-oslo-messaging-1.4.1-8.el7ost
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1270353
Environment:
Last Closed: 2016-03-03 12:46:17 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
fpercoco: needinfo? (hguemar)


Attachments (Terms of Use)

  None (edit)
Comment 2 John Eckersberg 2015-11-04 16:25:42 EST
Juno backport - https://review.openstack.org/#/c/195688/
Comment 3 John Eckersberg 2015-11-04 16:37:46 EST
https://code.engineering.redhat.com/gerrit/#/c/60808/
Comment 4 Flavio Percoco 2015-11-10 13:21:14 EST
Adding these 3 to the mix:

https://code.engineering.redhat.com/gerrit/61124 fix memory leak for function _safe_log        
https://code.engineering.redhat.com/gerrit/61125 cleanup connection pool return        
https://code.engineering.redhat.com/gerrit/61126 Fix amqp connection pool leak in ConnectionContext
Comment 5 Flavio Percoco 2015-12-18 08:40:38 EST
This has all the required acks now and the patches have landed.
Comment 6 Udi Shkalim 2016-01-13 09:16:58 EST
John Eckersberg 2016-01-11 17:32:46 EST
(In reply to Udi Shkalim from comment #8)
> Hi John,
> 
> Wanted to try and reproduce the same scenario mention above.
> What will be the best way to do so in order to verify solution?
> 
> Thanks,
> Udi

Since the timing is important here, I've written a patch that will simulate the failure by resetting the underlying connection 10% of the time in order to demonstrate the failure and the recovery.

To install on the controller:

# pushd /usr/lib/python2.7/site-packages; curl http://git.engineering.redhat.com/git/users/jeckersb/oslo.messaging/patch/?id=ackbreaker | patch -p1; popd

Then restart a service (I usually just use openstack-nova-conductor).

Before the fix, you will see...

2016-01-11 17:27:21.599 21973 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.
2016-01-11 17:27:21.599 21973 TRACE root Traceback (most recent call last):
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func
2016-01-11 17:27:21.599 21973 TRACE root     return infunc(*args, **kwargs)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread
2016-01-11 17:27:21.599 21973 TRACE root     self._dispatch(incoming)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch
2016-01-11 17:27:21.599 21973 TRACE root     spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with
2016-01-11 17:27:21.599 21973 TRACE root     callback = ctxt.__enter__()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
2016-01-11 17:27:21.599 21973 TRACE root     return self.gen.next()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__
2016-01-11 17:27:21.599 21973 TRACE root     incoming.acknowledge()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge
2016-01-11 17:27:21.599 21973 TRACE root     self.acknowledge_callback()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 171, in acknowledge
2016-01-11 17:27:21.599 21973 TRACE root     self._raw_message.ack()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
2016-01-11 17:27:21.599 21973 TRACE root     self.channel.basic_ack(self.delivery_tag)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack
2016-01-11 17:27:21.599 21973 TRACE root     self._send_method((60, 80), args)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2016-01-11 17:27:21.599 21973 TRACE root     self.channel_id, method_sig, args, content,
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2016-01-11 17:27:21.599 21973 TRACE root     write_frame(1, channel, payload)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
2016-01-11 17:27:21.599 21973 TRACE root     frame_type, channel, size, payload, 0xce,
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 377, in sendall
2016-01-11 17:27:21.599 21973 TRACE root     tail = self.send(data, flags)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
2016-01-11 17:27:21.599 21973 TRACE root     total_sent += fd.send(data[total_sent:], flags)
2016-01-11 17:27:21.599 21973 TRACE root error: [Errno 32] Broken pipe
2016-01-11 17:27:21.599 21973 TRACE root 

And there will be no attempt to reconnect.  It will basically hang here forever.

Then:

- apply the updated oslo.messaging package
- *repatch the file*!
- restart nova-conductor

And there will be a reconnect after the error:

2016-01-11 17:32:39.015 1070 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.
2016-01-11 17:32:39.015 1070 TRACE root Traceback (most recent call last):
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func
2016-01-11 17:32:39.015 1070 TRACE root     return infunc(*args, **kwargs)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread
2016-01-11 17:32:39.015 1070 TRACE root     self._dispatch(incoming)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch
2016-01-11 17:32:39.015 1070 TRACE root     spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with
2016-01-11 17:32:39.015 1070 TRACE root     callback = ctxt.__enter__()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
2016-01-11 17:32:39.015 1070 TRACE root     return self.gen.next()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__
2016-01-11 17:32:39.015 1070 TRACE root     incoming.acknowledge()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge
2016-01-11 17:32:39.015 1070 TRACE root     self.acknowledge_callback()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 172, in acknowledge
2016-01-11 17:32:39.015 1070 TRACE root     self._raw_message.ack()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
2016-01-11 17:32:39.015 1070 TRACE root     self.channel.basic_ack(self.delivery_tag)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack
2016-01-11 17:32:39.015 1070 TRACE root     self._send_method((60, 80), args)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2016-01-11 17:32:39.015 1070 TRACE root     self.channel_id, method_sig, args, content,
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2016-01-11 17:32:39.015 1070 TRACE root     write_frame(1, channel, payload)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
2016-01-11 17:32:39.015 1070 TRACE root     frame_type, channel, size, payload, 0xce,
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 377, in sendall
2016-01-11 17:32:39.015 1070 TRACE root     tail = self.send(data, flags)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
2016-01-11 17:32:39.015 1070 TRACE root     total_sent += fd.send(data[total_sent:], flags)
2016-01-11 17:32:39.015 1070 TRACE root error: [Errno 32] Broken pipe
2016-01-11 17:32:39.015 1070 TRACE root 
2016-01-11 17:32:40.018 1070 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 192.0.2.8:5672 is unreachable: <AMQPError: unknown error>. Trying again in 1 seconds.
2016-01-11 17:32:41.038 1070 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 192.0.2.8:5672

Note You need to log in before you can comment on or make changes to this bug.