Bug 1278176

Summary: Failure to reconnect when connection lost during message acknowlegement
Product: Red Hat OpenStack Reporter: John Eckersberg <jeckersb>
Component: python-oslo-messagingAssignee: John Eckersberg <jeckersb>
Status: CLOSED CURRENTRELEASE QA Contact: Ofer Blaut <oblaut>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.0 (Juno)CC: apevec, fpercoco, hguemar, jtaleric, karlthered, kgiusti, lhh, mwagner, sclewis, ushkalim, yeylon
Target Milestone: asyncKeywords: Triaged, ZStream
Target Release: 6.0 (Juno)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-oslo-messaging-1.4.1-8.el7ost Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1270353 Environment:
Last Closed: 2016-03-03 17:46:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1270353, 1278177, 1280016, 1280019, 1280020, 1288634, 1296680, 1300044    
Bug Blocks:    

Comment 2 John Eckersberg 2015-11-04 21:25:42 UTC
Juno backport - https://review.openstack.org/#/c/195688/

Comment 3 John Eckersberg 2015-11-04 21:37:46 UTC
https://code.engineering.redhat.com/gerrit/#/c/60808/

Comment 4 Flavio Percoco 2015-11-10 18:21:14 UTC
Adding these 3 to the mix:

https://code.engineering.redhat.com/gerrit/61124 fix memory leak for function _safe_log        
https://code.engineering.redhat.com/gerrit/61125 cleanup connection pool return        
https://code.engineering.redhat.com/gerrit/61126 Fix amqp connection pool leak in ConnectionContext

Comment 5 Flavio Percoco 2015-12-18 13:40:38 UTC
This has all the required acks now and the patches have landed.

Comment 6 Udi Shkalim 2016-01-13 14:16:58 UTC
John Eckersberg 2016-01-11 17:32:46 EST
(In reply to Udi Shkalim from comment #8)
> Hi John,
> 
> Wanted to try and reproduce the same scenario mention above.
> What will be the best way to do so in order to verify solution?
> 
> Thanks,
> Udi

Since the timing is important here, I've written a patch that will simulate the failure by resetting the underlying connection 10% of the time in order to demonstrate the failure and the recovery.

To install on the controller:

# pushd /usr/lib/python2.7/site-packages; curl http://git.engineering.redhat.com/git/users/jeckersb/oslo.messaging/patch/?id=ackbreaker | patch -p1; popd

Then restart a service (I usually just use openstack-nova-conductor).

Before the fix, you will see...

2016-01-11 17:27:21.599 21973 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.
2016-01-11 17:27:21.599 21973 TRACE root Traceback (most recent call last):
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func
2016-01-11 17:27:21.599 21973 TRACE root     return infunc(*args, **kwargs)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread
2016-01-11 17:27:21.599 21973 TRACE root     self._dispatch(incoming)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch
2016-01-11 17:27:21.599 21973 TRACE root     spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with
2016-01-11 17:27:21.599 21973 TRACE root     callback = ctxt.__enter__()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
2016-01-11 17:27:21.599 21973 TRACE root     return self.gen.next()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__
2016-01-11 17:27:21.599 21973 TRACE root     incoming.acknowledge()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge
2016-01-11 17:27:21.599 21973 TRACE root     self.acknowledge_callback()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 171, in acknowledge
2016-01-11 17:27:21.599 21973 TRACE root     self._raw_message.ack()
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
2016-01-11 17:27:21.599 21973 TRACE root     self.channel.basic_ack(self.delivery_tag)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack
2016-01-11 17:27:21.599 21973 TRACE root     self._send_method((60, 80), args)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2016-01-11 17:27:21.599 21973 TRACE root     self.channel_id, method_sig, args, content,
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2016-01-11 17:27:21.599 21973 TRACE root     write_frame(1, channel, payload)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
2016-01-11 17:27:21.599 21973 TRACE root     frame_type, channel, size, payload, 0xce,
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 377, in sendall
2016-01-11 17:27:21.599 21973 TRACE root     tail = self.send(data, flags)
2016-01-11 17:27:21.599 21973 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
2016-01-11 17:27:21.599 21973 TRACE root     total_sent += fd.send(data[total_sent:], flags)
2016-01-11 17:27:21.599 21973 TRACE root error: [Errno 32] Broken pipe
2016-01-11 17:27:21.599 21973 TRACE root 

And there will be no attempt to reconnect.  It will basically hang here forever.

Then:

- apply the updated oslo.messaging package
- *repatch the file*!
- restart nova-conductor

And there will be a reconnect after the error:

2016-01-11 17:32:39.015 1070 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.
2016-01-11 17:32:39.015 1070 TRACE root Traceback (most recent call last):
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func
2016-01-11 17:32:39.015 1070 TRACE root     return infunc(*args, **kwargs)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread
2016-01-11 17:32:39.015 1070 TRACE root     self._dispatch(incoming)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch
2016-01-11 17:32:39.015 1070 TRACE root     spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with
2016-01-11 17:32:39.015 1070 TRACE root     callback = ctxt.__enter__()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
2016-01-11 17:32:39.015 1070 TRACE root     return self.gen.next()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__
2016-01-11 17:32:39.015 1070 TRACE root     incoming.acknowledge()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge
2016-01-11 17:32:39.015 1070 TRACE root     self.acknowledge_callback()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 172, in acknowledge
2016-01-11 17:32:39.015 1070 TRACE root     self._raw_message.ack()
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
2016-01-11 17:32:39.015 1070 TRACE root     self.channel.basic_ack(self.delivery_tag)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack
2016-01-11 17:32:39.015 1070 TRACE root     self._send_method((60, 80), args)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2016-01-11 17:32:39.015 1070 TRACE root     self.channel_id, method_sig, args, content,
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2016-01-11 17:32:39.015 1070 TRACE root     write_frame(1, channel, payload)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
2016-01-11 17:32:39.015 1070 TRACE root     frame_type, channel, size, payload, 0xce,
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 377, in sendall
2016-01-11 17:32:39.015 1070 TRACE root     tail = self.send(data, flags)
2016-01-11 17:32:39.015 1070 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
2016-01-11 17:32:39.015 1070 TRACE root     total_sent += fd.send(data[total_sent:], flags)
2016-01-11 17:32:39.015 1070 TRACE root error: [Errno 32] Broken pipe
2016-01-11 17:32:39.015 1070 TRACE root 
2016-01-11 17:32:40.018 1070 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 192.0.2.8:5672 is unreachable: <AMQPError: unknown error>. Trying again in 1 seconds.
2016-01-11 17:32:41.038 1070 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 192.0.2.8:5672