Working with Mark who is doing some scale testing, I think we're seeing roughly this chain of events: (1) nova-compute is happily consuming messages from rabbitmq (2) nova-compute receives some message M (3) the connection to rabbitmq on which M arrived is reset (4) nova-compute (and oslo.messaging and kombu...) try to acknowledge message M (5) exception bubbles up for connection reset by peer (6) connection is *not* re-established (7) nova-compute no longer consumes messages from rabbitmq (8) restarting nova-compute service causes messages to be consumed again Here's the complete backtrace in the nova-compute.log: 2015-10-07 18:03:40.452 13727 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying. 2015-10-07 18:03:40.452 13727 TRACE root Traceback (most recent call last): 2015-10-07 18:03:40.452 13727 TRACE root File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func 2015-10-07 18:03:40.452 13727 TRACE root return infunc(*args, **kwargs) 2015-10-07 18:03:40.452 13727 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread 2015-10-07 18:03:40.452 13727 TRACE root self._dispatch(incoming) 2015-10-07 18:03:40.452 13727 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch 2015-10-07 18:03:40.452 13727 TRACE root spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool) 2015-10-07 18:03:40.452 13727 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with 2015-10-07 18:03:40.452 13727 TRACE root callback = ctxt.__enter__() 2015-10-07 18:03:40.452 13727 TRACE root File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ 2015-10-07 18:03:40.452 13727 TRACE root return self.gen.next() 2015-10-07 18:03:40.452 13727 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__ 2015-10-07 18:03:40.452 13727 TRACE root incoming.acknowledge() 2015-10-07 18:03:40.452 13727 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge 2015-10-07 18:03:40.452 13727 TRACE root self.acknowledge_callback() 2015-10-07 18:03:40.452 13727 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 168, in acknowledge 2015-10-07 18:03:40.452 13727 TRACE root self._raw_message.ack() 2015-10-07 18:03:40.452 13727 TRACE root File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack 2015-10-07 18:03:40.452 13727 TRACE root self.channel.basic_ack(self.delivery_tag) 2015-10-07 18:03:40.452 13727 TRACE root File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack 2015-10-07 18:03:40.452 13727 TRACE root self._send_method((60, 80), args) 2015-10-07 18:03:40.452 13727 TRACE root File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method 2015-10-07 18:03:40.452 13727 TRACE root self.channel_id, method_sig, args, content, 2015-10-07 18:03:40.452 13727 TRACE root File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method 2015-10-07 18:03:40.452 13727 TRACE root write_frame(1, channel, payload) 2015-10-07 18:03:40.452 13727 TRACE root File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame 2015-10-07 18:03:40.452 13727 TRACE root frame_type, channel, size, payload, 0xce, 2015-10-07 18:03:40.452 13727 TRACE root File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 376, in sendall 2015-10-07 18:03:40.452 13727 TRACE root tail = self.send(data, flags) 2015-10-07 18:03:40.452 13727 TRACE root File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send 2015-10-07 18:03:40.452 13727 TRACE root total_sent += fd.send(data[total_sent:], flags) 2015-10-07 18:03:40.452 13727 TRACE root error: [Errno 104] Connection reset by peer 2015-10-07 18:03:40.452 13727 TRACE root I would expect to see a log message about "Reconnecting to AMQP server" after, but no such message is present.
This seems to be because this is missing in OSP7: https://review.openstack.org/#/c/213780/ It was merged into stable/kilo after the 1.8.3 tag.
@eck Thanks for the analysis. I'll backport the patch and let you know.
https://code.engineering.redhat.com/gerrit/#/c/59535/2
Hi John, Wanted to try and reproduce the same scenario mention above. What will be the best way to do so in order to verify solution? Thanks, Udi
(In reply to Udi Shkalim from comment #8) > Hi John, > > Wanted to try and reproduce the same scenario mention above. > What will be the best way to do so in order to verify solution? > > Thanks, > Udi Since the timing is important here, I've written a patch that will simulate the failure by resetting the underlying connection 10% of the time in order to demonstrate the failure and the recovery. To install on the controller: # pushd /usr/lib/python2.7/site-packages; curl http://git.engineering.redhat.com/git/users/jeckersb/oslo.messaging/patch/?id=ackbreaker | patch -p1; popd Then restart a service (I usually just use openstack-nova-conductor). Before the fix, you will see... 2016-01-11 17:27:21.599 21973 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying. 2016-01-11 17:27:21.599 21973 TRACE root Traceback (most recent call last): 2016-01-11 17:27:21.599 21973 TRACE root File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func 2016-01-11 17:27:21.599 21973 TRACE root return infunc(*args, **kwargs) 2016-01-11 17:27:21.599 21973 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread 2016-01-11 17:27:21.599 21973 TRACE root self._dispatch(incoming) 2016-01-11 17:27:21.599 21973 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch 2016-01-11 17:27:21.599 21973 TRACE root spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool) 2016-01-11 17:27:21.599 21973 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with 2016-01-11 17:27:21.599 21973 TRACE root callback = ctxt.__enter__() 2016-01-11 17:27:21.599 21973 TRACE root File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ 2016-01-11 17:27:21.599 21973 TRACE root return self.gen.next() 2016-01-11 17:27:21.599 21973 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__ 2016-01-11 17:27:21.599 21973 TRACE root incoming.acknowledge() 2016-01-11 17:27:21.599 21973 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge 2016-01-11 17:27:21.599 21973 TRACE root self.acknowledge_callback() 2016-01-11 17:27:21.599 21973 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 171, in acknowledge 2016-01-11 17:27:21.599 21973 TRACE root self._raw_message.ack() 2016-01-11 17:27:21.599 21973 TRACE root File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack 2016-01-11 17:27:21.599 21973 TRACE root self.channel.basic_ack(self.delivery_tag) 2016-01-11 17:27:21.599 21973 TRACE root File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack 2016-01-11 17:27:21.599 21973 TRACE root self._send_method((60, 80), args) 2016-01-11 17:27:21.599 21973 TRACE root File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method 2016-01-11 17:27:21.599 21973 TRACE root self.channel_id, method_sig, args, content, 2016-01-11 17:27:21.599 21973 TRACE root File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method 2016-01-11 17:27:21.599 21973 TRACE root write_frame(1, channel, payload) 2016-01-11 17:27:21.599 21973 TRACE root File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame 2016-01-11 17:27:21.599 21973 TRACE root frame_type, channel, size, payload, 0xce, 2016-01-11 17:27:21.599 21973 TRACE root File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 377, in sendall 2016-01-11 17:27:21.599 21973 TRACE root tail = self.send(data, flags) 2016-01-11 17:27:21.599 21973 TRACE root File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send 2016-01-11 17:27:21.599 21973 TRACE root total_sent += fd.send(data[total_sent:], flags) 2016-01-11 17:27:21.599 21973 TRACE root error: [Errno 32] Broken pipe 2016-01-11 17:27:21.599 21973 TRACE root And there will be no attempt to reconnect. It will basically hang here forever. Then: - apply the updated oslo.messaging package - *repatch the file*! - restart nova-conductor And there will be a reconnect after the error: 2016-01-11 17:32:39.015 1070 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying. 2016-01-11 17:32:39.015 1070 TRACE root Traceback (most recent call last): 2016-01-11 17:32:39.015 1070 TRACE root File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func 2016-01-11 17:32:39.015 1070 TRACE root return infunc(*args, **kwargs) 2016-01-11 17:32:39.015 1070 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread 2016-01-11 17:32:39.015 1070 TRACE root self._dispatch(incoming) 2016-01-11 17:32:39.015 1070 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch 2016-01-11 17:32:39.015 1070 TRACE root spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool) 2016-01-11 17:32:39.015 1070 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with 2016-01-11 17:32:39.015 1070 TRACE root callback = ctxt.__enter__() 2016-01-11 17:32:39.015 1070 TRACE root File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ 2016-01-11 17:32:39.015 1070 TRACE root return self.gen.next() 2016-01-11 17:32:39.015 1070 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__ 2016-01-11 17:32:39.015 1070 TRACE root incoming.acknowledge() 2016-01-11 17:32:39.015 1070 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge 2016-01-11 17:32:39.015 1070 TRACE root self.acknowledge_callback() 2016-01-11 17:32:39.015 1070 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 172, in acknowledge 2016-01-11 17:32:39.015 1070 TRACE root self._raw_message.ack() 2016-01-11 17:32:39.015 1070 TRACE root File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack 2016-01-11 17:32:39.015 1070 TRACE root self.channel.basic_ack(self.delivery_tag) 2016-01-11 17:32:39.015 1070 TRACE root File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack 2016-01-11 17:32:39.015 1070 TRACE root self._send_method((60, 80), args) 2016-01-11 17:32:39.015 1070 TRACE root File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method 2016-01-11 17:32:39.015 1070 TRACE root self.channel_id, method_sig, args, content, 2016-01-11 17:32:39.015 1070 TRACE root File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method 2016-01-11 17:32:39.015 1070 TRACE root write_frame(1, channel, payload) 2016-01-11 17:32:39.015 1070 TRACE root File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame 2016-01-11 17:32:39.015 1070 TRACE root frame_type, channel, size, payload, 0xce, 2016-01-11 17:32:39.015 1070 TRACE root File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 377, in sendall 2016-01-11 17:32:39.015 1070 TRACE root tail = self.send(data, flags) 2016-01-11 17:32:39.015 1070 TRACE root File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send 2016-01-11 17:32:39.015 1070 TRACE root total_sent += fd.send(data[total_sent:], flags) 2016-01-11 17:32:39.015 1070 TRACE root error: [Errno 32] Broken pipe 2016-01-11 17:32:39.015 1070 TRACE root 2016-01-11 17:32:40.018 1070 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 192.0.2.8:5672 is unreachable: <AMQPError: unknown error>. Trying again in 1 seconds. 2016-01-11 17:32:41.038 1070 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 192.0.2.8:5672
Thanks John - worked like a charm. Verified on python-oslo-messaging-1.8.3-3.el7ost From steps to reproduce above got the expected error and re-connection. 2016-01-14 09:59:02.248 19142 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying. 2016-01-14 09:59:02.248 19142 TRACE root Traceback (most recent call last): 2016-01-14 09:59:02.248 19142 TRACE root File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func 2016-01-14 09:59:02.248 19142 TRACE root return infunc(*args, **kwargs) 2016-01-14 09:59:02.248 19142 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread 2016-01-14 09:59:02.248 19142 TRACE root self._dispatch(incoming) 2016-01-14 09:59:02.248 19142 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch 2016-01-14 09:59:02.248 19142 TRACE root spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool) 2016-01-14 09:59:02.248 19142 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with 2016-01-14 09:59:02.248 19142 TRACE root callback = ctxt.__enter__() 2016-01-14 09:59:02.248 19142 TRACE root File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ 2016-01-14 09:59:02.248 19142 TRACE root return self.gen.next() 2016-01-14 09:59:02.248 19142 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__ 2016-01-14 09:59:02.248 19142 TRACE root incoming.acknowledge() 2016-01-14 09:59:02.248 19142 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge 2016-01-14 09:59:02.248 19142 TRACE root self.acknowledge_callback() 2016-01-14 09:59:02.248 19142 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 172, in acknowledge 2016-01-14 09:59:02.248 19142 TRACE root self._raw_message.ack() 2016-01-14 09:59:02.248 19142 TRACE root File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack 2016-01-14 09:59:02.248 19142 TRACE root self.channel.basic_ack(self.delivery_tag) 2016-01-14 09:59:02.248 19142 TRACE root File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack 2016-01-14 09:59:02.248 19142 TRACE root self._send_method((60, 80), args) 2016-01-14 09:59:02.248 19142 TRACE root File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method 2016-01-14 09:59:02.248 19142 TRACE root self.channel_id, method_sig, args, content, 2016-01-14 09:59:02.248 19142 TRACE root File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method 2016-01-14 09:59:02.248 19142 TRACE root write_frame(1, channel, payload) 2016-01-14 09:59:02.248 19142 TRACE root File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame 2016-01-14 09:59:02.248 19142 TRACE root frame_type, channel, size, payload, 0xce, 2016-01-14 09:59:02.248 19142 TRACE root File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 377, in sendall 2016-01-14 09:59:02.248 19142 TRACE root tail = self.send(data, flags) 2016-01-14 09:59:02.248 19142 TRACE root File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send 2016-01-14 09:59:02.248 19142 TRACE root total_sent += fd.send(data[total_sent:], flags) 2016-01-14 09:59:02.248 19142 TRACE root error: [Errno 32] Broken pipe 2016-01-14 09:59:02.248 19142 TRACE root 2016-01-14 09:59:03.251 19142 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 192.0.2.10:5672 is unreachable: <AMQPError: unknown error>. Trying again in 1 seconds. 2016-01-14 09:59:04.295 19142 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 192.0.2.10:5672
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:0047
test case added: https://polarion.engineering.redhat.com/polarion/#/project/RHELOpenStackPlatform/wiki/HA/HA%20RabbitMQ%20Sanity