Description of problem: Live migrates instances lost connectivity, these are the interesting things I could find on the logs This entries can be found in logs: compute04 (destination compute) -------------------------------- 2018-01-16 14:34:43.535 4506 ERROR nova.network.neutronv2.api [req-f764de7e-6b6d-4b74-83a0-305433f1599b fc177f966e6d419bbd08e18866c31023 b5d7e5c1cb2e4f679618eb1503c7d3da - - -] [instance: 7103f8b2-13eb-444d-b712-ec2a841e12ef] Unable to update binding profile for port: b86d1752-a885-4f46-8292-8bcad207b433 due to failure: Request to http://172.1.1.1:9696/v2.0/ports/b86d1752-a885-4f46-8292-8bcad207b433.json timed out controller3 neutron/server.log ------------------------------ 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging [req-b20fe315-b35e-4831-a0aa-a271fd990806 908c461062244c72b6e5129302eb567c 68e6f659d9c641e48fcb89cf213a1583 - - -] Could not send notification to notifications. Payload={'_context_domain': None, '_context_roles': [u'admin'], 'event_type': 'port.update.start', '_context_request_id': 'req-b20fe315-b35e-4831-a0aa-a271fd990806', 'timestamp': u'2018-01-16 13:34:13.659597', '_context_tenant_id': u'68e6f659d9c641e48fcb89cf213a1583', '_context_user': u'908c461062244c72b6e5129302eb567c', '_unique_id': 'c08d035d6f604f6d84b7ffdce18d0fb4', '_context_resource_uuid': None, '_context_tenant_name': u'service', '_context_is_admin_project': True, '_context_user_id': u'908c4610622445129302eb567c', 'payload': {u'port': {u'binding:profile': {u'migrating_to': u'pollux-compute-004.localdomain'}}, 'id': u'b86d1752-a885-4f46-8292-8bcad207b433'}, '_context_project_name': u'service', '_context_user_identity': u'908c46106e5129302eb567c 68e6f659d9fcb89cf213a1583 - - -', '_context_auth_token': 'gAAAAABaXfH5MNE0vPnNF0OzrMDjBe1f1Um8vbnfOwi9OPtcdIKDxNwFElgwtmuQ1yt4oPzWuHOfpEb0Cnf7BKHgeSYabOzMT2RLT_6quY_49gxMqCHwnYjPapYFTr7P1basVCi52Kra2iWL-j4XkH4mOEbK7HmwH5qS0', '_context_show_deleted': False, '_context_tenant': u'68e6f659d9c641e48fcb89cf213a1583', 'priority': 'INFO', '_context_read_only': False, '_context_is_admin': True, '_context_project_id': u'68e6f659d9c641e48fcb89cf213a1583', '_context_project_domain': None, '_context_timestamp': '2018-01-16 13:34:13.653759', '_context_user_domain': None, '_context_user_name': u'neutron', 'publisher_id': 'netwcontroller-3.localdomain', 'message_id': u'f9b01f74-7f8a-473a-9fb2-e851bcb85183'} 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging Traceback (most recent call last): 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging File "/usr/lib/python2.7/site-packages/oslo_messaging/notify/messaging.py", line 70, in notify 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging retry=retry) 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 104, in _send_notification 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging retry=retry) 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 582, in send_notification 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging envelope=(version == 2.0), notify=True, retry=retry) 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 547, in _send 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging conn.notify_send(exchange, target.topic, msg, retry=retry) 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1306, in notify_send 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging exchange, msg, routing_key=topic, retry=retry) 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1170, in _ensure_publishing 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging self.ensure(method, retry=retry, error_callback=_error_callback) 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 817, in ensure 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging ret, channel = autoretry_method() 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 436, in _ensured 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging return fun(*args, **kwargs) 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 508, in __call__ 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging return fun(*args, channel=channels[0], **kwargs), channels[0] 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 793, in execute_method 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging method() 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1234, in _publish_and_creates_default_queue 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging queue.declare() 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 522, in declare 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging self.queue_declare(nowait, passive=False) 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 548, in queue_declare 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging nowait=nowait) 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1258, in queue_declare 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging (50, 11), # Channel.queue_declare_ok 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 69, in wait 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging return self.dispatch_method(method_sig, args, content) 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 87, in dispatch_method 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging return amqp_method(self, args) 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 241, in _close 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging reply_code, reply_text, (class_id, method_id), ChannelError, 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging NotFound: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'notifications.info' in vhost '/' due to timeout 2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging 2018-01-16 14:35:16.859 5472 INFO neutron.notifiers.nova [-] Nova event response: {u'status': u'completed', u'tag': u'b86d1752-a885-4f46-8292-8bcad207b433', u'name': u'network-vif-unplugged', u'server_uuid': u'7103f8b2-13eb-444d-b712-ec2a841e12ef', u'code': 200} 2018-01-16 14:35:18.223 5547 INFO neutron.notifiers.nova [-] Nova event response: {u'status': u'completed', u'tag': u'b86d1752-a885-4f46-8292-8bcad207b433', u'name': u'network-vif-plugged', u'server_uuid': u'7103f8b2-13eb-444d-b712-ec2a841e12ef', u'code': 200} 2018-01-16 14:35:29.697 5528 INFO neutron.wsgi [req-c48 .... 2018-01-16 14:36:19.416 5472 INFO neutron.wsgi [req-b20fe315-b35e-4831-a0aa-a271fd990806 908c461062244c72b6e5129302eb567c 68e6f659d9c641e48fcb89cf213a1583 - - -] Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/eventlet/wsgi.py", line 512, in handle_one_response write(b''.join(towrite)) File "/usr/lib/python2.7/site-packages/eventlet/wsgi.py", line 453, in write wfile.flush() File "/usr/lib64/python2.7/socket.py", line 303, in flush self._sock.sendall(view[write_offset:write_offset+buffer_size]) File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 385, in sendall tail = self.send(data, flags) File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 379, in send return self._send_loop(self.fd.send, data, flags) File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 366, in _send_loop return send_method(data, *args) error: [Errno 104] Connection reset by peer This can be found in rabbit logs: =ERROR REPORT==== 16-Jan-2018::14:38:17 === Channel error on connection <0.16217.49> (172.16.2.2:35084 -> 172.16.2.2:5672, vhost: '/', user: 'guest'), channel 1: operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'notifications.info' in vhost '/' due to timeout" =INFO REPORT==== 16-Jan-2018::14:38:18 === accepting AMQP connection <0.29543.49> (172.16.2.23:39113 -> 172.16.2.5:5672) =INFO REPORT==== 16-Jan-2018::14:38:18 === Connection <0.29543.49> (172.16.2.23:39113 -> 172.16.2.25:5672) has a client-provided name: neutron-server:5473:645b09ae-9525-4e2f-9dec-0d32c26a5ac9 =INFO REPORT==== 16-Jan-2018::14:39:19 === accepting AMQP connection <0.8736.50> (172.16.2.5:36184 -> 172.16.2.5:5672) =INFO REPORT==== 16-Jan-2018::14:39:19 === Connection <0.8736.50> (172.16.2.5:36184 -> 172.16.2.5:5672) has a client-provided name: ceilometer-agent-notification:653307:6077ae6b-ebad-4e11-ab18-c92c2ce42235 =ERROR REPORT==== 16-Jan-2018::14:39:21 === Channel error on connection <0.16217.49> (172.16.2.3:35084 -> 172.16.2.5:5672, vhost: '/', user: 'guest'), channel 1: operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'notifications.info' in vhost '/' due to timeout" =ERROR REPORT==== 16-Jan-2018::14:40:21 === Channel error on connection <0.8736.50> (172.16.2.5:36184 -> 172.16.2.5:5672, vhost: '/', user: 'guest'), channel 1: operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'notifications.info' in vhost '/' due to timeout" =ERROR REPORT==== 16-Jan-2018::14:41:23 === Channel error on connection <0.8736.50> (172.16.2.5:36184 -> 172.16.2.5:5672, vhost: '/', user: 'guest'), channel 1: operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'notifications.info' in vhost '/' due to timeout" Version-Release number of selected component (if applicable): How reproducible: The second time we encounter this problem Steps to Reproduce: These are the steps customer performed: * We updated the system packages. * We rebooted the controllers. * We started the VM migration to free up nodes in order to reboot them. * The first VM migration looked fine, so we migrated other 2. After this we realized that none of the 3 was reachable from the internet. Even though their console was reachable via Horizon. Actual results: After instances life migration network connectivity lost Expected results: instances have network connectivity after live migration Additional info:
Neutrons RPC calls are failing as its connection to AMQP server is broken. May be rabbit dfg can look into why clients are losing connection to AMQP server.
Hello All. I'm still investigating this, but let me do an intermediate report. I've found an issue in between 16-Jan-2018::18:28:30 and 16-Jan-2018::18:28:36 which looks like another reincarnation of a https://github.com/rabbitmq/rabbitmq-server/issues/687. * RabbitMQ at pollux-controller-2 died (18:30). * RabbitMQ at pollux-controller-1 received a single ack for some message. * Unfortunately its queue of unacked messages is empty (unhandled situation) * RabbitMQ at pollux-controller-1 dies abruptly * RabbitMQ at pollux-controller-2 rebooted (18:36). * It happens during unhealthy cluster state As for situation mentioned above (~ 16-Jan-2018::14:30). Indeed there was some networking glitch. However none of RabbitMQ nodes lost connectivity to each other, and their restored accepting connections since 16-Jan-2018::14:45. I can't explain this one, but it looks like something which has external (to RabbitMQ) cause. I'm looking into this.
This build (rabbitmq-server-3.6.5-7.el7ost) should fix all RabbitMQ-specific issues found in mentioned customer's cases.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2521