Bug 1538956

Summary: instances lost network connectivity after live migration
Product: Red Hat OpenStack Reporter: Eduard Barrera <ebarrera>
Component: rabbitmq-serverAssignee: Peter Lemenkov <plemenko>
Status: CLOSED ERRATA QA Contact: Udi Shkalim <ushkalim>
Severity: medium Docs Contact:
Priority: medium    
Version: 11.0 (Ocata)CC: abeekhof, amuller, apevec, chjones, chrisw, ebarrera, jamsmith, jeckersb, jlibosva, lhh, mkrcmari, nyechiel, plemenko, ragiman, srevivo, vkommadi
Target Milestone: z3Keywords: Triaged, ZStream
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rabbitmq-server-3.6.5-7.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-20 12:53:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eduard Barrera 2018-01-26 09:01:28 UTC
Description of problem:

Live migrates instances lost connectivity, these are the interesting things I could find on the logs
This entries can be found in logs:

compute04 (destination compute)
--------------------------------

2018-01-16 14:34:43.535 4506 ERROR nova.network.neutronv2.api [req-f764de7e-6b6d-4b74-83a0-305433f1599b fc177f966e6d419bbd08e18866c31023 b5d7e5c1cb2e4f679618eb1503c7d3da - - -] [instance: 7103f8b2-13eb-444d-b712-ec2a841e12ef] Unable to update binding profile for port: b86d1752-a885-4f46-8292-8bcad207b433 due to failure: Request to http://172.1.1.1:9696/v2.0/ports/b86d1752-a885-4f46-8292-8bcad207b433.json timed out


controller3 neutron/server.log
------------------------------

2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging [req-b20fe315-b35e-4831-a0aa-a271fd990806 908c461062244c72b6e5129302eb567c 68e6f659d9c641e48fcb89cf213a1583 - - -] Could not send notification to notifications. Payload={'_context_domain': None, '_context_roles': [u'admin'], 'event_type': 'port.update.start', '_context_request_id': 'req-b20fe315-b35e-4831-a0aa-a271fd990806', 'timestamp': u'2018-01-16 13:34:13.659597', '_context_tenant_id': u'68e6f659d9c641e48fcb89cf213a1583', '_context_user': u'908c461062244c72b6e5129302eb567c', '_unique_id': 'c08d035d6f604f6d84b7ffdce18d0fb4', '_context_resource_uuid': None, '_context_tenant_name': u'service', '_context_is_admin_project': True, '_context_user_id': u'908c4610622445129302eb567c', 'payload': {u'port': {u'binding:profile': {u'migrating_to': u'pollux-compute-004.localdomain'}}, 'id': u'b86d1752-a885-4f46-8292-8bcad207b433'}, '_context_project_name': u'service', '_context_user_identity': u'908c46106e5129302eb567c 68e6f659d9fcb89cf213a1583 - - -', '_context_auth_token': 
'gAAAAABaXfH5MNE0vPnNF0OzrMDjBe1f1Um8vbnfOwi9OPtcdIKDxNwFElgwtmuQ1yt4oPzWuHOfpEb0Cnf7BKHgeSYabOzMT2RLT_6quY_49gxMqCHwnYjPapYFTr7P1basVCi52Kra2iWL-j4XkH4mOEbK7HmwH5qS0', '_context_show_deleted': False, '_context_tenant': u'68e6f659d9c641e48fcb89cf213a1583', 'priority': 'INFO', '_context_read_only': False, '_context_is_admin': True, '_context_project_id': u'68e6f659d9c641e48fcb89cf213a1583', '_context_project_domain': None, '_context_timestamp': '2018-01-16 13:34:13.653759', '_context_user_domain': None, '_context_user_name': u'neutron', 'publisher_id': 'netwcontroller-3.localdomain', 'message_id': u'f9b01f74-7f8a-473a-9fb2-e851bcb85183'}

2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging Traceback (most recent call last):
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging   File "/usr/lib/python2.7/site-packages/oslo_messaging/notify/messaging.py", line 70, in notify
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging     retry=retry)
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 104, in _send_notification
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging     retry=retry)
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 582, in send_notification
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging     envelope=(version == 2.0), notify=True, retry=retry)
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 547, in _send
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging     conn.notify_send(exchange, target.topic, msg, retry=retry)
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1306, in notify_send
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging     exchange, msg, routing_key=topic, retry=retry)
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1170, in _ensure_publishing
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging     self.ensure(method, retry=retry, error_callback=_error_callback)
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 817, in ensure
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging     ret, channel = autoretry_method()
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 436, in _ensured
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging     return fun(*args, **kwargs)
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 508, in __call__
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging     return fun(*args, channel=channels[0], **kwargs), channels[0]
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 793, in execute_method
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging     method()
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1234, in _publish_and_creates_default_queue
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging     queue.declare()
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging   File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 522, in declare
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging     self.queue_declare(nowait, passive=False)
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging   File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 548, in queue_declare
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging     nowait=nowait)
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1258, in queue_declare
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging     (50, 11),  # Channel.queue_declare_ok
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 69, in wait
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging     return self.dispatch_method(method_sig, args, content)
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 87, in dispatch_method
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging     return amqp_method(self, args)
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 241, in _close
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging     reply_code, reply_text, (class_id, method_id), ChannelError,
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging NotFound: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'notifications.info' in vhost '/' due to timeout
2018-01-16 14:35:15.727 5472 ERROR oslo_messaging.notify.messaging
2018-01-16 14:35:16.859 5472 INFO neutron.notifiers.nova [-] Nova event response: {u'status': u'completed', u'tag': u'b86d1752-a885-4f46-8292-8bcad207b433', u'name': u'network-vif-unplugged', u'server_uuid': u'7103f8b2-13eb-444d-b712-ec2a841e12ef', u'code': 200}
2018-01-16 14:35:18.223 5547 INFO neutron.notifiers.nova [-] Nova event response: {u'status': u'completed', u'tag': u'b86d1752-a885-4f46-8292-8bcad207b433', u'name': u'network-vif-plugged', u'server_uuid': u'7103f8b2-13eb-444d-b712-ec2a841e12ef', u'code': 200}
2018-01-16 14:35:29.697 5528 INFO neutron.wsgi [req-c48

....
2018-01-16 14:36:19.416 5472 INFO neutron.wsgi [req-b20fe315-b35e-4831-a0aa-a271fd990806 908c461062244c72b6e5129302eb567c 68e6f659d9c641e48fcb89cf213a1583 - - -] Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/eventlet/wsgi.py", line 512, in handle_one_response
    write(b''.join(towrite))
  File "/usr/lib/python2.7/site-packages/eventlet/wsgi.py", line 453, in write
    wfile.flush()
  File "/usr/lib64/python2.7/socket.py", line 303, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
  File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 385, in sendall
    tail = self.send(data, flags)
  File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 379, in send
    return self._send_loop(self.fd.send, data, flags)
  File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 366, in _send_loop
    return send_method(data, *args)
error: [Errno 104] Connection reset by peer



This can be found in rabbit logs:



=ERROR REPORT==== 16-Jan-2018::14:38:17 ===
Channel error on connection <0.16217.49> (172.16.2.2:35084 -> 172.16.2.2:5672, vhost: '/', user: 'guest'), channel 1:
operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'notifications.info' in vhost '/' due to timeout"

=INFO REPORT==== 16-Jan-2018::14:38:18 ===
accepting AMQP connection <0.29543.49> (172.16.2.23:39113 -> 172.16.2.5:5672)

=INFO REPORT==== 16-Jan-2018::14:38:18 ===
Connection <0.29543.49> (172.16.2.23:39113 -> 172.16.2.25:5672) has a client-provided name: neutron-server:5473:645b09ae-9525-4e2f-9dec-0d32c26a5ac9

=INFO REPORT==== 16-Jan-2018::14:39:19 ===
accepting AMQP connection <0.8736.50> (172.16.2.5:36184 -> 172.16.2.5:5672)

=INFO REPORT==== 16-Jan-2018::14:39:19 ===
Connection <0.8736.50> (172.16.2.5:36184 -> 172.16.2.5:5672) has a client-provided name: ceilometer-agent-notification:653307:6077ae6b-ebad-4e11-ab18-c92c2ce42235

=ERROR REPORT==== 16-Jan-2018::14:39:21 ===
Channel error on connection <0.16217.49> (172.16.2.3:35084 -> 172.16.2.5:5672, vhost: '/', user: 'guest'), channel 1:
operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'notifications.info' in vhost '/' due to timeout"

=ERROR REPORT==== 16-Jan-2018::14:40:21 ===
Channel error on connection <0.8736.50> (172.16.2.5:36184 -> 172.16.2.5:5672, vhost: '/', user: 'guest'), channel 1:
operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'notifications.info' in vhost '/' due to timeout"

=ERROR REPORT==== 16-Jan-2018::14:41:23 ===
Channel error on connection <0.8736.50> (172.16.2.5:36184 -> 172.16.2.5:5672, vhost: '/', user: 'guest'), channel 1:
operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'notifications.info' in vhost '/' due to timeout"


Version-Release number of selected component (if applicable):


How reproducible:
The second time we encounter this problem

Steps to Reproduce:


These are the steps customer performed:

* We updated the system packages.
* We rebooted the controllers.
* We started the VM migration to free up nodes in order to reboot them.
* The first VM migration looked fine, so we migrated other 2. After this we realized that none of the 3 was reachable from the internet. Even though their console was reachable via Horizon.


Actual results:
After instances life migration network connectivity lost

Expected results:
instances have network connectivity after live migration

Additional info:

Comment 4 anil venkata 2018-02-13 11:49:24 UTC
Neutrons RPC calls are failing as its connection to AMQP server is broken. May be rabbit dfg can look into why clients are losing connection to AMQP server.

Comment 5 Peter Lemenkov 2018-02-15 15:35:16 UTC
Hello All.

I'm still investigating this, but let me do an intermediate report.

I've found an issue in between 16-Jan-2018::18:28:30 and 16-Jan-2018::18:28:36 which looks like another reincarnation of a https://github.com/rabbitmq/rabbitmq-server/issues/687.

* RabbitMQ at pollux-controller-2 died (18:30).
* RabbitMQ at pollux-controller-1 received a single ack for some message.
* Unfortunately its queue of unacked messages is empty (unhandled situation)
* RabbitMQ at pollux-controller-1 dies abruptly
* RabbitMQ at pollux-controller-2 rebooted (18:36).
* It happens during unhealthy cluster state

As for situation mentioned above (~ 16-Jan-2018::14:30). Indeed there was some networking glitch. However none of RabbitMQ nodes lost connectivity to each other, and their restored accepting connections since 16-Jan-2018::14:45. I can't explain this one, but it looks like something which has external (to RabbitMQ) cause.

I'm looking into this.

Comment 14 Peter Lemenkov 2018-06-14 13:02:29 UTC
This build (rabbitmq-server-3.6.5-7.el7ost) should fix all RabbitMQ-specific issues found in mentioned customer's cases.

Comment 30 errata-xmlrpc 2018-08-20 12:53:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2521