Bug 1446825 - OSP10 -> OSP11 upgrade: nova live migration fails before/after upgrading compute node
Summary: OSP10 -> OSP11 upgrade: nova live migration fails before/after upgrading comp...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 11.0 (Ocata)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 11.0 (Ocata)
Assignee: Marios Andreou
QA Contact: Marius Cornea
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-04-29 10:53 UTC by Marius Cornea
Modified: 2017-05-18 05:55 UTC (History)
13 users (show)

Fixed In Version: openstack-tripleo-heat-templates-6.0.0-10.el7ost
Doc Type: Known Issue
Doc Text:
A design flaw issue was found in the Red Hat OpenStack Platform director use of TripleO to enable libvirtd based live migration. TripleO did not have support for secure live migration and no additional steps were taken to lock-down the libvirtd deployment by director. Libvirtd is deployed by default (by director) listening on 0.0.0.0 (all interfaces) with no-authentication or encryption. Anyone able to make a TCP connection to any compute host IP address, including 127.0.0.1, other loopback interface addresses or in some cases possibly addresses that have been exposed beyond the management interface, could use this to open a virsh session to the libvirtd instance and gain control of virtual machine instances or possibly take over the host. Note that without the presence of additional flaws, this should not be accessible from tenant or external networks. Users who are upgrading to Red Hat OpenStack Platform 11 from Red Hat OpenStack Platform 10 should first apply the relevant update that resolves this issue. Red Hat OpenStack Platform 11 already contains this update as of general availability and no subsequent update is required. For more information about this flaw and the accompanying resolution, see https://access.redhat.com/solutions/3022771.
Clone Of:
Environment:
Last Closed: 2017-05-17 20:24:14 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1245 normal SHIPPED_LIVE Red Hat OpenStack Platform 11.0 Bug Fix and Enhancement Advisory 2017-05-17 23:01:50 UTC

Description Marius Cornea 2017-04-29 10:53:15 UTC
Description of problem:
OSP10 -> OSP11 upgrade: nova live migration fails before/after upgrading compute node with errors like:

2017-04-29 09:19:40.273 60594 ERROR nova.virt.libvirt.driver [req-6c05cab3-172d-440c-bc38-d5e748ecb972 997449a88f7d4116afeeb4822f34f16c 0d3a76f69f8f444783385464e590414f - - -] [instance: bdfc5839-dc68-4832-8a46-19030e55d90e] Live Migration failure: operation failed: Failed to connect to remote libvirt URI qemu+tcp://overcloud-compute-1.localdomain/system: unable to connect to server at 'overcloud-compute-1.localdomain:16509': Connection refused

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-6.0.0-10.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP10 with 2 compute nodes
2. Run major-upgrade-composable-steps.yaml OSP11 upgrade step
3. Before upgrading one of the compute nodes live migrate instances running on it with:
nova host-evacuate-live overcloud-compute-1.localdomain
Wait for node to be quiesced and no instances are running on it.
4. Upgrade compute node:
upgrade-non-controller.sh --upgrade overcloud-compute-1
5. Reboot compute node
6. Wait for nova-compute to be up on overcloud-compute-1.localdomain 
7. Live migrate intances back to overcloud-compute-1.localdomain:
nova live-migration st-provinstance-ryetobxqen63-my_instance-tfvwjp427pbs overcloud-compute-1.localdomain

Actual results:
Instance doesn't get migrated to the upgraded node because there's no service listening on port 16509 and migration fails with:

2017-04-29 09:19:40.273 60594 ERROR nova.virt.libvirt.driver [req-6c05cab3-172d-440c-bc38-d5e748ecb972 997449a88f7d4116afeeb4822f34f16c 0d3a76f69f8f444783385464e590414f - - -] [instance: bdfc5839-dc68-4832-8a46-19030e55d90e] Live Migration
 failure: operation failed: Failed to connect to remote libvirt URI qemu+tcp://overcloud-compute-1.localdomain/system: unable to connect to server at 'overcloud-compute-1.localdomain:16509': Connection refused
2017-04-29 09:19:40.353 60594 ERROR nova.virt.libvirt.driver [req-6c05cab3-172d-440c-bc38-d5e748ecb972 997449a88f7d4116afeeb4822f34f16c 0d3a76f69f8f444783385464e590414f - - -] [instance: bdfc5839-dc68-4832-8a46-19030e55d90e] Migration oper
ation has aborted

Expected results:
Instance live migration works during the upgrade process so workloads can be moved to nodes which are not being upgraded thus minimizing the risk of failures happening during the upgrade process. 

Additional info:

listen_tcp is set to 0 in /etc/libvirt/libvirtd.conf

[root@overcloud-compute-1 heat-admin]# grep listen_tcp /etc/libvirt/libvirtd.conf
listen_tcp = 0

[root@overcloud-compute-1 heat-admin]# ps axu | grep libvirt
root        1823  0.0  0.2 1440240 21596 ?       Ssl  09:12   0:01 /usr/sbin/libvirtd --listen

After setting listen_tcp = 1 in /etc/libvirt/libvirtd.conf and restarting libvirtd migration completes fine. 

Note: in the scenario described above the nova control plane services are running on a custom role. When the nova control plane service are running on the monolithic controller the test fails at step 3: when running nova host-evacuate-live the instances are not migrated from the host, failing on the same error.

Comment 24 errata-xmlrpc 2017-05-17 20:24:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245


Note You need to log in before you can comment on or make changes to this bug.