Bug 1446825

Summary: OSP10 -> OSP11 upgrade: nova live migration fails before/after upgrading compute node
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Marios Andreou <mandreou>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 11.0 (Ocata)CC: aschultz, dbecker, jcoufal, lbopf, mandreou, mburns, morazi, owalsh, rhel-osp-director-maint, sasha, sclewis, sgordon, slinaber
Target Milestone: rcKeywords: TestOnly, Triaged
Target Release: 11.0 (Ocata)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-6.0.0-10.el7ost Doc Type: Known Issue
Doc Text:
A design flaw issue was found in the Red Hat OpenStack Platform director use of TripleO to enable libvirtd based live migration. TripleO did not have support for secure live migration and no additional steps were taken to lock-down the libvirtd deployment by director. Libvirtd is deployed by default (by director) listening on 0.0.0.0 (all interfaces) with no-authentication or encryption. Anyone able to make a TCP connection to any compute host IP address, including 127.0.0.1, other loopback interface addresses or in some cases possibly addresses that have been exposed beyond the management interface, could use this to open a virsh session to the libvirtd instance and gain control of virtual machine instances or possibly take over the host. Note that without the presence of additional flaws, this should not be accessible from tenant or external networks. Users who are upgrading to Red Hat OpenStack Platform 11 from Red Hat OpenStack Platform 10 should first apply the relevant update that resolves this issue. Red Hat OpenStack Platform 11 already contains this update as of general availability and no subsequent update is required. For more information about this flaw and the accompanying resolution, see https://access.redhat.com/solutions/3022771.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-17 20:24:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Marius Cornea 2017-04-29 10:53:15 UTC
Description of problem:
OSP10 -> OSP11 upgrade: nova live migration fails before/after upgrading compute node with errors like:

2017-04-29 09:19:40.273 60594 ERROR nova.virt.libvirt.driver [req-6c05cab3-172d-440c-bc38-d5e748ecb972 997449a88f7d4116afeeb4822f34f16c 0d3a76f69f8f444783385464e590414f - - -] [instance: bdfc5839-dc68-4832-8a46-19030e55d90e] Live Migration failure: operation failed: Failed to connect to remote libvirt URI qemu+tcp://overcloud-compute-1.localdomain/system: unable to connect to server at 'overcloud-compute-1.localdomain:16509': Connection refused

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-6.0.0-10.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP10 with 2 compute nodes
2. Run major-upgrade-composable-steps.yaml OSP11 upgrade step
3. Before upgrading one of the compute nodes live migrate instances running on it with:
nova host-evacuate-live overcloud-compute-1.localdomain
Wait for node to be quiesced and no instances are running on it.
4. Upgrade compute node:
upgrade-non-controller.sh --upgrade overcloud-compute-1
5. Reboot compute node
6. Wait for nova-compute to be up on overcloud-compute-1.localdomain 
7. Live migrate intances back to overcloud-compute-1.localdomain:
nova live-migration st-provinstance-ryetobxqen63-my_instance-tfvwjp427pbs overcloud-compute-1.localdomain

Actual results:
Instance doesn't get migrated to the upgraded node because there's no service listening on port 16509 and migration fails with:

2017-04-29 09:19:40.273 60594 ERROR nova.virt.libvirt.driver [req-6c05cab3-172d-440c-bc38-d5e748ecb972 997449a88f7d4116afeeb4822f34f16c 0d3a76f69f8f444783385464e590414f - - -] [instance: bdfc5839-dc68-4832-8a46-19030e55d90e] Live Migration
 failure: operation failed: Failed to connect to remote libvirt URI qemu+tcp://overcloud-compute-1.localdomain/system: unable to connect to server at 'overcloud-compute-1.localdomain:16509': Connection refused
2017-04-29 09:19:40.353 60594 ERROR nova.virt.libvirt.driver [req-6c05cab3-172d-440c-bc38-d5e748ecb972 997449a88f7d4116afeeb4822f34f16c 0d3a76f69f8f444783385464e590414f - - -] [instance: bdfc5839-dc68-4832-8a46-19030e55d90e] Migration oper
ation has aborted

Expected results:
Instance live migration works during the upgrade process so workloads can be moved to nodes which are not being upgraded thus minimizing the risk of failures happening during the upgrade process. 

Additional info:

listen_tcp is set to 0 in /etc/libvirt/libvirtd.conf

[root@overcloud-compute-1 heat-admin]# grep listen_tcp /etc/libvirt/libvirtd.conf
listen_tcp = 0

[root@overcloud-compute-1 heat-admin]# ps axu | grep libvirt
root        1823  0.0  0.2 1440240 21596 ?       Ssl  09:12   0:01 /usr/sbin/libvirtd --listen

After setting listen_tcp = 1 in /etc/libvirt/libvirtd.conf and restarting libvirtd migration completes fine. 

Note: in the scenario described above the nova control plane services are running on a custom role. When the nova control plane service are running on the monolithic controller the test fails at step 3: when running nova host-evacuate-live the instances are not migrated from the host, failing on the same error.

Comment 24 errata-xmlrpc 2017-05-17 20:24:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245