Bug 1729114 - nova migration can leave shut off libvirt instances under specific conditions
Summary: nova migration can leave shut off libvirt instances under specific conditions
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Stephen Finucane
QA Contact: nova-maint
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-11 12:34 UTC by Andreas Karis
Modified: 2019-11-14 15:15 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-09 14:56:29 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Andreas Karis 2019-07-11 12:34:30 UTC
Description of problem:
nova migration can leave shut off libvirt instances under specific conditions

This can be reproduced when using "nova reset-state --active" instead of "nova migration-confirm"

"shut off" instances will also count against nova's allocations, claiming unused hypervisor resources

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

### Baseline ###

~~~
[root@overcloud-compute-0 ~]# date
Thu Jul 11 12:14:39 UTC 2019
~~~

~~~
[stack@undercloud-7 ~]$ nova list
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
| ID                                   | Name      | Status | Task State | Power State | Networks                                           |
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
| c10f8c10-5a4b-4a98-9a07-484ce4a51717 | rhel-test | ACTIVE | -          | Running     | provider1=2000:10::f816:3eff:fe98:cbb9, 10.0.0.105 |
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
~~~

~~~
[root@overcloud-compute-0 ~]# virsh list
 Id    Name                           State
----------------------------------------------------
 9     instance-00000005              running
~~~

~~~
[root@overcloud-compute-1 ~]# virsh list
 Id    Name                           State
----------------------------------------------------
~~~

~~~
[stack@undercloud-7 ~]$ nova migrate rhel-test
[stack@undercloud-7 ~]$ nova list
+--------------------------------------+-----------+---------------+------------+-------------+----------------------------------------------------+
| ID                                   | Name      | Status        | Task State | Power State | Networks                                           |
+--------------------------------------+-----------+---------------+------------+-------------+----------------------------------------------------+
| c10f8c10-5a4b-4a98-9a07-484ce4a51717 | rhel-test | VERIFY_RESIZE | -          | Running     | provider1=2000:10::f816:3eff:fe98:cbb9, 10.0.0.105 |
+--------------------------------------+-----------+---------------+------------+-------------+----------------------------------------------------+
~~~

~~~
[root@overcloud-compute-0 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     instance-00000005              shut off
~~~

~~~
[root@overcloud-compute-1 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 5     instance-00000005              running
~~~

~~~
[stack@undercloud-7 ~]$ nova resize-confirm rhel-test
[stack@undercloud-7 ~]$ nova list
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
| ID                                   | Name      | Status | Task State | Power State | Networks                                           |
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
| c10f8c10-5a4b-4a98-9a07-484ce4a51717 | rhel-test | ACTIVE | -          | Running     | provider1=2000:10::f816:3eff:fe98:cbb9, 10.0.0.105 |
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
~~~

~~~
[root@overcloud-compute-0 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
~~~

~~~
[root@overcloud-compute-1 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 5     instance-00000005              running
~~~

~~~
[stack@undercloud-7 ~]$ nova migrate rhel-test
[stack@undercloud-7 ~]$ nova list
+--------------------------------------+-----------+---------------+------------+-------------+----------------------------------------------------+
| ID                                   | Name      | Status        | Task State | Power State | Networks                                           |
+--------------------------------------+-----------+---------------+------------+-------------+----------------------------------------------------+
| c10f8c10-5a4b-4a98-9a07-484ce4a51717 | rhel-test | VERIFY_RESIZE | -          | Running     | provider1=2000:10::f816:3eff:fe98:cbb9, 10.0.0.105 |
+--------------------------------------+-----------+---------------+------------+-------------+----------------------------------------------------+
[stack@undercloud-7 ~]$ 
~~~

~~~
[root@overcloud-compute-0 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 10    instance-00000005              running
~~~

~~~
[root@overcloud-compute-1 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     instance-00000005              shut off
~~~

~~~
[stack@undercloud-7 ~]$ nova resize-confirm rhel-test
[stack@undercloud-7 ~]$ nova list
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
| ID                                   | Name      | Status | Task State | Power State | Networks                                           |
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
| c10f8c10-5a4b-4a98-9a07-484ce4a51717 | rhel-test | ACTIVE | -          | Running     | provider1=2000:10::f816:3eff:fe98:cbb9, 10.0.0.105 |
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
~~~

~~~
[root@overcloud-compute-0 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 10    instance-00000005              running
~~~

~~~
[root@overcloud-compute-1 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
~~~

Repeated the same cycle a second time to be sure that everything works as expected.

~~~
[root@overcloud-controller-0 ~]# mysqldump nova > nova.after_baseline.dump.sql
[root@overcloud-controller-0 ~]# mysqldump nova_api > nova_api.after_baseline.dump.sql
~~~

### Reproducing reset-state --active issue ###

~~~
[stack@undercloud-7 ~]$ nova migrate rhel-test
[stack@undercloud-7 ~]$ nova list 
+--------------------------------------+-----------+---------------+------------+-------------+----------------------------------------------------+
| ID                                   | Name      | Status        | Task State | Power State | Networks                                           |
+--------------------------------------+-----------+---------------+------------+-------------+----------------------------------------------------+
| c10f8c10-5a4b-4a98-9a07-484ce4a51717 | rhel-test | VERIFY_RESIZE | -          | Running     | provider1=2000:10::f816:3eff:fe98:cbb9, 10.0.0.105 |
+--------------------------------------+-----------+---------------+------------+-------------+----------------------------------------------------+
~~~

~~~
[root@overcloud-compute-0 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     instance-00000005              shut off
~~~

~~~
[root@overcloud-compute-1 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 7     instance-00000005              running
~~~

~~~
[stack@undercloud-7 ~]$ nova reset-state rhel-test --active
Reset state for server rhel-test succeeded; new state is active
[stack@undercloud-7 ~]$ nova list 
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
| ID                                   | Name      | Status | Task State | Power State | Networks                                           |
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
| c10f8c10-5a4b-4a98-9a07-484ce4a51717 | rhel-test | ACTIVE | -          | Running     | provider1=2000:10::f816:3eff:fe98:cbb9, 10.0.0.105 |
+----------------------
~~~

~~~
[root@overcloud-compute-0 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     instance-00000005              shut off
~~~

~~~
[root@overcloud-compute-1 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 7     instance-00000005              running
~~~

~~~
stack@undercloud-7 ~]$ nova migrate rhel-test
[stack@undercloud-7 ~]$ nova list 
+--------------------------------------+-----------+---------------+------------+-------------+----------------------------------------------------+
| ID                                   | Name      | Status        | Task State | Power State | Networks                                           |
+--------------------------------------+-----------+---------------+------------+-------------+----------------------------------------------------+
| c10f8c10-5a4b-4a98-9a07-484ce4a51717 | rhel-test | VERIFY_RESIZE | -          | Running     | provider1=2000:10::f816:3eff:fe98:cbb9, 10.0.0.105 |
+--------------------------------------+-----------+---------------+------------+-------------+----------------------------------------------------+
~~~

~~~
[root@overcloud-compute-0 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 12    instance-00000005              running
~~~

~~~
[root@overcloud-compute-1 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     instance-00000005              shut off
~~~

~~~
[stack@undercloud-7 ~]$ nova resize-confirm rhel-test
[stack@undercloud-7 ~]$ nova list 
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
| ID                                   | Name      | Status | Task State | Power State | Networks                                           |
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
| c10f8c10-5a4b-4a98-9a07-484ce4a51717 | rhel-test | ACTIVE | -          | Running     | provider1=2000:10::f816:3eff:fe98:cbb9, 10.0.0.105 |
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
~~~

~~~
[root@overcloud-compute-0 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 12    instance-00000005              running
~~~

~~~
[root@overcloud-compute-1 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     instance-00000005              shut off
~~~

~~~
[stack@undercloud-7 ~]$ nova migrate rhel-test
ERROR (BadRequest): No valid host was found. No valid host found for cold migrate (HTTP 400) (Request-ID: req-4a30ab21-2dc1-4859-9aa6-dcb472a2f05b)
~~~

From this moment on, the instance will always leave a "shut off" virtual machine upon migration, no matter what one does.

~~~
[root@overcloud-controller-0 ~]# mysqldump nova > nova.after_reproducer.dump.sql
[root@overcloud-controller-0 ~]# mysqldump nova_api > nova_api.after_reproducer.dump.sql
~~~

Comment 1 Andreas Karis 2019-07-11 12:35:40 UTC
rpm -qa | grep nova

[root@overcloud-compute-1 ~]# rpm -qa | grep nova
openstack-nova-console-14.1.0-35.el7ost.noarch
puppet-nova-9.6.0-9.el7ost.noarch
python-novaclient-6.0.2-2.el7ost.noarch
openstack-nova-common-14.1.0-35.el7ost.noarch
openstack-nova-migration-14.1.0-35.el7ost.noarch
openstack-nova-api-14.1.0-35.el7ost.noarch
python-nova-14.1.0-35.el7ost.noarch
openstack-nova-compute-14.1.0-35.el7ost.noarch
openstack-nova-cert-14.1.0-35.el7ost.noarch
openstack-nova-conductor-14.1.0-35.el7ost.noarch
openstack-nova-scheduler-14.1.0-35.el7ost.noarch
openstack-nova-novncproxy-14.1.0-35.el7ost.noarch
[root@overcloud-compute-1 ~]#

Comment 2 Andreas Karis 2019-07-11 12:38:38 UTC
I admit that my lab is a few versions behind, but I cannot see anything that's related to this in the changelog:

2019-05-15 Sylvain Bauza <sbauza@redhat.com> 1:14.1.0-50

    - Add functional test for live migrate with anti-affinity group (rhbz#1640624)
    - OSP10-only: Fix live-migrate checking affinity rules (rhbz#1640624)
2019-04-30 Lee Yarwood <lyarwood@redhat.com> 1:14.1.0-49

    - Ensure rbd auth fallback uses matching credentials (rhbz#1700534)
    - libvirt: handle missing rbd_secret_uuid from old connection info (rhbz#1700534)
    - Add missing libvirt exception during device detach (rhbz#1703441)
2019-04-25 Lee Yarwood <lyarwood@redhat.com> 1:14.1.0-48

    - Avoid exploding if guest refuses to detach a volume (rhbz#1669225)
2019-04-12 Lee Yarwood <lyarwood@redhat.com> 1:14.1.0-47

    - Delete instance_id_mappings record in instance_destroy (rhbz#1696757)
2019-04-02 Stephen Finucane <sfinucan@redhat.com> 1:14.1.0-46

    - [Stable Only] hardware: Handle races during pinning (rhbz#1686511)
2019-04-01 Rajesh Tailor <ratailor@redhat.com> 1:14.1.0-45

    - Enforce case-sensitive hostnames in aggregate host add (rhbz#1694152)
2019-03-29 Stephen Finucane <sfinucan@redhat.com> 1:14.1.0-44

    - Handle unbound vif plug errors on compute restart (rhbz#1578028)
    - Handle binding_failed vif plug errors on compute restart (rhbz#1578028)
2019-03-28 Sylvain Bauza <sbauza@redhat.com> 1:14.1.0-42

    - Fix typo (rhbz#1664702)
    - libvirt: Report the virtual size of RAW disks (rhbz#1685343)
2019-03-28 Stephen Finucane <sfinucan@redhat.com> 1:14.1.0-43

    - Fix overcommit for NUMA-based instances (rhbz#1664702)
2019-03-26 Rajesh Tailor <ratailor@redhat.com> 1:14.1.0-41

    - Return 400 when compute host is not found (rhbz#1496718)
    - Fix host validity check for live-migration (rhbz#1496718)
2019-03-08 Matthew Booth <mbooth@redhat.com> 1:14.1.0-40

    - Only attempt a rebuild claim for an evacuation to a new host (rhbz#1619987)
2019-01-28 Lee Yarwood <lyarwood@redhat.com> 1:14.1.0-39

    - Add secret=true to fixed_key configuration parameter (rhbz#1657276)
2019-01-11 Lee Yarwood <lyarwood@redhat.com> 1:14.1.0-38

    - libvirt: Add workaround to cleanup instance dir when using rbd (rhbz#1456718)
2019-01-10 Artom Lifshitz <alifshit@redhat.com> 1:14.1.0-36

    - Stop _undefine_domain erroring if domain not found (rhbz#1636280)
2019-01-10 Artom Lifshitz <alifshit@redhat.com> 1:14.1.0-37

    - Rollback instance.image_ref on failed rebuild (rhbz#1540369)
2018-11-28 Lee Yarwood <lyarwood@redhat.com> 1:14.1.0-34

    - Support qemu >= 2.10 (rhbz#1646382)

Comment 4 Matthew Booth 2019-07-12 12:23:33 UTC
My initial reaction is: nova reset-state is an admin action which can break things if improperly used. If I've understood correctly the reproducer steps are circumventing a state transition that explicitly does a bunch of cleanup, so you're omitting that cleanup. If the admin wants to do that, the admin gets to clean up after it.

I'm going to ask for a second opinion in case I missed something obvious here, but I'm expecting to close this NOTABUG.

Comment 5 Matthew Booth 2019-07-12 15:11:08 UTC
I ran this past the team, and I can confirm that we don't consider this a bug.

Comment 6 Andreas Karis 2019-07-30 22:41:22 UTC
Hi,

Sorry, I was on PTO for a few weeks.

Reopening this: The problem is not that the admin needs to cleanup *once*. The problem is that the cleanup then needs to be done manually after every single valid and correctly executed live-migration after this. Hence, this *is* a valid bug. 

>> From this moment on, the instance will always leave a "shut off" virtual machine upon migration, no matter what one does.

This here is the critical part:

~~~
[stack@undercloud-7 ~]$ nova reset-state rhel-test --active
Reset state for server rhel-test succeeded; new state is active
[stack@undercloud-7 ~]$ nova list 
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
| ID                                   | Name      | Status | Task State | Power State | Networks                                           |
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
| c10f8c10-5a4b-4a98-9a07-484ce4a51717 | rhel-test | ACTIVE | -          | Running     | provider1=2000:10::f816:3eff:fe98:cbb9, 10.0.0.105 |
+----------------------
~~~

~~~
[root@overcloud-compute-0 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     instance-00000005              shut off
~~~

~~~
[root@overcloud-compute-1 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 7     instance-00000005              running
~~~

--------> I agree that up here, we're in the territory that admins should clean up stuff manually. 100% agreed.
--------> The problem is what follows:

~~~
stack@undercloud-7 ~]$ nova migrate rhel-test
[stack@undercloud-7 ~]$ nova list 
+--------------------------------------+-----------+---------------+------------+-------------+----------------------------------------------------+
| ID                                   | Name      | Status        | Task State | Power State | Networks                                           |
+--------------------------------------+-----------+---------------+------------+-------------+----------------------------------------------------+
| c10f8c10-5a4b-4a98-9a07-484ce4a51717 | rhel-test | VERIFY_RESIZE | -          | Running     | provider1=2000:10::f816:3eff:fe98:cbb9, 10.0.0.105 |
+--------------------------------------+-----------+---------------+------------+-------------+----------------------------------------------------+
~~~

~~~
[root@overcloud-compute-0 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 12    instance-00000005              running
~~~

~~~
[root@overcloud-compute-1 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     instance-00000005              shut off
~~~

-------------> Above, you can see that I migrated the instance now from compute-1 to compute-0 with the supported procedure. The "shut off" instance on compute-0 was correctly replaced on compute-0. The "shut off" instance on compute-1 is still there and should be cleaned up by resize-confirm - along with the allocations on compute-1.

~~~
[stack@undercloud-7 ~]$ nova resize-confirm rhel-test
[stack@undercloud-7 ~]$ nova list 
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
| ID                                   | Name      | Status | Task State | Power State | Networks                                           |
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
| c10f8c10-5a4b-4a98-9a07-484ce4a51717 | rhel-test | ACTIVE | -          | Running     | provider1=2000:10::f816:3eff:fe98:cbb9, 10.0.0.105 |
+--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------+
~~~

~~~
[root@overcloud-compute-0 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 12    instance-00000005              running
~~~

~~~
[root@overcloud-compute-1 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     instance-00000005              shut off
~~~

--------------------> But here, we can see that this did not happen. What I did not show in the above output is that I can even virsh undefine / virsh destroy the shut off resource and then do another couple of live migrations, and I will from now on always leave "shut off" instances as a trace. Deleting the instance with virsh and restarting openstack-nova-compute will clear the allocations, but only until I did another couple of valid live migrations. Hence, in my tests, the instance was basically now broken for live-migration, as it always left traces behind, even when I executed live-migrations as per the documentation.

Comment 8 Stephen Finucane 2019-08-09 14:56:29 UTC
We have a look at this again today. While I agree that the instance should not being perpetually leaving stuff after it on the source, the fact remains that something has gone behind nova's back and messed with heuristics we rely on for accurate lifecycle management. 'nova reset-state' is an admin-only operation because things like this can happen and care is needed to avoid that happening. A number of suggestions were raised that might help reset the instance to a good state, including reverting the second migration or stopping and starting the instance, and these might be worth investigating. However, given all the above, it is very unlikely that we're going to be able to prioritize a larger fix any time soon. I'm going to close again, this time as WONTFIX, rather than give any impression to the contrary.


Note You need to log in before you can comment on or make changes to this bug.