1567606 – Nova reports overcloud instance in error state after failed double compute failover instance-ha evacuation

Bug 1567606 - Nova reports overcloud instance in error state after failed double compute failover instance-ha evacuation

Summary: Nova reports overcloud instance in error state after failed double compute fa...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	Upstream M3
Target Release:	14.0 (Rocky)
Assignee:	Lee Yarwood
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1562747 (view as bug list)
Depends On:	1562747
Blocks:	1264181 1570429 1582827 1649253
TreeView+	depends on / blocked

Reported:	2018-04-15 12:01 UTC by pkomarov
Modified:	2023-03-21 18:47 UTC (History)
CC List:	18 users (show)
Fixed In Version:	openstack-nova-18.0.0-0.20180726110729.1222c87.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1562747
Clones:	1570429 1649253 (view as bug list)
Environment:
Last Closed:	2019-01-11 11:49:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1764883	None	None	None	2018-04-17 22:34:15 UTC
OpenStack gerrit	562072	'None'	MERGED	Add regression test for bug #1764883	2021-02-09 21:38:56 UTC
OpenStack gerrit	562284	'None'	MERGED	compute: Ensure pre-migrating instances are destroyed during init_host	2021-02-09 21:38:57 UTC
Red Hat Knowledge Base (Solution)	3993881	Troubleshoot	None	Nova Reports Instance in ERROR Post Double Compute Failover	2019-03-18 17:46:50 UTC
Red Hat Product Errata	RHEA-2019:0045	None	None	None	2019-01-11 11:49:53 UTC

Internal Links: 2127285

Comment 1 pkomarov 2018-04-15 13:08:57 UTC

nova log view of one instance in error : (not sorted :) 

http://pastebin.test.redhat.com/576512

Comment 2 pkomarov 2018-04-15 13:17:54 UTC

Link to google drive dir containing all overcloud nodes :

https://drive.google.com/drive/folders/1IHazKxvWdXl4KQRqtGjWw8Q4Sf0pXgUd?usp=sharing

Comment 3 pkomarov 2018-04-15 13:32:56 UTC

Noticed that this happens more often when using larger number of spawned instances : 8 vs 4 , and then bouncing them between computes using instance-ha (IHA).

Steps to Reproduce:
1) deploy OSP13 with IHA env (3 controllers + 2 computes)
2) deploy instance-ha with non_shared_storage option using the following procedure :
https://docs.openstack.org/tripleo-docs/latest/install/advanced_deployment/instance_ha.html

3) boot 8 instances with fip's (no volumes attached)
4) failover the compute hypervisors (echo b >/proc/sysrq-trigger), each two times, with each failover ,wait for evacuation of the instances.
5) after two failovers(each) instances go into ERROR state.

Comment 4 Andrew Beekhof 2018-04-15 21:57:32 UTC

PSA: Keep in mind that syslog and file logs will be off by 4 hours


Taking this error as a reference point:

2018-04-15 12:31:44.682 1 ERROR nova.compute.manager [req-fdd1d356-1fd5-49d6-8acd-84a4d210a18a 295ff13c59b64dd09f771f11d01243a0 4590303de6c54cd6920d521d0a72480d - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Setting instance vm_state to ERROR: InstanceExists: Instance instance-000001bc already exists.

req-fdd1d356-1fd5-49d6-8acd-84a4d210a18a happened in sosreport-overcloud-novacompute-1-20180415125609/var/log/containers/nova/nova-compute.log and was initiated at:

2018-04-15 12:31:43.294 1 INFO nova.compute.manager [req-fdd1d356-1fd5-49d6-8acd-84a4d210a18a 295ff13c59b64dd09f771f11d01243a0 4590303de6c54cd6920d521d0a72480d - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Rebuilding instance


Looking for fencing of overcloud-novacompute-0, we see the following in ./sosreport-controller-0-20180415125608/var/log/cluster/corosync.log:

Apr 15 12:31:19 [20104] controller-0 stonith-ng:     info: call_remote_stonith:	Requesting that 'controller-1' perform op 'overcloud-novacompute-0 off' with 'stonith-fence_ipmilan-5254008be2cc' for crmd.20108 (72s)
Apr 15 12:31:22 [20104] controller-0 stonith-ng:   notice: process_remote_stonith_exec:	Call to stonith-fence_ipmilan-5254008be2cc for 'overcloud-novacompute-0 off' on behalf of crmd.20108@controller-0: OK (0)

Apr 15 12:31:25 [20104] controller-0 stonith-ng:     info: call_remote_stonith:	Requesting that 'controller-1' perform op 'overcloud-novacompute-0 on' with 'stonith-fence_ipmilan-5254008be2cc' for crmd.20108 (72s)
Apr 15 12:31:28 [20104] controller-0 stonith-ng:   notice: process_remote_stonith_exec:	Call to stonith-fence_ipmilan-5254008be2cc for 'overcloud-novacompute-0 on' on behalf of crmd.20108@controller-0: OK (0)

Apr 15 12:32:44 [20104] controller-0 stonith-ng:     info: call_remote_stonith:	Requesting that 'controller-1' perform op 'overcloud-novacompute-0 on' with 'stonith-fence_ipmilan-5254008be2cc' for crmd.20108 (72s)
Apr 15 12:32:44 [20104] controller-0 stonith-ng:   notice: process_remote_stonith_exec:	Call to stonith-fence_ipmilan-5254008be2cc for 'overcloud-novacompute-0 on' on behalf of crmd.20108@controller-0: OK (0)

Prior to that, the previous activity for overcloud-novacompute-0 was significantly earlier but also appears to have succeeded: 

Apr 15 12:19:37 [20104] controller-0 stonith-ng:     info: call_remote_stonith:	Requesting that 'controller-1' perform op 'overcloud-novacompute-0 off' with 'stonith-fence_ipmilan-5254008be2cc' for crmd.20108 (72s)
Apr 15 12:19:40 [20104] controller-0 stonith-ng:   notice: process_remote_stonith_exec:	Call to stonith-fence_ipmilan-5254008be2cc for 'overcloud-novacompute-0 off' on behalf of crmd.20108@controller-0: OK (0)

Apr 15 12:19:44 [20104] controller-0 stonith-ng:     info: call_remote_stonith:	Requesting that 'controller-1' perform op 'overcloud-novacompute-0 on' with 'stonith-fence_ipmilan-5254008be2cc' for crmd.20108 (72s)
Apr 15 12:19:46 [20104] controller-0 stonith-ng:   notice: process_remote_stonith_exec:	Call to stonith-fence_ipmilan-5254008be2cc for 'overcloud-novacompute-0 on' on behalf of crmd.20108@controller-0: OK (0)

Apr 15 12:21:02 [20104] controller-0 stonith-ng:     info: call_remote_stonith:	Requesting that 'controller-2' perform op 'overcloud-novacompute-0 on' with 'stonith-fence_ipmilan-5254008be2cc' for crmd.20108 (72s)
Apr 15 12:21:02 [20104] controller-0 stonith-ng:   notice: process_remote_stonith_exec:	Call to stonith-fence_ipmilan-5254008be2cc for 'overcloud-novacompute-0 on' on behalf of crmd.20108@controller-0: OK (0)

So power fencing appears to be working and correlates both times with:

./sosreport-controller-1-20180415125608/var/log/cluster/corosync.log:Apr 15 12:31:28  NovaEvacuate(nova-evacuate)[546861]:    NOTICE: Initiating evacuation of overcloud-novacompute-0.localdomain with fence_evacuate
./sosreport-controller-1-20180415125608/var/log/cluster/corosync.log:Apr 15 12:31:41  NovaEvacuate(nova-evacuate)[546861]:    NOTICE: Completed evacuation of overcloud-novacompute-0.localdomain

and

./sosreport-controller-1-20180415125608/var/log/cluster/corosync.log:Apr 15 12:19:46  NovaEvacuate(nova-evacuate)[485642]:    NOTICE: Initiating evacuation of overcloud-novacompute-0.localdomain with fence_evacuate
./sosreport-controller-1-20180415125608/var/log/cluster/corosync.log:Apr 15 12:19:57  NovaEvacuate(nova-evacuate)[485642]:    NOTICE: Completed evacuation of overcloud-novacompute-0.localdomain



Timeline:

./sosreport-controller-0-20180415125608/var/log/cluster/corosync.log:Apr 15 12:31:22 [20104] controller-0 stonith-ng:   notice: process_remote_stonith_exec:	Call to stonith-fence_ipmilan-5254008be2cc for 'overcloud-novacompute-0 off' on behalf of crmd.20108@controller-0: OK (0)
./sosreport-controller-0-20180415125608/var/log/cluster/corosync.log:Apr 15 12:31:25 [20104] controller-0 stonith-ng:   notice: process_remote_stonith_exec:	Call to stonith-fence_compute-fence-nova for 'overcloud-novacompute-0 off' on behalf of crmd.20108@controller-0: OK (0)
./sosreport-controller-0-20180415125608/var/log/cluster/corosync.log:Apr 15 12:31:28 [20104] controller-0 stonith-ng:   notice: process_remote_stonith_exec:	Call to stonith-fence_ipmilan-5254008be2cc for 'overcloud-novacompute-0 on' on behalf of crmd.20108@controller-0: OK (0)

at this point we've turned the compute's power off, used the force down api, disabled it and then allows the node to power up again.

./sosreport-controller-1-20180415125608/var/log/cluster/corosync.log:Apr 15 12:31:28  NovaEvacuate(nova-evacuate)[546861]:    NOTICE: Initiating evacuation of overcloud-novacompute-0.localdomain with fence_evacuate
./sosreport-controller-1-20180415125608/var/log/cluster/corosync.log:Apr 15 12:31:41  NovaEvacuate(nova-evacuate)[546861]:    NOTICE: Completed evacuation of overcloud-novacompute-0.localdomain

all evacuation calls have now been made

2018-04-15 12:31:44.682 1 ERROR nova.compute.manager [req-fdd1d356-1fd5-49d6-8acd-84a4d210a18a 295ff13c59b64dd09f771f11d01243a0 4590303de6c54cd6920d521d0a72480d - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Setting instance vm_state to ERROR: InstanceExists: Instance instance-000001bc already exists.

nova throws the error

./sosreport-controller-0-20180415125608/var/log/cluster/corosync.log:Apr 15 12:32:44 [20104] controller-0 stonith-ng:   notice: process_remote_stonith_exec:	Call to stonith-fence_ipmilan-5254008be2cc for 'overcloud-novacompute-0 on' on behalf of crmd.20108@controller-0: OK (0)

we've ensured the power is back on

./sosreport-controller-0-20180415125608/var/log/cluster/corosync.log:Apr 15 12:32:47 [20104] controller-0 stonith-ng:   notice: process_remote_stonith_exec:	Call to stonith-fence_compute-fence-nova for 'overcloud-novacompute-0 on' on behalf of crmd.20108@controller-0: OK (0)

and finally, we've undone the force down API call and enabled the compute in nova.

Comment 5 Andrew Beekhof 2018-04-15 22:51:41 UTC

Focusing on the nova logs for instance-000001bc aka. a773f1f7-5121-4eab-bfef-2deeb53c10c3 and correlating with evacuation events, I see:

12:13:38 - Start compute-0
12:19:46 - Evacuate compute-0
12:19:58 - Rebuilding on compute-1
12:20:44 - Deleted compute-0
12:25:25 - Evacuate compute-1
12:25:40 - Rebuilding on compute-0
*** No "Deleting instance as it has been evacuated from this host" message for compute-1 ***
12:31:28 - Evacuate compute-0
12:31:43 - Rebuilding on compute-1
12:31:44.682 Setting instance vm_state to ERROR: InstanceExists: Instance instance-000001bc already exists.

So compute-1 didn't have a chance (and/or think it needed) to clean up its record of a773f1f7-5121-4eab-bfef-2deeb53c10c3 after the 12:25:25 evacuation which seems like a plausible root cause of the "already exists" error.

-- Andrew

Fencing did occur:

Apr 15 12:25:19 [20104] controller-0 stonith-ng:   notice: process_remote_stonith_exec:	Call to stonith-fence_ipmilan-525400fca120 for 'overcloud-novacompute-1 off' on behalf of crmd.20108@controller-0: OK (0)
Apr 15 12:25:23 [20104] controller-0 stonith-ng:   notice: process_remote_stonith_exec:	Call to stonith-fence_compute-fence-nova for 'overcloud-novacompute-1 off' on behalf of crmd.20108@controller-0: OK (0)
Apr 15 12:25:25 [20104] controller-0 stonith-ng:   notice: process_remote_stonith_exec:	Call to stonith-fence_ipmilan-525400fca120 for 'overcloud-novacompute-1 on' on behalf of crmd.20108@controller-0: OK (0)
Apr 15 12:26:43 [20104] controller-0 stonith-ng:   notice: process_remote_stonith_exec:	Call to stonith-fence_ipmilan-525400fca120 for 'overcloud-novacompute-1 on' on behalf of crmd.20108@controller-0: OK (0)
Apr 15 12:26:48 [20104] controller-0 stonith-ng:   notice: process_remote_stonith_exec:	Call to stonith-fence_compute-fence-nova for 'overcloud-novacompute-1 on' on behalf of crmd.20108@controller-0: OK (0)



Chronological Nova plus IHA evacuation logs:


compute-0:2018-04-15 12:13:38.937 1 INFO nova.compute.claims [req-c7eb107c-2fcf-4409-a6d6-88d655f83faa 81b1b95db70a46098e1dd21e4fe5e11b 39574f3ccbb4493e9d7e48b32ed0a1d9 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Claim successful on node overcloud-novacompute-0.localdomain
compute-0:2018-04-15 12:13:40.074 1 INFO nova.virt.libvirt.driver [req-c7eb107c-2fcf-4409-a6d6-88d655f83faa 81b1b95db70a46098e1dd21e4fe5e11b 39574f3ccbb4493e9d7e48b32ed0a1d9 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Creating image
compute-0:2018-04-15 12:13:53.397 1 INFO nova.compute.manager [req-8519d642-59f2-4619-b9c1-e06aa2b97289 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] VM Started (Lifecycle Event)
compute-0:2018-04-15 12:13:53.464 1 INFO nova.compute.manager [req-8519d642-59f2-4619-b9c1-e06aa2b97289 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] VM Paused (Lifecycle Event)
compute-0:2018-04-15 12:13:53.577 1 INFO nova.compute.manager [req-8519d642-59f2-4619-b9c1-e06aa2b97289 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] During sync_power_state the instance has a pending task (spawning). Skip.
compute-0:2018-04-15 12:13:56.961 1 INFO nova.compute.manager [req-381abaf9-c98c-4757-83ea-ba056269eddf 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] VM Resumed (Lifecycle Event)
compute-0:2018-04-15 12:13:56.968 1 INFO nova.virt.libvirt.driver [req-381abaf9-c98c-4757-83ea-ba056269eddf 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Instance spawned successfully.
compute-0:2018-04-15 12:13:56.968 1 INFO nova.compute.manager [req-381abaf9-c98c-4757-83ea-ba056269eddf 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Took 16.90 seconds to spawn the instance on the hypervisor.
compute-0:2018-04-15 12:13:57.115 1 INFO nova.compute.manager [req-381abaf9-c98c-4757-83ea-ba056269eddf 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] During sync_power_state the instance has a pending task (spawning). Skip.
compute-0:2018-04-15 12:13:57.116 1 INFO nova.compute.manager [req-381abaf9-c98c-4757-83ea-ba056269eddf 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] VM Resumed (Lifecycle Event)
compute-0:2018-04-15 12:13:57.168 1 INFO nova.compute.manager [req-381abaf9-c98c-4757-83ea-ba056269eddf 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Took 18.27 seconds to build instance.

Apr 15 12:19:46  NovaEvacuate(nova-evacuate)[485642]:    NOTICE: Initiating evacuation of overcloud-novacompute-0.localdomain with fence_evacuate
Apr 15 12:19:57  NovaEvacuate(nova-evacuate)[485642]:    NOTICE: Completed evacuation of overcloud-novacompute-0.localdomain

compute-1:2018-04-15 12:19:58.111 1 INFO nova.compute.manager [req-770e7702-0ad2-45d1-a8fb-92a73814a5b7 295ff13c59b64dd09f771f11d01243a0 4590303de6c54cd6920d521d0a72480d - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Rebuilding instance
compute-1:2018-04-15 12:19:58.269 1 INFO nova.compute.claims [req-770e7702-0ad2-45d1-a8fb-92a73814a5b7 295ff13c59b64dd09f771f11d01243a0 4590303de6c54cd6920d521d0a72480d - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Claim successful on node overcloud-novacompute-1.localdomain
compute-1:2018-04-15 12:19:58.665 1 INFO nova.compute.resource_tracker [req-770e7702-0ad2-45d1-a8fb-92a73814a5b7 295ff13c59b64dd09f771f11d01243a0 4590303de6c54cd6920d521d0a72480d - default default] Updating from migration a773f1f7-5121-4eab-bfef-2deeb53c10c3
compute-1:2018-04-15 12:20:06.152 1 INFO nova.network.neutronv2.api [req-6299550c-9b0f-4a49-9057-e16bd841c17f 295ff13c59b64dd09f771f11d01243a0 4590303de6c54cd6920d521d0a72480d - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Updating port ed902164-72db-4b3b-95d0-402b90df81e9 with attributes {'binding:profile': {}, 'binding:host_id': 'overcloud-novacompute-1.localdomain'}
compute-1:2018-04-15 12:20:06.268 1 WARNING nova.compute.manager [req-d29a4a9b-50b4-44fb-9486-81a77d340d15 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Received unexpected event network-vif-unplugged-ed902164-72db-4b3b-95d0-402b90df81e9 for instance with vm_state active and task_state rebuilding.
compute-1:2018-04-15 12:20:09.898 1 INFO nova.virt.libvirt.driver [req-4cbf4baa-e191-4006-9def-c816fb8cb1b2 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Creating image
compute-1:2018-04-15 12:20:27.623 1 INFO nova.compute.manager [req-dc0bb653-1348-42bb-a19f-039890e08990 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] VM Resumed (Lifecycle Event)
compute-1:2018-04-15 12:20:27.634 1 INFO nova.virt.libvirt.driver [req-dc0bb653-1348-42bb-a19f-039890e08990 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Instance spawned successfully.
compute-1:2018-04-15 12:20:27.918 1 INFO nova.compute.manager [req-dc0bb653-1348-42bb-a19f-039890e08990 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] During the sync_power process the instance has moved from host overcloud-novacompute-0.localdomain to host overcloud-novacompute-1.localdomain
compute-1:2018-04-15 12:20:27.919 1 INFO nova.compute.manager [req-dc0bb653-1348-42bb-a19f-039890e08990 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] VM Started (Lifecycle Event)
compute-1:2018-04-15 12:20:28.171 1 INFO nova.compute.manager [req-dc0bb653-1348-42bb-a19f-039890e08990 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] During the sync_power process the instance has moved from host overcloud-novacompute-0.localdomain to host overcloud-novacompute-1.localdomain
compute-1:2018-04-15 12:20:35.893 1 WARNING nova.compute.manager [req-01d642bd-f0ed-4233-a031-76b59d51e87d 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Received unexpected event network-vif-plugged-ed902164-72db-4b3b-95d0-402b90df81e9 for instance with vm_state active and task_state None.

compute-0:2018-04-15 12:20:44.543 1 INFO nova.compute.manager [req-36072b4c-cd5a-4738-931c-2015c00d47f6 - - - - -] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Deleting instance as it has been evacuated from this host
compute-0:2018-04-15 12:20:49.129 1 INFO nova.virt.libvirt.driver [req-36072b4c-cd5a-4738-931c-2015c00d47f6 - - - - -] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Instance destroyed successfully.
compute-0:2018-04-15 12:20:49.981 1 INFO nova.virt.libvirt.driver [req-36072b4c-cd5a-4738-931c-2015c00d47f6 - - - - -] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Deleting instance files /var/lib/nova/instances/a773f1f7-5121-4eab-bfef-2deeb53c10c3_del
compute-0:2018-04-15 12:20:49.982 1 INFO nova.virt.libvirt.driver [req-36072b4c-cd5a-4738-931c-2015c00d47f6 - - - - -] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Deletion of /var/lib/nova/instances/a773f1f7-5121-4eab-bfef-2deeb53c10c3_del complete
compute-0:2018-04-15 12:21:08.694 1 WARNING nova.compute.manager [req-d29a4a9b-50b4-44fb-9486-81a77d340d15 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Received unexpected event network-vif-unplugged-ed902164-72db-4b3b-95d0-402b90df81e9 for instance with vm_state active and task_state rebuilding.

Apr 15 12:25:25  NovaEvacuate(nova-evacuate)[1004018]:    NOTICE: Initiating evacuation of overcloud-novacompute-1.localdomain with fence_evacuate
Apr 15 12:25:39  NovaEvacuate(nova-evacuate)[1004018]:    NOTICE: Completed evacuation of overcloud-novacompute-1.localdomain

compute-0:2018-04-15 12:25:40.770 1 INFO nova.compute.manager [req-eaf096b1-4e03-4152-9734-e3e4bad86d6f 295ff13c59b64dd09f771f11d01243a0 4590303de6c54cd6920d521d0a72480d - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Rebuilding instance
compute-0:2018-04-15 12:25:41.236 1 INFO nova.compute.claims [req-a1db886f-bb46-4af2-8068-b1fad2b28fc3 295ff13c59b64dd09f771f11d01243a0 4590303de6c54cd6920d521d0a72480d - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Claim successful on node overcloud-novacompute-0.localdomain
compute-0:2018-04-15 12:25:41.521 1 INFO nova.compute.resource_tracker [req-a1db886f-bb46-4af2-8068-b1fad2b28fc3 295ff13c59b64dd09f771f11d01243a0 4590303de6c54cd6920d521d0a72480d - default default] Updating from migration a773f1f7-5121-4eab-bfef-2deeb53c10c3
compute-0:2018-04-15 12:25:49.467 1 WARNING nova.compute.manager [req-cc0e8a3c-1b43-40f7-b6f1-f804c307c58b 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Received unexpected event network-vif-unplugged-ed902164-72db-4b3b-95d0-402b90df81e9 for instance with vm_state active and task_state rebuilding.
compute-0:2018-04-15 12:25:49.519 1 INFO nova.network.neutronv2.api [req-cc0e8a3c-1b43-40f7-b6f1-f804c307c58b 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Updating port ed902164-72db-4b3b-95d0-402b90df81e9 with attributes {'binding:profile': {}, 'binding:host_id': 'overcloud-novacompute-0.localdomain'}
compute-0:2018-04-15 12:25:51.590 1 WARNING nova.compute.manager [req-cf61552c-80f7-4803-b4a8-cf9cccf7e938 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Received unexpected event network-vif-plugged-ed902164-72db-4b3b-95d0-402b90df81e9 for instance with vm_state active and task_state rebuilding.
compute-0:2018-04-15 12:25:54.040 1 WARNING nova.compute.manager [req-69b6c516-e5f5-4159-b39e-fc1883bbc919 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Received unexpected event network-vif-unplugged-ed902164-72db-4b3b-95d0-402b90df81e9 for instance with vm_state active and task_state rebuilding.
compute-0:2018-04-15 12:25:57.336 1 INFO nova.virt.libvirt.driver [req-1578042c-2f40-4b51-a794-b1de8255b820 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Creating image
compute-0:2018-04-15 12:26:29.031 1 INFO nova.compute.manager [req-7323ef16-afb3-4dcf-bc2b-365034ae17a9 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] VM Resumed (Lifecycle Event)
compute-0:2018-04-15 12:26:29.096 1 INFO nova.virt.libvirt.driver [req-7323ef16-afb3-4dcf-bc2b-365034ae17a9 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Instance spawned successfully.
compute-0:2018-04-15 12:26:29.332 1 INFO nova.compute.manager [req-7323ef16-afb3-4dcf-bc2b-365034ae17a9 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] During the sync_power process the instance has moved from host overcloud-novacompute-1.localdomain to host overcloud-novacompute-0.localdomain
compute-0:2018-04-15 12:26:29.333 1 INFO nova.compute.manager [req-7323ef16-afb3-4dcf-bc2b-365034ae17a9 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] VM Started (Lifecycle Event)
compute-0:2018-04-15 12:26:29.581 1 INFO nova.compute.manager [req-7323ef16-afb3-4dcf-bc2b-365034ae17a9 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] During the sync_power process the instance has moved from host overcloud-novacompute-1.localdomain to host overcloud-novacompute-0.localdomain
compute-0:2018-04-15 12:26:31.396 1 WARNING nova.compute.manager [req-9033065b-c0c3-4151-abf6-b0377e17bed8 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Received unexpected event network-vif-plugged-ed902164-72db-4b3b-95d0-402b90df81e9 for instance with vm_state error and task_state None.

compute-1:2018-04-15 12:26:39.345 1 WARNING nova.compute.manager [req-cc0e8a3c-1b43-40f7-b6f1-f804c307c58b 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Received unexpected event network-vif-unplugged-ed902164-72db-4b3b-95d0-402b90df81e9 for instance with vm_state active and task_state rebuilding.
compute-1:2018-04-15 12:26:39.380 1 WARNING nova.compute.manager [req-cf61552c-80f7-4803-b4a8-cf9cccf7e938 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Received unexpected event network-vif-plugged-ed902164-72db-4b3b-95d0-402b90df81e9 for instance with vm_state active and task_state rebuilding.
compute-1:2018-04-15 12:26:39.389 1 WARNING nova.compute.manager [req-69b6c516-e5f5-4159-b39e-fc1883bbc919 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Received unexpected event network-vif-unplugged-ed902164-72db-4b3b-95d0-402b90df81e9 for instance with vm_state active and task_state rebuilding.

Apr 15 12:31:28  NovaEvacuate(nova-evacuate)[546861]:    NOTICE: Initiating evacuation of overcloud-novacompute-0.localdomain with fence_evacuate
Apr 15 12:31:41  NovaEvacuate(nova-evacuate)[546861]:    NOTICE: Completed evacuation of overcloud-novacompute-0.localdomain

compute-1:2018-04-15 12:31:43.294 1 INFO nova.compute.manager [req-fdd1d356-1fd5-49d6-8acd-84a4d210a18a 295ff13c59b64dd09f771f11d01243a0 4590303de6c54cd6920d521d0a72480d - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Rebuilding instance
compute-1:2018-04-15 12:31:43.374 1 INFO nova.compute.claims [req-fdd1d356-1fd5-49d6-8acd-84a4d210a18a 295ff13c59b64dd09f771f11d01243a0 4590303de6c54cd6920d521d0a72480d - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Claim successful on node overcloud-novacompute-1.localdomain
compute-1:2018-04-15 12:31:43.585 1 INFO nova.compute.resource_tracker [req-fdd1d356-1fd5-49d6-8acd-84a4d210a18a 295ff13c59b64dd09f771f11d01243a0 4590303de6c54cd6920d521d0a72480d - default default] Updating from migration a773f1f7-5121-4eab-bfef-2deeb53c10c3
compute-1:2018-04-15 12:31:44.682 1 ERROR nova.compute.manager [req-fdd1d356-1fd5-49d6-8acd-84a4d210a18a 295ff13c59b64dd09f771f11d01243a0 4590303de6c54cd6920d521d0a72480d - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Setting instance vm_state to ERROR: InstanceExists: Instance instance-000001bc already exists.
compute-1:2018-04-15 12:31:44.682 1 ERROR nova.compute.manager [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Traceback (most recent call last):
compute-1:2018-04-15 12:31:44.682 1 ERROR nova.compute.manager [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7418, in _error_out_instance_on_exception
compute-1:2018-04-15 12:31:44.682 1 ERROR nova.compute.manager [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3]     yield
compute-1:2018-04-15 12:31:44.682 1 ERROR nova.compute.manager [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2880, in rebuild_instance
compute-1:2018-04-15 12:31:44.682 1 ERROR nova.compute.manager [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3]     migration, request_spec)
compute-1:2018-04-15 12:31:44.682 1 ERROR nova.compute.manager [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2941, in _do_rebuild_instance_with_claim
compute-1:2018-04-15 12:31:44.682 1 ERROR nova.compute.manager [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3]     self._do_rebuild_instance(*args, **kwargs)
compute-1:2018-04-15 12:31:44.682 1 ERROR nova.compute.manager [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2969, in _do_rebuild_instance
compute-1:2018-04-15 12:31:44.682 1 ERROR nova.compute.manager [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3]     self._check_instance_exists(context, instance)
compute-1:2018-04-15 12:31:44.682 1 ERROR nova.compute.manager [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1371, in _check_instance_exists
compute-1:2018-04-15 12:31:44.682 1 ERROR nova.compute.manager [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3]     raise exception.InstanceExists(name=instance.name)
compute-1:2018-04-15 12:31:44.682 1 ERROR nova.compute.manager [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] InstanceExists: Instance instance-000001bc already exists.
compute-1:2018-04-15 12:31:44.682 1 ERROR nova.compute.manager [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] 
compute-1:2018-04-15 12:31:45.099 1 INFO nova.compute.manager [req-fdd1d356-1fd5-49d6-8acd-84a4d210a18a 295ff13c59b64dd09f771f11d01243a0 4590303de6c54cd6920d521d0a72480d - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Successfully reverted task state from rebuilding on failure for instance.


compute-0:2018-04-15 12:32:35.102 1 WARNING nova.compute.manager [req-4e5e9fb7-6af3-4042-b4a5-2944059a7b74 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Received unexpected event network-vif-unplugged-ed902164-72db-4b3b-95d0-402b90df81e9 for instance with vm_state error and task_state None.
compute-0:2018-04-15 12:33:10.654 1 INFO nova.compute.manager [req-4e5e9fb7-6af3-4042-b4a5-2944059a7b74 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] During _sync_instance_power_state the DB power_state (1) does not match the vm_power_state from the hypervisor (4). Updating power_state in the DB to match the hypervisor.
compute-0:2018-04-15 12:44:21.883 1 INFO nova.compute.manager [req-aaae3e30-80d6-4f1b-9f7e-1616e79da414 81b1b95db70a46098e1dd21e4fe5e11b 39574f3ccbb4493e9d7e48b32ed0a1d9 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Terminating instance
compute-0:2018-04-15 12:44:21.903 1 INFO nova.virt.libvirt.driver [req-aaae3e30-80d6-4f1b-9f7e-1616e79da414 81b1b95db70a46098e1dd21e4fe5e11b 39574f3ccbb4493e9d7e48b32ed0a1d9 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Instance destroyed successfully.
compute-0:2018-04-15 12:44:22.632 1 INFO nova.virt.libvirt.driver [req-aaae3e30-80d6-4f1b-9f7e-1616e79da414 81b1b95db70a46098e1dd21e4fe5e11b 39574f3ccbb4493e9d7e48b32ed0a1d9 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Deleting instance files /var/lib/nova/instances/a773f1f7-5121-4eab-bfef-2deeb53c10c3_del
compute-0:2018-04-15 12:44:22.634 1 INFO nova.virt.libvirt.driver [req-aaae3e30-80d6-4f1b-9f7e-1616e79da414 81b1b95db70a46098e1dd21e4fe5e11b 39574f3ccbb4493e9d7e48b32ed0a1d9 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Deletion of /var/lib/nova/instances/a773f1f7-5121-4eab-bfef-2deeb53c10c3_del complete
compute-0:2018-04-15 12:44:22.787 1 INFO nova.compute.manager [req-aaae3e30-80d6-4f1b-9f7e-1616e79da414 81b1b95db70a46098e1dd21e4fe5e11b 39574f3ccbb4493e9d7e48b32ed0a1d9 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Took 0.89 seconds to destroy the instance on the hypervisor.
compute-0:2018-04-15 12:44:25.435 1 INFO nova.compute.manager [req-aaae3e30-80d6-4f1b-9f7e-1616e79da414 81b1b95db70a46098e1dd21e4fe5e11b 39574f3ccbb4493e9d7e48b32ed0a1d9 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Took 2.65 seconds to deallocate network for instance.
compute-0:2018-04-15 12:44:27.271 1 INFO nova.scheduler.client.report [req-44fb51de-9c21-42d8-88fe-ded02a01a550 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] Deleted allocation for instance a773f1f7-5121-4eab-bfef-2deeb53c10c3



compute-1:2018-04-15 12:32:35.101 1 WARNING nova.compute.manager [req-4e5e9fb7-6af3-4042-b4a5-2944059a7b74 270bde522ab94be7a82dea654d901ef6 7d69b2e8b221427ab3738b0bdfb1f788 - default default] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Received unexpected event network-vif-unplugged-ed902164-72db-4b3b-95d0-402b90df81e9 for instance with vm_state error and task_state None.
compute-1:2018-04-15 12:38:09.577 1 INFO nova.compute.manager [req-5b29c33c-39e1-426f-b752-4d9c55339a52 - - - - -] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Deleting instance as it has been evacuated from this host
compute-1:2018-04-15 12:38:13.429 1 INFO nova.virt.libvirt.driver [req-5b29c33c-39e1-426f-b752-4d9c55339a52 - - - - -] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Instance destroyed successfully.
compute-1:2018-04-15 12:38:15.219 1 INFO nova.virt.libvirt.driver [req-5b29c33c-39e1-426f-b752-4d9c55339a52 - - - - -] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Deleting instance files /var/lib/nova/instances/a773f1f7-5121-4eab-bfef-2deeb53c10c3_del
compute-1:2018-04-15 12:38:15.221 1 INFO nova.virt.libvirt.driver [req-5b29c33c-39e1-426f-b752-4d9c55339a52 - - - - -] [instance: a773f1f7-5121-4eab-bfef-2deeb53c10c3] Deletion of /var/lib/nova/instances/a773f1f7-5121-4eab-bfef-2deeb53c10c3_del complete
compute-1:2018-04-15 12:38:16.822 1 WARNING nova.scheduler.client.report [req-5b29c33c-39e1-426f-b752-4d9c55339a52 - - - - -] Expected to find allocations referencing resource provider d5d93bab-058b-4527-a1a4-bdf38904c00e for a773f1f7-5121-4eab-bfef-2deeb53c10c3, but found none.

Comment 6 Lee Yarwood 2018-04-16 11:02:52 UTC

Thanks Andrew the above is really helpful! I've added a line for the n-cpu service restart below :

(In reply to Andrew Beekhof from comment #5)
> Focusing on the nova logs for instance-000001bc aka.
> a773f1f7-5121-4eab-bfef-2deeb53c10c3 and correlating with evacuation events,
> I see:
> 
> 12:13:38 - Start compute-0
> 12:19:46 - Evacuate compute-0
> 12:19:58 - Rebuilding on compute-1
> 12:20:44 - Deleted compute-0
> 12:25:25 - Evacuate compute-1
> 12:25:40 - Rebuilding on compute-0

12:26:13 - compute-1 n-cpu restarts

> *** No "Deleting instance as it has been evacuated from this host" message
> for compute-1 ***
> 12:31:28 - Evacuate compute-0
> 12:31:43 - Rebuilding on compute-1
> 12:31:44.682 Setting instance vm_state to ERROR: InstanceExists: Instance
> instance-000001bc already exists.
> 
> So compute-1 didn't have a chance (and/or think it needed) to clean up its
> record of a773f1f7-5121-4eab-bfef-2deeb53c10c3 after the 12:25:25 evacuation
> which seems like a plausible root cause of the "already exists" error.

So I previously assumed that we were creating transient Libvirt domains that wouldn't persist a reboot, this isn't the case and is the reason for the failures when we attempt to rebuild on compute-1 _without_ first destroying the instance and thus undefining the domain on the host. compute-1 has ~5 minutes in which to destroy the domain of the previously evacuated instance after it starts back up @ 12:26:13 but makes no attempt to do so prior to compute-0 being evacuated and a rebuild request again being made to compute-1 @ 12:31:44.

This suggests that there was something wrong with the migration record for the evacuation of the instance from compute-1 to compute-0 @ 12:25:25. We would expect the migration record to have a status of `accepted` (evacuation/rebuild in progress) or `done` (evacuation/rebuild complete) at this point, this is then used with a local lookup of defined active or inactive domains on compute-1 to confirm which instances need to be destroyed, domains undefined etc.

Unfortunately the logging in this area is pretty woeful so we are going to need to reproduce this while gathering some additional data on a compute node where we don't see `Deleting instance as it has been evacuated from this host` in the logs once n-cpu has restarted after an evacuation :

1. Current state of the migration records for a given instance :

# mysql nova_cell1 -e "select * from migrations where instance_uuid=$instance_uuid"

2. Current state of the domains on a compute host _after_ n-cpu has been restarted and has had a chance to destroy previously evacuated instances :

# sudo virsh list --all

Pini, I'd be happy to help collect the above from an environment where we are seeing this btw so feel free to point me at a system if you have one.

Finally given the above I'm going to move this back to DFG:Compute.

Cheers,

Lee

[1] https://github.com/openstack/nova/blob/fda768b304e05821f7479f9698c59d18bf3d3516/nova/compute/manager.py#L627-L656

Comment 8 Lee Yarwood 2018-04-17 13:47:26 UTC

Pini provided access to a test environment where we ran through another example of this. AFAICT instance HA is allowing the source compute to come back up *before* we complete the rebuild on the destination host. While this should work from a migration records point of view there appears to be a race with the source compute setting the task_state for the rebuilding instance to None ahead of the rebuild completing on the destination. This results in a failure on the destination that doesn't appear to clean up the instance, leading to failures with all future evacuation attempts as the Libvirt domain is still present on the destination.

On the Nova side I'm going to continue looking for the underlying task_state race between the source and destination and also try to understand why we aren't cleaning up on the destination fully after the initial failure.

Andrew, has Instance HA always allowed the source compute to come back up during a rebuild or is that something unique to these tests? Would ensuring the source compute is down during the rebuild be possible if we can't correct the race in Nova?

[ notes ]

MariaDB [nova]> select instance_uuid, created_at, updated_at, status, source_compute, dest_compute from migrations where migration_type = 'evacuation' and instance_uuid = 'd9419b05-025e-4193-b3f7-7f0efc23593b';
+--------------------------------------+---------------------+---------------------+--------+-------------------------------------+-------------------------------------+
| instance_uuid                        | created_at          | updated_at          | status | source_compute                      | dest_compute                        |
+--------------------------------------+---------------------+---------------------+--------+-------------------------------------+-------------------------------------+
| d9419b05-025e-4193-b3f7-7f0efc23593b | 2018-04-17 11:10:22 | 2018-04-17 11:11:12 | failed | overcloud-novacompute-0.localdomain | overcloud-novacompute-1.localdomain |
| d9419b05-025e-4193-b3f7-7f0efc23593b | 2018-04-17 11:21:31 | 2018-04-17 11:21:36 | failed | overcloud-novacompute-0.localdomain | overcloud-novacompute-1.localdomain |
+--------------------------------------+---------------------+---------------------+--------+-------------------------------------+-------------------------------------+
2 rows in set (0.00 sec)

Looking at the first evacuation I can see that the source compute-0 comes back online while we are still rebuilding on the destination compute-1 :

overcloud-novacompute-0.localdomain dies:

2018-04-17 11:08:24.804 1 INFO nova.compute.resource_tracker [req-cafdc005-ceac-4534-83a2-2c57a6926ea8 da4d95c480c343c5bf6abe3b789f4c17 d2c2437b7f6642b4a1d5907fa5f373a9 - default default] Final resource view: name=overcloud-novacompute-0.localdomain phys_ram=8191MB used_ram=4352MB phys_disk=29GB used_disk=4GB total_vcpus=2 used_vcpus=4 pci_stats=[]

overcloud-novacompute-1.localdomain starts the rebuild of instance:

2018-04-17 11:10:24.518 1 INFO nova.compute.manager [req-6faafc7c-7d9a-4c58-81c6-910f2d34d7fd e8ad72fef968482ea34be01399151145 7f263badcbbd4752a9e8921b2cee73e3 - default default] [instance: d9419b05-025e-4193-b3f7-7f0efc23593b] Rebuilding instance

overcloud-novacompute-0.localdomain comes back online and starts running through init_host:

2018-04-17 11:10:59.750 1 INFO nova.service [req-c0b8632c-44d8-4d31-800b-c654d6f66cc2 - - - - -] Starting compute node (version 17.0.1-0.20180302144923.9ace6ed.el7ost)
[..]
2018-04-17 11:11:09.849 1 WARNING nova.compute.resource_tracker [req-c0b8632c-44d8-4d31-800b-c654d6f66cc2 - - - - -] [instance: d9419b05-025e-4193-b3f7-7f0efc23593b] Instance not resizing, skipping migration.

overcloud-novacompute-1.localdomain completes the rebuild but can't update the DB as the task_state for the instance is now None:

2018-04-17 11:11:12.406 1 INFO nova.virt.libvirt.driver [req-ac20c023-9abf-412f-987f-2981c7837c57 da4d95c480c343c5bf6abe3b789f4c17 d2c2437b7f6642b4a1d5907fa5f373a9 - default default] [instance: d9419b05-025e-4193-b3f7-7f0efc23593b] Instance spawned successfully.
[..]
2018-04-17 11:11:12.999 1 ERROR nova.compute.manager [req-ac20c023-9abf-412f-987f-2981c7837c57 da4d95c480c343c5bf6abe3b789f4c17 d2c2437b7f6642b4a1d5907fa5f373a9 - default default] [instance: d9419b05-025e-4193-b3f7-7f0efc23593b] Setting instance vm_state to ERROR: 
UnexpectedTaskStateError_Remote: Conflict updating instance d9419b05-025e-4193-b3f7-7f0efc23593b. Expected: {'task_state': [u'rebuild_spawning']}. Actual: {'task_state': None}

Comment 9 Lee Yarwood 2018-04-17 13:50:39 UTC

Ops, missed the additional output showing the second evacuation failing due to the domain still being present on the destination:

2018-04-17 11:21:33.973 1 INFO nova.compute.manager [req-731ed24c-5c7a-45ff-be2d-edd8472ba250 e8ad72fef968482ea34be01399151145 7f263badcbbd4752a9e8921b2cee73e3 - default default] [instance: d9419b05-025e-4193-b3f7-7f0efc23593b] Rebuilding instance
[..]
2018-04-17 11:21:36.423 1 ERROR nova.compute.manager [req-4866dab4-c486-41bd-a9ec-50ebf6e6ec46 e8ad72fef968482ea34be01399151145 7f263badcbbd4752a9e8921b2cee73e3 - default default] [instance: d9419b05-025e-4193-b3f7-7f0efc23593b] Setting instance vm_state to ERROR: InstanceExists: Instance instance-0000000e already exists.

Comment 10 Dan Smith 2018-04-17 14:41:53 UTC

We should see this log message be emitted from the original source compute when it starts back up:

            LOG.info('Deleting instance as it has been evacuated from '
                     'this host', instance=instance)

If not, then it's not properly nuking the local remnants of the instance during recovery.

Comment 11 Andrew Beekhof 2018-04-18 06:29:48 UTC

(In reply to Lee Yarwood from comment #8)
> Andrew, has Instance HA always allowed the source compute to come back up
> during a rebuild or is that something unique to these tests? Would ensuring
> the source compute is down during the rebuild be possible if we can't
> correct the race in Nova?

In the past we inhibited the nova process from starting on the compute until after evacuations completed.
However after consulting with Dan Smith we reached the conclusion that it should be sufficiently safe to allow the nova process to start earlier PROVIDED the compute remained disabled on the control plane and the force-down flag was still in effect.

This then allowed us to improve how IHA was delivered (including a smaller delta compared to the non-IHA architecture).
We could look at undoing this if absolutely necessary but it would mean missing this release.

Comment 12 Dan Smith 2018-04-18 16:30:55 UTC

Andrew, I remember that conversation and just reviewed it to make sure I hadn't told  you something incorrect. Here's the salient quote from that conversation:

> IMHO, if you want to be airtight here, you fence the node, evacuate
> everything that you will or could evacuate, and only unfence after
> that. Never allow the possibility for the node to come back up until
> you've evacuated everything you might possibly evacuate. Anything else
> is playing with fire, as far as I'm concerned.

Hopefully that's sufficiently clear that allowing the source node to come back up is not okay. If not, this later quote from the same conversation addresses the fact that telling nova that the source is down does not prevent it from doing damage if it's allowed to restart:

> Marking it down thorough the API is independent of the
> nova-discovered liveness check and it should be sticky. Meaning, it
> should stay forced down until you un-forced-down it, even if the
> compute node checks back in. Thus, no need to pollute the
> instance-focused evacuate call with host-focused details.
> 
> However, it's only for the evacuate check, so failure to fence the
> node such that it comes back online before you're ready still means
> you can hit all the badness that the check is supposed to prevent.

We also just discussed two potential changes we could entertain in the future, but no promises.

1. We could make nova-compute hang at startup, going into a poll loop on any instances it thinks have been evacuated from it, but are still in the process (or have not completed).

2. We could make nova-compute hang at startup, waiting for its force-down flag to be removed before continuing.

In both cases, we'd just poll once a minute, logging a warning that we're in a holding pattern until the situation is resolved. I don't know how these would be received upstream, and I tend to think that this is more within the realm of the fencing agents to handle. That said, it doesn't seem unreasonable to at least propose this to help prevent nova from corrupting its own data.

Comment 13 Andrew Beekhof 2018-04-19 02:33:26 UTC

(In reply to Dan Smith from comment #12)
> Andrew, I remember that conversation and just reviewed it to make sure I
> hadn't told  you something incorrect. Here's the salient quote from that
> conversation:
> 
> > IMHO, if you want to be airtight here, you fence the node, evacuate
> > everything that you will or could evacuate, and only unfence after
> > that. Never allow the possibility for the node to come back up until
> > you've evacuated everything you might possibly evacuate. Anything else
> > is playing with fire, as far as I'm concerned.
> 
> Hopefully that's sufficiently clear that allowing the source node to come
> back up is not okay. If not, this later quote from the same conversation
> addresses the fact that telling nova that the source is down does not
> prevent it from doing damage if it's allowed to restart:

I went back to read the thread again and we were definitely on the same page at this point.  I think the disconnect happened in the "I know better, do the evacuation!" flag sub-thread.

You talked about possible bad things but didn't elaborate and things seemed to JustWork(tm) when I artificially created situations where nova started on the fenced compute node before the evacuations were initiated.

Do we think that the ordering change is triggering this bug or an orthogonal (but equally important) issue?  Personally I've got a suspicion that I've seen the effects of this bug even with the old arrangement (but never had the data to back it up).

> 
> > Marking it down thorough the API is independent of the
> > nova-discovered liveness check and it should be sticky. Meaning, it
> > should stay forced down until you un-forced-down it, even if the
> > compute node checks back in. Thus, no need to pollute the
> > instance-focused evacuate call with host-focused details.
> > 
> > However, it's only for the evacuate check, so failure to fence the
> > node such that it comes back online before you're ready still means
> > you can hit all the badness that the check is supposed to prevent.
> 
> We also just discussed two potential changes we could entertain in the
> future, but no promises.
> 
> 1. We could make nova-compute hang at startup, going into a poll loop on any
> instances it thinks have been evacuated from it, but are still in the
> process (or have not completed).
> 
> 2. We could make nova-compute hang at startup, waiting for its force-down
> flag to be removed before continuing.

Option 2 seems like it might be the more generically useful feature and in the spirit of the force down flag (making it more palatable upstream?)

> 
> In both cases, we'd just poll once a minute, logging a warning that we're in
> a holding pattern until the situation is resolved. I don't know how these
> would be received upstream, and I tend to think that this is more within the
> realm of the fencing agents to handle. That said, it doesn't seem
> unreasonable to at least propose this to help prevent nova from corrupting
> its own data.

we can go back to having pacemaker own the nova-compute service (so it can enforce the correct ordering) but we'd need to revisit the issue again once we move to k8s and pacemaker is out of the picture.

i like the idea of option 2 above

Comment 14 Lee Yarwood 2018-04-19 10:40:35 UTC

tl;dr - For IHA on 12 & 13 our safest option is for the PIDONE DFG to ensure the source 
        compute service remains stopped until evacuations are marked as done.

(In reply to Andrew Beekhof from comment #13)
> (In reply to Dan Smith from comment #12)
> > Andrew, I remember that conversation and just reviewed it to make sure I
> > hadn't told  you something incorrect. Here's the salient quote from that
> > conversation:
> > 
> > > IMHO, if you want to be airtight here, you fence the node, evacuate
> > > everything that you will or could evacuate, and only unfence after
> > > that. Never allow the possibility for the node to come back up until
> > > you've evacuated everything you might possibly evacuate. Anything else
> > > is playing with fire, as far as I'm concerned.
> > 
> > Hopefully that's sufficiently clear that allowing the source node to come
> > back up is not okay. If not, this later quote from the same conversation
> > addresses the fact that telling nova that the source is down does not
> > prevent it from doing damage if it's allowed to restart:
> 
> I went back to read the thread again and we were definitely on the same page
> at this point.  I think the disconnect happened in the "I know better, do
> the evacuation!" flag sub-thread.
> 
> You talked about possible bad things but didn't elaborate and things seemed
> to JustWork(tm) when I artificially created situations where nova started on
> the fenced compute node before the evacuations were initiated.

Yeah starting the forced down source compute before calling for evacuations is different, in that case instances would be destroyed (shutdown) and the source compute service wouldn't race the destination in setting task_state etc.
 
> Do we think that the ordering change is triggering this bug or an orthogonal
> (but equally important) issue?  Personally I've got a suspicion that I've
> seen the effects of this bug even with the old arrangement (but never had
> the data to back it up).

The latent race [1] seen in c#8 is caused by the forced down source compute starting _during_ the evacuation and rebuild on the destination. If we can eliminate that during IHA by fencing the source service and/or host until the evacuation migration records are `done` [2] then we can avoid the above issues and any other edge cases.

[1] https://review.openstack.org/#/c/562072/2/nova/tests/functional/regressions/test_bug_1764883.py
[2] https://developer.openstack.org/api-ref/compute/#list-migrations

> > > Marking it down thorough the API is independent of the
> > > nova-discovered liveness check and it should be sticky. Meaning, it
> > > should stay forced down until you un-forced-down it, even if the
> > > compute node checks back in. Thus, no need to pollute the
> > > instance-focused evacuate call with host-focused details.
> > > 
> > > However, it's only for the evacuate check, so failure to fence the
> > > node such that it comes back online before you're ready still means
> > > you can hit all the badness that the check is supposed to prevent.
> > 
> > We also just discussed two potential changes we could entertain in the
> > future, but no promises.
> > 
> > 1. We could make nova-compute hang at startup, going into a poll loop on any
> > instances it thinks have been evacuated from it, but are still in the
> > process (or have not completed).
> > 
> > 2. We could make nova-compute hang at startup, waiting for its force-down
> > flag to be removed before continuing.
> 
> Option 2 seems like it might be the more generically useful feature and in
> the spirit of the force down flag (making it more palatable upstream?)

Yeah, I agree.

> > In both cases, we'd just poll once a minute, logging a warning that we're in
> > a holding pattern until the situation is resolved. I don't know how these
> > would be received upstream, and I tend to think that this is more within the
> > realm of the fencing agents to handle. That said, it doesn't seem
> > unreasonable to at least propose this to help prevent nova from corrupting
> > its own data.
> 
> we can go back to having pacemaker own the nova-compute service (so it can
> enforce the correct ordering) but we'd need to revisit the issue again once
> we move to k8s and pacemaker is out of the picture.

I think we have to do this in the short term to ensure we don't hit this race and any other weird edge cases with IHA if the source compute comes back during the evacuations.

> i like the idea of option 2 above

Yup same but that's something to target at Rocky and 14 IMHO, I don't think we can deliver that any earlier.

Comment 15 Andrew Beekhof 2018-04-20 01:41:00 UTC

(In reply to Lee Yarwood from comment #14)
> tl;dr - For IHA on 12 & 13 our safest option is for the PIDONE DFG to ensure
> the source 
>         compute service remains stopped until evacuations are marked as done.

Can you be more specific about what done means to you?
Done as in the evacuation calls have been made or done as in nova has finished performing the evacuations?

Because we've never tracked the latter and having to start now would be a major upheaval.

Comment 16 Andrew Beekhof 2018-04-20 02:16:10 UTC

(In reply to Lee Yarwood from comment #14)
> The latent race [1] seen in c#8 is caused by the forced down source compute
> starting _during_ the evacuation and rebuild on the destination. If we can
> eliminate that during IHA by fencing the source service and/or host until
> the evacuation migration records are `done` [2] then we can avoid the above
> issues and any other edge cases.
> 
> [1]
> https://review.openstack.org/#/c/562072/2/nova/tests/functional/regressions/
> test_bug_1764883.py
> [2] https://developer.openstack.org/api-ref/compute/#list-migrations

The docs suggest this is for live migrations. Will our kind show up there too?
Also, we need to support IHA back to OSP9... when was microversion 2.23 introduced?  (I tried googling)


> > i like the idea of option 2 above
> 
> Yup same but that's something to target at Rocky and 14 IMHO, I don't think
> we can deliver that any earlier.

Apart from a resourcing perspective, are you also worried about unintended consequences or just the realities of getting it merged in time?

Comment 17 Lee Yarwood 2018-04-20 10:52:01 UTC

(In reply to Andrew Beekhof from comment #15)
> (In reply to Lee Yarwood from comment #14)
> > tl;dr - For IHA on 12 & 13 our safest option is for the PIDONE DFG to ensure
> > the source 
> >         compute service remains stopped until evacuations are marked as done.
> 
> Can you be more specific about what done means to you?
> Done as in the evacuation calls have been made or done as in nova has
> finished performing the evacuations?

The actual migration record for an evacuation will move to done once the instance has been successfully rebuilt on the destination host [1]. Once that happens we can then start the source compute service that will clean up and finally mark the migration as completed [2].

[1] https://github.com/openstack/nova/blob/8a407bd288bb3116e50ea3b29a125caa036204cb/nova/compute/manager.py#L2952-L2954
[2] https://github.com/openstack/nova/blob/8a407bd288bb3116e50ea3b29a125caa036204cb/nova/compute/manager.py#L701-L702

> Because we've never tracked the latter and having to start now would be a
> major upheaval. 

Yeah understood, it's just the cleanest way to workaround this latent race in Nova at the moment.

(In reply to Andrew Beekhof from comment #16)
> (In reply to Lee Yarwood from comment #14)
> > The latent race [1] seen in c#8 is caused by the forced down source compute
> > starting _during_ the evacuation and rebuild on the destination. If we can
> > eliminate that during IHA by fencing the source service and/or host until
> > the evacuation migration records are `done` [2] then we can avoid the above
> > issues and any other edge cases.
> > 
> > [1]
> > https://review.openstack.org/#/c/562072/2/nova/tests/functional/regressions/
> > test_bug_1764883.py
> > [2] https://developer.openstack.org/api-ref/compute/#list-migrations
> 
> The docs suggest this is for live migrations. Will our kind show up there
> too?

Yes, you will just need to provide migration_type: "evacuation", you can also filter by source_compute and status.

> Also, we need to support IHA back to OSP9... when was microversion 2.23
> introduced?  (I tried googling)

Ia92ecbe3c99082e3a34adf4fd29041b1a95ef21e [1] introduced 2.23 in nova 13.0.0 AKA OSP 9 so we should be okay to use it for IHA.

[1] https://review.openstack.org/#/c/258771/

> > > i like the idea of option 2 above
> > 
> > Yup same but that's something to target at Rocky and 14 IMHO, I don't think
> > we can deliver that any earlier.
> 
> Apart from a resourcing perspective, are you also worried about unintended
> consequences or just the realities of getting it merged in time?

Changing this behaviour in master ahead of M2 shouldn't be an issue but getting it backported into upstream stable and downstream patches all the way to Mitaka would be challenging.

Comment 18 Andrew Beekhof 2018-04-20 12:19:01 UTC

(In reply to Lee Yarwood from comment #17)
> (In reply to Andrew Beekhof from comment #15)
> > (In reply to Lee Yarwood from comment #14)
> > > tl;dr - For IHA on 12 & 13 our safest option is for the PIDONE DFG to ensure
> > > the source 
> > >         compute service remains stopped until evacuations are marked as done.
> > 
> > Can you be more specific about what done means to you?
> > Done as in the evacuation calls have been made or done as in nova has
> > finished performing the evacuations?
> 
> The actual migration record for an evacuation will move to done once the
> instance has been successfully rebuilt on the destination host [1]. Once
> that happens we can then start the source compute service that will clean up
> and finally mark the migration as completed [2].

Ack

> 
> [1]
> https://github.com/openstack/nova/blob/
> 8a407bd288bb3116e50ea3b29a125caa036204cb/nova/compute/manager.py#L2952-L2954
> [2]
> https://github.com/openstack/nova/blob/
> 8a407bd288bb3116e50ea3b29a125caa036204cb/nova/compute/manager.py#L701-L702
> 
> > Because we've never tracked the latter and having to start now would be a
> > major upheaval. 
> 
> Yeah understood, it's just the cleanest way to workaround this latent race
> in Nova at the moment.

Ok. I think we have a plan to simulate option 2 with some python in the heat template (before kolla spins up the nova compute process).  All I need now is a way to query the state of the force-down flag.

> 
> (In reply to Andrew Beekhof from comment #16)
> > (In reply to Lee Yarwood from comment #14)
> > > The latent race [1] seen in c#8 is caused by the forced down source compute
> > > starting _during_ the evacuation and rebuild on the destination. If we can
> > > eliminate that during IHA by fencing the source service and/or host until
> > > the evacuation migration records are `done` [2] then we can avoid the above
> > > issues and any other edge cases.
> > > 
> > > [1]
> > > https://review.openstack.org/#/c/562072/2/nova/tests/functional/regressions/
> > > test_bug_1764883.py
> > > [2] https://developer.openstack.org/api-ref/compute/#list-migrations
> > 
> > The docs suggest this is for live migrations. Will our kind show up there
> > too?
> 
> Yes, you will just need to provide migration_type: "evacuation", you can
> also filter by source_compute and status.
> 
> > Also, we need to support IHA back to OSP9... when was microversion 2.23
> > introduced?  (I tried googling)
> 
> Ia92ecbe3c99082e3a34adf4fd29041b1a95ef21e [1] introduced 2.23 in nova 13.0.0
> AKA OSP 9 so we should be okay to use it for IHA.

great

> 
> [1] https://review.openstack.org/#/c/258771/
> 
> > > > i like the idea of option 2 above
> > > 
> > > Yup same but that's something to target at Rocky and 14 IMHO, I don't think
> > > we can deliver that any earlier.
> > 
> > Apart from a resourcing perspective, are you also worried about unintended
> > consequences or just the realities of getting it merged in time?
> 
> Changing this behaviour in master ahead of M2 shouldn't be an issue but
> getting it backported into upstream stable and downstream patches all the
> way to Mitaka would be challenging.

For now my concern is getting OSP13 right.
Without customer reports, its not clear that closing this race is going to happen for older releases.

Comment 19 Lee Yarwood 2018-04-20 13:49:48 UTC

(In reply to Andrew Beekhof from comment #18)
> (In reply to Lee Yarwood from comment #17)
> > Yeah understood, it's just the cleanest way to workaround this latent race
> > in Nova at the moment.
> 
> Ok. I think we have a plan to simulate option 2 with some python in the heat
> template (before kolla spins up the nova compute process).  All I need now
> is a way to query the state of the force-down flag.

os-services will allow you to query that :

https://developer.openstack.org/api-ref/compute/#list-compute-services

Comment 20 Lee Yarwood 2018-04-20 13:58:57 UTC

Marking this as Triaged from an openstack-nova POV, Andrew did you want to use this for the IHA changes? Happy to clone this out to a separate bug for openstack-nova if you do.

Comment 21 Andrew Beekhof 2018-04-22 22:14:57 UTC

I've created a script: https://bugzilla.redhat.com/attachment.cgi?id=1425440

Could someone run their eye over it to make sure we're doing it right?

Comment 22 Lee Yarwood 2018-04-25 12:33:00 UTC

Apologies Andrew, 

I was sure I had already posted this but in any case:

18 def evacuations_done(connection, hostname): 
[..]
25         migrations = connection.migrations.list(host=hostname)
[..]
32         for migration in migrations:

I'd suggest further filtering these records by migration_type as we only care about 'evacuations' here and not 'live-migration' records.

55                 if migration.status == "running":                               
56                         return False        

I'd also flip this logic and return True only if the status is 'done'.

The forced_down logic looks good.

Comment 27 Lee Yarwood 2018-05-08 08:36:40 UTC

*** Bug 1562747 has been marked as a duplicate of this bug. ***

Comment 38 errata-xmlrpc 2019-01-11 11:49:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045

Note You need to log in before you can comment on or make changes to this bug.