Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1698050

Summary:	Octavia VM instance remains on host even though it's gone from Nova
Product:	Red Hat OpenStack	Reporter:	Darin Sorrentino <dsorrent>
Component:	openstack-octavia	Assignee:	Stephen Finucane <stephenfin>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Alexander Stafeyev <astafeye>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	14.0 (Rocky)	CC:	cgoncalves, ihrachys, lpeer, majopela, mbooth, scohen, stephenfin
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-05-29 14:08:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Darin Sorrentino 2019-04-09 14:05:44 UTC

Description of problem:

When I shutdown a compute node hosting an Octavia LB instance, the LB goes into ERROR state and the VM is no longer listed in "openstack server list --all", however, when I restart the compute node and use virsh to look at the VM instances that exist, the LB instance is still there.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. openstack loadbalancer create...
2. openstack server list --all # identify the VM
3. openstack server show -c OS-EXT-SRV-ATTR:host -c OS-EXT-SRV-ATTR:instance_name <UUID OF LB SERVER INSTANCE> #Note host and the instance virsh instance ID
4. source ~/stackrc # switch to undercloud to shut host down
5. openstack server list | grep <NAME OF HOST> # get UUID of server instance
6. openstack baremetal node list | grep <UUID of server instance> # Identify baremetal node
7. openstack baremetal node power off <UUID IF NODE>
8. source ~/overcloudrc
9. openstack loadbalancer list

Wait until the loadbalancer goes into error state

10. openstack server list --all # Note VM is gone
11. source ~/stackrc
12. openstack baremetal node power off <UUID IF NODE>

Wait for node to power back up

13. ssh heat-admin@<NODE IP>
14. sudo virsh list --all

You'll see the instance ID from step 3 listed as "shut off".

Even if you delete the loadbalancer in the overcloud, the instance remains.

Actual results:

Some sort of consistency between Openstack and the hypervisor. Either the VM should remain in openstack and be connected to the instance on the host or the host should be cleaned up when the instance is removed out of openstack.

Maintaining the connection between openstack and the hypervisor and allowing the ability restart the instance would be the ideal situation.

Expected results:

Additional info:

Comment 1 Carlos Goncalves 2019-04-17 14:35:12 UTC

Octavia issued a Nova delete but Nova timed out trying to delete the instance because the compute node hosting it was down. Nova gave up but the instance still lived on that compute. This is a instance live-cycle issue in Nova that should be handled there. There's nothing else Octavia could have done to fix it. In some circumstances, Octavia tries to kill "zombie" amphorae but since Nova has no record of the instance, Octavia cannot issue a instance delete either.

Compute DFG, is there a bug report tracking this behavior that could be linked?

Comment 3 Stephen Finucane 2019-04-24 15:12:26 UTC

This should be handled by nova automatically if the 'running_deleted_instance_action' config option is configured to 'reap' (the default) [1]. It is worth noting that this is a periodic task and that period is configured by the 'running_deleted_instance_poll_interval' config option which defaults to 1800 seconds (30 minutes) [2]. Did you wait sufficient time for nova to clean this up?

[1] https://docs.openstack.org/nova/rocky/configuration/config.html#DEFAULT.running_deleted_instance_action
[2] https://docs.openstack.org/nova/rocky/configuration/config.html#DEFAULT.running_deleted_instance_poll_interval

Comment 4 Stephen Finucane 2019-05-10 13:47:15 UTC

Any updates, Darin?

Comment 6 Stephen Finucane 2019-05-16 16:42:04 UTC

OK, it is possible to reproduce, in that case? This will be tough to diagnose without some form of log. In addition, could you check whether an instance with the name reported in 'virsh' (likely 'instance-NNNN') exists in the 'instances' table of the nova database? It should be present and have the 'deleted' field and possibly the 'cleaned' fields set. A dump of that row would be appreciated, if so.

Comment 8 Darin Sorrentino 2019-05-28 15:08:58 UTC

I was unable to reproduce this in my lab.  I've reached out to the customer to see if I can get data from the environment that had this issue.  If I don't hear back from them by EOW, I will request to close thie BZ as unable to reproduce.

Leaving needinfo until then.

Comment 9 Darin Sorrentino 2019-05-29 14:08:48 UTC

Checked the output from the script I provided the customer and the instances are no longer there.  I am closing this BZ as insufficient data.