Bug 1698050 - Octavia VM instance remains on host even though it's gone from Nova
Summary: Octavia VM instance remains on host even though it's gone from Nova
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-octavia
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Stephen Finucane
QA Contact: Alexander Stafeyev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-09 14:05 UTC by Darin Sorrentino
Modified: 2019-09-10 14:12 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-29 14:08:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Darin Sorrentino 2019-04-09 14:05:44 UTC
Description of problem:

When I shutdown a compute node hosting an Octavia LB instance, the LB goes into ERROR state and the VM is no longer listed in "openstack server list --all", however, when I restart the compute node and use virsh to look at the VM instances that exist, the LB instance is still there.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. openstack loadbalancer create...
2. openstack server list --all # identify the VM
3. openstack server show -c OS-EXT-SRV-ATTR:host -c OS-EXT-SRV-ATTR:instance_name <UUID OF LB SERVER INSTANCE> #Note host and the instance virsh instance ID
4. source ~/stackrc # switch to undercloud to shut host down
5. openstack server list | grep <NAME OF HOST> # get UUID of server instance
6. openstack baremetal node list | grep <UUID of server instance> # Identify baremetal node
7. openstack baremetal node power off <UUID IF NODE>
8. source ~/overcloudrc
9. openstack loadbalancer list

Wait until the loadbalancer goes into error state

10. openstack server list --all # Note VM is gone
11. source ~/stackrc
12. openstack baremetal node power off <UUID IF NODE>

Wait for node to power back up

13. ssh heat-admin@<NODE IP>
14. sudo virsh list --all

You'll see the instance ID from step 3 listed as "shut off".

Even if you delete the loadbalancer in the overcloud, the instance remains.

Actual results:

Some sort of consistency between Openstack and the hypervisor.  Either the VM should remain in openstack and be connected to the instance on the host or the host should be cleaned up when the instance is removed out of openstack.

Maintaining the connection between openstack and the hypervisor and allowing the ability restart the instance would be the ideal situation.

Expected results:


Additional info:

Comment 1 Carlos Goncalves 2019-04-17 14:35:12 UTC
Octavia issued a Nova delete but Nova timed out trying to delete the instance because the compute node hosting it was down. Nova gave up but the instance still lived on that compute. This is a instance live-cycle issue in Nova that should be handled there. There's nothing else Octavia could have done to fix it. In some circumstances, Octavia tries to kill "zombie" amphorae but since Nova has no record of the instance, Octavia cannot issue a instance delete either.

Compute DFG, is there a bug report tracking this behavior that could be linked?

Comment 3 Stephen Finucane 2019-04-24 15:12:26 UTC
This should be handled by nova automatically if the 'running_deleted_instance_action' config option is configured to 'reap' (the default) [1]. It is worth noting that this is a periodic task and that period is configured by the 'running_deleted_instance_poll_interval' config option which defaults to 1800 seconds (30 minutes) [2]. Did you wait sufficient time for nova to clean this up?

[1] https://docs.openstack.org/nova/rocky/configuration/config.html#DEFAULT.running_deleted_instance_action
[2] https://docs.openstack.org/nova/rocky/configuration/config.html#DEFAULT.running_deleted_instance_poll_interval

Comment 4 Stephen Finucane 2019-05-10 13:47:15 UTC
Any updates, Darin?

Comment 6 Stephen Finucane 2019-05-16 16:42:04 UTC
OK, it is possible to reproduce, in that case? This will be tough to diagnose without some form of log. In addition, could you check whether an instance with the name reported in 'virsh' (likely 'instance-NNNN') exists in the 'instances' table of the nova database? It should be present and have the 'deleted' field and possibly the 'cleaned' fields set. A dump of that row would be appreciated, if so.

Comment 8 Darin Sorrentino 2019-05-28 15:08:58 UTC
I was unable to reproduce this in my lab.  I've reached out to the customer to see if I can get data from the environment that had this issue.  If I don't hear back from them by EOW, I will request to close thie BZ as unable to reproduce.

Leaving needinfo until then.

Comment 9 Darin Sorrentino 2019-05-29 14:08:48 UTC
Checked the output from the script I provided the customer and the instances are no longer there.  I am closing this BZ as insufficient data.


Note You need to log in before you can comment on or make changes to this bug.