Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1902393

Summary:	Several VMs are running on hypervisor nodes while Nova instances are no longer exist
Product:	Red Hat OpenStack	Reporter:	Alex Stupnikov <astupnik>
Component:	openstack-nova	Assignee:	OSP DFG:Compute <osp-dfg-compute>
Status:	CLOSED DUPLICATE	QA Contact:	OSP DFG:Compute <osp-dfg-compute>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	13.0 (Queens)	CC:	amaron, dasmith, eglynn, jhakimra, kchamart, mwitt, sbauza, sgordon, vromanso
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-12-09 17:12:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alex Stupnikov 2020-11-28 12:40:49 UTC

Description of problem:

A customer with RHOSP 13 deployments reported a situation when multiple VMs exist on hypervisor nodes in running or shutoff states, but doesn't exist from Nova perspective.

We have tried to understand the problem better and asked customer to provide more information about affected VMs, sosreports from affected hypervisors and DB dumps. Provided data was helpful, but we weren't able to isolate the root cause. Please find the summary below:

- there are no related logs as they were rotated and customer doesn't use any centralized logging system
- from virsh perspective VMs are valid and have one ephemeral root disk drive and one or two NICs
- records related to affected VMs could only be found in consumers table from nova_api DB

We asked customer to check if appropriate neutron ports were deleted and provide other related sosreports. I will provide more information about provided data privately.

Comment 3 Arieh Maron 2020-12-03 11:48:41 UTC

I have encountered a similar situation in OSP 13 in which there is inconsistency between the virsh view of the network and the Nova view:

I stopped the servers controller-1 and controller-2 using virsh on the HW host:
[root@panther13 ~]# virsh list --all
setlocale: No such file or directory
 Id    Name                           State
----------------------------------------------------
 6     undercloud-0                   running
 18    controller-0                   running
 20    compute-1                      running
 21    compute-0                      running
 -     controller-1                   shut off
 -     controller-2                   shut off

But when I check on their status on the undercloud VM they all are reported as active

[stack@undercloud-0 ~]$ . stackrc ;openstack server list
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| 50e68413-790a-409d-bc94-6c453eb0ceca | controller-1 | ACTIVE | ctlplane=192.168.24.14 | overcloud-full | controller |
| 321c345f-cbda-42d0-96f7-b2b32cf54d5d | compute-0    | ACTIVE | ctlplane=192.168.24.29 | overcloud-full | compute    |
| 77f95b40-daeb-4ba4-b077-f64dec86f2be | compute-1    | ACTIVE | ctlplane=192.168.24.10 | overcloud-full | compute    |
| 8f1e0b3e-9687-416b-8721-a5ab22e4b0d7 | controller-2 | ACTIVE | ctlplane=192.168.24.38 | overcloud-full | controller |
| a13beae1-7d6b-4f4c-9021-83ac51b67eed | controller-0 | ACTIVE | ctlplane=192.168.24.15 | overcloud-full | controller |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+

And when I try to ping the controller hosts with the IP displayed for each I only get a response from the controller-2 IP even though it is supposedly offline and only controller-0 which did not respond is supposedly active.

Comment 4 Alex Stupnikov 2020-12-03 11:55:13 UTC

Arieh, thank you for bringing this up: your comment showed how ambiguous was the provided data. I used this bug to report a situation when instance no longer exist from Nova perspective: there are no related DB records, instance is not listed in "openstack server list --all" output, etc. It is different from your situation, which seem to be caused by overcloud controller scheduling: by default in our virtualized labs we don't map overcloud hostnames with baremetal server names. As a result, server names in "nova list" output are the same as in "openstack baremetal node list" or "virsh list" on hypervisor, but they are not related and controller-0 instance could be running on controller-2 baremetal node...

Kind Regards, Alex.

Comment 6 melanie witt 2020-12-08 22:02:06 UTC

Hi, in the absence of log data that encompasses the time when the nova instances corresponding to the orphaned or zombie VMs, we can only guess the root cause of how the VMs became orphaned. There is one known way they can be orphaned that has been fixed in OSP13z11 [1] -- it is not clear to me what version of OSP13 the customer is currently running.

In [1], the scenario where VMs can become orphaned is if their nova instances were deleted by the user while the nova-compute service where they were hosted was being seen as "down" by the system. When a user requests a delete of a nova instance while the VM's nova-compute service is "down", nova will do something we call a local delete. This means that nova will delete the instance from a database perspective while the VM guest continues to exist or run on the compute host itself. If and when nova-compute comes back "up", there is a periodic task [2][3][4] that will by default "reap" the VMs by destroying their libvirt domains and deleting the related instance files on the compute host. By default this task runs every 30 minutes.

Now, if the user deleted an instance while nova-compute was showing as "down" (this can happen because of network partition etc) then the instance will be deleted from a database perspective and **IF** the 'nova-manage db archive_deleted_rows' cron command happens to run while nova-compute is still "down", the deleted nova instance database records will be entirely swept away such that nova no longer has any knowledge of them. If this happens, the VMs running on the compute host will never be removed because nova doesn't know they exist to remove them.

This is not really a bug/defect but an effect of running the 'nova-manage db archive_deleted_rows' command at a improper time. To address it [1], we added the '--before' option to the 'nova-manage db archive_deleted_rows' command we provide in the nova cron job. This way, there is a minimum time range where we will not sweep away deleted nova instances records to allow the system some time to reap the hosted VM guests when nova-compute comes back up.

So, if the customer is on a release earlier than z11, you will want to apply a minor update to get the improved nova cron command for the database archive which will prevent the problem. Note that the default age for archiving deleted rows is 90 days and this can be adjusted via the NovaCronArchiveDeleteRowsAge tripleo parameter [5].

All of that said, if after updating to z11 the customer still gets new VMs orphaned, this will be indicating a bug and we will need the log data for the time range before the corresponding nova instances were requested to be deleted and after that so we can scrutinize everything that happened leading up to the instance deletion and the detail around the VM guest not being removed at the same time.

Hope this helps.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1763329
[2] https://docs.openstack.org/nova/queens/configuration/config.html#DEFAULT.running_deleted_instance_action
[3] https://docs.openstack.org/nova/queens/configuration/config.html#DEFAULT.running_deleted_instance_poll_interval
[4] https://docs.openstack.org/nova/queens/configuration/config.html#DEFAULT.maximum_instance_delete_attempts
[5] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/overcloud_parameters/index

Comment 7 Alex Stupnikov 2020-12-09 10:18:01 UTC

Hi Melanie.

Thank you very much for detailed and thorough reply. I think that you have provided a reference to correct bug: customer is running RHOSP 13 Z8. We will share your follow-up and tell him that it is OK to delete orphaned instances manually.

Thank you, Alex.

Comment 8 melanie witt 2020-12-09 17:12:19 UTC

Great, thank you Alex. I'm going to close this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1763329 based on the current assessment. Please reopen this bug if the customer encounters newly generated orphaned VM guests after applying the minor update to z11. When reopening, please provide DEBUG log level logs for: (nova-api and nova-compute for the host the VM is on) beginning at the delete request and for at least a few hours after the delete request (assuming the customer has DEFAULT.running_deleted_instance_poll_interval set to the default of 30 minutes).

*** This bug has been marked as a duplicate of bug 1763329 ***