Bug 1902393
| Summary: | Several VMs are running on hypervisor nodes while Nova instances are no longer exist | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Alex Stupnikov <astupnik> |
| Component: | openstack-nova | Assignee: | OSP DFG:Compute <osp-dfg-compute> |
| Status: | CLOSED DUPLICATE | QA Contact: | OSP DFG:Compute <osp-dfg-compute> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 13.0 (Queens) | CC: | amaron, dasmith, eglynn, jhakimra, kchamart, mwitt, sbauza, sgordon, vromanso |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-12-09 17:12:19 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Alex Stupnikov
2020-11-28 12:40:49 UTC
I have encountered a similar situation in OSP 13 in which there is inconsistency between the virsh view of the network and the Nova view: I stopped the servers controller-1 and controller-2 using virsh on the HW host: [root@panther13 ~]# virsh list --all setlocale: No such file or directory Id Name State ---------------------------------------------------- 6 undercloud-0 running 18 controller-0 running 20 compute-1 running 21 compute-0 running - controller-1 shut off - controller-2 shut off But when I check on their status on the undercloud VM they all are reported as active [stack@undercloud-0 ~]$ . stackrc ;openstack server list +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | 50e68413-790a-409d-bc94-6c453eb0ceca | controller-1 | ACTIVE | ctlplane=192.168.24.14 | overcloud-full | controller | | 321c345f-cbda-42d0-96f7-b2b32cf54d5d | compute-0 | ACTIVE | ctlplane=192.168.24.29 | overcloud-full | compute | | 77f95b40-daeb-4ba4-b077-f64dec86f2be | compute-1 | ACTIVE | ctlplane=192.168.24.10 | overcloud-full | compute | | 8f1e0b3e-9687-416b-8721-a5ab22e4b0d7 | controller-2 | ACTIVE | ctlplane=192.168.24.38 | overcloud-full | controller | | a13beae1-7d6b-4f4c-9021-83ac51b67eed | controller-0 | ACTIVE | ctlplane=192.168.24.15 | overcloud-full | controller | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ And when I try to ping the controller hosts with the IP displayed for each I only get a response from the controller-2 IP even though it is supposedly offline and only controller-0 which did not respond is supposedly active. Arieh, thank you for bringing this up: your comment showed how ambiguous was the provided data. I used this bug to report a situation when instance no longer exist from Nova perspective: there are no related DB records, instance is not listed in "openstack server list --all" output, etc. It is different from your situation, which seem to be caused by overcloud controller scheduling: by default in our virtualized labs we don't map overcloud hostnames with baremetal server names. As a result, server names in "nova list" output are the same as in "openstack baremetal node list" or "virsh list" on hypervisor, but they are not related and controller-0 instance could be running on controller-2 baremetal node... Kind Regards, Alex. Hi, in the absence of log data that encompasses the time when the nova instances corresponding to the orphaned or zombie VMs, we can only guess the root cause of how the VMs became orphaned. There is one known way they can be orphaned that has been fixed in OSP13z11 [1] -- it is not clear to me what version of OSP13 the customer is currently running. In [1], the scenario where VMs can become orphaned is if their nova instances were deleted by the user while the nova-compute service where they were hosted was being seen as "down" by the system. When a user requests a delete of a nova instance while the VM's nova-compute service is "down", nova will do something we call a local delete. This means that nova will delete the instance from a database perspective while the VM guest continues to exist or run on the compute host itself. If and when nova-compute comes back "up", there is a periodic task [2][3][4] that will by default "reap" the VMs by destroying their libvirt domains and deleting the related instance files on the compute host. By default this task runs every 30 minutes. Now, if the user deleted an instance while nova-compute was showing as "down" (this can happen because of network partition etc) then the instance will be deleted from a database perspective and **IF** the 'nova-manage db archive_deleted_rows' cron command happens to run while nova-compute is still "down", the deleted nova instance database records will be entirely swept away such that nova no longer has any knowledge of them. If this happens, the VMs running on the compute host will never be removed because nova doesn't know they exist to remove them. This is not really a bug/defect but an effect of running the 'nova-manage db archive_deleted_rows' command at a improper time. To address it [1], we added the '--before' option to the 'nova-manage db archive_deleted_rows' command we provide in the nova cron job. This way, there is a minimum time range where we will not sweep away deleted nova instances records to allow the system some time to reap the hosted VM guests when nova-compute comes back up. So, if the customer is on a release earlier than z11, you will want to apply a minor update to get the improved nova cron command for the database archive which will prevent the problem. Note that the default age for archiving deleted rows is 90 days and this can be adjusted via the NovaCronArchiveDeleteRowsAge tripleo parameter [5]. All of that said, if after updating to z11 the customer still gets new VMs orphaned, this will be indicating a bug and we will need the log data for the time range before the corresponding nova instances were requested to be deleted and after that so we can scrutinize everything that happened leading up to the instance deletion and the detail around the VM guest not being removed at the same time. Hope this helps. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1763329 [2] https://docs.openstack.org/nova/queens/configuration/config.html#DEFAULT.running_deleted_instance_action [3] https://docs.openstack.org/nova/queens/configuration/config.html#DEFAULT.running_deleted_instance_poll_interval [4] https://docs.openstack.org/nova/queens/configuration/config.html#DEFAULT.maximum_instance_delete_attempts [5] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/overcloud_parameters/index Hi Melanie. Thank you very much for detailed and thorough reply. I think that you have provided a reference to correct bug: customer is running RHOSP 13 Z8. We will share your follow-up and tell him that it is OK to delete orphaned instances manually. Thank you, Alex. Great, thank you Alex. I'm going to close this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1763329 based on the current assessment. Please reopen this bug if the customer encounters newly generated orphaned VM guests after applying the minor update to z11. When reopening, please provide DEBUG log level logs for: (nova-api and nova-compute for the host the VM is on) beginning at the delete request and for at least a few hours after the delete request (assuming the customer has DEFAULT.running_deleted_instance_poll_interval set to the default of 30 minutes). *** This bug has been marked as a duplicate of bug 1763329 *** |