Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1722846

Summary: Incorrect instance count on some of the compute nodes.
Product: Red Hat OpenStack Reporter: sawaghma
Component: openstack-novaAssignee: OSP DFG:Compute <osp-dfg-compute>
Status: CLOSED EOL QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: high Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: dasmith, dbarhate, eglynn, gkadam, jhakimra, kchamart, lyarwood, mark.a.sloan, mburns, nlevinki, sbauza, sgordon, smooney, vromanso
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-07 09:44:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description sawaghma 2019-06-21 13:39:09 UTC
Description of problem:
Couple of the compute nodes are not able to sync the resources and hence causing incorrect "running VMs" count on UI and CLI. This issue is occurring on some of the compute nodes.

Version-Release number of selected component (if applicable):


How reproducible:
Launch number of instances on RHOSP10 and check the actual instance count on hypervisor.

Steps to Reproduce:
1.
2.
3.

Actual results:
Instance count is less than actual instances running on some hosts when performing below operations,  
a) UI:
   horizona->hypervisor section

b) Database:
   MariaDB [nova]> select hypervisor_hostname,running_vms  from compute_nodes;

c) CLI
   openstack hypervisor show <hypervisor_hostname>

Expected results:
Instance count should be same in below operations as running instances,
a) UI:
   horizona->hypervisor section

b) Database:
   MariaDB [nova]> select hypervisor_hostname,running_vms  from compute_nodes;

c) CLI
   openstack hypervisor show <hypervisor_hostname>
 

Additional info:
The new instance launch test did show the updated instance count for some time  before it reverted back to old one. We observed that there is a issue with resource tracker. Compute service is not able to refresh/update the resource/instance count.

Comment 1 sawaghma 2019-06-27 06:27:26 UTC
Hi Team,

Any update on registered issue?

Regards,
Sagar W

Comment 6 Artom Lifshitz 2019-07-12 14:04:36 UTC
Might be related to [1] and its bug [2], in which the periodic task doesn't catch an exception raised from the hardware module.

[1] https://review.opendev.org/#/c/661208/
[2] https://launchpad.net/bugs/1829349

Comment 8 sawaghma 2019-07-22 12:27:54 UTC
Hi Team,

Any update?

Regards,
Sagar W

Comment 9 sawaghma 2019-07-29 06:41:57 UTC
Hi Team,

We are waiting for response!

Regards,
Sagar W

Comment 12 smooney 2019-08-13 12:53:17 UTC
the exception indicates the host has a vm that is requesting pinning to cores 
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 20, 21, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 103, 104, 105, 106, 108, 109]
however only the following cores are free
[0, 1, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 88, 89, 91, 92, 93, 94, 95, 96, 97, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109]

this indicate that either the vm was live migrated  and the cores used on the source host were not free on the destination host or the vcpu pin set was
modifed on a host with running vms.

in the former case we have a downstream only check to prevent live migration with cpu pinning if the cpus are not free on the destination host.
we also added a workaround config option to allow you to disable the downstream only behavior however if you disable that feature and continue
to live migrate numa instance the operator is required to ensure the host has the same cpus free on the destination.

similarly if the admin modifies the vcpu_pin_set without first removing all pinned vms from the host its there responsibility to ensure that the new vcpu_pin_set
is valid for all vms currently on the host.


it is unclear that the upstream patch is a correct fix as incorrect numa topology information could result in vms being killed if there is not enough ram available
to support a new instance. as such it is not clear that the upstream review should be merged or backported.

in this specific case i think we need to confrim with the customer if they have set cpu_pinning_migration_quick_fail=false in there nova.conf
and  if they have performed live migrations of pinned guest or modifed the vcpu_pin_set to remvoe core 2 and 10

Comment 13 sawaghma 2019-08-15 06:10:26 UTC
Hello Sean,

Please be noted that, user is not able to find the "cpu_pinning_migration_quick_fail=false" entry in nova.conf.

Also user has done some live migration activities for few instances, though he is not sure how to verify if they were pinned instances.

Kindly let us know the way to find the instances are pinned.

Regards,
Sagar W

Comment 14 dbarhate 2020-04-21 15:55:36 UTC
Hello Team,

Appreciate if someone can provide current status on this bug.