Bug 2094064

Summary: CPUUnpinningUnknown exception thrown after failed Live Migration for instance with dedicated CPUs
Product: Red Hat OpenStack Reporter: James Parker <jparker>
Component: openstack-novaAssignee: OSP DFG:Compute <osp-dfg-compute>
Status: NEW --- QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: high Docs Contact:
Priority: medium    
Version: 17.0 (Wallaby)CC: alifshit, bgibizer, dasmith, eglynn, jhakimra, kchamart, osp-dfg-compute, sbauza, sgordon, vromanso
Target Milestone: ---Keywords: Triaged
Target Release: ---Flags: bgibizer: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description James Parker 2022-06-06 17:58:42 UTC
Description of problem:  Seeing CPUUnpinningUnknown after failed live migration, resulting in the guest being stuck in ERROR state and unable to delete the guest. 

2022-05-31 20:25:21.043 2 ERROR nova.compute.manager [req-e5bf5fea-3a44-4b7f-ba16-23cbea06506c f3517376835a44b28518009181dcb6ae 66b514c5acba4bdfb168ac7b08414aa8 - default default] [instance: 73317ed9-6f16-4a05-82e1-74e5be808e18] Setting instance vm_state to ERROR: nova.exception.CPUUnpinningUnknown: CPU set to unpin [24, 36] must be a subset of known CPU set [4, 6, 8, 10, 12, 14, 16, 18]

What I believe is happening is the guest has vCPUs pinned to dedicated pCPUs and is failing live migration.  The target host has a different range of pCPUs than the source:
[root@computesriov-1 nova]# crudini --get /var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf compute cpu_shared_set
0,1,2,3
[root@computesriov-1 nova]# crudini --get /var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf compute cpu_dedicated_set
4-19

[root@computesriov-0 heat-admin]# crudini --get /var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf compute cpu_shared_set
20-23
[root@computesriov-0 heat-admin]# crudini --get /var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf compute cpu_dedicated_set
24-39

I think the XML being generated for the target host is kept when rolling back the failed migration. This results in the above failure log and the instance failing to be cleaned up:
(overcloud) [stack@undercloud-0 tempest-dir]$ openstack server list --all-projects
/usr/lib/python3.9/site-packages/ansible/_vendor/__init__.py:42: UserWarning: One or more Python packages bundled by this ansible-core distribution were already loaded (pyparsing). This may result in undefined behavior.
  warnings.warn('One or more Python packages bundled by this ansible-core distribution were already '
+--------------------------------------+---------------------------------------------+--------+----------+------------------------------+--------------------------------------------+
| ID                                   | Name                                        | Status | Networks | Image                        | Flavor                                     |
+--------------------------------------+---------------------------------------------+--------+----------+------------------------------+--------------------------------------------+
| 73317ed9-6f16-4a05-82e1-74e5be808e18 | tempest-LiveMigrationBase-server-1863882178 | ERROR  |          | cirros-0.5.2-x86_64-disk.img | tempest-LiveMigrationBase-flavor-682811067 |
+--------------------------------------+---------------------------------------------+--------+----------+------------------------------+--------------------------------------------+


Version-Release number of selected component (if applicable):
RHOS-17.0-RHEL-9-20220519.n.1

How reproducible:
100%

Steps to Reproduce:
1. Live migrate a guest with sr-iov port and hit [1]
2. Attempt to clean up guest after failure
3.

Actual results:
After live migration roll back nova-compute repeatable reports CPUUnpinningUnknown and guest cannot be deleted

Expected results:
After rollback guest is in ACTIVE state and can be deleted

Additional info:
[1] https://bugzilla.redhat.com/show_bug.cgi?id=2089520
Clean up failure: https://rhos-ci-staging-jenkins.lab.eng.tlv2.redhat.com/job/DFG-compute-nova-17.0_director-1cont-2comp-ipv4-vxlan-sriov-vgpu-hybrid-phase3/70/testReport/(root)/(empty)/tearDownClass__whitebox_tempest_plugin_api_compute_test_live_migration_LiveMigrationBase_/
Bed Logs: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/staging/DFG-compute-nova-17.0_director-1cont-2comp-ipv4-vxlan-sriov-vgpu-hybrid-phase3/70/

Comment 1 Artom Lifshitz 2022-06-06 21:40:26 UTC
Looks like you're right.

Looking at the job test results posted in the description [1], the server that failed to delete was f2526ddb-3c71-4ffc-807c-9afb0a7cb6e9 (the UUID in the description must come from a previous run).

2022-05-24 23:19:43.594 [./DFG-compute-nova-17.0_director-1cont-2comp-ipv4-vxlan-sriov-vgpu-hybrid-phase3-70/computesriov-1/var/log/containers/nova/nova-compute.log] 2 ERROR nova.compute.manager [req-fab93c7d-7df9-40bc-9e43-b99215454597 edf747c75ef54378a1ac4388519de590 9887205d75674571afc0de0286d790b5 - default default] [instance: f2526ddb-3c71-4ffc-807c-9afb0a7cb6e9] Setting instance vm_state to ERROR: nova.exception.CPUUnpinningUnknown: CPU set to unpin [16, 4] must be a subset of known CPU set [32, 34, 36, 38, 24, 26, 28, 30]


DEBUG nova.virt.libvirt.migration [-] _update_numa_xml input xml=<domain type="kvm">
[ ... ]
                                          <cputune>
                                            <vcpupin vcpu="0" cpuset="36"/>
                                            <vcpupin vcpu="1" cpuset="24"/>
                                            <emulatorpin cpuset="24,36"/>
                                          </cputune>

DEBUG nova.virt.libvirt.migration [-] _update_numa_xml output xml=<domain type="kvm">
[ ... ]
                                          <cputune>
                                            <vcpupin vcpu="0" cpuset="16"/>
                                            <vcpupin vcpu="1" cpuset="4"/>
                                            <emulatorpin cpuset="4,16"/>
                                          </cputune>

What's weird is that we never use the XML as source of truth (except when modifying it before the live migration). So if the delete request is hitting this but, it means we've accidentally applied the migration context to the instance, and saved its destination PCPUs. There is a call to instance.apply_migration_context() in the _update_available_resource() periodic, but if we were hitting that it would be a race, and not consistently 100% reproducible, which... is it?

[1] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/staging/DFG-compute-nova-17.0_director-1cont-2comp-ipv4-vxlan-sriov-vgpu-hybrid-phase3/70/test_results/tempest-results-whitebox_plugin.1.html

Comment 2 Artom Lifshitz 2022-06-08 14:38:20 UTC
Might be https://bugs.launchpad.net/nova/+bug/1894095 ?