Bug 2066222

Summary: Large scale |VMs Migration is failing due to different HT configurations on nodes
Product: Container Native Virtualization (CNV) Reporter: Boaz <bbenshab>
Component: VirtualizationAssignee: Barak <bmordeha>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Kedar Bidarkar <kbidarka>
Severity: high Docs Contact:
Priority: high    
Version: 4.9.2CC: acardace, akamra, danken, fdeutsch, sgott, ycui
Target Milestone: ---Keywords: Scale
Target Release: 4.13.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-02-27 15:53:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Boaz 2022-03-21 09:42:46 UTC
Some background:
-------------------------
I'm running a scale OpenShift setup with 100 OpenShift nodes as a preparation for an environment that was requested by a customer, with 47 RHCS 5.0 hosts as an external storage cluster.

this setup is currently running 3000 VMs:
1500 RHEL 8.5 persistent storage VMs 
500 Windows10 persistent storage VMs.
1000 Fedora Ephemeral storage VMs.

The workers are divided to 3 zones:
worker000 - worker031. = Zone0
worker032 - worker062. = Zone1
worker033 - worker096. = Zone2

I start the migration by applying an empty machineconfig to the zone
which then causes the nodes to start draining.

this is my migration config:
----
  liveMigrationConfig:
    completionTimeoutPerGiB: 800
    parallelMigrationsPerCluster: 11
    parallelOutboundMigrationsPerNode: 22
    progressTimeout: 150
  workloads: {}
----

another thing worth mentioning is that I'm running a custom kubletconfig that is required due to the additional 21,400 pods on the cluster:
-----
spec:
  kubeletConfig:
    kubeAPIBurst: 200
    kubeAPIQPS: 100
    maxPods: 500
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: enabled
----

Issue number 1:
the first problem I encountered was that right after starting the migration, 
it got stuck for a few hours and nothing happened.
I also tried to manually run virtctl migrate to a few of the vms there were scheduled on cordoned nodes, and the cli was failing due to timeouts.
I resolved that by patching the API with additional pods, this issue is already discussed at -  https://github.com/kubevirt/kubevirt/issues/7101 

----
apiVersion: hco.kubevirt.io/v1beta1
kind: HyperConverged
metadata:
  annotations:
    deployOVS: "false"
    kubevirt.kubevirt.io/jsonpatch: '[{"op": "add", "path": "/spec/customizeComponents/patches",
      "value": [{"resourceType": "Deployment", "resourceName": "virt-api", "type":
      "json", "patch": "[{\"op\": \"replace\", \"path\": \"/spec/replicas\", \"value\":
      5}]"}]}]'
----

Issue number 2:
once the migration started to run I hoped that was it however a few VMs are currently failing to migrate due to various reasons, those are the VMs:

rhel82-vm0074   3d23h   Migrating   True
rhel82-vm0188   3d22h   Migrating   True
rhel82-vm0253   3d21h   Migrating   True
rhel82-vm0443   3d19h   Migrating   True
rhel82-vm0451   3d19h   Migrating   True
rhel82-vm0611   3d18h   Migrating   True
rhel82-vm0784   3d17h   Migrating   True
rhel82-vm1184   3d14h   Migrating   True
rhel82-vm1428   3d12h   Migrating   True

here are a few examples:


VM rhel82-vm0451 - running on worker031 failing due to Assertion on kvm_buf_set_msrs
---------------------------------------
  Type     Reason            Age                  From                         Message
  ----     ------            ----                 ----                         -------
  Normal   SuccessfulCreate  148m                 disruptionbudget-controller  Created Migration kubevirt-evacuation-mvr7r
  Normal   PreparingTarget   143m                 virt-handler                 Migration Target is listening at 10.131.44.5, on ports: 36763,37373
  Normal   PreparingTarget   143m (x24 over 11h)  virt-handler                 VirtualMachineInstance Migration Target Prepared.
  Warning  Migrated          143m                 virt-handler                 VirtualMachineInstance migration uid b9fd0b54-26a5-4063-bd46-7b1e5dbeddd5 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
  Normal   SuccessfulCreate  89m                  disruptionbudget-controller  Created Migration kubevirt-evacuation-nvjbk
  Normal   PreparingTarget   85m                  virt-handler                 Migration Target is listening at 10.130.2.5, on ports: 37595,32775
  Normal   PreparingTarget   85m (x12 over 10h)   virt-handler                 VirtualMachineInstance Migration Target Prepared.
  Warning  Migrated          80m                  virt-handler                 VirtualMachineInstance migration uid 7c9f1f23-43e3-4af7-b423-6b0088cf563f failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
  Normal   SuccessfulCreate  35m                  disruptionbudget-controller  Created Migration kubevirt-evacuation-7fstf
  Normal   PreparingTarget   33m                  virt-handler                 Migration Target is listening at 10.128.44.6, on ports: 38759,34661
  Warning  Migrated          27m                  virt-handler                 VirtualMachineInstance migration uid 5b8e1b88-1b03-4668-9e1a-755989f7c868 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2022-03-21T08:48:41.882103Z qemu-kvm: error: failed to set MSR 0x38f to 0x7000000ff
qemu-kvm: ../target/i386/kvm.c:2701: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.')
  Normal   SuccessfulCreate  27m                    disruptionbudget-controller  Created Migration kubevirt-evacuation-ts6nn
  Normal   PreparingTarget   23m                    virt-handler                 Migration Target is listening at 10.128.44.6, on ports: 41015,35881
  Warning  Migrated          17m                    virt-handler                 VirtualMachineInstance migration uid 74324d10-c606-41b2-8c8c-baeb94ccaa04 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
  Normal   SuccessfulCreate  14m                    disruptionbudget-controller  Created Migration kubevirt-evacuation-6sqfb
  Normal   SuccessfulUpdate  13m (x32 over 11h)     virtualmachine-controller    Expanded PodDisruptionBudget kubevirt-disruption-budget-cnqj5
  Normal   PreparingTarget   8m53s (x2 over 8m53s)  virt-handler                 Migration Target is listening at 10.128.44.6, on ports: 46429,34129
  Normal   PreparingTarget   8m52s (x13 over 33m)   virt-handler                 VirtualMachineInstance Migration Target Prepared.
  Normal   Migrating         8m52s (x116 over 11h)  virt-handler                 VirtualMachineInstance is migrating.
  Normal   SuccessfulUpdate  7m47s (x32 over 11h)   disruptionbudget-controller  shrank PodDisruptionBudget%!(EXTRA string=kubevirt-disruption-budget-cnqj5)
  Warning  SyncFailed        3m40s (x32 over 10h)   virt-handler                 server error. command Migrate failed: "migration job already executed"
  Warning  Migrated          3m40s                  virt-handler                 VirtualMachineInstance migration uid ed604e03-8658-4739-9875-95b88f2e0dd0 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')


VM rhel82-vm0660 - running on worker031 failing due to what seems to be a race condition
---------------------------------------
  Type     Reason            Age                    From                         Message
  ----     ------            ----                   ----                         -------
  Normal   SuccessfulCreate  151m                   disruptionbudget-controller  Created Migration kubevirt-evacuation-6zlsd
  Normal   PreparingTarget   145m                   virt-handler                 Migration Target is listening at 10.131.0.7, on ports: 45093,37935
  Normal   PreparingTarget   145m (x12 over 7h47m)  virt-handler                 VirtualMachineInstance Migration Target Prepared.
  Warning  Migrated          140m                   virt-handler                 VirtualMachineInstance migration uid 09832eb9-bed5-403c-9020-3e2f586a41e7 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
  Normal   SuccessfulCreate  92m                    disruptionbudget-controller  Created Migration kubevirt-evacuation-n8267
  Normal   SuccessfulUpdate  91m (x11 over 11h)     virtualmachine-controller    Expanded PodDisruptionBudget kubevirt-disruption-budget-twdcv
  Normal   PreparingTarget   88m                    virt-handler                 Migration Target is listening at 10.128.4.6, on ports: 39099,40877
  Normal   Migrating         88m (x35 over 11h)     virt-handler                 VirtualMachineInstance is migrating.
  Normal   PreparingTarget   88m (x8 over 8h)       virt-handler                 VirtualMachineInstance Migration Target Prepared.
  Normal   SuccessfulUpdate  87m (x11 over 11h)     disruptionbudget-controller  shrank PodDisruptionBudget%!(EXTRA string=kubevirt-disruption-budget-twdcv)
  Warning  SyncFailed        83m (x11 over 11h)     virt-handler                 server error. command Migrate failed: "migration job already executed"
  Warning  Migrated          83m                    virt-handler                 VirtualMachineInstance migration uid fb9e3a06-21ab-4e54-8e6e-861f44bbee36 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
---------------------------------------



VM rhel82-vm0836 - running on worker024 failing again to what seems to be a race condition, and failed API calls.
---------------------------------------


  Type     Reason            Age                    From                         Message
  ----     ------            ----                   ----                         -------
  Normal   SuccessfulCreate  168m                   disruptionbudget-controller  Created Migration kubevirt-evacuation-ddrdp
  Normal   PreparingTarget   165m                   virt-handler                 Migration Target is listening at 10.131.0.7, on ports: 46719,42575
  Normal   PreparingTarget   165m (x12 over 7h43m)  virt-handler                 VirtualMachineInstance Migration Target Prepared.
  Warning  Migrated          159m                   virt-handler                 VirtualMachineInstance migration uid 61b4e7c8-1348-4bac-9768-6a6be9c129e0 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
  Normal   SuccessfulCreate  140m                   disruptionbudget-controller  Created Migration kubevirt-evacuation-pnbpm
  Normal   PreparingTarget   135m (x9 over 3h34m)   virt-handler                 VirtualMachineInstance Migration Target Prepared.
  Normal   PreparingTarget   135m (x2 over 135m)    virt-handler                 Migration Target is listening at 10.130.30.5, on ports: 35207,37223
  Warning  Migrated          129m                   virt-handler                 VirtualMachineInstance migration uid acca7d9e-723e-4f3c-adad-97d432db3a1b failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2022-03-21T07:18:04.563275Z qemu-kvm: error: failed to set MSR 0x38f to 0x7000000ff
qemu-kvm: ../target/i386/kvm.c:2701: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.')
  Normal   SuccessfulCreate  124m                 disruptionbudget-controller  Created Migration kubevirt-evacuation-frtqg
  Normal   PreparingTarget   120m                 virt-handler                 Migration Target is listening at 10.131.44.5, on ports: 45403,37683
  Normal   PreparingTarget   120m (x8 over 8h)    virt-handler                 VirtualMachineInstance Migration Target Prepared.
  Warning  Migrated          120m                 virt-handler                 VirtualMachineInstance migration uid 0f6f18b5-3fde-4beb-8a9c-c112b7f8da02 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
  Normal   SuccessfulCreate  107m                 disruptionbudget-controller  Created Migration kubevirt-evacuation-wsjrc
  Normal   PreparingTarget   102m (x17 over 11h)  virt-handler                 VirtualMachineInstance Migration Target Prepared.
  Normal   PreparingTarget   102m                 virt-handler                 Migration Target is listening at 10.130.2.5, on ports: 35511,35451
  Warning  Migrated          95m                  virt-handler                 VirtualMachineInstance migration uid ab614074-848a-4ee4-8c37-07d4fcfbd872 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2022-03-21T07:51:10.157656Z qemu-kvm: error: failed to set MSR 0x38f to 0x7000000ff
qemu-kvm: ../target/i386/kvm.c:2701: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.')
  Normal   SuccessfulCreate  11m                    disruptionbudget-controller  Created Migration kubevirt-evacuation-zv85q
  Normal   SuccessfulUpdate  10m (x38 over 16h)     virtualmachine-controller    Expanded PodDisruptionBudget kubevirt-disruption-budget-ln2kq
  Normal   PreparingTarget   6m39s                  virt-handler                 Migration Target is listening at 10.128.44.6, on ports: 42829,45929
  Normal   Migrating         6m38s (x126 over 16h)  virt-handler                 VirtualMachineInstance is migrating.
  Normal   PreparingTarget   6m38s (x4 over 6m39s)  virt-handler                 VirtualMachineInstance Migration Target Prepared.
  Normal   SuccessfulUpdate  5m14s (x38 over 16h)   disruptionbudget-controller  shrank PodDisruptionBudget%!(EXTRA string=kubevirt-disruption-budget-ln2kq)
  Warning  SyncFailed        95s (x38 over 16h)     virt-handler                 server error. command Migrate failed: "migration job already executed"
  Warning  Migrated          95s                    virt-handler                 VirtualMachineInstance migration uid 1d146437-b031-47f7-accb-9ca42e960025 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
---------------------------------------

Versions of all relevant components:
CNV	4.9.2
RHCS	5.0
OCP     4.9.15


CNV must-gather:
-----------------
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/must-gather-failed-migration.tar.gz

Comment 1 Boaz 2022-03-22 15:42:12 UTC
I'm currently exploring different options in order to get a safe & fast migration as possible, currently testing

  liveMigrationConfig:
    completionTimeoutPerGiB: 800
    parallelMigrationsPerCluster: 20
    parallelOutboundMigrationsPerNode: 4
    progressTimeout: 150



and

  maxUnavailable: 10


unfortunately, nmstate is failing (and recovering) for various reasons during migration  during migration, 

[root@e26-h01-000-r640 logs]# oc get pod -n openshift-cnv|grep nmstate-handler|egrep '[0-9]m ago'
nmstate-handler-2hcp9                                 1/1     Running             6 (160m ago)     16d
nmstate-handler-4wlpt                                 1/1     Running             6 (131m ago)     16d
nmstate-handler-595zt                                 1/1     Running             8 (27m ago)      16d
nmstate-handler-5crvq                                 1/1     Running             8 (38m ago)      16d
nmstate-handler-66w2m                                 1/1     Running             8 (45m ago)      16d
nmstate-handler-7nzwf                                 1/1     Running             12 (158m ago)    16d
nmstate-handler-9x2pp                                 1/1     Running             6 (137m ago)     16d
nmstate-handler-btrc5                                 1/1     Running             6 (170m ago)     16d
nmstate-handler-cdxzm                                 1/1     Running             8 (38m ago)      16d
nmstate-handler-d4jtt                                 1/1     Running             8 (23m ago)      16d
nmstate-handler-d75x7                                 1/1     Running             8 (46m ago)      16d
nmstate-handler-fff7f                                 1/1     Running             8 (40m ago)      16d
nmstate-handler-g49cf                                 1/1     Running             8 (29m ago)      16d
nmstate-handler-gpg28                                 1/1     Running             8 (29m ago)      16d
nmstate-handler-gtpwl                                 1/1     Running             14 (73m ago)     16d
nmstate-handler-h4hk4                                 1/1     Running             8 (26m ago)      16d
nmstate-handler-j66dd                                 1/1     Running             8 (39m ago)      16d
nmstate-handler-k4jmv                                 1/1     Running             8 (52m ago)      16d
nmstate-handler-k6xxs                                 1/1     Running             8 (46m ago)      16d
nmstate-handler-m8bzx                                 1/1     Running             8 (52m ago)      16d
nmstate-handler-mg9xs                                 1/1     Running             8 (39m ago)      16d
nmstate-handler-n5xzx                                 1/1     Running             6 (3h6m ago)     16d
nmstate-handler-n7v7b                                 1/1     Running             6 (163m ago)     16d
nmstate-handler-nvhmg                                 1/1     Running             8 (73m ago)      16d
nmstate-handler-sg2nm                                 1/1     Running             8 (45m ago)      16d
nmstate-handler-x6rxg                                 1/1     Running             14 (40m ago)     16d
nmstate-handler-zmh5z                                 1/1     Running             6 (127m ago)     16d

Comment 3 sgott 2022-03-23 12:37:14 UTC
Hi Boaz,

https://bugzilla.redhat.com/show_bug.cgi?id=1439078 has a similar log message, which leads us to believe there might be a difference in the way the bios in configured between nodes in this cluster. Especially check if hyperthreading is configured the same way on all nodes.

If you're not specifying a default CPU type the node's type will be used (host model), and the flags could be different even for otherwise identical CPUs.

It appears that perhaps worker031 is the preferred target for the scheduler in this case--and since it's having trouble, it keeps being re-selected for each subsequent migration (and failure).

Comment 4 Boaz 2022-03-23 12:52:17 UTC
Hey Stu, 
I checked and Indeed 5/100 nodes have hyperthreading disabled, worker031 is one of them.

Comment 5 Boaz 2022-04-25 07:39:15 UTC
Hey @sgott, so now that we know that this issue was caused because of the "nonhomogeneous" nodes (same model but hyperthreading disabled), the question remained 
is the Kube scheduler is to blame here because it went ahead and decided to schedule almost all the migrating VMs to the specific nodes that actually had fewer cores available due to hyperthreading disabled,
or the virt control that attempted to migrate VMs to those nodes?
this is a classic chicken and egg situation.

Comment 6 Boaz 2022-04-25 07:41:03 UTC
just for the record, the workaround was to verify hyperthreading was enabled on all the nodes, I also reduced the severity due to the WA.

Comment 7 Fabian Deutsch 2022-05-18 14:47:56 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1439078 seems to be related

Comment 8 sgott 2022-05-26 17:35:27 UTC
Per conversation with the virt team, it should be noted that CPU capabilities reported by libvirt can tell us if hyperthreading is enabled on a node--by observing siblings of CPUs.

Comment 9 sgott 2022-06-03 15:52:31 UTC
Deferring to the next release as the workaround is to esnure hyperthreading matches on each node.

Comment 10 sgott 2022-08-16 14:27:48 UTC
Deferring this to the next release because with testing so far, we've been unable to reproduce. Without a clear reproducer, we don't have a path forward.

Boaz, can you please confirm again the steps we need to take to reproduce this because we can't with what has been given so far.

Comment 11 Barak 2022-08-17 13:08:21 UTC
I couldn't reproduce the problem with both Bare Metal  cluster and virtualized cluster (kubevirt-ci's nodes ).

When i migrate vm from a node with HT enabled\disabled to a node with HT disabled\enabled the migration  Succeeded in both cases.

The way i turn off the HT:
Bare Metal cluster:
echo off > /sys/devices/system/cpu/smt/control

kubevirt-ci cluster:
Change the -smp flag of qemu command just for one of the nodes 
-smp ${CPU}
   -> 
-smp 12,sockets=1,cores=6,threads=2
 

The way i verify if HT is enabled:

check if 
Thread(s) per core: 1 (disabled)
when running lscpu in the nodes.

Am i missing something?

Comment 12 Boaz 2022-11-08 16:06:46 UTC
(In reply to Barak from comment #11)
> I couldn't reproduce the problem with both Bare Metal  cluster and
> virtualized cluster (kubevirt-ci's nodes ).
> 
> When i migrate vm from a node with HT enabled\disabled to a node with HT
> disabled\enabled the migration  Succeeded in both cases.
> 
> The way i turn off the HT:
> Bare Metal cluster:
> echo off > /sys/devices/system/cpu/smt/control
> 
> kubevirt-ci cluster:
> Change the -smp flag of qemu command just for one of the nodes 
> -smp ${CPU}
>    -> 
> -smp 12,sockets=1,cores=6,threads=2
>  
> 
> The way i verify if HT is enabled:
> 
> check if 
> Thread(s) per core: 1 (disabled)
> when running lscpu in the nodes.
> 
> Am I missing something?

hey Barak, 
I belive you have to disable HT from the BIOS in order to produce this.

Comment 14 Barak 2023-02-27 15:11:41 UTC
I disabled HT from Bios using IPMI LAN and still i can't reproduce  it

lscpu of node without HT:
[root@zeus35 ~]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
BIOS Vendor ID:      Intel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
BIOS Model name:     Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
Stepping:            7
CPU MHz:             800.076
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            28160K
NUMA node0 CPU(s):   0
NUMA node1 CPU(s):   1
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

lscpu node with HT enabled:
[root@zeus32 ~]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              80
On-line CPU(s) list: 0-79
Thread(s) per core:  2
Core(s) per socket:  20
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
BIOS Vendor ID:      Intel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
BIOS Model name:     Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
Stepping:            7
CPU MHz:             800.058
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            28160K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

i tried migration from both nodes and it worked.

Comment 15 Barak 2023-02-27 15:13:56 UTC
[bmordeha@fedora Work]$ kubectl get po -owide
NAME                                  READY   STATUS      RESTARTS   AGE     IP              NODE                             NOMINATED NODE   READINESS GATES
virt-launcher-vmi-migratable-98k6n    0/2     Completed   0          22m     192.168.75.8    zeus32.lab.eng.tlv2.redhat.com   <none>           1/1
virt-launcher-vmi-migratable-xc668    2/2     Running     0          16s     192.168.40.7    zeus35.lab.eng.tlv2.redhat.com   <none>           1/1
virt-launcher-vmi-migratable1-2dmzx   2/2     Running     0          8m29s   192.168.75.13   zeus32.lab.eng.tlv2.redhat.com   <none>           1/1
virt-launcher-vmi-migratable1-jw5bc   0/2     Completed   0          8m52s   192.168.40.6    zeus35.lab.eng.tlv2.redhat.com   <none>           1/1
[bmordeha@fedora Work]$ 
[bmordeha@fedora Work]$ kubectl get vmim 
NAME             PHASE       VMI
migration-job    Succeeded   vmi-migratable1
migration-job1   Succeeded   vmi-migratable

Comment 16 Barak 2023-02-27 15:53:00 UTC
Closing this for the following reasons:
- We don't have enough data to reproduce 
- This happen with 4.9 old version and there weren't any similar bugs for a while 
This might be fixed already lets see if someone else complain about a similar issue in the future.