Bug 2066222
Summary: | Large scale |VMs Migration is failing due to different HT configurations on nodes | ||
---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Boaz <bbenshab> |
Component: | Virtualization | Assignee: | Barak <bmordeha> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Kedar Bidarkar <kbidarka> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.9.2 | CC: | acardace, akamra, danken, fdeutsch, sgott, ycui |
Target Milestone: | --- | Keywords: | Scale |
Target Release: | 4.13.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-02-27 15:53:00 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Boaz
2022-03-21 09:42:46 UTC
I'm currently exploring different options in order to get a safe & fast migration as possible, currently testing liveMigrationConfig: completionTimeoutPerGiB: 800 parallelMigrationsPerCluster: 20 parallelOutboundMigrationsPerNode: 4 progressTimeout: 150 and maxUnavailable: 10 unfortunately, nmstate is failing (and recovering) for various reasons during migration during migration, [root@e26-h01-000-r640 logs]# oc get pod -n openshift-cnv|grep nmstate-handler|egrep '[0-9]m ago' nmstate-handler-2hcp9 1/1 Running 6 (160m ago) 16d nmstate-handler-4wlpt 1/1 Running 6 (131m ago) 16d nmstate-handler-595zt 1/1 Running 8 (27m ago) 16d nmstate-handler-5crvq 1/1 Running 8 (38m ago) 16d nmstate-handler-66w2m 1/1 Running 8 (45m ago) 16d nmstate-handler-7nzwf 1/1 Running 12 (158m ago) 16d nmstate-handler-9x2pp 1/1 Running 6 (137m ago) 16d nmstate-handler-btrc5 1/1 Running 6 (170m ago) 16d nmstate-handler-cdxzm 1/1 Running 8 (38m ago) 16d nmstate-handler-d4jtt 1/1 Running 8 (23m ago) 16d nmstate-handler-d75x7 1/1 Running 8 (46m ago) 16d nmstate-handler-fff7f 1/1 Running 8 (40m ago) 16d nmstate-handler-g49cf 1/1 Running 8 (29m ago) 16d nmstate-handler-gpg28 1/1 Running 8 (29m ago) 16d nmstate-handler-gtpwl 1/1 Running 14 (73m ago) 16d nmstate-handler-h4hk4 1/1 Running 8 (26m ago) 16d nmstate-handler-j66dd 1/1 Running 8 (39m ago) 16d nmstate-handler-k4jmv 1/1 Running 8 (52m ago) 16d nmstate-handler-k6xxs 1/1 Running 8 (46m ago) 16d nmstate-handler-m8bzx 1/1 Running 8 (52m ago) 16d nmstate-handler-mg9xs 1/1 Running 8 (39m ago) 16d nmstate-handler-n5xzx 1/1 Running 6 (3h6m ago) 16d nmstate-handler-n7v7b 1/1 Running 6 (163m ago) 16d nmstate-handler-nvhmg 1/1 Running 8 (73m ago) 16d nmstate-handler-sg2nm 1/1 Running 8 (45m ago) 16d nmstate-handler-x6rxg 1/1 Running 14 (40m ago) 16d nmstate-handler-zmh5z 1/1 Running 6 (127m ago) 16d Hi Boaz, https://bugzilla.redhat.com/show_bug.cgi?id=1439078 has a similar log message, which leads us to believe there might be a difference in the way the bios in configured between nodes in this cluster. Especially check if hyperthreading is configured the same way on all nodes. If you're not specifying a default CPU type the node's type will be used (host model), and the flags could be different even for otherwise identical CPUs. It appears that perhaps worker031 is the preferred target for the scheduler in this case--and since it's having trouble, it keeps being re-selected for each subsequent migration (and failure). Hey Stu, I checked and Indeed 5/100 nodes have hyperthreading disabled, worker031 is one of them. Hey @sgott, so now that we know that this issue was caused because of the "nonhomogeneous" nodes (same model but hyperthreading disabled), the question remained is the Kube scheduler is to blame here because it went ahead and decided to schedule almost all the migrating VMs to the specific nodes that actually had fewer cores available due to hyperthreading disabled, or the virt control that attempted to migrate VMs to those nodes? this is a classic chicken and egg situation. just for the record, the workaround was to verify hyperthreading was enabled on all the nodes, I also reduced the severity due to the WA. https://bugzilla.redhat.com/show_bug.cgi?id=1439078 seems to be related Per conversation with the virt team, it should be noted that CPU capabilities reported by libvirt can tell us if hyperthreading is enabled on a node--by observing siblings of CPUs. Deferring to the next release as the workaround is to esnure hyperthreading matches on each node. Deferring this to the next release because with testing so far, we've been unable to reproduce. Without a clear reproducer, we don't have a path forward. Boaz, can you please confirm again the steps we need to take to reproduce this because we can't with what has been given so far. I couldn't reproduce the problem with both Bare Metal cluster and virtualized cluster (kubevirt-ci's nodes ). When i migrate vm from a node with HT enabled\disabled to a node with HT disabled\enabled the migration Succeeded in both cases. The way i turn off the HT: Bare Metal cluster: echo off > /sys/devices/system/cpu/smt/control kubevirt-ci cluster: Change the -smp flag of qemu command just for one of the nodes -smp ${CPU} -> -smp 12,sockets=1,cores=6,threads=2 The way i verify if HT is enabled: check if Thread(s) per core: 1 (disabled) when running lscpu in the nodes. Am i missing something? (In reply to Barak from comment #11) > I couldn't reproduce the problem with both Bare Metal cluster and > virtualized cluster (kubevirt-ci's nodes ). > > When i migrate vm from a node with HT enabled\disabled to a node with HT > disabled\enabled the migration Succeeded in both cases. > > The way i turn off the HT: > Bare Metal cluster: > echo off > /sys/devices/system/cpu/smt/control > > kubevirt-ci cluster: > Change the -smp flag of qemu command just for one of the nodes > -smp ${CPU} > -> > -smp 12,sockets=1,cores=6,threads=2 > > > The way i verify if HT is enabled: > > check if > Thread(s) per core: 1 (disabled) > when running lscpu in the nodes. > > Am I missing something? hey Barak, I belive you have to disable HT from the BIOS in order to produce this. I disabled HT from Bios using IPMI LAN and still i can't reproduce it lscpu of node without HT: [root@zeus35 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel BIOS Vendor ID: Intel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz BIOS Model name: Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz Stepping: 7 CPU MHz: 800.076 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 28160K NUMA node0 CPU(s): 0 NUMA node1 CPU(s): 1 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities lscpu node with HT enabled: [root@zeus32 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel BIOS Vendor ID: Intel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz BIOS Model name: Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz Stepping: 7 CPU MHz: 800.058 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 28160K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities i tried migration from both nodes and it worked. [bmordeha@fedora Work]$ kubectl get po -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES virt-launcher-vmi-migratable-98k6n 0/2 Completed 0 22m 192.168.75.8 zeus32.lab.eng.tlv2.redhat.com <none> 1/1 virt-launcher-vmi-migratable-xc668 2/2 Running 0 16s 192.168.40.7 zeus35.lab.eng.tlv2.redhat.com <none> 1/1 virt-launcher-vmi-migratable1-2dmzx 2/2 Running 0 8m29s 192.168.75.13 zeus32.lab.eng.tlv2.redhat.com <none> 1/1 virt-launcher-vmi-migratable1-jw5bc 0/2 Completed 0 8m52s 192.168.40.6 zeus35.lab.eng.tlv2.redhat.com <none> 1/1 [bmordeha@fedora Work]$ [bmordeha@fedora Work]$ kubectl get vmim NAME PHASE VMI migration-job Succeeded vmi-migratable1 migration-job1 Succeeded vmi-migratable Closing this for the following reasons: - We don't have enough data to reproduce - This happen with 4.9 old version and there weren't any similar bugs for a while This might be fixed already lets see if someone else complain about a similar issue in the future. |