Description of problem: Can't use SSBD on 4.3 upgrade without clearing libvirt cache [ /var/cache/libvirt/qemu/capabilities/*.xml ] Reported on ovirt users list: https://lists.ovirt.org/archives/list/users@ovirt.org/thread/4Y4X7UGDEYSB5JK45TLDERNM7IMTHIYY/ """ So I tried making a new cluster with a 4.2 compatibility level and moving one of my EPYC hosts into it. I then updated the host to 4.3 and switched the cluster version 4.3 + set cluster cpu to the new AMD EPYC IBPD SSBD (also tried plain AMD EPYC). It still fails to make the host operational complaining that 'CPU type is not supported in this cluster compatibility version or is not supported at all'. I tried a few iterations of updating, moving, activating, reinstalling, etc, but none of them seem to work. The hosts are running CentOS Linux release 7.6.1810 (Core), all packages are up to date. I checked my CPU flags, and I can't see anything missing. ... Got a host activated! 1. Update host to 4.3 2. rm /var/cache/libvirt/qemu/capabilities/*.xml 3. systemctl restart libvirtd 4. Activate host Seems like some kind of stuck state going from 4.2 -> 4.3 """ Version-Release number of selected component (if applicable): 4.3 How reproducible: ? [not sure is original posted tried setting SSBD before or after upgrade] Steps to Reproduce: see https://lists.ovirt.org/archives/list/users@ovirt.org/thread/4Y4X7UGDEYSB5JK45TLDERNM7IMTHIYY/ Actual results: 'CPU type is not supported in this cluster compatibility version or is not supported at all'. (but using an SSBD CPU, it should be ok) Expected results: host upgrades to 4.3 and works, no error about incompatibility
seems libvirt related if the capabilities were reported wrongly. Maybe it can happen on firmware upgrade as the cache seems to be cleared only on libvirt upgrade Also, 4.2 doesn't have EPYC and it was added manually to the db by user, so that's not a valid path unless reproduced cleanly
I did some investigation on the issue at Michal's request. We believe that AMD EPYC CPU Types are working on 4.2 and earlier because they are using the model_Opteron_G3 which we deprecated in Version 4.3 The CPU flags come from the VDSM and seem to come from libvirt. To verify this, please run this command on the host that is running with the AMD EPYC CPU Type and provide the output: vdsm-client Host getCapabilities This will give us the Host's "cpuFlags". If it does not have the flag "model_EPYC" it will not support AMD EPYC. The problem itself may be with the libvirt version not supporting this CPU Type.
There are differences in the cpu flags between g4/5 and epyc. The upstream reporter here had differences between avic and x2avic Let's see which flags are exposed
I asked Ryan Bullock to run the command: vdsm-client Host getCapabilities The proper CPU Flags are returned without avic=1 including model_EPYC, but with avic=1 the proper CPU Flags are not returned (no model_ flags are included). Here are the details Ryan provided: Without avic=1 (Works Fine): "cpuFlags": "fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,mmx,fxsr,sse,sse2,ht,syscall,nx,mmxext,fxsr_opt,pdpe1gb,rdtscp,lm,constant_tsc,art,rep_good,nopl,nonstop_tsc,extd_apicid,amd_dcm,aperfmperf,eagerfpu,pni,pclmulqdq,monitor,ssse3,fma,cx16,sse4_1,sse4_2,movbe,popcnt,aes,xsave,avx,f16c,rdrand,lahf_lm,cmp_legacy,svm,extapic,cr8_legacy,abm,sse4a,misalignsse,3dnowprefetch,osvw,skinit,wdt,tce,topoext,perfctr_core,perfctr_nb,bpext,perfctr_l2,cpb,hw_pstate,sme,retpoline_amd,ssbd,ibpb,vmmcall,fsgsbase,bmi1,avx2,smep,bmi2,rdseed,adx,smap,clflushopt,sha_ni,xsaveopt,xsavec,xgetbv1,clzero,irperf,xsaveerptr,arat,npt,lbrv,svm_lock,nrip_save,tsc_scale,vmcb_clean,flushbyasid,decodeassists,pausefilter,pfthreshold,avic,v_vmsave_vmload,vgif,overflow_recov,succor,smca,model_Opteron_G3,model_Opteron_G2,model_kvm32,model_kvm64,model_Westmere,model_Nehalem,model_Conroe,model_EPYC-IBPB,model_Opteron_G1,model_SandyBridge,model_qemu32,model_Penryn,model_pentium2,model_486,model_qemu64,model_cpu64-rhel6,model_EPYC,model_pentium,model_pentium3" With avic=1 (Problem Configuration): "cpuFlags": "fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,mmx,fxsr,sse,sse2,ht,syscall,nx,mmxext,fxsr_opt,pdpe1gb,rdtscp,lm,constant_tsc,art,rep_good,nopl,nonstop_tsc,extd_apicid,amd_dcm,aperfmperf,eagerfpu,pni,pclmulqdq,monitor,ssse3,fma,cx16,sse4_1,sse4_2,movbe,popcnt,aes,xsave,avx,f16c,rdrand,lahf_lm,cmp_legacy,svm,extapic,cr8_legacy,abm,sse4a,misalignsse,3dnowprefetch,osvw,skinit,wdt,tce,topoext,perfctr_core,perfctr_nb,bpext,perfctr_l2,cpb,hw_pstate,sme,retpoline_amd,ssbd,ibpb,vmmcall,fsgsbase,bmi1,avx2,smep,bmi2,rdseed,adx,smap,clflushopt,sha_ni,xsaveopt,xsavec,xgetbv1,clzero,irperf,xsaveerptr,arat,npt,lbrv,svm_lock,nrip_save,tsc_scale,vmcb_clean,flushbyasid,decodeassists,pausefilter,pfthreshold,avic,v_vmsave_vmload,vgif,overflow_recov,succor,smca" We do have an open report regarding the avic=1 setting which seems to require the x2apic flag: https://bugzilla.redhat.com/show_bug.cgi?id=1675030 It seems this is a qemu-kvm issue and dependent upon the above report.
So far this hasn't been reproduced on base RHEL 7.6. Thinking this might be an issue with the version of qemu-ev that oVirt installs from the CentOS Virt SIG (http://mirror.centos.org/centos/7/virt/x86_64/kvm-common/) Here is my original CentOS bug that I just updated: https://bugs.centos.org/view.php?id=15814 Not sure if the oVirt group is part of the maintenance of those packages?
we talked about this being caused by ovirt-node actaully, that because it's not upgrading teh libvirt rpm no one clears out the libvirt cache. Should be reassigned to Yuval then, though Ryan you said you may post the patch?
I discussed with Yuva to see if this had been fixed in the meantime, but neither of us could find an appropriate bug (and after a review of what's changed in imgbased in the last few months, I don't see a solution there either)
The root problem here was that if AVIC is enabled on AMD processors oVirt reports them as unsupported. Based on the findings in https://bugzilla.redhat.com/show_bug.cgi?id=1675030, it seems like oVirt doesn't like that the x2apic flag is removed when AVIC is enabled. This is benign for qemu/libvirt, which will still run VMs as expected. Just seems to be a problem with how oVirt is checking host compatibility. Perhaps oVirt should allow for missing x2apic support with AMD processors? Or check if AVIC is enabled and then ignore x2apic? Should this be opened as a new/different bug report?
Well, the real question is "did clearing the cache resolve it?" If it did, then we can also treat this as benign in oVirt, and instead deal with clearing the cache on oVirt Node on upgrades
From my testing, you need to sometimes clear the cache between enabling and disabling AVIC for the capabilities to get re-detected. However if you enable AVIC (and make sure it gets re-detected that way) oVirt will report an unsupported CPU type for this host. The problem here was we had AVIC enabled under 4.2 without issue, and then after the upgrade to 4.3 our hosts were showing as unsupported. While troubleshooting we disabled AVIC at some point, but capabilities didn't get re-detected until after we cleared the cache. The underlying issue was that AVIC enabled AMD CPUs were now showing as unsupported, but the capabilities caching by libvirt muddled troubleshooting. Does that make sense? I am trying to explain the issue as concisely as possible. In all honesty this ticket would better be titled 'CPU Shown as unsupported after upgrade to 4.3 on AMD CPU with AVIC enabled', but the capabilities caching made the initial reporting a bit confusing.
I'd suggest that it should really be: [RFE] Support AVIC on AMD CPUs Which is likely to be a 4.3 target, or possibly 4.3.z somewhere... In general, modifying CPU flags is going to be NOTABUG, and if clearing the cache on Node upgrades (without AVIC enabled) works, then that's this bug. Enabling additional flags should be a different RFE
QE do not have AMD EPYC machines, I tested with other model AMD machine without AVIC enabled (AMD Opteron Processor 6376), can not reproduce the issue. But according to Comment 12, the bug fix is: Clearing libvirt cache on Node upgrades (without AVIC enabled), so I tested it accordingly. Test version: From: rhvh-4.2-20190219 To: rhvh-4.3.0.5-0.20190328.0 imgbased-1.1.6-0.1.el7ev.noarch Test steps: 1. Install rhvh-4.2-20190219 in AMD machine(without AVIC enabled), and add rhvh to rhvm 2. The rhvh is up in rhvm, check the libvirt cache: /var/cache/libvirt/qemu/capabilities/*.xml 3. Upgrade rhvh to rhvh-4.3-20190318 4. Reboot rhvh to new build, and check the libvirt cache: /var/cache/libvirt/qemu/capabilities/*.xml 5. rm /var/cache/libvirt/qemu/capabilities/*.xml 6. systemctl restart libvirtd 7. Activate host Test results: 1. The libvirt cache file in step 4 is different with step 2, it is refreshed after upgrade 2. After step 7, the host is up in rhvm. So this bug is fixed according to Comment 12, move the status to VERIFIED. If refer to "[RFE] Support AVIC on AMD CPUs", please file another bug to trace.
This bugzilla is included in oVirt 4.3.3 release, published on April 16th 2019. Since the problem described in this bug report should be resolved in oVirt 4.3.3 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.