Bug 1674265 - Can't use AMD EPYC IBPD SSBD on 4.3 upgrade without clearing libvirt cache
Summary: Can't use AMD EPYC IBPD SSBD on 4.3 upgrade without clearing libvirt cache
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: imgbased
Classification: oVirt
Component: General
Version: ---
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ovirt-4.3.3
: ---
Assignee: Yuval Turgeman
QA Contact: Huijuan Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-10 16:04 UTC by Greg Sheremeta
Modified: 2019-04-16 13:58 UTC (History)
17 users (show)

Fixed In Version: imgbased-1.1.6
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-16 13:58:11 UTC
oVirt Team: Node
Embargoed:
rule-engine: ovirt-4.3+
cshao: testing_ack?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 97739 0 master ABANDONED core: Re-added AMD Opteron G3 CPU Type 2021-01-09 22:02:37 UTC
oVirt gerrit 98746 0 master MERGED osupdater: don't copy the libvirt/qemu cache on upgrades 2021-01-09 22:02:35 UTC
oVirt gerrit 98936 0 ovirt-4.3 MERGED osupdater: don't copy the libvirt/qemu cache on upgrades 2021-01-09 22:03:14 UTC

Description Greg Sheremeta 2019-02-10 16:04:10 UTC
Description of problem:
Can't use SSBD on 4.3 upgrade without clearing libvirt cache

[ /var/cache/libvirt/qemu/capabilities/*.xml ]

Reported on ovirt users list:
https://lists.ovirt.org/archives/list/users@ovirt.org/thread/4Y4X7UGDEYSB5JK45TLDERNM7IMTHIYY/

"""
So I tried making a new cluster with a 4.2 compatibility level and moving
one of my EPYC hosts into it. I then updated the host to 4.3 and switched
the cluster version 4.3 + set cluster cpu to the new AMD EPYC IBPD SSBD
(also tried plain AMD EPYC). It still fails to make the host operational
complaining that 'CPU type is not supported in this cluster compatibility
version or is not supported at all'.

I tried a few iterations of updating, moving, activating, reinstalling,
etc, but none of them seem to work.

The hosts are running CentOS Linux release 7.6.1810 (Core), all packages
are up to date.

I checked my CPU flags, and I can't see anything missing.

...

Got a host activated!

1. Update host to 4.3
2. rm /var/cache/libvirt/qemu/capabilities/*.xml
3. systemctl restart libvirtd
4. Activate host

Seems like some kind of stuck state going from 4.2 -> 4.3
"""

Version-Release number of selected component (if applicable):
4.3

How reproducible:
?
[not sure is original posted tried setting SSBD before or after upgrade]

Steps to Reproduce:
see
https://lists.ovirt.org/archives/list/users@ovirt.org/thread/4Y4X7UGDEYSB5JK45TLDERNM7IMTHIYY/

Actual results:
'CPU type is not supported in this cluster compatibility version or is not supported at all'.   (but using an SSBD CPU, it should be ok)

Expected results:
host upgrades to 4.3 and works, no error about incompatibility

Comment 1 Michal Skrivanek 2019-02-11 07:30:10 UTC
seems libvirt related if the capabilities were reported wrongly. Maybe it can happen on firmware upgrade as the cache seems to be cleared only on libvirt upgrade

Also, 4.2 doesn't have EPYC and it was added manually to the db by user, so that's not a valid path unless reproduced cleanly

Comment 2 Steven Rosenberg 2019-02-14 17:23:08 UTC
I did some investigation on the issue at Michal's request. 

We believe that AMD EPYC CPU Types are working on 4.2 and earlier because they are using the model_Opteron_G3 which we deprecated in Version 4.3 

The CPU flags come from the VDSM and seem to come from libvirt. 

To verify this, please run this command on the host that is running with the AMD EPYC CPU Type and provide the output:

vdsm-client Host getCapabilities

This will give us the Host's "cpuFlags". If it does not have the flag "model_EPYC" it will not support AMD EPYC. 

The problem itself may be with the libvirt version not supporting this CPU Type.

Comment 3 Ryan Barry 2019-02-14 17:34:26 UTC
There are differences in the cpu flags between g4/5 and epyc. The upstream reporter here had differences between avic and x2avic

Let's see which flags are exposed

Comment 4 Steven Rosenberg 2019-02-18 07:37:15 UTC
I asked Ryan Bullock to run the command:

vdsm-client Host getCapabilities

The proper CPU Flags are returned without avic=1 including model_EPYC, but with avic=1 the proper CPU Flags are not returned (no model_ flags are included).

Here are the details Ryan provided:

Without avic=1 (Works Fine):
    "cpuFlags": "fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,mmx,fxsr,sse,sse2,ht,syscall,nx,mmxext,fxsr_opt,pdpe1gb,rdtscp,lm,constant_tsc,art,rep_good,nopl,nonstop_tsc,extd_apicid,amd_dcm,aperfmperf,eagerfpu,pni,pclmulqdq,monitor,ssse3,fma,cx16,sse4_1,sse4_2,movbe,popcnt,aes,xsave,avx,f16c,rdrand,lahf_lm,cmp_legacy,svm,extapic,cr8_legacy,abm,sse4a,misalignsse,3dnowprefetch,osvw,skinit,wdt,tce,topoext,perfctr_core,perfctr_nb,bpext,perfctr_l2,cpb,hw_pstate,sme,retpoline_amd,ssbd,ibpb,vmmcall,fsgsbase,bmi1,avx2,smep,bmi2,rdseed,adx,smap,clflushopt,sha_ni,xsaveopt,xsavec,xgetbv1,clzero,irperf,xsaveerptr,arat,npt,lbrv,svm_lock,nrip_save,tsc_scale,vmcb_clean,flushbyasid,decodeassists,pausefilter,pfthreshold,avic,v_vmsave_vmload,vgif,overflow_recov,succor,smca,model_Opteron_G3,model_Opteron_G2,model_kvm32,model_kvm64,model_Westmere,model_Nehalem,model_Conroe,model_EPYC-IBPB,model_Opteron_G1,model_SandyBridge,model_qemu32,model_Penryn,model_pentium2,model_486,model_qemu64,model_cpu64-rhel6,model_EPYC,model_pentium,model_pentium3"

With avic=1 (Problem Configuration):
"cpuFlags": "fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,mmx,fxsr,sse,sse2,ht,syscall,nx,mmxext,fxsr_opt,pdpe1gb,rdtscp,lm,constant_tsc,art,rep_good,nopl,nonstop_tsc,extd_apicid,amd_dcm,aperfmperf,eagerfpu,pni,pclmulqdq,monitor,ssse3,fma,cx16,sse4_1,sse4_2,movbe,popcnt,aes,xsave,avx,f16c,rdrand,lahf_lm,cmp_legacy,svm,extapic,cr8_legacy,abm,sse4a,misalignsse,3dnowprefetch,osvw,skinit,wdt,tce,topoext,perfctr_core,perfctr_nb,bpext,perfctr_l2,cpb,hw_pstate,sme,retpoline_amd,ssbd,ibpb,vmmcall,fsgsbase,bmi1,avx2,smep,bmi2,rdseed,adx,smap,clflushopt,sha_ni,xsaveopt,xsavec,xgetbv1,clzero,irperf,xsaveerptr,arat,npt,lbrv,svm_lock,nrip_save,tsc_scale,vmcb_clean,flushbyasid,decodeassists,pausefilter,pfthreshold,avic,v_vmsave_vmload,vgif,overflow_recov,succor,smca"


We do have an open report regarding the avic=1 setting which seems to require the x2apic flag:

https://bugzilla.redhat.com/show_bug.cgi?id=1675030 

It seems this is a qemu-kvm issue and dependent upon the above report.

Comment 5 Ryan 2019-02-22 17:58:17 UTC
So far this hasn't been reproduced on base RHEL 7.6.

Thinking this might be an issue with the version of qemu-ev that oVirt installs from the CentOS Virt SIG (http://mirror.centos.org/centos/7/virt/x86_64/kvm-common/)

Here is my original CentOS bug that I just updated:
https://bugs.centos.org/view.php?id=15814

Not sure if the oVirt group is part of the maintenance of those packages?

Comment 7 Michal Skrivanek 2019-03-21 09:41:00 UTC
we talked about this being caused by ovirt-node actaully, that because it's not upgrading teh libvirt rpm no one clears out the libvirt cache. Should be reassigned to Yuval then, though Ryan you said you may post the patch?

Comment 8 Ryan Barry 2019-03-21 10:59:25 UTC
I discussed with Yuva to see if this had been fixed in the meantime, but neither of us could find an appropriate bug (and after a review of what's changed in imgbased in the last few months, I don't see a solution there either)

Comment 9 Ryan 2019-03-21 17:30:47 UTC
The root problem here was that if AVIC is enabled on AMD processors oVirt reports them as unsupported.

Based on the findings in https://bugzilla.redhat.com/show_bug.cgi?id=1675030, it seems like oVirt doesn't like that the x2apic flag is removed when AVIC is enabled. This is benign for qemu/libvirt, which will still run VMs as expected. Just seems to be a problem with how oVirt is checking host compatibility. 

Perhaps oVirt should allow for missing x2apic support with AMD processors? Or check if AVIC is enabled and then ignore x2apic?

Should this be opened as a new/different bug report?

Comment 10 Ryan Barry 2019-03-21 18:35:29 UTC
Well, the real question is "did clearing the cache resolve it?" If it did, then we can also treat this as benign in oVirt, and instead deal with clearing the cache on oVirt Node on upgrades

Comment 11 Ryan 2019-03-21 18:54:45 UTC
From my testing, you need to sometimes clear the cache between enabling and disabling AVIC for the capabilities to get re-detected. However if you enable AVIC (and make sure it gets re-detected that way) oVirt will report an unsupported CPU type for this host.

The problem here was we had AVIC enabled under 4.2 without issue, and then after the upgrade to 4.3 our hosts were showing as unsupported. While troubleshooting we disabled AVIC at some point, but capabilities didn't get re-detected until after we cleared the cache. The underlying issue was that AVIC enabled AMD CPUs were now showing as unsupported, but the capabilities caching by libvirt muddled troubleshooting.

Does that make sense? I am trying to explain the issue as concisely as possible. In all honesty this ticket would better be titled 'CPU Shown as unsupported after upgrade to 4.3 on AMD CPU with AVIC enabled', but the capabilities caching made the initial reporting a bit confusing.

Comment 12 Ryan Barry 2019-03-21 19:32:51 UTC
I'd suggest that it should really be:

[RFE] Support AVIC on AMD CPUs

Which is likely to be a 4.3 target, or possibly 4.3.z somewhere...

In general, modifying CPU flags is going to be NOTABUG, and if clearing the cache on Node upgrades (without AVIC enabled) works, then that's this bug.

Enabling additional flags should be a different RFE

Comment 13 Huijuan Zhao 2019-04-01 10:27:57 UTC
QE do not have AMD EPYC machines, I tested with other model AMD machine without AVIC enabled (AMD Opteron Processor 6376), can not reproduce the issue.

But according to Comment 12, the bug fix is: Clearing libvirt cache on Node upgrades (without AVIC enabled), so I tested it accordingly.

Test version:
From: rhvh-4.2-20190219
To:   rhvh-4.3.0.5-0.20190328.0
      imgbased-1.1.6-0.1.el7ev.noarch

Test steps:
1. Install rhvh-4.2-20190219 in AMD machine(without AVIC enabled), and add rhvh to rhvm
2. The rhvh is up in rhvm, check the libvirt cache: /var/cache/libvirt/qemu/capabilities/*.xml 
3. Upgrade rhvh to rhvh-4.3-20190318
4. Reboot rhvh to new build, and check the libvirt cache: /var/cache/libvirt/qemu/capabilities/*.xml
5. rm /var/cache/libvirt/qemu/capabilities/*.xml
6. systemctl restart libvirtd
7. Activate host 

Test results:
1. The libvirt cache file in step 4 is different with step 2, it is refreshed after upgrade
2. After step 7, the host is up in rhvm.

So this bug is fixed according to Comment 12, move the status to VERIFIED.
If refer to "[RFE] Support AVIC on AMD CPUs", please file another bug to trace.

Comment 14 Sandro Bonazzola 2019-04-16 13:58:11 UTC
This bugzilla is included in oVirt 4.3.3 release, published on April 16th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.3 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.