Bug 2031795
Summary: | KVM Fedora 35 guest x86 programs randomly crash in signal handler | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Miroslav Lichvar <mlichvar> | |
Component: | qemu-kvm | Assignee: | Dr. David Alan Gilbert <dgilbert> | |
qemu-kvm sub component: | CPU Models | QA Contact: | liunana <nanliu> | |
Status: | CLOSED CURRENTRELEASE | Docs Contact: | ||
Severity: | high | |||
Priority: | medium | CC: | ailan, cohuck, coli, dgilbert, jinzhao, juzhang, mlevitsk, nanliu, nilal, peterx, virt-maint, wei.huang2, ymankad | |
Version: | 8.5 | Keywords: | Reopened, Triaged, ZStream | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2043121 2043122 2065230 (view as bug list) | Environment: | ||
Last Closed: | 2022-02-08 20:27:21 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2043121, 2043122, 2052198, 2052199, 2065230, 2065235, 2065239 |
Description
Miroslav Lichvar
2021-12-13 13:13:47 UTC
After some tests with different libvirt configurations it seems this is related to the "xsaves" feature of the CPU. The default CPU with the "host-model" cpu is EPYC Milan, but if I change it to EPYC Rome, or just disable the xsaves feature in the libvirt xml, the crashes stop. I'm still not sure in what component this needs to be fixed. Please reassign as appropriate. Amnon, assigning this to you. Maybe Dave can help us with the triaging? Thanks Maxim - anything about this look familiar to you? Since you've debugged AMD CPU issues previous... Moving the BZ back to virt-maint so that this keeps showing as a BZ that is not triaged yet. Thanks, John, for adding Maxim. Also, clearing needinfo from Amnon. I can reproduce this bug on RHEL.8.5 host with three kind of guests. Test Env: dell-per7525-10.lab.eng.pek2.redhat.com 4.18.0-348.7.1.el8_5.x86_64 qemu-kvm-6.0.0-33.module+el8.5.0+13740+349232b6.2 Model name: AMD EPYC 7313 16-Core Processor Guest: Fedora 35: 5.15.12-200.fc35.x86_64 RHEL.8.6 : 4.18.0-357.el8.x86_64 RHEL.8.5 : 4.18.0-348.7.1.el8_5.x86_64 And I can't rerpoduce this bug with combination (RHEL.8.5 host + qemu-kvm-6.2.0-2.module+el8.6.0+13738+17338784.x86_64) with three kind of guests. CPU configuration xml: <cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>EPYC-Milan</model> <vendor>AMD</vendor> <feature policy='require' name='x2apic'/> <feature policy='require' name='tsc-deadline'/> <feature policy='require' name='hypervisor'/> <feature policy='require' name='tsc_adjust'/> <feature policy='require' name='vaes'/> <feature policy='require' name='vpclmulqdq'/> <feature policy='require' name='spec-ctrl'/> <feature policy='require' name='stibp'/> <feature policy='require' name='arch-capabilities'/> <feature policy='require' name='ssbd'/> <feature policy='require' name='cmp_legacy'/> <feature policy='require' name='virt-ssbd'/> <feature policy='require' name='rdctl-no'/> <feature policy='require' name='skip-l1dfl-vmentry'/> <feature policy='require' name='mds-no'/> <feature policy='require' name='pschange-mc-no'/> <feature policy='disable' name='erms'/> <feature policy='disable' name='fsrm'/> <feature policy='require' name='xsaves'/> <feature policy='disable' name='svm'/> <feature policy='require' name='topoext'/> <feature policy='disable' name='npt'/> <feature policy='disable' name='nrip-save'/> <feature policy='disable' name='svme-addr-chk'/> </cpu> Seems it is a qemu related bz, Could you please try the qemu6.2 on rhel8.5 host to see if it works? please help to check this, thanks. Best regards Liu Nana Yes, upgrading to qemu-kvm-6.2.0 fixes the issue for me. Reassigning the bug to the qemu-kvm component. Thanks. Hi Connie, Since this has now been moved to qemu-kvm, do you think this is something that you might be able to look into it? If not, I think Igor should be able to help. Also adding Dave to the loop. Thanks (In reply to Nitesh Narayan Lal from comment #7) > Hi Connie, > > Since this has now been moved to qemu-kvm, do you think this is something > that you might be able to look into it? > If not, I think Igor should be able to help. > > Also adding Dave to the loop. > > Thanks I'm afraid I'm not really familiar with the cpu models here; it seems that it has been fixed at some time before 6.2? 7bde6b18575d ("target/i386: Add CPU model versions supporting 'xsaves'") might be a candidate to look at, just going by that it mentiones 'xsaves'. Maybe; although if I'm reading that right it's adding xsaves to other cpu models; but Milan already has xsaves which is happening on the reporters host. We might have to bisect between ~@6.0 and 6.2 to find out when it got fixed. (In reply to Dr. David Alan Gilbert from comment #9) > Maybe; although if I'm reading that right it's adding xsaves to other cpu > models; but Milan already has xsaves which is happening on the reporters > host. > We might have to bisect between ~@6.0 and 6.2 to find out when it got fixed. Hi, I test three kinds of qemu version, seems it is fixed in the early qemu6.1 version. Test Env: 4.18.0-348.7.1.el8_5.x86_64 amd-milan-11.khw1.lab.eng.bos.redhat.com Guest: 4.18.0-348.el8.x86_64 qemu versions and test results: Version results ------------------------------------------------------------------------------ qemu-kvm-6.0.0-33.module+el8.5.0+13740+349232b6.2 | crash qemu-kvm-6.0.0-29.module+el8.6.0+12490+ec3e565c | crash qemu-kvm-6.1.0-1.rc0.scrmod+el8.5.0+11915+5290fd16.wrb210721 | No crash ------------------------------------------------------------------------------- Could you help to check this? Thanks. Best regards Liu Nana Thanks! I'll try and bisect it further. I can confirm this is OK on a Rome machine; going to find a Milan to try it on bisected: 6.0.0 fails 1500 0c7af1a778d036402ec0829783afee1ce6ea942c bad 750 9bef7ea9d93ee6b6297a5be6cb5a557f7d1764c9 bad 743 dd52af17ec947332dfe45bd5f098c94c6ec0baa3 bad! 738 5aa10ab1a08e4123dee214a2f854909efb07b45b bad! 737 3568987f78faff90829ea6c885bbdd5b083dc86c bad! 736 fea4500841024195ec701713e05b92ebf667f192 good 734 f08b65b651bca2eac543de694f866049e48fb242 good 700 af4ba0ec8f017c402c239f2888ef62f63770ba8b good 650 2044969f0b27fa67f2b69bc710eaef45998cb6fb good 550 f6b12dfd80f3b0d6fbaf982718946e5ad72a543e good 375 2e3e3da3c2ad559d1255a9a3bf3df0782c2cf231 good! 6.1.0-rc1 good This is a series by David Edmondson 'Derive XSAVE state component offsets from CPUID leaf 0xd where possible' and the cover letter has the following smoking gun: |The offset of XSAVE state components within the XSAVE state area is |currently hard-coded via reference to the X86XSaveArea structure. This |structure is accurate for Intel systems at the time of writing, but |incorrect for newer AMD systems, as the state component for protection |keys is located differently (offset 0x980 rather than offset 0xa80). I'm having a crack at backporting that to 8.5.0; it's not quite trivial because there's a rearrangement of the cpu structures by Claudio in between. Still, now that we know whtat the fix is, and that it's in 8.6, we should close this as 'fixed current version' and ask for z? for the backport. Asking for Z stream because corrupt state makes me nervous; maybe there's other less obscure cases that crash; we should fix it. [My first attempt at backporting that set isn't that succesful; I get a reliable crash in the test rather than an intermittent one!) OK, I seem to have a working world. Note, this also triggers on 8.4.0 - so I think we might have to Z this further back. Copying in Wei for information. (Clearing needinfo on Maxim since we know the problem) Yash: Anything else I need to do to z-streamify this? *** Bug 2065230 has been marked as a duplicate of this bug. *** |