Bug 1695596
| Summary: | Instance shutdown without any reason | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | vivek koul <vkoul> |
| Component: | qemu-kvm-rhev | Assignee: | Bandan Das <bdas> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | nlevinki <nlevinki> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.4 | CC: | bdas, dgilbert, mbooth, mburns, pbonzini, ribarry, sbandyop, virt-maint |
| Target Milestone: | rc | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-01-13 12:48:07 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
vivek koul
2019-04-03 11:59:09 UTC
I don't see any bug here - either in Nova or in Libvirt. I can't speak to the cause of the "shutting down, reason=destroyed" line in the QEMU log because I don't have what came before it, but it could something as simple as someone running `virsh destroy <instance>` on the host, or someone shutting down the instance (for example with a `poweroff` command) from within the guest. (In reply to Artom Lifshitz from comment #3) > I don't see any bug here - either in Nova or in Libvirt. I can't speak to > the cause of the "shutting down, reason=destroyed" line in the QEMU log > because I don't have what came before it, but it could something as simple > as someone running `virsh destroy <instance>` on the host, or someone > shutting down the instance (for example with a `poweroff` command) from > within the guest. I retract that. I open the instance QEMU in the sosreports, and there's a KVM error/failure right before the instance is shutdown. Currently talking to the virt folks to figure out the best course of action. ./60-sosreport-DOldham.02347882-20190401143137.tar.xz/sosreport-DOldham.02347882-20190401143137/var/log/libvirt/qemu/instance-00001a59.log 2019-01-29T15:40:01.625174Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/13 (label charserial1) KVM: entry failed, hardware error 0x5 RAX=0000000000000040 RBX=ffffd00096fc0180 RCX=0000000000000082 RDX=0000000000000000 RSI=00000000ffffffff RDI=ffffe00183df3010 RBP=ffffd00093393ab9 RSP=ffffd000933939d8 R8 =00000000ffffffff R9 =0000000000000000 R10=0000000000000002 R11=0000000000000001 R12=0000000000000001 R13=0000000000000000 R14=00000d1e5028b06c R15=0000000000000000 RIP=fffff800f5ba581f RFL=00000286 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS [-WA] CS =0010 0000000000000000 00000000 00209b00 DPL=0 CS64 [-RA] SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS [-WA] FS =0053 000000007ffe0000 0000bc00 0040f300 DPL=3 DS [-WA] GS =002b ffffd00096fc0000 ffffffff 00c0f300 DPL=3 DS [-WA] LDT=0000 0000000000000000 ffffffff 00000000 TR =0040 ffffd00096fd2000 00000067 00008b00 DPL=0 TSS64-busy GDT= ffffd00096fd3000 0000007f IDT= ffffd00096fd1000 00000fff CR0=80050031 CR2=000000300027ef70 CR3=00000000001aa000 CR4=001506f8 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000d01 Code=00 00 00 00 00 48 83 ec 28 e8 57 36 ff ff 48 83 c4 28 fb f4 <c3> cc cc cc cc cc cc 66 66 0f 1f 84 00 00 00 00 00 0f 20 d0 0f 22 d0 c3 cc cc cc cc cc cc 2019-03-18T10:05:35.655670Z qemu-kvm: terminating on signal 15 from pid 5388 (<unknown process>) 2019-03-18 10:05:38.657+0000: shutting down, reason=destroyed Since this looks like a legitimate bug in KMV (or at least warrants more investigation), but is not a Nova bug, I've moved it to the RHEL7.4 product, keeping the qemu-kvm-rhev component. Similarity to upstream kernel bug report: https://bugzilla.kernel.org/show_bug.cgi?id=197813 So in our case it always looks like the same host? And most of the Code lines seem the same each time. Very odd. Error 0x5 is "VMRESUME with non-launched VMCS". What host processor is this? I found this erratum from 2012: > BF168. VM Entries That Return From SMM Using VMLAUNCH May Not Update The Launch State of the VMCS > > Problem: Successful VM entries using the VMLAUNCH instruction should set the launch state of the > VMCS to "launched". Due to this erratum, such a VM entry may not update the launch state of the > current VMCS if the VM entry is returning from SMM.Implication: Subsequent VM entries using the > VMRESUME instruction with this VMCS will fail. RFLAGS.ZF is set to 1 and the value 5 (indicating > VMRESUME with non-launched VMCS) is stored in the VM-instruction error field. This erratum applies > only if dual monitor treatment of SMI and SMM is active. > > Workaround: None identified. > > Status:For the steppings affected, see the Summary Table of Changes And also a similar erratum BK85 where the workaround is "It is possible for the BIOS to contain a workaround for this erratum". Since this only appeared twice in several years, it is not unreasonable that it could be this erratum or a similar one. (In reply to Paolo Bonzini from comment #10) > What host processor is this? I found this erratum from 2012: /proc/cpuinfo is in the sosreports for more info, but the model this: model name : Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz > > > BF168. VM Entries That Return From SMM Using VMLAUNCH May Not Update The Launch State of the VMCS > > > > Problem: Successful VM entries using the VMLAUNCH instruction should set the launch state of the > > VMCS to "launched". Due to this erratum, such a VM entry may not update the launch state of the > > current VMCS if the VM entry is returning from SMM.Implication: Subsequent VM entries using the > > VMRESUME instruction with this VMCS will fail. RFLAGS.ZF is set to 1 and the value 5 (indicating > > VMRESUME with non-launched VMCS) is stored in the VM-instruction error field. This erratum applies > > only if dual monitor treatment of SMI and SMM is active. > > > > Workaround: None identified. > > > > Status:For the steppings affected, see the Summary Table of Changes > > And also a similar erratum BK85 where the workaround is "It is possible for > the BIOS to contain a workaround for this erratum". > > Since this only appeared twice in several years, it is not unreasonable that > it could be this erratum or a similar one. Based on comment 10, is it possible to check if there's a bios update available for the affected system and can it be applied ? I couldn't find the erratum in the document for Xeon E5 v3 processors (https://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-spec-update.html). It was there in the Xeon E5 (Sandy Bridge) update, where it is listed as BT48. https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf The upstream report was for a Xeon E5 v4 (Broadwell) processor. I agree with Bandan that a BIOS update is still a good idea. Bandan, you probably should have it pr_err the current VMCS address before crashing. In the meanwhile David's suggested experiment is a great one! The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |