Bug 1695596

Summary: Instance shutdown without any reason
Product: Red Hat Enterprise Linux 7 Reporter: vivek koul <vkoul>
Component: qemu-kvm-rhevAssignee: Bandan Das <bdas>
Status: CLOSED INSUFFICIENT_DATA QA Contact: nlevinki <nlevinki>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.4CC: bdas, dgilbert, mbooth, mburns, pbonzini, ribarry, sbandyop, virt-maint
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-13 12:48:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description vivek koul 2019-04-03 11:59:09 UTC
Description of problem:
The instance is being paused without any reason

Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux Server release 7.4 (Maipo)
libvirt-3.2.0-14.el7_4.7.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
libvirt-daemon-driver-qemu-3.2.0-14.el7_4.7.x86_64
qemu-guest-agent-2.8.0-2.el7.x86_64
qemu-img-rhev-2.9.0-16.el7_4.13.x86_64
qemu-kvm-common-rhev-2.9.0-16.el7_4.13.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64
erlang-kernel-18.3.4.1-1.el7ost.x86_64
kernel-3.10.0-693.17.1.el7.x86_64
kernel-3.10.0-693.el7.x86_64
kernel-devel-3.10.0-693.17.1.el7.x86_64
kernel-devel-3.10.0-693.el7.x86_64
kernel-headers-3.10.0-693.17.1.el7.x86_64
kernel-tools-3.10.0-693.17.1.el7.x86_64
kernel-tools-libs-3.10.0-693.17.1.el7.x86_64

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: 
VM is in a paused state

Expected results: 
Vm should not in paused state

Additional info:
vm got paused on Monday 18th March
The affected instance has been stable since they restarted on the day of the issue

Comment 3 Artom Lifshitz 2019-04-11 17:33:35 UTC
I don't see any bug here - either in Nova or in Libvirt. I can't speak to the cause of the "shutting down, reason=destroyed" line in the QEMU log because I don't have what came before it, but it could something as simple as someone running `virsh destroy <instance>` on the host, or someone shutting down the instance (for example with a `poweroff` command) from within the guest.

Comment 4 Artom Lifshitz 2019-04-11 17:45:12 UTC
(In reply to Artom Lifshitz from comment #3)
> I don't see any bug here - either in Nova or in Libvirt. I can't speak to
> the cause of the "shutting down, reason=destroyed" line in the QEMU log
> because I don't have what came before it, but it could something as simple
> as someone running `virsh destroy <instance>` on the host, or someone
> shutting down the instance (for example with a `poweroff` command) from
> within the guest.

I retract that. I open the instance QEMU in the sosreports, and there's a KVM error/failure right before the instance is shutdown. Currently talking to the virt folks to figure out the best course of action.

./60-sosreport-DOldham.02347882-20190401143137.tar.xz/sosreport-DOldham.02347882-20190401143137/var/log/libvirt/qemu/instance-00001a59.log

2019-01-29T15:40:01.625174Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/13 (label charserial1)
KVM: entry failed, hardware error 0x5
RAX=0000000000000040 RBX=ffffd00096fc0180 RCX=0000000000000082 RDX=0000000000000000
RSI=00000000ffffffff RDI=ffffe00183df3010 RBP=ffffd00093393ab9 RSP=ffffd000933939d8
R8 =00000000ffffffff R9 =0000000000000000 R10=0000000000000002 R11=0000000000000001
R12=0000000000000001 R13=0000000000000000 R14=00000d1e5028b06c R15=0000000000000000
RIP=fffff800f5ba581f RFL=00000286 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
CS =0010 0000000000000000 00000000 00209b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
FS =0053 000000007ffe0000 0000bc00 0040f300 DPL=3 DS   [-WA]
GS =002b ffffd00096fc0000 ffffffff 00c0f300 DPL=3 DS   [-WA]
LDT=0000 0000000000000000 ffffffff 00000000
TR =0040 ffffd00096fd2000 00000067 00008b00 DPL=0 TSS64-busy
GDT=     ffffd00096fd3000 0000007f
IDT=     ffffd00096fd1000 00000fff
CR0=80050031 CR2=000000300027ef70 CR3=00000000001aa000 CR4=001506f8
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=00 00 00 00 00 48 83 ec 28 e8 57 36 ff ff 48 83 c4 28 fb f4 <c3> cc cc cc cc cc cc 66 66 0f 1f 84 00 00 00 00 00 0f 20 d0 0f 22 d0 c3 cc cc cc cc cc cc
2019-03-18T10:05:35.655670Z qemu-kvm: terminating on signal 15 from pid 5388 (<unknown process>)
2019-03-18 10:05:38.657+0000: shutting down, reason=destroyed

Comment 6 Artom Lifshitz 2019-04-11 18:16:55 UTC
Since this looks like a legitimate bug in KMV (or at least warrants more investigation), but is not a Nova bug, I've moved it to the RHEL7.4 product, keeping the qemu-kvm-rhev component.

Comment 7 Dr. David Alan Gilbert 2019-04-11 19:01:17 UTC
Similarity to upstream kernel bug report:
https://bugzilla.kernel.org/show_bug.cgi?id=197813

So in our case it always looks like the same host?
And most of the Code lines seem the same each time.

Very odd.

Comment 9 Paolo Bonzini 2019-04-12 12:50:52 UTC
Error 0x5 is "VMRESUME with non-launched VMCS".

Comment 10 Paolo Bonzini 2019-04-12 13:00:43 UTC
What host processor is this?  I found this erratum from 2012:

> BF168. VM Entries That Return From SMM Using VMLAUNCH May Not Update The Launch State of the VMCS
>
> Problem: Successful VM entries using the VMLAUNCH instruction should set the launch state of the
> VMCS to "launched". Due to this erratum, such a VM entry may not update the launch state of the
> current VMCS if the VM entry is returning from SMM.Implication:   Subsequent VM entries using the
> VMRESUME instruction with this VMCS will fail. RFLAGS.ZF is set to 1 and the value 5 (indicating
> VMRESUME with non-launched VMCS) is stored in the VM-instruction error field. This erratum applies
> only if dual monitor treatment of SMI and SMM is active.
>
> Workaround:  None identified.
>
> Status:For the steppings affected, see the Summary Table of Changes

And also a similar erratum BK85 where the workaround is "It is possible for the BIOS to contain a workaround for this erratum".

Since this only appeared twice in several years, it is not unreasonable that it could be this erratum or a similar one.

Comment 11 Artom Lifshitz 2019-04-12 13:08:40 UTC
(In reply to Paolo Bonzini from comment #10)
> What host processor is this?  I found this erratum from 2012:

/proc/cpuinfo is in the sosreports for more info, but the model this:

model name      : Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz

> 
> > BF168. VM Entries That Return From SMM Using VMLAUNCH May Not Update The Launch State of the VMCS
> >
> > Problem: Successful VM entries using the VMLAUNCH instruction should set the launch state of the
> > VMCS to "launched". Due to this erratum, such a VM entry may not update the launch state of the
> > current VMCS if the VM entry is returning from SMM.Implication:   Subsequent VM entries using the
> > VMRESUME instruction with this VMCS will fail. RFLAGS.ZF is set to 1 and the value 5 (indicating
> > VMRESUME with non-launched VMCS) is stored in the VM-instruction error field. This erratum applies
> > only if dual monitor treatment of SMI and SMM is active.
> >
> > Workaround:  None identified.
> >
> > Status:For the steppings affected, see the Summary Table of Changes
> 
> And also a similar erratum BK85 where the workaround is "It is possible for
> the BIOS to contain a workaround for this erratum".
> 
> Since this only appeared twice in several years, it is not unreasonable that
> it could be this erratum or a similar one.

Comment 12 Bandan Das 2019-04-12 13:29:13 UTC
Based on comment 10, is it possible to check if there's a bios update available for the affected system and can it be applied ?

Comment 13 Paolo Bonzini 2019-04-12 15:11:17 UTC
I couldn't find the erratum in the document for Xeon E5 v3 processors (https://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-spec-update.html). It was there in the Xeon E5 (Sandy Bridge) update, where it is listed as BT48.

https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf

The upstream report was for a Xeon E5 v4 (Broadwell) processor.

I agree with Bandan that a BIOS update is still a good idea.

Comment 36 Paolo Bonzini 2019-07-18 21:18:21 UTC
Bandan, you probably should have it pr_err the current VMCS address before crashing.

In the meanwhile David's suggested experiment is a great one!

Comment 47 Red Hat Bugzilla 2023-09-15 00:16:27 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days