1695596 – Instance shutdown without any reason

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1695596 - Instance shutdown without any reason

Summary: Instance shutdown without any reason

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Bandan Das
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-03 11:59 UTC by vivek koul
Modified:	2023-09-15 00:16 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-13 12:48:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description vivek koul 2019-04-03 11:59:09 UTC

Description of problem:
The instance is being paused without any reason

Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux Server release 7.4 (Maipo)
libvirt-3.2.0-14.el7_4.7.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
libvirt-daemon-driver-qemu-3.2.0-14.el7_4.7.x86_64
qemu-guest-agent-2.8.0-2.el7.x86_64
qemu-img-rhev-2.9.0-16.el7_4.13.x86_64
qemu-kvm-common-rhev-2.9.0-16.el7_4.13.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64
erlang-kernel-18.3.4.1-1.el7ost.x86_64
kernel-3.10.0-693.17.1.el7.x86_64
kernel-3.10.0-693.el7.x86_64
kernel-devel-3.10.0-693.17.1.el7.x86_64
kernel-devel-3.10.0-693.el7.x86_64
kernel-headers-3.10.0-693.17.1.el7.x86_64
kernel-tools-3.10.0-693.17.1.el7.x86_64
kernel-tools-libs-3.10.0-693.17.1.el7.x86_64

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: 
VM is in a paused state

Expected results: 
Vm should not in paused state

Additional info:
vm got paused on Monday 18th March
The affected instance has been stable since they restarted on the day of the issue

Comment 3 Artom Lifshitz 2019-04-11 17:33:35 UTC

I don't see any bug here - either in Nova or in Libvirt. I can't speak to the cause of the "shutting down, reason=destroyed" line in the QEMU log because I don't have what came before it, but it could something as simple as someone running `virsh destroy <instance>` on the host, or someone shutting down the instance (for example with a `poweroff` command) from within the guest.

Comment 4 Artom Lifshitz 2019-04-11 17:45:12 UTC

(In reply to Artom Lifshitz from comment #3)
> I don't see any bug here - either in Nova or in Libvirt. I can't speak to
> the cause of the "shutting down, reason=destroyed" line in the QEMU log
> because I don't have what came before it, but it could something as simple
> as someone running `virsh destroy <instance>` on the host, or someone
> shutting down the instance (for example with a `poweroff` command) from
> within the guest.

I retract that. I open the instance QEMU in the sosreports, and there's a KVM error/failure right before the instance is shutdown. Currently talking to the virt folks to figure out the best course of action.

./60-sosreport-DOldham.02347882-20190401143137.tar.xz/sosreport-DOldham.02347882-20190401143137/var/log/libvirt/qemu/instance-00001a59.log

2019-01-29T15:40:01.625174Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/13 (label charserial1)
KVM: entry failed, hardware error 0x5
RAX=0000000000000040 RBX=ffffd00096fc0180 RCX=0000000000000082 RDX=0000000000000000
RSI=00000000ffffffff RDI=ffffe00183df3010 RBP=ffffd00093393ab9 RSP=ffffd000933939d8
R8 =00000000ffffffff R9 =0000000000000000 R10=0000000000000002 R11=0000000000000001
R12=0000000000000001 R13=0000000000000000 R14=00000d1e5028b06c R15=0000000000000000
RIP=fffff800f5ba581f RFL=00000286 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
CS =0010 0000000000000000 00000000 00209b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
FS =0053 000000007ffe0000 0000bc00 0040f300 DPL=3 DS   [-WA]
GS =002b ffffd00096fc0000 ffffffff 00c0f300 DPL=3 DS   [-WA]
LDT=0000 0000000000000000 ffffffff 00000000
TR =0040 ffffd00096fd2000 00000067 00008b00 DPL=0 TSS64-busy
GDT=     ffffd00096fd3000 0000007f
IDT=     ffffd00096fd1000 00000fff
CR0=80050031 CR2=000000300027ef70 CR3=00000000001aa000 CR4=001506f8
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=00 00 00 00 00 48 83 ec 28 e8 57 36 ff ff 48 83 c4 28 fb f4 <c3> cc cc cc cc cc cc 66 66 0f 1f 84 00 00 00 00 00 0f 20 d0 0f 22 d0 c3 cc cc cc cc cc cc
2019-03-18T10:05:35.655670Z qemu-kvm: terminating on signal 15 from pid 5388 (<unknown process>)
2019-03-18 10:05:38.657+0000: shutting down, reason=destroyed

Comment 6 Artom Lifshitz 2019-04-11 18:16:55 UTC

Since this looks like a legitimate bug in KMV (or at least warrants more investigation), but is not a Nova bug, I've moved it to the RHEL7.4 product, keeping the qemu-kvm-rhev component.

Comment 7 Dr. David Alan Gilbert 2019-04-11 19:01:17 UTC

Similarity to upstream kernel bug report:
https://bugzilla.kernel.org/show_bug.cgi?id=197813

So in our case it always looks like the same host?
And most of the Code lines seem the same each time.

Very odd.

Comment 9 Paolo Bonzini 2019-04-12 12:50:52 UTC

Error 0x5 is "VMRESUME with non-launched VMCS".

Comment 10 Paolo Bonzini 2019-04-12 13:00:43 UTC

What host processor is this?  I found this erratum from 2012:

> BF168. VM Entries That Return From SMM Using VMLAUNCH May Not Update The Launch State of the VMCS
>
> Problem: Successful VM entries using the VMLAUNCH instruction should set the launch state of the
> VMCS to "launched". Due to this erratum, such a VM entry may not update the launch state of the
> current VMCS if the VM entry is returning from SMM.Implication:   Subsequent VM entries using the
> VMRESUME instruction with this VMCS will fail. RFLAGS.ZF is set to 1 and the value 5 (indicating
> VMRESUME with non-launched VMCS) is stored in the VM-instruction error field. This erratum applies
> only if dual monitor treatment of SMI and SMM is active.
>
> Workaround:  None identified.
>
> Status:For the steppings affected, see the Summary Table of Changes

And also a similar erratum BK85 where the workaround is "It is possible for the BIOS to contain a workaround for this erratum".

Since this only appeared twice in several years, it is not unreasonable that it could be this erratum or a similar one.

Comment 11 Artom Lifshitz 2019-04-12 13:08:40 UTC

(In reply to Paolo Bonzini from comment #10)
> What host processor is this?  I found this erratum from 2012:

/proc/cpuinfo is in the sosreports for more info, but the model this:

model name      : Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz

> 
> > BF168. VM Entries That Return From SMM Using VMLAUNCH May Not Update The Launch State of the VMCS
> >
> > Problem: Successful VM entries using the VMLAUNCH instruction should set the launch state of the
> > VMCS to "launched". Due to this erratum, such a VM entry may not update the launch state of the
> > current VMCS if the VM entry is returning from SMM.Implication:   Subsequent VM entries using the
> > VMRESUME instruction with this VMCS will fail. RFLAGS.ZF is set to 1 and the value 5 (indicating
> > VMRESUME with non-launched VMCS) is stored in the VM-instruction error field. This erratum applies
> > only if dual monitor treatment of SMI and SMM is active.
> >
> > Workaround:  None identified.
> >
> > Status:For the steppings affected, see the Summary Table of Changes
> 
> And also a similar erratum BK85 where the workaround is "It is possible for
> the BIOS to contain a workaround for this erratum".
> 
> Since this only appeared twice in several years, it is not unreasonable that
> it could be this erratum or a similar one.

Comment 12 Bandan Das 2019-04-12 13:29:13 UTC

Based on comment 10, is it possible to check if there's a bios update available for the affected system and can it be applied ?

Comment 13 Paolo Bonzini 2019-04-12 15:11:17 UTC

I couldn't find the erratum in the document for Xeon E5 v3 processors (https://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-spec-update.html). It was there in the Xeon E5 (Sandy Bridge) update, where it is listed as BT48.

https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf

The upstream report was for a Xeon E5 v4 (Broadwell) processor.

I agree with Bandan that a BIOS update is still a good idea.

Comment 36 Paolo Bonzini 2019-07-18 21:18:21 UTC

Bandan, you probably should have it pr_err the current VMCS address before crashing.

In the meanwhile David's suggested experiment is a great one!

Comment 47 Red Hat Bugzilla 2023-09-15 00:16:27 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.