1539330 – System crash on nested virt since Meltdown/Spectre patches

Bug 1539330 - System crash on nested virt since Meltdown/Spectre patches

Summary: System crash on nested virt since Meltdown/Spectre patches

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	28
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-01-28 00:00 UTC by Adam Williamson
Modified:	2019-01-30 08:29 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-01-30 08:29:29 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Adam Williamson 2018-01-28 00:00:54 UTC

Since the Spectre/Meltdown fixes landed in the kernel (we think), it seems like any attempt at nested virt will cause the host VM to crash some relatively short time after the virt guest is launched.

tflink has run into this with taskotron, and I have with openqa. Tim talked about it a little with Patrick, and Patrick seemed to think it was likely related to the Spectre/Meltdown fixes (not sure which).

This is why openqa01 and openqa-stg01 have died shortly after 4am every day for the last little while; that's when createhdds kicks in and tries to create some disk images...which involves launching a VM (via virt-install).

Unfortunately the kernel trace does not appear to make it to the system logs, so I don't have one to attach ATM, but it looks like this should be relatively easy to reproduce.

Comment 1 Laura Abbott 2018-01-28 06:31:47 UTC

Can you try booting with nopti on the kernel command line in some combination of host/guest?

Comment 2 Adam Williamson 2018-01-28 10:15:39 UTC

Will do - I'll just have to set up a reproducer somewhere outside of one of our automated testing systems :P

Comment 3 Justin M. Forbes 2018-01-29 13:20:13 UTC

I am assuming these are running fedora all the way down the chain? host/virt host/guest?

Comment 4 Tim Flink 2018-01-29 13:27:04 UTC

In my case, I was running an F27 VM inside a F27 VM on an EL7 virthost. As far as I know, adam was running fedora for all of the systems

Comment 5 Adam Williamson 2018-01-29 20:54:55 UTC

Mine would be F26/F27 in F27 in ??? - I don't know what infra runs openqa01 and openqa-stg01 VMs (which are the 'F27' in that chain) on.

Sorry I still didn't get the reproducer, I was going to do it in the Super Secret Meeting this morning but we had to close laptops, then this afternoon I had a ton of meetings 'n' stuff. I'll do it soon, really I promise.

Comment 6 Adam Williamson 2018-02-13 21:50:34 UTC

Sorry for the delay, but I just tried this, and the crash still seems to happen even with 'nopti' on the kernel command line...this with kernel 4.14.16-300.fc27.

Comment 7 Adam Williamson 2018-02-22 22:21:36 UTC

Trying with 4.15.3 now.

Comment 8 Adam Williamson 2018-02-22 23:21:09 UTC

This looks to be fixed in 4.15.3! openQA staging has rebuilt 3 disk images without crashing so far. With 4.14 it would never even get one done. Marking as fixed, will re-open if it reoccurs.

Comment 9 Adam Williamson 2018-02-23 00:43:16 UTC

Spoke too soon, I just got lucky :( First attempt with 4.15.3 got through two images but crashed halfway through the third. After rebooting, another try crashed halfway through the first tried image again.

Comment 10 Adam Williamson 2018-04-17 23:08:03 UTC

Note: still happening with 4.15.14-300.fc27.x86_64 at least. I just crashed the production openQA server three times by trying to re-generate an image on it, having forgotten about this bug...

Comment 11 Justin M. Forbes 2018-07-23 15:02:38 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 27 kernel bugs.

Fedora 27 has now been rebased to 4.17.7-100.fc27.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 28, and are still experiencing this issue, please change the version to Fedora 28.

If you experience different issues, please open a new bug report for those.

Comment 12 Adam Williamson 2018-07-24 19:29:36 UTC

This is still valid with 4.17.7-200.fc28.x86_64 . Just tested it on openqa-stg01, attempted two runs, system crashed both times.

Comment 13 Justin M. Forbes 2019-01-29 16:24:26 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs.

Fedora 28 has now been rebased to 4.20.5-100.fc28.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 29, and are still experiencing this issue, please change the version to Fedora 29.

If you experience different issues, please open a new bug report for those.

Comment 14 Adam Williamson 2019-01-30 08:29:29 UTC

I do believe this really did go away recently, as I'm pretty sure I did a disk image build on one of the servers by mistake and it worked. So I'm gonna close it again. Will reopen again if it turns out I was wrong again. :)

Note You need to log in before you can comment on or make changes to this bug.

airlied
awilliam
bskeggs
ewk
hdegoede
ichavero
itamar
jarodwilson
jforbes
jglisse
john.j5live
jonathan
josef
kernel-maint
labbott
linville
mchehab
mjg59
puiterwijk
steved
tflink