1430043 – VMs with virtio-scsi devices often crash during boot with traceback running through scsi code since kernel-4.11.0-0.rc1.git0.1.fc27

Bug 1430043 - VMs with virtio-scsi devices often crash during boot with traceback running through scsi code since kernel-4.11.0-0.rc1.git0.1.fc27

Summary: VMs with virtio-scsi devices often crash during boot with traceback running t...

Keywords:
Status:	NEW
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	RejectedBlocker,AcceptedFreezeException
Depends On:
Blocks:	F26AlphaBlocker
TreeView+	depends on / blocked

Reported:	2017-03-07 18:06 UTC by Adam Williamson
Modified:	2020-01-17 22:32 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Linux Kernel	194837	0	None	None	None	2017-03-10 00:41:58 UTC
Linux Kernel	194911	0	None	None	None	2017-03-17 19:47:30 UTC

Description Adam Williamson 2017-03-07 18:06:50 UTC

In today's Rawhide openQA tests, several tests failed to boot properly at some point during the test. Some just show a blank screen (because they're doing a quiet mode graphical boot, I think) but several show kernel tracebacks running through SCSI code. I've seen three variants so far. Two have identical-looking tracebacks but a slightly different error message:

https://openqa.fedoraproject.org/tests/60571#step/_console_wait_login/7
https://openqa.fedoraproject.org/tests/60572#step/_console_wait_login/6

note that one error is 'unable to handle kernel paging request' and the other is 'unable to handle kernel NULL pointer dereference', but the tracebacks look very similar.

Another case shows a somewhat different traceback:

https://openqa.fedoraproject.org/tests/60574#step/_console_wait_login/4

but doesn't show an error message (it may just be in the backscroll, unfortunately there's no way to recover it from that test now). The traceback still appears to be in the same general area, though, so may be the same problem.

I'm pretty sure this is new with kernel-4.11.0-0.rc1.git0.1.fc27 ; the tests from the 20170306.n.0 compose don't show any of the same crashes.

Proposing as an Alpha blocker as a conditional violation of any basic boot or install criterion; if the kernel crashes during boot, obviously those are all violated. This may be virt-only, I guess, haven't confirmed yet.

Comment 1 Adam Williamson 2017-03-07 18:09:32 UTC

Actually not proposing as an Alpha blocker yet, as F26 Alpha is now in freeze and Bodhi is activated, so the new kernel isn't in F26 composes yet and wouldn't be unless it got a freeze exception.

Comment 2 Adam Williamson 2017-03-07 18:14:26 UTC

Actually, nirik says this will probably get into the next F26 compose, so I will propose it as an Alpha blocker.

Comment 3 Laura Abbott 2017-03-07 18:46:48 UTC

There's at least one SCSI bug which is still being discussed, https://marc.info/?l=linux-kernel&m=148891054317472 so this _might_ be related to that?

Comment 4 Adam Williamson 2017-03-07 18:50:36 UTC

Picture my head. Are you you picturing it? Good. Now look up. Now look further up. Now look so far up, your neck starts hurting. That's about where that discussion is. :P

So...I guess we just wait on upstream, are you saying?

Comment 5 Adam Williamson 2017-03-09 19:32:39 UTC

Bit more on this: it seems to be linked to having a virtio-scsi device. openQA attaches ISOs to a virtio-scsi optical drive:

16:58:05.1134 9173 starting: /usr/bin/qemu-kvm -serial file:serial0 -soundhw ac97 -global isa-fdc.driveA= -vga qxl -m 2048 -cpu Nehalem -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -device virtio-scsi-pci,id=scsi0 -device virtio-blk,drive=hd1 -drive file=raid/l1,cache=unsafe,if=none,id=hd1,format=qcow2 -drive media=cdrom,if=none,id=cd0,format=raw,file=/var/lib/openqa/share/factory/iso/Fedora-Server-dvd-x86_64-26-20170309.n.0.iso -device scsi-cd,drive=cd0,bus=scsi0.0 -boot once=d,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 2 -enable-kvm -no-shutdown -vnc :95,share=force-shared -qmp unix:qmp_socket,server,nowait -monitor unix:hmp_socket,server,nowait -S -monitor telnet:127.0.0.1:20052,server,nowait

Note the '-device virtio-scsi-pci,id=scsi0' and '-drive media=cdrom,if=none,id=cd0,format=raw,file=/var/lib/openqa/share/factory/iso/Fedora-Server-dvd-x86_64-26-20170309.n.0.iso -device scsi-cd,drive=cd0,bus=scsi0.0' in there - taken together, those three args set up a virtio-scsi optical drive with the ISO file attached to it.

In local testing, I can't reproduce this bug with a VM that uses an *IDE* optical drive (which is, I think, the default setup for virt-manager - I'm not sure about boxes), but I can reproduce it with a VM that uses a *SCSI* optical drive. That is, in virt-manager, I removed the 'IDE CDROM 1' device and added a new device, 'Device type' set to 'CDROM device', 'Bus type' set to 'SCSI'. After doing that, I can reproduce the bug just by booting a few times. There does not actually need to be an ISO attached to the drive, it just has to exist.

So this may not be an Alpha blocker, since openQA is using a configuration which doesn't match the virt-manager default. Still seems like a problem, though.

Comment 6 Mike Ruckman 2017-03-13 18:34:39 UTC

Discussed in today's Blocker review meeting: Rejected as a Blocker but Accepted as an FE. This isn't conisdered serious enough to block release because it only seems to affect VMs booted with a SCSI virtio device and other methods work. We would consider a patch to fix this before release provided it's tested and focused on just this issue.

Comment 7 Thorsten Leemhuis 2017-03-17 19:46:54 UTC

Just to give a status update here:
* there were bugs in virtio-scsi that afaics have been fixed in rc2. See https://bugzilla.kernel.org/show_bug.cgi?id=194837 for details
* the big vhost/virtio merge during the merge window of 4.11 also contained a problem that still shows up in my setup; see https://bugzilla.kernel.org/show_bug.cgi?id=194911 for details. First bad commit is "virtio_pci: use shared interrupts for virtqueues" (5c34d002dcc7). 

Side note: I wonder if I'm the only one seeing the remaining problem. Adam, did you by chance give rc2 or later a try with the openqa setup where you saw problems earlier?

Comment 8 Adam Williamson 2017-03-17 20:08:35 UTC

I've seen a couple of hangs in local testing which I think may be what you're talking about. I don't think any of the openQA tests have hit them.

Note You need to log in before you can comment on or make changes to this bug.