In today's Rawhide openQA tests, several tests failed to boot properly at some point during the test. Some just show a blank screen (because they're doing a quiet mode graphical boot, I think) but several show kernel tracebacks running through SCSI code. I've seen three variants so far. Two have identical-looking tracebacks but a slightly different error message: https://openqa.fedoraproject.org/tests/60571#step/_console_wait_login/7 https://openqa.fedoraproject.org/tests/60572#step/_console_wait_login/6 note that one error is 'unable to handle kernel paging request' and the other is 'unable to handle kernel NULL pointer dereference', but the tracebacks look very similar. Another case shows a somewhat different traceback: https://openqa.fedoraproject.org/tests/60574#step/_console_wait_login/4 but doesn't show an error message (it may just be in the backscroll, unfortunately there's no way to recover it from that test now). The traceback still appears to be in the same general area, though, so may be the same problem. I'm pretty sure this is new with kernel-4.11.0-0.rc1.git0.1.fc27 ; the tests from the 20170306.n.0 compose don't show any of the same crashes. Proposing as an Alpha blocker as a conditional violation of any basic boot or install criterion; if the kernel crashes during boot, obviously those are all violated. This may be virt-only, I guess, haven't confirmed yet.
Actually not proposing as an Alpha blocker yet, as F26 Alpha is now in freeze and Bodhi is activated, so the new kernel isn't in F26 composes yet and wouldn't be unless it got a freeze exception.
Actually, nirik says this will probably get into the next F26 compose, so I will propose it as an Alpha blocker.
There's at least one SCSI bug which is still being discussed, https://marc.info/?l=linux-kernel&m=148891054317472 so this _might_ be related to that?
Picture my head. Are you you picturing it? Good. Now look up. Now look further up. Now look so far up, your neck starts hurting. That's about where that discussion is. :P So...I guess we just wait on upstream, are you saying?
Bit more on this: it seems to be linked to having a virtio-scsi device. openQA attaches ISOs to a virtio-scsi optical drive: 16:58:05.1134 9173 starting: /usr/bin/qemu-kvm -serial file:serial0 -soundhw ac97 -global isa-fdc.driveA= -vga qxl -m 2048 -cpu Nehalem -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -device virtio-scsi-pci,id=scsi0 -device virtio-blk,drive=hd1 -drive file=raid/l1,cache=unsafe,if=none,id=hd1,format=qcow2 -drive media=cdrom,if=none,id=cd0,format=raw,file=/var/lib/openqa/share/factory/iso/Fedora-Server-dvd-x86_64-26-20170309.n.0.iso -device scsi-cd,drive=cd0,bus=scsi0.0 -boot once=d,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 2 -enable-kvm -no-shutdown -vnc :95,share=force-shared -qmp unix:qmp_socket,server,nowait -monitor unix:hmp_socket,server,nowait -S -monitor telnet:127.0.0.1:20052,server,nowait Note the '-device virtio-scsi-pci,id=scsi0' and '-drive media=cdrom,if=none,id=cd0,format=raw,file=/var/lib/openqa/share/factory/iso/Fedora-Server-dvd-x86_64-26-20170309.n.0.iso -device scsi-cd,drive=cd0,bus=scsi0.0' in there - taken together, those three args set up a virtio-scsi optical drive with the ISO file attached to it. In local testing, I can't reproduce this bug with a VM that uses an *IDE* optical drive (which is, I think, the default setup for virt-manager - I'm not sure about boxes), but I can reproduce it with a VM that uses a *SCSI* optical drive. That is, in virt-manager, I removed the 'IDE CDROM 1' device and added a new device, 'Device type' set to 'CDROM device', 'Bus type' set to 'SCSI'. After doing that, I can reproduce the bug just by booting a few times. There does not actually need to be an ISO attached to the drive, it just has to exist. So this may not be an Alpha blocker, since openQA is using a configuration which doesn't match the virt-manager default. Still seems like a problem, though.
Discussed in today's Blocker review meeting: Rejected as a Blocker but Accepted as an FE. This isn't conisdered serious enough to block release because it only seems to affect VMs booted with a SCSI virtio device and other methods work. We would consider a patch to fix this before release provided it's tested and focused on just this issue.
Just to give a status update here: * there were bugs in virtio-scsi that afaics have been fixed in rc2. See https://bugzilla.kernel.org/show_bug.cgi?id=194837 for details * the big vhost/virtio merge during the merge window of 4.11 also contained a problem that still shows up in my setup; see https://bugzilla.kernel.org/show_bug.cgi?id=194911 for details. First bad commit is "virtio_pci: use shared interrupts for virtqueues" (5c34d002dcc7). Side note: I wonder if I'm the only one seeing the remaining problem. Adam, did you by chance give rc2 or later a try with the openqa setup where you saw problems earlier?
I've seen a couple of hangs in local testing which I think may be what you're talking about. I don't think any of the openQA tests have hit them.