Description of problem: 100 Windows VM's are failing to start with Windows BSOD saying “inaccessible boot device” Version-Release number of selected component (if applicable): 2.5.4 How reproducible: Unknown. It happened once Steps to Reproduce: 1. Bounce OSD's running the VM disks (customer states that the issue occurred after he bounced 4 OCD's) 2. 3. Actual results: 100 VM's were impacted and the fail to start since they cannot find the boot device. Expected results: VM's should not be impacted. (The perception is that there may be corruption on the disks and hence the windows is not booting despite getting to the boot stage. and it's not clear if OCD bouncing has what caused the issue/corruption) Additional info: On Monday 5/3, Ceph was in warning state. cluster: id: ec3fa060-3c4f-4754-88e4-5e1a09b8d69c health: HEALTH_WARN 5 slow ops, oldest one blocked for 184764 sec, mon.a has slow ops services: mon: 3 daemons, quorum a,b,c (age 2d) mgr: a(active, since 13d) osd: 96 osds: 96 up (since 2d), 96 in (since 2w) In the mon logs customer noticed timeouts, and bounced 4 OSD's (osd 66 osd.65 osd.67 osd.7), and after that, the status went to OK. sh-4.4# ceph -s cluster: id: ec3fa060-3c4f-4754-88e4-5e1a09b8d69c health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 2d) mgr: a(active, since 13d) osd: 96 osds: 96 up (since 64s), 96 in (since 2w) After the above, customer started seeing 100 VM's crashing. From windows VM perspective, it seems obvious that they are crashing since they cannot find boot disk “inaccessible boot device” From Must gather I see that 96 OSD's are running, and it's bit weird that all of them are running for either "8d" or "19d". There are other failed OSD's, but since customer reported 96 in healthy state, I am assuming that we can ignore those failed ones for this issue. What we need to find out. Why the VM's are not booting ? Is it due to disk corruption (if so will customer bouncing 4 OCD's while VM's were running cause corruption ?) The SFDC case that customer opened for openshift is 02932868 and it has OCS and CNV logs. SFDC case 02887571 is for the mon issue they noticed.
Could you please share full QEMU log and libvirt (ideally with debug turned on) log for the failed VM?
Oh, thanks to Jarda who had better eyes when looking at the command line, we have a new possibly interesting point for the investigation: the VM is configured to boot from network. Note the bootindex= option for both block and net device: -device virtio-blk-pci,scsi=off,bus=pcie.0,addr=0x2,drive=libvirt-1-format,id=ua-os-disk,bootindex=2,write-cache=on -device virtio-net-pci,mq=on,vectors=6,host_mtu=1500,netdev=hostua-vnic0,id=ua-vnic0,mac=00:50:56:01:de:8e,bus=pcie.0,addr=0x3,bootindex=1 Is this expected?
Yes, it is. It boots iPXE, this is then falling back to local boot.
The fact that the Windows bootloader executed could indicate that netboot succeeded and then the issue occurred when the Windows bootloader attempted to access the virtio-blk device. Is it possible to confirm that network boot succeeded by checking the network boot server logs to see if the VM successfully download the PXE image? If netboot did not occur then the problem is more complicated than an inaccessible disk. The QEMU virtio-blk configuration is straightforward and looks fine. It is unlikely that bouncing the OSDs caused corruption at the QEMU level. QEMU is simply performing preadv(2)/pwritev(2)/fdatasync(2) system calls to the host block device. There is no disk image file with metadata that could become corrupted here. If you want to check that the VM's disk is accessible, try running dd if=/dev/os-disk of=/dev/null bs=64k count=10 from inside the VM's container or an equivalent command on the node. If the host kernel rbd driver is healthy it will read 640 KB from the start of the disk successfully. > How reproducible: > > Unknown. It happened once Did the VMs boot successfully after restarting them again? What about cold boot (stopping them and then starting them again)? If the answer to either question is yes then there is no persistent disk corruption.
David Gilbert mentioned that setting the disk error policy could help in the future. It's a recent feature: https://github.com/kubevirt/kubevirt/issues/4799 QEMU's default is to report I/O errors to the guest. The exception is ENOSPC on write, which pauses the guest so the administrator can provision additional storage and then resume the guest. If the VM should be isolated from network storage issues then an error policy that pauses the guest could be helpful. There is also ongoing work in upstream QEMU to implement a retry time so that network storage automatically retries failed requests. Currently retry only happens when a paused guest is resumed explicitly by the user or management software.
please attach the PR related to this issue
PR https://github.com/kubevirt/kubevirt/pull/4840 is attached. I agree with Stefan and David. Changing the error policy to 'stop' means that qemu will pause the VM execution when the hypervisor encounters I/O errors. When this happens, kubevirt will attempt to resume the VM automatically. The effect is that hypervisor I/O errors are not passed to the VM operating system. The VM will notice the pause but this should be handled better.
we can get the VM paused when there is hypervisor I/O error Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 78s virtualmachine-controller Created virtual machine pod virt-launcher-vm-block-966z4 Normal Started 67s virt-handler VirtualMachineInstance started. Warning IOerror 36s (x5 over 59s) virt-handler VM Paused due to IO error at the volume: ioerror-disk Normal Created 13s (x13 over 67s) virt-handler VirtualMachineInstance defined. Also did a round of regression tests for this fix, no more issue found. Move the bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2920