Bug 1957423 - 100 Windows VM's are failing to start with Windows BSOD saying “inaccessible boot device”
Summary: 100 Windows VM's are failing to start with Windows BSOD saying “inaccessible ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 2.5.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Adam Litke
QA Contact: Yan Du
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-05 18:32 UTC by Anand Paladugu
Modified: 2024-12-20 20:00 UTC (History)
17 users (show)

Fixed In Version: virt-controller v4.8.0-3
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 14:31:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 4840 0 None closed Generate K8s events on IO errors 2021-05-12 17:45:30 UTC
Red Hat Product Errata RHSA-2021:2920 0 None None None 2021-07-27 14:32:08 UTC

Description Anand Paladugu 2021-05-05 18:32:40 UTC
Description of problem: 100 Windows VM's are failing to start with Windows BSOD saying “inaccessible boot device”


Version-Release number of selected component (if applicable): 2.5.4


How reproducible:

Unknown. It happened once


Steps to Reproduce:
1. Bounce OSD's running the VM disks  (customer states that the issue occurred after he bounced 4 OCD's)
2.
3.

Actual results:

100 VM's were impacted and the fail to start since they cannot find the boot device.

Expected results:

VM's should not be impacted.   

(The perception is that there may be corruption on the disks and hence the windows is not booting despite getting to the boot stage. and it's not clear if OCD bouncing has what caused the issue/corruption)  

Additional info:

On Monday 5/3, Ceph was in warning state.


cluster:
    id:     ec3fa060-3c4f-4754-88e4-5e1a09b8d69c
    health: HEALTH_WARN
            5 slow ops, oldest one blocked for 184764 sec, mon.a has slow ops


  services:
    mon: 3 daemons, quorum a,b,c (age 2d)
    mgr: a(active, since 13d)
    osd: 96 osds: 96 up (since 2d), 96 in (since 2w)



In the mon logs customer noticed timeouts, and bounced 4 OSD's  (osd 66  osd.65  osd.67 osd.7), and after that, the status went to OK.


sh-4.4# ceph -s
  cluster:
    id:     ec3fa060-3c4f-4754-88e4-5e1a09b8d69c
    health: HEALTH_OK


  services:
    mon: 3 daemons, quorum a,b,c (age 2d)
    mgr: a(active, since 13d)
    osd: 96 osds: 96 up (since 64s), 96 in (since 2w)



After the above, customer started seeing 100 VM's crashing.  From windows VM perspective, it seems obvious that they are crashing since they cannot find boot disk  “inaccessible boot device”

From Must gather I see that 96 OSD's are running, and it's bit weird that all of them are running for either "8d" or "19d".     There are other failed OSD's, but since customer reported 96 in healthy state, I am assuming that we can ignore those failed ones for this issue.


What we need to find out.

Why the VM's are not booting ? Is it due to disk corruption (if so will customer bouncing 4 OCD's while VM's were running cause corruption ?)

The SFDC case that customer opened for openshift is  02932868 and it has OCS and CNV logs. SFDC case 02887571 is for the mon issue they noticed.

Comment 6 Jiri Denemark 2021-05-06 09:55:23 UTC
Could you please share full QEMU log and libvirt (ideally with debug turned
on) log for the failed VM?

Comment 13 Jiri Denemark 2021-05-06 13:10:07 UTC
Oh, thanks to Jarda who had better eyes when looking at the command line, we
have a new possibly interesting point for the investigation: the VM is
configured to boot from network. Note the bootindex= option for both block and
net device:

-device virtio-blk-pci,scsi=off,bus=pcie.0,addr=0x2,drive=libvirt-1-format,id=ua-os-disk,bootindex=2,write-cache=on
-device virtio-net-pci,mq=on,vectors=6,host_mtu=1500,netdev=hostua-vnic0,id=ua-vnic0,mac=00:50:56:01:de:8e,bus=pcie.0,addr=0x3,bootindex=1

Is this expected?

Comment 14 Fabian Deutsch 2021-05-06 13:11:51 UTC
Yes, it is.

It boots iPXE, this is then falling back to local boot.

Comment 15 Stefan Hajnoczi 2021-05-06 14:21:44 UTC
The fact that the Windows bootloader executed could indicate that netboot succeeded and then the issue occurred when the Windows bootloader attempted to access the virtio-blk device. Is it possible to confirm that network boot succeeded by checking the network boot server logs to see if the VM successfully download the PXE image? If netboot did not occur then the problem is more complicated than an inaccessible disk.

The QEMU virtio-blk configuration is straightforward and looks fine. It is unlikely that bouncing the OSDs caused corruption at the QEMU level. QEMU is simply performing preadv(2)/pwritev(2)/fdatasync(2) system calls to the host block device. There is no disk image file with metadata that could become corrupted here.

If you want to check that the VM's disk is accessible, try running dd if=/dev/os-disk of=/dev/null bs=64k count=10 from inside the VM's container or an equivalent command on the node. If the host kernel rbd driver is healthy it will read 640 KB from the start of the disk successfully.

> How reproducible:
> 
> Unknown. It happened once

Did the VMs boot successfully after restarting them again? What about cold boot (stopping them and then starting them again)? If the answer to either question is yes then there is no persistent disk corruption.

Comment 16 Stefan Hajnoczi 2021-05-06 14:30:26 UTC
David Gilbert mentioned that setting the disk error policy could help in the future. It's a recent feature:
https://github.com/kubevirt/kubevirt/issues/4799

QEMU's default is to report I/O errors to the guest. The exception is ENOSPC on write, which pauses the guest so the administrator can provision additional storage and then resume the guest.

If the VM should be isolated from network storage issues then an error policy that pauses the guest could be helpful.

There is also ongoing work in upstream QEMU to implement a retry time so that network storage automatically retries failed requests. Currently retry only happens when a paused guest is resumed explicitly by the user or management software.

Comment 17 Yan Du 2021-05-12 12:22:29 UTC
please attach the PR related to this issue

Comment 18 Adam Litke 2021-05-12 18:02:26 UTC
PR https://github.com/kubevirt/kubevirt/pull/4840 is attached.  I agree with Stefan and David.  Changing the error policy to 'stop' means that qemu will pause the VM execution when the hypervisor encounters I/O errors.  When this happens, kubevirt will attempt to resume the VM automatically.  The effect is that hypervisor I/O errors are not passed to the VM operating system.  The VM will notice the pause but this should be handled better.

Comment 23 Yan Du 2021-07-06 10:45:46 UTC
we can get the VM paused when there is hypervisor I/O error
Events:
  Type     Reason            Age                 From                       Message
  ----     ------            ----                ----                       -------
  Normal   SuccessfulCreate  78s                 virtualmachine-controller  Created virtual machine pod virt-launcher-vm-block-966z4
  Normal   Started           67s                 virt-handler               VirtualMachineInstance started.
  Warning  IOerror           36s (x5 over 59s)   virt-handler               VM Paused due to IO error at the volume: ioerror-disk
  Normal   Created           13s (x13 over 67s)  virt-handler               VirtualMachineInstance defined.


Also did a round of regression tests for this fix, no more issue found. 

Move the bug to verified.

Comment 26 errata-xmlrpc 2021-07-27 14:31:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2920


Note You need to log in before you can comment on or make changes to this bug.