The GRUB configuration for boot disk RAID assumes that if the primary disk fails, it'll drop off the bus entirely. If the disk enumerates but fails I/O, GRUB will fail when reading data from the first disk. We need to reconfigure GRUB to treat /boot as a RAID, rather than reading directly from the first replica. We can entirely fix this on UEFI and reduce the exposure window on BIOS. Fixing it entirely on BIOS will require bootupd to support reinstalling BIOS GRUB, and is out of scope for this bug.
Unable to simulate disk I/O error so just verified RAID /boot and grub.cfg file contains the correct bits. [core@cosa-devsh ~]$ rpm-ostree status State: idle Deployments: * ostree://8e87a86b9444784ab29e7917fa82e00d5e356f18b19449946b687ee8dc27c51a Version: 47.83.202101161239-0 (2021-01-16T12:43:01Z) [core@cosa-devsh ~]$ lsblk -f NAME FSTYPE LABEL UUID MOUNTPOINT sr0 vda |-vda1 |-vda2 vfat esp-1 925B-A4E7 |-vda3 linux_raid_member any:md-boot 719af5c2-ad77-c76d-5bf7-386f2615494c | `-md127 ext4 boot 7b8a382d-3039-4910-bc03-82b2775c2a64 /boot `-vda4 linux_raid_member any:md-root fc5fb428-9c1c-15ee-c51e-45258bc646fe `-md126 xfs root 0f752e48-64b5-4db0-907a-e736e1d2313e /sysroot vdb |-vdb1 |-vdb2 vfat esp-2 925B-FC96 |-vdb3 linux_raid_member any:md-boot 719af5c2-ad77-c76d-5bf7-386f2615494c | `-md127 ext4 boot 7b8a382d-3039-4910-bc03-82b2775c2a64 /boot `-vdb4 linux_raid_member any:md-root fc5fb428-9c1c-15ee-c51e-45258bc646fe `-md126 xfs root 0f752e48-64b5-4db0-907a-e736e1d2313e /sysroot vdc |-vdc1 |-vdc2 vfat EFI-SYSTEM F811-ED3D |-vdc3 ext4 boot 07ca1891-f27a-421d-a2f9-70326ca46858 `-vdc4 xfs root 910678ff-f77e-4a7d-8d53-86f2ac47a823 [core@cosa-devsh ~]$ cat /boot/grub2/grub.cfg set pager=1 # petitboot doesn't support -e and doesn't support an empty path part if [ -d (md/md-boot)/grub2 ]; then # fcct currently creates /boot RAID with superblock 1.0, which allows # component partitions to be read directly as filesystems. This is # necessary because transposefs doesn't yet rerun grub2-install on BIOS, # so GRUB still expects /boot to be a partition on the first disk. # # There are two consequences: # 1. On BIOS and UEFI, the search command might pick an individual RAID # component, but we want it to use the full RAID in case there are bad # sectors etc. The undocumented --hint option is supposed to support # this sort of override, but it doesn't seem to work, so we set $boot # directly. # 2. On BIOS, the "normal" module has already been loaded from an # individual RAID component, and $prefix still points there. We want # future module loads to come from the RAID, so we reset $prefix. # (On UEFI, the stub grub.cfg has already set $prefix properly.) set boot=md/md-boot set prefix=($boot)/grub2 else search --label boot --set boot fi set root=$boot if [ -f ${config_directory}/grubenv ]; then load_env -f ${config_directory}/grubenv elif [ -s $prefix/grubenv ]; then load_env fi if [ x"${feature_menuentry_id}" = xy ]; then menuentry_id_option="--id" else menuentry_id_option="" fi function load_video { if [ x$feature_all_video_module = xy ]; then insmod all_video else insmod efi_gop insmod efi_uga insmod ieee1275_fb insmod vbe insmod vga insmod video_bochs insmod video_cirrus fi } serial --speed=115200 terminal_input serial console terminal_output serial console if [ x$feature_timeout_style = xy ] ; then set timeout_style=menu set timeout=1 # Fallback normal timeout code in case the timeout_style feature is # unavailable. else set timeout=1 fi # Determine if this is a first boot and set the ${ignition_firstboot} variable # which is used in the kernel command line. set ignition_firstboot="" if [ -f "/ignition.firstboot" ]; then # Default networking parameters to be used with ignition. set ignition_network_kcmdline='' # Source in the `ignition.firstboot` file which could override the # above $ignition_network_kcmdline with static networking config. # This override feature is also by coreos-installer to persist static # networking config provided during install to the first boot of the machine. source "/ignition.firstboot" set ignition_firstboot="ignition.firstboot ${ignition_network_kcmdline}" fi blscfg [core@cosa-devsh ~]$ rpm-ostree status State: idle Deployments: * ostree://8e87a86b9444784ab29e7917fa82e00d5e356f18b19449946b687ee8dc27c51a Version: 47.83.202101161239-0 (2021-01-16T12:43:01Z) [core@cosa-devsh ~]$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633