Description of problem: I'm testing the partition saving feature for a customer, using the latest OCP 4.6. The OpenShift cluster is deployed via UPI and the following karg was added to the PXE boot parameters: coreos.inst.save_partindex=5- The partition is indeed saved (confirmed via rescue image) but the nodes become unbootable. After the disk image is written and the system attempts to reboot, it gets stuck trying to boot from the drive. Version-Release number of selected component (if applicable): rhcos-4.6.0-0.nightly-2020-09-10-195619-x86_64-live-initramfs.x86_64.img rhcos-4.6.0-0.nightly-2020-09-10-195619-x86_64-live-kernel-x86_64 $ openshift-install version openshift-install 4.6.0-0.nightly-2020-09-16-114952 built from commit 3c130f21348caddc37f4458378e6bf288b00d69e release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:bbd795884df7e5a200f2ae68bfa362c09b11136569fa612baa457effa5776e8e How reproducible: Always Steps to Reproduce: 1. Deploy nodes with RHEL7, using the following custom kickstart partition script: bid=sda uefi_size=512 bootsz=512 biosbootsz=2 rootsz=122880 sgdisk --zap-all /dev/${bid} sgdisk -n 1:2048:+${uefi_size}M \ -n 2:0:+${bootsz}M \ -n 3:0:+${biosbootsz}M \ -n 4:0:+${rootsz}M \ -n 5:0:0 \ -t 1:EF00 \ -t 2:0700 \ -t 3:EF02 \ -t 4:8E00 \ -t 5:8E00 /dev/${bid} partprobe cat <<EOF > /tmp/diskpart.cfg bootloader --append="nofb quiet splash=quiet crashkernel=auto" --location=mbr --boot-drive=sda part /boot/efi --fstype=efi --asprimary --onpart=/dev/${bid}1 part /boot --fstype=ext3 --asprimary --onpart=/dev/${bid}2 part biosboot --fstype=biosboot --onpart=/dev/${bid}3 part / --fstype="xfs" --onpart=/dev/${bid}4 part /mnt/datastore --fstype="xfs" --onpart=/dev/${bid}5 ignoredisk --only-use=sda EOF 2. Attempt to deploy OCP on the nodes Actual results: Partition is saved, but disk is not bootable. Expected results: Partition is saved, and the node boots in to the OS Additional info: Here's a video of how/when it gets stuck: https://drive.google.com/file/d/1SG0o1Q0P0_EsLQ2UeeUApJKGpLmyaPL8/view?usp=sharing
@Dusty @Glenn could you have a look at this BZ and see if you can provide some guidance?
I have duplicated the issue on my cluster. And its repeatable every time.
I found a sensitivity: On the RHEL 7.x kickstart file: This works: uefi_size=384 bootsz=127 This does not: uefi_size=512 bootsz=512 I suspect that something in the bootloader is not getting updated somewhere when rhcos is installed. Note this only "fails" when partition is preserved.
RHEL 7 Disk Before Install: bootstrap:/home/cloud# sfdisk -d /dev/sda label: gpt label-id: A5711EF8-8620-4DB7-970D-B47BC5293380 device: /dev/sda unit: sectors first-lba: 34 last-lba: 1258291166 sector-size: 512 /dev/sda1 : start= 2048, size= 1048576, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=23DE8DBB-921A-47C1-98BF-D030725AF826 /dev/sda2 : start= 1050624, size= 1048576, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=EC64B55D-E37A-4BD2-8F0E-2E2E12860DDF /dev/sda3 : start= 2099200, size= 2048, type=21686148-6449-6E6F-744E-656564454649, uuid=B27DEF94-90BF-474A-92F8-B066EEA7CC9E /dev/sda4 : start= 2101248, size= 246169600, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=24134628-1379-45A7-BB08-BE9CBBF5D05B /dev/sda5 : start= 248270848, size= 1010020319, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=640FCD96-6D9F-48E3-B143-B741C6847206 Failing RHCOS Disk: Failing RHCOS Install bootstrap:/home/cloud# sfdisk -d /dev/sda label: gpt label-id: 00000000-0000-4000-A000-000000000001 device: /dev/sda unit: sectors first-lba: 34 last-lba: 1258291166 sector-size: 512 /dev/sda1 : start= 2048, size= 786432, type=0FC63DAF-8483-4772-8E79-3D69D8477DE4, uuid=32BEC487-012D-4527-9CBD-683442A0C4AB, name="boot" /dev/sda2 : start= 788480, size= 260096, type=C12A7328-F81F-11D2-BA4B-00A0C93EC93B, uuid=90A11FB5-7A53-4CE0-8276-5C0BE7EF7542, name="EFI-SYSTEM" /dev/sda3 : start= 1048576, size= 2048, type=21686148-6449-6E6F-744E-656564454649, uuid=C825940C-4C63-4535-A38C-27CD83AAF336, name="BIOS-BOOT" /dev/sda4 : start= 1050624, size= 6346719, type=0FC63DAF-8483-4772-8E79-3D69D8477DE4, uuid=DC847CB7-962C-4D7F-83C7-6C3ECB7B4151, name="luks_root" /dev/sda5 : start= 248270848, size= 1010020319, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=26150E01-BF86-4EC1-A6FE-0FB7C512A66C
(In reply to Glenn West from comment #3) > I found a sensitivity: > On the RHEL 7.x kickstart file: > This works: > uefi_size=384 > bootsz=127 > This does not: > uefi_size=512 > bootsz=512 > > I suspect that something in the bootloader is not getting updated somewhere > when rhcos is installed. > Note this only "fails" when partition is preserved. I can confirm that using uefi_size=384 and bootsz=127 does not show the problem, while uefi_size=512 and bootsz=512 in the kickstart does. One other bit I've found this morning. After the install is done if I overwrite the first 512 with the contents of the raw disk image then it fixes the boot problem: ``` zcat ./rhcos-4.6.0-0.nightly-2020-09-10-195619-x86_64-metal.x86_64.raw.gz | dd bs=512 count=1 of=/dev/sda status=progress` ``` Still investigating.
ok it turns out that the early boot code is getting left over from the RHEL7 install. If we look at the hexdump from before and after we write that 512 from the original image here is what we see: ``` [root@localhost ~]# zcat ./rhcos-4.6.0-0.nightly-2020-09-10-195619-x86_64-metal.x86_64.raw.gz | dd bs=440 count=1 of=/dev/sda 1+0 records in 1+0 records out 440 bytes copied, 0.00472143 s, 93.2 kB/s [root@localhost ~]# dd if=/dev/sda bs=512 count=1 | hexdump -C > hexdump-after.txt 1+0 records in 1+0 records out 512 bytes copied, 0.00045035 s, 1.1 MB/s [root@localhost ~]# diff hexdump-afterinstall-beforeboot.txt hexdump-after.txt 4c4 < 00000050 00 00 00 00 00 00 00 00 00 00 00 80 00 08 20 00 |.............. .| --- > 00000050 00 00 00 00 00 00 00 00 00 00 00 80 00 00 10 00 |................| ``` So the diff between what it should be and what it was `00 10 00` vs the `08 20 00`. Correcting that fixes the problem.
Note I can't reproduce this with UEFI, which makes sense because it doesn't use the early boot code like legacy BIOS boot does.
Thanks Dusty. This is a regression in https://github.com/coreos/coreos-installer/commit/66ffb81a8d32. When partition saving is enabled, we're taking the first sector verbatim from the original disk contents, rather than copying it from the install image. The 384/127 sizes in comment 3 work because the BIOS-BOOT partition is in the same place after install, so the old grub MBR correctly chains into the new grub BIOS-BOOT code.
Fix landed in RHCOS 46.82.202009212140-0.
@David would it be possible for you to test with the RHCOS image noted in comment #10 and confirm that the fix is working for you?
I've done a single node test of rhel7 install followed by rhcos on version 46.82.202009220041-0 of rhcos, and the partition was saved and the installed rhcos booted normally. Problem appears to be resolved in limited test. Will verify 40 version as well, and then full cluster. Further verification is proceeding.
rhcos 46.82.202009212140-0 has been verified to have problem solved in single node test.
I was able to redeploy my 4.6 cluster with the new images, and everything works great. Thanks for the quick turnaround!
Once the boot image bump is merged and shows up in a nightly release payload, we can move this to VERIFIED https://github.com/openshift/installer/pull/4206
The boot image bump (https://github.com/openshift/installer/pull/4206) was verified in https://bugzilla.redhat.com/show_bug.cgi?id=1881487. Closing this as verified in 4.6.0-0.nightly-2020-09-25-085318.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196