Bug 1879690

Summary:	nodes become unbootable when partition saving is enabled
Product:	OpenShift Container Platform	Reporter:	David Critch <dcritch>
Component:	RHCOS	Assignee:	Benjamin Gilbert <bgilbert>
Status:	CLOSED ERRATA	QA Contact:	Michael Nguyen <mnguyen>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.6	CC:	alitke, bbreard, bgilbert, danken, dornelas, dustymabe, gwest, imcleod, jligon, miabbott, nstielau, smilner, walters
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:41:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Critch 2020-09-16 19:21:46 UTC

Description of problem:
I'm testing the partition saving feature for a customer, using the latest OCP 4.6. The OpenShift cluster is deployed via UPI and the following karg was added to the PXE boot parameters:
coreos.inst.save_partindex=5-

The partition is indeed saved (confirmed via rescue image) but the nodes become unbootable. After the disk image is written and the system attempts to reboot, it gets stuck trying to boot from the drive.

Version-Release number of selected component (if applicable):
rhcos-4.6.0-0.nightly-2020-09-10-195619-x86_64-live-initramfs.x86_64.img
rhcos-4.6.0-0.nightly-2020-09-10-195619-x86_64-live-kernel-x86_64

$ openshift-install version
openshift-install 4.6.0-0.nightly-2020-09-16-114952
built from commit 3c130f21348caddc37f4458378e6bf288b00d69e
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:bbd795884df7e5a200f2ae68bfa362c09b11136569fa612baa457effa5776e8e


How reproducible:
Always

Steps to Reproduce:
1. Deploy nodes with RHEL7, using the following custom kickstart partition script:
bid=sda
uefi_size=512
bootsz=512
biosbootsz=2
rootsz=122880
 
sgdisk --zap-all /dev/${bid}
sgdisk -n 1:2048:+${uefi_size}M \
       -n 2:0:+${bootsz}M \
       -n 3:0:+${biosbootsz}M \
       -n 4:0:+${rootsz}M \
       -n 5:0:0 \
       -t 1:EF00 \
       -t 2:0700 \
       -t 3:EF02 \
       -t 4:8E00 \
       -t 5:8E00 /dev/${bid}
 
partprobe

cat <<EOF > /tmp/diskpart.cfg
bootloader --append="nofb quiet splash=quiet crashkernel=auto" --location=mbr --boot-drive=sda
part /boot/efi --fstype=efi --asprimary --onpart=/dev/${bid}1
part /boot --fstype=ext3 --asprimary --onpart=/dev/${bid}2
part biosboot --fstype=biosboot --onpart=/dev/${bid}3
part / --fstype="xfs" --onpart=/dev/${bid}4
part /mnt/datastore --fstype="xfs" --onpart=/dev/${bid}5
ignoredisk --only-use=sda
EOF

2. Attempt to deploy OCP on the nodes

Actual results:
Partition is saved, but disk is not bootable.

Expected results:
Partition is saved, and the node boots in to the OS

Additional info:
Here's a video of how/when it gets stuck: https://drive.google.com/file/d/1SG0o1Q0P0_EsLQ2UeeUApJKGpLmyaPL8/view?usp=sharing

Comment 1 Micah Abbott 2020-09-16 19:58:36 UTC

@Dusty @Glenn could you have a look at this BZ and see if you can provide some guidance?

Comment 2 Glenn West 2020-09-16 22:08:26 UTC

I have duplicated the issue on my cluster. And its repeatable every time.

Comment 3 Glenn West 2020-09-16 22:26:57 UTC

I found a sensitivity:
On the RHEL 7.x kickstart file:
This works:
uefi_size=384
bootsz=127
This does not:
uefi_size=512
bootsz=512

I suspect that something in the bootloader is not getting updated somewhere when rhcos is installed.
Note this only "fails" when partition is preserved.

Comment 5 Glenn West 2020-09-17 14:32:55 UTC

RHEL 7 Disk Before Install:
bootstrap:/home/cloud# sfdisk -d /dev/sda
label: gpt
label-id: A5711EF8-8620-4DB7-970D-B47BC5293380
device: /dev/sda
unit: sectors
first-lba: 34
last-lba: 1258291166
sector-size: 512

/dev/sda1 : start=        2048, size=     1048576, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=23DE8DBB-921A-47C1-98BF-D030725AF826
/dev/sda2 : start=     1050624, size=     1048576, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=EC64B55D-E37A-4BD2-8F0E-2E2E12860DDF
/dev/sda3 : start=     2099200, size=        2048, type=21686148-6449-6E6F-744E-656564454649, uuid=B27DEF94-90BF-474A-92F8-B066EEA7CC9E
/dev/sda4 : start=     2101248, size=   246169600, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=24134628-1379-45A7-BB08-BE9CBBF5D05B
/dev/sda5 : start=   248270848, size=  1010020319, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=640FCD96-6D9F-48E3-B143-B741C6847206


Failing RHCOS Disk:
Failing RHCOS Install
bootstrap:/home/cloud# sfdisk -d /dev/sda
label: gpt
label-id: 00000000-0000-4000-A000-000000000001
device: /dev/sda
unit: sectors
first-lba: 34
last-lba: 1258291166
sector-size: 512

/dev/sda1 : start=        2048, size=      786432, type=0FC63DAF-8483-4772-8E79-3D69D8477DE4, uuid=32BEC487-012D-4527-9CBD-683442A0C4AB, name="boot"
/dev/sda2 : start=      788480, size=      260096, type=C12A7328-F81F-11D2-BA4B-00A0C93EC93B, uuid=90A11FB5-7A53-4CE0-8276-5C0BE7EF7542, name="EFI-SYSTEM"
/dev/sda3 : start=     1048576, size=        2048, type=21686148-6449-6E6F-744E-656564454649, uuid=C825940C-4C63-4535-A38C-27CD83AAF336, name="BIOS-BOOT"
/dev/sda4 : start=     1050624, size=     6346719, type=0FC63DAF-8483-4772-8E79-3D69D8477DE4, uuid=DC847CB7-962C-4D7F-83C7-6C3ECB7B4151, name="luks_root"
/dev/sda5 : start=   248270848, size=  1010020319, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=26150E01-BF86-4EC1-A6FE-0FB7C512A66C

Comment 6 Dusty Mabe 2020-09-18 18:03:02 UTC

(In reply to Glenn West from comment #3)
> I found a sensitivity:
> On the RHEL 7.x kickstart file:
> This works:
> uefi_size=384
> bootsz=127
> This does not:
> uefi_size=512
> bootsz=512
> 
> I suspect that something in the bootloader is not getting updated somewhere
> when rhcos is installed.
> Note this only "fails" when partition is preserved.

I can confirm that using uefi_size=384 and bootsz=127 does not show the problem, while uefi_size=512 and bootsz=512 in the kickstart does.

One other bit I've found this morning. After the install is done if I overwrite the first 512 with the contents of the raw disk image then it fixes the boot problem:

```
zcat ./rhcos-4.6.0-0.nightly-2020-09-10-195619-x86_64-metal.x86_64.raw.gz | dd bs=512 count=1 of=/dev/sda status=progress`
```

Still investigating.

Comment 7 Dusty Mabe 2020-09-18 21:25:33 UTC

ok it turns out that the early boot code is getting left over from the RHEL7 install.

If we look at the hexdump from before and after we write that 512 from the original image here is what we see:

```
[root@localhost ~]# zcat ./rhcos-4.6.0-0.nightly-2020-09-10-195619-x86_64-metal.x86_64.raw.gz | dd bs=440 count=1 of=/dev/sda
1+0 records in
1+0 records out
440 bytes copied, 0.00472143 s, 93.2 kB/s
[root@localhost ~]# dd if=/dev/sda bs=512 count=1 | hexdump -C > hexdump-after.txt                   
1+0 records in
1+0 records out
512 bytes copied, 0.00045035 s, 1.1 MB/s
[root@localhost ~]# diff hexdump-afterinstall-beforeboot.txt hexdump-after.txt 
4c4
< 00000050  00 00 00 00 00 00 00 00  00 00 00 80 00 08 20 00  |.............. .|
---
> 00000050  00 00 00 00 00 00 00 00  00 00 00 80 00 00 10 00  |................|
```

So the diff between what it should be and what it was `00 10 00` vs the `08 20 00`. Correcting that fixes the problem.

Comment 8 Dusty Mabe 2020-09-18 21:26:08 UTC

Note I can't reproduce this with UEFI, which makes sense because it doesn't use the early boot code like legacy BIOS boot does.

Comment 9 Benjamin Gilbert 2020-09-18 21:37:06 UTC

Thanks Dusty.  This is a regression in https://github.com/coreos/coreos-installer/commit/66ffb81a8d32.  When partition saving is enabled, we're taking the first sector verbatim from the original disk contents, rather than copying it from the install image.  The 384/127 sizes in comment 3 work because the BIOS-BOOT partition is in the same place after install, so the old grub MBR correctly chains into the new grub BIOS-BOOT code.

Comment 10 Benjamin Gilbert 2020-09-22 03:34:23 UTC

Fix landed in RHCOS 46.82.202009212140-0.

Comment 12 Micah Abbott 2020-09-22 14:08:22 UTC

@David would it be possible for you to test with the RHCOS image noted in comment #10 and confirm that the fix is working for you?

Comment 13 Glenn West 2020-09-22 14:45:08 UTC

I've done a single node test of rhel7 install followed by rhcos on version 46.82.202009220041-0 of rhcos,
and the partition was saved and the installed rhcos booted normally. Problem appears to be resolved in limited test.
Will verify 40 version as well, and then full cluster. 

Further verification is proceeding.

Comment 14 Glenn West 2020-09-22 15:20:59 UTC

rhcos 46.82.202009212140-0 has been verified to have problem solved in single node test.

Comment 15 David Critch 2020-09-22 15:53:28 UTC

I was able to redeploy my 4.6 cluster with the new images, and everything works great. Thanks for the quick turnaround!

Comment 16 Micah Abbott 2020-09-22 16:10:06 UTC

Once the boot image bump is merged and shows up in a nightly release payload, we can move this to VERIFIED

https://github.com/openshift/installer/pull/4206

Comment 18 Michael Nguyen 2020-09-25 14:55:24 UTC

The boot image bump (https://github.com/openshift/installer/pull/4206) was verified in https://bugzilla.redhat.com/show_bug.cgi?id=1881487.  Closing this as verified in 4.6.0-0.nightly-2020-09-25-085318.

Comment 21 errata-xmlrpc 2020-10-27 16:41:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196