Bug 1879690
Summary: | nodes become unbootable when partition saving is enabled | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | David Critch <dcritch> |
Component: | RHCOS | Assignee: | Benjamin Gilbert <bgilbert> |
Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 4.6 | CC: | alitke, bbreard, bgilbert, danken, dornelas, dustymabe, gwest, imcleod, jligon, miabbott, nstielau, smilner, walters |
Target Milestone: | --- | ||
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 16:41:47 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
David Critch
2020-09-16 19:21:46 UTC
@Dusty @Glenn could you have a look at this BZ and see if you can provide some guidance? I have duplicated the issue on my cluster. And its repeatable every time. I found a sensitivity: On the RHEL 7.x kickstart file: This works: uefi_size=384 bootsz=127 This does not: uefi_size=512 bootsz=512 I suspect that something in the bootloader is not getting updated somewhere when rhcos is installed. Note this only "fails" when partition is preserved. RHEL 7 Disk Before Install: bootstrap:/home/cloud# sfdisk -d /dev/sda label: gpt label-id: A5711EF8-8620-4DB7-970D-B47BC5293380 device: /dev/sda unit: sectors first-lba: 34 last-lba: 1258291166 sector-size: 512 /dev/sda1 : start= 2048, size= 1048576, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=23DE8DBB-921A-47C1-98BF-D030725AF826 /dev/sda2 : start= 1050624, size= 1048576, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=EC64B55D-E37A-4BD2-8F0E-2E2E12860DDF /dev/sda3 : start= 2099200, size= 2048, type=21686148-6449-6E6F-744E-656564454649, uuid=B27DEF94-90BF-474A-92F8-B066EEA7CC9E /dev/sda4 : start= 2101248, size= 246169600, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=24134628-1379-45A7-BB08-BE9CBBF5D05B /dev/sda5 : start= 248270848, size= 1010020319, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=640FCD96-6D9F-48E3-B143-B741C6847206 Failing RHCOS Disk: Failing RHCOS Install bootstrap:/home/cloud# sfdisk -d /dev/sda label: gpt label-id: 00000000-0000-4000-A000-000000000001 device: /dev/sda unit: sectors first-lba: 34 last-lba: 1258291166 sector-size: 512 /dev/sda1 : start= 2048, size= 786432, type=0FC63DAF-8483-4772-8E79-3D69D8477DE4, uuid=32BEC487-012D-4527-9CBD-683442A0C4AB, name="boot" /dev/sda2 : start= 788480, size= 260096, type=C12A7328-F81F-11D2-BA4B-00A0C93EC93B, uuid=90A11FB5-7A53-4CE0-8276-5C0BE7EF7542, name="EFI-SYSTEM" /dev/sda3 : start= 1048576, size= 2048, type=21686148-6449-6E6F-744E-656564454649, uuid=C825940C-4C63-4535-A38C-27CD83AAF336, name="BIOS-BOOT" /dev/sda4 : start= 1050624, size= 6346719, type=0FC63DAF-8483-4772-8E79-3D69D8477DE4, uuid=DC847CB7-962C-4D7F-83C7-6C3ECB7B4151, name="luks_root" /dev/sda5 : start= 248270848, size= 1010020319, type=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7, uuid=26150E01-BF86-4EC1-A6FE-0FB7C512A66C (In reply to Glenn West from comment #3) > I found a sensitivity: > On the RHEL 7.x kickstart file: > This works: > uefi_size=384 > bootsz=127 > This does not: > uefi_size=512 > bootsz=512 > > I suspect that something in the bootloader is not getting updated somewhere > when rhcos is installed. > Note this only "fails" when partition is preserved. I can confirm that using uefi_size=384 and bootsz=127 does not show the problem, while uefi_size=512 and bootsz=512 in the kickstart does. One other bit I've found this morning. After the install is done if I overwrite the first 512 with the contents of the raw disk image then it fixes the boot problem: ``` zcat ./rhcos-4.6.0-0.nightly-2020-09-10-195619-x86_64-metal.x86_64.raw.gz | dd bs=512 count=1 of=/dev/sda status=progress` ``` Still investigating. ok it turns out that the early boot code is getting left over from the RHEL7 install.
If we look at the hexdump from before and after we write that 512 from the original image here is what we see:
```
[root@localhost ~]# zcat ./rhcos-4.6.0-0.nightly-2020-09-10-195619-x86_64-metal.x86_64.raw.gz | dd bs=440 count=1 of=/dev/sda
1+0 records in
1+0 records out
440 bytes copied, 0.00472143 s, 93.2 kB/s
[root@localhost ~]# dd if=/dev/sda bs=512 count=1 | hexdump -C > hexdump-after.txt
1+0 records in
1+0 records out
512 bytes copied, 0.00045035 s, 1.1 MB/s
[root@localhost ~]# diff hexdump-afterinstall-beforeboot.txt hexdump-after.txt
4c4
< 00000050 00 00 00 00 00 00 00 00 00 00 00 80 00 08 20 00 |.............. .|
---
> 00000050 00 00 00 00 00 00 00 00 00 00 00 80 00 00 10 00 |................|
```
So the diff between what it should be and what it was `00 10 00` vs the `08 20 00`. Correcting that fixes the problem.
Note I can't reproduce this with UEFI, which makes sense because it doesn't use the early boot code like legacy BIOS boot does. Thanks Dusty. This is a regression in https://github.com/coreos/coreos-installer/commit/66ffb81a8d32. When partition saving is enabled, we're taking the first sector verbatim from the original disk contents, rather than copying it from the install image. The 384/127 sizes in comment 3 work because the BIOS-BOOT partition is in the same place after install, so the old grub MBR correctly chains into the new grub BIOS-BOOT code. Fix landed in RHCOS 46.82.202009212140-0. @David would it be possible for you to test with the RHCOS image noted in comment #10 and confirm that the fix is working for you? I've done a single node test of rhel7 install followed by rhcos on version 46.82.202009220041-0 of rhcos, and the partition was saved and the installed rhcos booted normally. Problem appears to be resolved in limited test. Will verify 40 version as well, and then full cluster. Further verification is proceeding. rhcos 46.82.202009212140-0 has been verified to have problem solved in single node test. I was able to redeploy my 4.6 cluster with the new images, and everything works great. Thanks for the quick turnaround! Once the boot image bump is merged and shows up in a nightly release payload, we can move this to VERIFIED https://github.com/openshift/installer/pull/4206 The boot image bump (https://github.com/openshift/installer/pull/4206) was verified in https://bugzilla.redhat.com/show_bug.cgi?id=1881487. Closing this as verified in 4.6.0-0.nightly-2020-09-25-085318. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |