Bug 2053527
Summary: | RHOSP17 Node provisioning failing - Timeout waiting for provisioned nodes to become available. | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Sandeep Yadav <sandyada> |
Component: | diskimage-builder | Assignee: | Steve Baker <sbaker> |
Status: | CLOSED ERRATA | QA Contact: | |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 17.0 (Wallaby) | CC: | apevec, cjeanner, grosenbe, hjensas, iwienand, jkreger, jparoly, pweeks, sbaker |
Target Milestone: | beta | Keywords: | Triaged |
Target Release: | 17.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | diskimage-builder-3.19.2-0.20220301083325.41c21e9.el8ost | Doc Type: | No Doc Update |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-09-21 12:18:59 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Sandeep Yadav
2022-02-11 13:25:02 UTC
Looking at the image shows most of /boot/efi is missing: # tree boot/efi/ boot/efi/ └── EFI ├── BOOT └── redhat ├── grub.cfg ├── grubenv └── grubx64.efi This is because the base rhel-9 image has grub2-efi and shim packages pre-installed on a /boot/efi partition, but diskimage-builder only mounts the root partition when it extracts "all" of the image content. This means image building happens with an empty /boot/efi, and nothing gets installed there because rpm treats grub2-efi and shim as already installed. To fix this I've proposed the following to diskimage-builder, which mounts all discovered parititions during extract-image: https://review.opendev.org/c/openstack/diskimage-builder/+/828617 I can now build, upload and UEFI boot images which replicate this issue: error: ../../grub-core/fs/fshelp.c:257:file`/boot/vmlinuz-5.14.0-1.7.1.el9.x86_64' not found. This happens even with my /boot/efi fix, and it looks like the /boot/loader/entries/*.conf has not been refreshed during the 50-bootloader run. The Cento-9 base image has a special workaround for this, and I think the rhel-9 base image will also need a workaround but it might be slightly different. Now that I have a dev->replication process I'll come up with a fix. I now have a fix which allows me to boot an overcloud-hardened-uefi-full.qcow2 to a UEFI enabled virtual machine. This is caused by the base rhel-9 image having a separate boot partition, but overcloud-hardened-uefi-full (and most other images) having /boot as a directory in the root partition. This means the kernel/initramfs paths in the /boot/loader/entries/*.conf are incorrect, so the boot fails. The proposed fix[1] does the same *.conf machine-id rename as for centos-9-stream, but also seds the paths in the entry conf file to ensure they include /boot. I think the extract-image fix is still required, that would cause a different boot failure once this one is fixed. [1] https://review.opendev.org/c/openstack/diskimage-builder/+/829620 Hello Steve, I could test an UEFI build using both patches (extract-image + your new one), but it fails to boot - the following error is shown: error: ../../grub-core/fs/fshelp.c:257:file `/boot/vmlinuz-5.14.0-1.7.1.el9.x86_64' not found. error: ../../grub-core/fs/fshelp.c:257:file `/boot/vmlinuz-5.14.0-1.7.1.el9.x86_64' not found. error: ../../grub-core/loader/i386/efi/linux.c:208:you need to load the kernel first. error: ../../grub-core/loader/i386/efi/linux.c:208:you need to load the kernel first. After checking the content of the vg-lv_root LVM partition, I can see two loaders: ls mount/boot/loader/entries/ d851058d2fc9482cdc6a55bea203d869-5.14.0-42.el9.x86_64.conf ffffffffffffffffffffffffffffffff-5.14.0-1.7.1.el9.x86_64.conf While the first one looks correct: cat mount/boot/loader/entries/d851058d2fc9482cdc6a55bea203d869-5.14.0-42.el9.x86_64.conf title Red Hat Enterprise Linux (5.14.0-42.el9.x86_64) 9.0 (Plow) version 5.14.0-42.el9.x86_64 linux /boot/vmlinuz-5.14.0-42.el9.x86_64 initrd /boot/initramfs-5.14.0-42.el9.x86_64.img options root=LABEL=img-rootfs ro console=tty0 console=ttyS0,115200n8 no_timer_check crashkernel=auto console=tty0 console=ttyS0,115200 no_timer_check nofb nomodeset vga=normal console=tty0 console=ttyS0,115200 audit=1 nousb grub_users $grub_users grub_arg --unrestricted grub_class rhel The second one seems incorrect, at least for the "options" line: cat mount/boot/loader/entries/ffffffffffffffffffffffffffffffff-5.14.0-1.7.1.el9.x86_64.conf title Red Hat Enterprise Linux (5.14.0-1.7.1.el9.x86_64) 9.0 (Plow) version 5.14.0-1.7.1.el9.x86_64 linux /boot/vmlinuz-5.14.0-1.7.1.el9.x86_64 initrd /boot/initramfs-5.14.0-1.7.1.el9.x86_64.img options root=UUID=b0bb50ab-82ac-45de-bbd8-51a4314e7719 console=tty0 console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M grub_users $grub_users grub_arg --unrestricted grub_class rhel We're still pointing to the "root=UUID=...." I'm wondering how this is possible, when reading your 03-reset-bls-entries - we're supposed to end with only one file in there, aren't we? Also, here's the content of the /boot: ls -l mount/boot/ total 78760 -rw-r--r--. 1 root root 212901 Jan 13 21:48 config-5.14.0-42.el9.x86_64 drwxr-xr-x. 3 root root 16384 Jan 1 1970 efi drwx------. 5 root root 79 Feb 17 09:09 grub2 -rw-------. 1 root root 64086578 Feb 17 09:10 initramfs-5.14.0-42.el9.x86_64.img drwxr-xr-x. 3 root root 21 Oct 26 16:57 loader lrwxrwxrwx. 1 root root 44 Feb 17 09:06 symvers-5.14.0-42.el9.x86_64.gz -> /lib/modules/5.14.0-42.el9.x86_64/symvers.gz -rw-------. 1 root root 5233256 Jan 13 21:48 System.map-5.14.0-42.el9.x86_64 -rwxr-xr-x. 1 root root 11096016 Jan 13 21:48 vmlinuz-5.14.0-42.el9.x86_64 Note: disk layout seems to be as follow: Device Start End Sectors Size Type /dev/nbd0p1 2048 34815 32768 16M EFI System /dev/nbd0p2 34816 51199 16384 8M BIOS boot /dev/nbd0p3 51200 11769855 11718656 5.6G Linux filesystem /dev/nbd0p4 209582080 209715166 133087 65M Linux filesystem p3 has the lvm things, and is divided as follow: ls /dev/vg -1 lv_audit lv_home lv_log lv_root lv_srv lv_tmp lv_var the /etc/fstab is: cat mount/etc/fstab LABEL=img-rootfs / xfs rw,relatime 0 1 LABEL=MKFS_ESP /boot/efi vfat defaults 0 2 LABEL=fs_tmp /tmp xfs rw,nosuid,nodev,noexec,relatime 0 2 LABEL=fs_var /var xfs rw,relatime 0 2 LABEL=fs_log /var/log xfs rw,relatime 0 2 LABEL=fs_audit /var/log/audit xfs rw,relatime 0 2 LABEL=fs_home /home xfs rw,nodev,relatime 0 2 LABEL=fs_srv /srv xfs rw,nodev,relatime 0 2 So all seems to be just fine. Just.... that dual loader file thing - it's a bit weird. The fix is now in RHOS-17.0-RHEL-8-20220314.n.2 compose, so this should be propagating into built overcloud-hardened-uefi-full images. Do we know why this BZ is stuck in MODIFIED? bz should be moved to on_qa once we get all the acks, I'll followup on that. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:6543 This comment was flagged a spam, view the edit history to see the original text if required. |