Description of problem: We are trying RHOSP17 on RHEL9, Our ovb job is failing on node provisioning with "Timeout waiting for provisioned nodes to become available". Version-Release number of selected component (if applicable): RHOSP17 How reproducible: Everytime Steps to Reproduce: 1. In ovb based environment try to overcloud node provisioning. Actual results: Node provisioning failing:- overcloud_node_provision.log:_ ~~~ PLAY [Overcloud Node Grow Volumes] ********************************************* 2022-02-11 01:34:33.640972 | fa163e47-d34e-191b-51e5-00000000000c | TASK | Wait for provisioned nodes to boot 2022-02-11 01:44:35.920089 | | DEPRECATED | Distribution rhel 9.0 on host overcloud-controller-1 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using the discovered platform python for this host. See https://docs.ansible.com/ansible/2.11/reference_appendices/interpreter_discovery.html for more information 2022-02-11 01:44:35.922981 | fa163e47-d34e-191b-51e5-00000000000c | FATAL | Wait for provisioned nodes to boot | overcloud-controller-1 | error={"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python"}, "changed": false, "elapsed": 601, "msg": "Timeout waiting for provisioned nodes to become available"} 2022-02-11 01:44:35.924169 | fa163e47-d34e-191b-51e5-00000000000c | TIMING | Wait for provisioned nodes to boot | overcloud-controller-1 | 0:10:02.313786 | 602.26s 2022-02-11 01:44:35.924911 | | DEPRECATED | Distribution rhel 9.0 on host overcloud-controller-2 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using the discovered platform python for this host. See https://docs.ansible.com/ansible/2.11/reference_appendices/interpreter_discovery.html for more information 2022-02-11 01:44:35.925485 | fa163e47-d34e-191b-51e5-00000000000c | FATAL | Wait for provisioned nodes to boot | overcloud-controller-2 | error={"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python"}, "changed": false, "elapsed": 601, "msg": "Timeout waiting for provisioned nodes to become available"} 2022-02-11 01:44:35.926144 | fa163e47-d34e-191b-51e5-00000000000c | TIMING | Wait for provisioned nodes to boot | overcloud-controller-2 | 0:10:02.315810 | 602.25s 2022-02-11 01:44:35.926780 | | DEPRECATED | Distribution rhel 9.0 on host overcloud-controller-0 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using the discovered platform python for this host. See https://docs.ansible.com/ansible/2.11/reference_appendices/interpreter_discovery.html for more information 2022-02-11 01:44:35.927307 | fa163e47-d34e-191b-51e5-00000000000c | FATAL | Wait for provisioned nodes to boot | overcloud-controller-0 | error={"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python"}, "changed": false, "elapsed": 601, "msg": "Timeout waiting for provisioned nodes to become available"} 2022-02-11 01:44:35.928059 | fa163e47-d34e-191b-51e5-00000000000c | TIMING | Wait for provisioned nodes to boot | overcloud-controller-0 | 0:10:02.317724 | 602.29s ~~~ Expected results: Node provisioning should pass. Additional info: The following traceback is noticed in ironic-conductor.log:- ~~~ 2022-02-11 01:14:47.566 2 ERROR oslo.service.loopingcall [-] Dynamic backoff interval looping call 'ironic.conductor.utils.node_wait_for_power_state.<locals>._wait' failed: oslo_service.loopingcall.LoopingCallTimeOut: Looping call timed out after 49.06 seconds 2022-02-11 01:14:47.566 2 ERROR oslo.service.loopingcall Traceback (most recent call last): 2022-02-11 01:14:47.566 2 ERROR oslo.service.loopingcall File "/usr/lib/python3.9/site-packages/oslo_service/loopingcall.py", line 154, in _run_loop 2022-02-11 01:14:47.566 2 ERROR oslo.service.loopingcall idle = idle_for_func(result, self._elapsed(watch)) 2022-02-11 01:14:47.566 2 ERROR oslo.service.loopingcall File "/usr/lib/python3.9/site-packages/oslo_service/loopingcall.py", line 349, in _idle_for 2022-02-11 01:14:47.566 2 ERROR oslo.service.loopingcall raise LoopingCallTimeOut( 2022-02-11 01:14:47.566 2 ERROR oslo.service.loopingcall oslo_service.loopingcall.LoopingCallTimeOut: Looping call timed out after 49.06 seconds ~~~
Looking at the image shows most of /boot/efi is missing: # tree boot/efi/ boot/efi/ └── EFI ├── BOOT └── redhat ├── grub.cfg ├── grubenv └── grubx64.efi This is because the base rhel-9 image has grub2-efi and shim packages pre-installed on a /boot/efi partition, but diskimage-builder only mounts the root partition when it extracts "all" of the image content. This means image building happens with an empty /boot/efi, and nothing gets installed there because rpm treats grub2-efi and shim as already installed. To fix this I've proposed the following to diskimage-builder, which mounts all discovered parititions during extract-image: https://review.opendev.org/c/openstack/diskimage-builder/+/828617
I can now build, upload and UEFI boot images which replicate this issue: error: ../../grub-core/fs/fshelp.c:257:file`/boot/vmlinuz-5.14.0-1.7.1.el9.x86_64' not found. This happens even with my /boot/efi fix, and it looks like the /boot/loader/entries/*.conf has not been refreshed during the 50-bootloader run. The Cento-9 base image has a special workaround for this, and I think the rhel-9 base image will also need a workaround but it might be slightly different. Now that I have a dev->replication process I'll come up with a fix.
I now have a fix which allows me to boot an overcloud-hardened-uefi-full.qcow2 to a UEFI enabled virtual machine. This is caused by the base rhel-9 image having a separate boot partition, but overcloud-hardened-uefi-full (and most other images) having /boot as a directory in the root partition. This means the kernel/initramfs paths in the /boot/loader/entries/*.conf are incorrect, so the boot fails. The proposed fix[1] does the same *.conf machine-id rename as for centos-9-stream, but also seds the paths in the entry conf file to ensure they include /boot. I think the extract-image fix is still required, that would cause a different boot failure once this one is fixed. [1] https://review.opendev.org/c/openstack/diskimage-builder/+/829620
Hello Steve, I could test an UEFI build using both patches (extract-image + your new one), but it fails to boot - the following error is shown: error: ../../grub-core/fs/fshelp.c:257:file `/boot/vmlinuz-5.14.0-1.7.1.el9.x86_64' not found. error: ../../grub-core/fs/fshelp.c:257:file `/boot/vmlinuz-5.14.0-1.7.1.el9.x86_64' not found. error: ../../grub-core/loader/i386/efi/linux.c:208:you need to load the kernel first. error: ../../grub-core/loader/i386/efi/linux.c:208:you need to load the kernel first. After checking the content of the vg-lv_root LVM partition, I can see two loaders: ls mount/boot/loader/entries/ d851058d2fc9482cdc6a55bea203d869-5.14.0-42.el9.x86_64.conf ffffffffffffffffffffffffffffffff-5.14.0-1.7.1.el9.x86_64.conf While the first one looks correct: cat mount/boot/loader/entries/d851058d2fc9482cdc6a55bea203d869-5.14.0-42.el9.x86_64.conf title Red Hat Enterprise Linux (5.14.0-42.el9.x86_64) 9.0 (Plow) version 5.14.0-42.el9.x86_64 linux /boot/vmlinuz-5.14.0-42.el9.x86_64 initrd /boot/initramfs-5.14.0-42.el9.x86_64.img options root=LABEL=img-rootfs ro console=tty0 console=ttyS0,115200n8 no_timer_check crashkernel=auto console=tty0 console=ttyS0,115200 no_timer_check nofb nomodeset vga=normal console=tty0 console=ttyS0,115200 audit=1 nousb grub_users $grub_users grub_arg --unrestricted grub_class rhel The second one seems incorrect, at least for the "options" line: cat mount/boot/loader/entries/ffffffffffffffffffffffffffffffff-5.14.0-1.7.1.el9.x86_64.conf title Red Hat Enterprise Linux (5.14.0-1.7.1.el9.x86_64) 9.0 (Plow) version 5.14.0-1.7.1.el9.x86_64 linux /boot/vmlinuz-5.14.0-1.7.1.el9.x86_64 initrd /boot/initramfs-5.14.0-1.7.1.el9.x86_64.img options root=UUID=b0bb50ab-82ac-45de-bbd8-51a4314e7719 console=tty0 console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M grub_users $grub_users grub_arg --unrestricted grub_class rhel We're still pointing to the "root=UUID=...." I'm wondering how this is possible, when reading your 03-reset-bls-entries - we're supposed to end with only one file in there, aren't we? Also, here's the content of the /boot: ls -l mount/boot/ total 78760 -rw-r--r--. 1 root root 212901 Jan 13 21:48 config-5.14.0-42.el9.x86_64 drwxr-xr-x. 3 root root 16384 Jan 1 1970 efi drwx------. 5 root root 79 Feb 17 09:09 grub2 -rw-------. 1 root root 64086578 Feb 17 09:10 initramfs-5.14.0-42.el9.x86_64.img drwxr-xr-x. 3 root root 21 Oct 26 16:57 loader lrwxrwxrwx. 1 root root 44 Feb 17 09:06 symvers-5.14.0-42.el9.x86_64.gz -> /lib/modules/5.14.0-42.el9.x86_64/symvers.gz -rw-------. 1 root root 5233256 Jan 13 21:48 System.map-5.14.0-42.el9.x86_64 -rwxr-xr-x. 1 root root 11096016 Jan 13 21:48 vmlinuz-5.14.0-42.el9.x86_64 Note: disk layout seems to be as follow: Device Start End Sectors Size Type /dev/nbd0p1 2048 34815 32768 16M EFI System /dev/nbd0p2 34816 51199 16384 8M BIOS boot /dev/nbd0p3 51200 11769855 11718656 5.6G Linux filesystem /dev/nbd0p4 209582080 209715166 133087 65M Linux filesystem p3 has the lvm things, and is divided as follow: ls /dev/vg -1 lv_audit lv_home lv_log lv_root lv_srv lv_tmp lv_var the /etc/fstab is: cat mount/etc/fstab LABEL=img-rootfs / xfs rw,relatime 0 1 LABEL=MKFS_ESP /boot/efi vfat defaults 0 2 LABEL=fs_tmp /tmp xfs rw,nosuid,nodev,noexec,relatime 0 2 LABEL=fs_var /var xfs rw,relatime 0 2 LABEL=fs_log /var/log xfs rw,relatime 0 2 LABEL=fs_audit /var/log/audit xfs rw,relatime 0 2 LABEL=fs_home /home xfs rw,nodev,relatime 0 2 LABEL=fs_srv /srv xfs rw,nodev,relatime 0 2 So all seems to be just fine. Just.... that dual loader file thing - it's a bit weird.
The fix is now in RHOS-17.0-RHEL-8-20220314.n.2 compose, so this should be propagating into built overcloud-hardened-uefi-full images.
Do we know why this BZ is stuck in MODIFIED?
bz should be moved to on_qa once we get all the acks, I'll followup on that.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:6543
This comment was flagged a spam, view the edit history to see the original text if required.