2053527 – RHOSP17 Node provisioning failing - Timeout waiting for provisioned nodes to become available.

Bug 2053527 - RHOSP17 Node provisioning failing - Timeout waiting for provisioned nodes to become available.

Summary: RHOSP17 Node provisioning failing - Timeout waiting for provisioned nodes to ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	diskimage-builder
Sub Component:
Version:	17.0 (Wallaby)
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	beta
Target Release:	17.0
Assignee:	Steve Baker
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-11 13:25 UTC by Sandeep Yadav
Modified:	2024-11-18 23:12 UTC (History)
CC List:	9 users (show)
Fixed In Version:	diskimage-builder-3.19.2-0.20220301083325.41c21e9.el8ost
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-09-21 12:18:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	828617	None	MERGED	Detect boot and EFI partitions in extract-image	2022-03-14 18:47:43 UTC
OpenStack gerrit	829620	None	MERGED	rhel: work around RHEL-9 BLS issues	2022-03-14 18:47:45 UTC
Red Hat Issue Tracker	OSP-12629	None	None	None	2022-02-11 13:31:34 UTC
Red Hat Product Errata	RHEA-2022:6543	None	None	None	2022-09-21 12:19:19 UTC

Description Sandeep Yadav 2022-02-11 13:25:02 UTC

Description of problem:

We are trying RHOSP17 on RHEL9, Our ovb job is failing on node provisioning with "Timeout waiting for provisioned nodes to become available".


Version-Release number of selected component (if applicable):

RHOSP17


How reproducible:
Everytime


Steps to Reproduce:
1. In ovb based environment try to overcloud node provisioning.


Actual results:

Node provisioning failing:-

overcloud_node_provision.log:_

~~~

PLAY [Overcloud Node Grow Volumes] *********************************************
2022-02-11 01:34:33.640972 | fa163e47-d34e-191b-51e5-00000000000c |       TASK | Wait for provisioned nodes to boot
2022-02-11 01:44:35.920089 |                                      | DEPRECATED | Distribution rhel 9.0 on host overcloud-controller-1 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using the discovered platform python for this host. See https://docs.ansible.com/ansible/2.11/reference_appendices/interpreter_discovery.html for more information
2022-02-11 01:44:35.922981 | fa163e47-d34e-191b-51e5-00000000000c |      FATAL | Wait for provisioned nodes to boot | overcloud-controller-1 | error={"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python"}, "changed": false, "elapsed": 601, "msg": "Timeout waiting for provisioned nodes to become available"}
2022-02-11 01:44:35.924169 | fa163e47-d34e-191b-51e5-00000000000c |     TIMING | Wait for provisioned nodes to boot | overcloud-controller-1 | 0:10:02.313786 | 602.26s
2022-02-11 01:44:35.924911 |                                      | DEPRECATED | Distribution rhel 9.0 on host overcloud-controller-2 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using the discovered platform python for this host. See https://docs.ansible.com/ansible/2.11/reference_appendices/interpreter_discovery.html for more information
2022-02-11 01:44:35.925485 | fa163e47-d34e-191b-51e5-00000000000c |      FATAL | Wait for provisioned nodes to boot | overcloud-controller-2 | error={"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python"}, "changed": false, "elapsed": 601, "msg": "Timeout waiting for provisioned nodes to become available"}
2022-02-11 01:44:35.926144 | fa163e47-d34e-191b-51e5-00000000000c |     TIMING | Wait for provisioned nodes to boot | overcloud-controller-2 | 0:10:02.315810 | 602.25s
2022-02-11 01:44:35.926780 |                                      | DEPRECATED | Distribution rhel 9.0 on host overcloud-controller-0 should use /usr/libexec/platform-python, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible release will default to using the discovered platform python for this host. See https://docs.ansible.com/ansible/2.11/reference_appendices/interpreter_discovery.html for more information
2022-02-11 01:44:35.927307 | fa163e47-d34e-191b-51e5-00000000000c |      FATAL | Wait for provisioned nodes to boot | overcloud-controller-0 | error={"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python"}, "changed": false, "elapsed": 601, "msg": "Timeout waiting for provisioned nodes to become available"}
2022-02-11 01:44:35.928059 | fa163e47-d34e-191b-51e5-00000000000c |     TIMING | Wait for provisioned nodes to boot | overcloud-controller-0 | 0:10:02.317724 | 602.29s
~~~


Expected results:

Node provisioning should pass.


Additional info:

The following traceback is noticed in ironic-conductor.log:-

~~~
2022-02-11 01:14:47.566 2 ERROR oslo.service.loopingcall [-] Dynamic backoff interval looping call 'ironic.conductor.utils.node_wait_for_power_state.<locals>._wait' failed: oslo_service.loopingcall.LoopingCallTimeOut: Looping call timed out after 49.06 seconds
2022-02-11 01:14:47.566 2 ERROR oslo.service.loopingcall Traceback (most recent call last):
2022-02-11 01:14:47.566 2 ERROR oslo.service.loopingcall   File "/usr/lib/python3.9/site-packages/oslo_service/loopingcall.py", line 154, in _run_loop
2022-02-11 01:14:47.566 2 ERROR oslo.service.loopingcall     idle = idle_for_func(result, self._elapsed(watch))
2022-02-11 01:14:47.566 2 ERROR oslo.service.loopingcall   File "/usr/lib/python3.9/site-packages/oslo_service/loopingcall.py", line 349, in _idle_for
2022-02-11 01:14:47.566 2 ERROR oslo.service.loopingcall     raise LoopingCallTimeOut(
2022-02-11 01:14:47.566 2 ERROR oslo.service.loopingcall oslo_service.loopingcall.LoopingCallTimeOut: Looping call timed out after 49.06 seconds
~~~

Comment 5 Steve Baker 2022-02-13 20:07:08 UTC

Looking at the image shows most of /boot/efi is missing:

# tree boot/efi/
boot/efi/
└── EFI
    ├── BOOT
    └── redhat
        ├── grub.cfg
        ├── grubenv
        └── grubx64.efi


This is because the base rhel-9 image has grub2-efi and shim packages pre-installed on a /boot/efi partition, but diskimage-builder only mounts the root partition when it extracts "all" of the image content. This means image building happens with an empty /boot/efi, and nothing gets installed there because rpm treats grub2-efi and shim as already installed.

To fix this I've proposed the following to diskimage-builder, which mounts all discovered parititions during extract-image:
https://review.opendev.org/c/openstack/diskimage-builder/+/828617

Comment 6 Steve Baker 2022-02-16 01:32:03 UTC

I can now build, upload and UEFI boot images which replicate this issue:

  error: ../../grub-core/fs/fshelp.c:257:file`/boot/vmlinuz-5.14.0-1.7.1.el9.x86_64' not found.

This happens even with my /boot/efi fix, and it looks like the /boot/loader/entries/*.conf has not been refreshed during the 50-bootloader run.

The Cento-9 base image has a special workaround for this, and I think the rhel-9 base image will also need a workaround but it might be slightly different.

Now that I have a dev->replication process I'll come up with a fix.

Comment 7 Steve Baker 2022-02-17 01:08:01 UTC

I now have a fix which allows me to boot an overcloud-hardened-uefi-full.qcow2 to a UEFI enabled virtual machine.

This is caused by the base rhel-9 image having a separate boot partition, but overcloud-hardened-uefi-full (and most other images) having /boot as a directory in the root partition. This means the kernel/initramfs paths in the /boot/loader/entries/*.conf are incorrect, so the boot fails.

The proposed fix[1] does the same *.conf machine-id rename as for centos-9-stream, but also seds the paths in the entry conf file to ensure they include /boot.

I think the extract-image fix is still required, that would cause a different boot failure once this one is fixed.

[1] https://review.opendev.org/c/openstack/diskimage-builder/+/829620

Comment 8 Cédric Jeanneret 2022-02-17 10:24:19 UTC

Hello Steve,

I could test an UEFI build using both patches (extract-image + your new one), but it fails to boot - the following error is shown:

error: ../../grub-core/fs/fshelp.c:257:file `/boot/vmlinuz-5.14.0-1.7.1.el9.x86_64' not found.
error: ../../grub-core/fs/fshelp.c:257:file `/boot/vmlinuz-5.14.0-1.7.1.el9.x86_64' not found.
error: ../../grub-core/loader/i386/efi/linux.c:208:you need to load the kernel first.
error: ../../grub-core/loader/i386/efi/linux.c:208:you need to load the kernel first.

After checking the content of the vg-lv_root LVM partition, I can see two loaders:
ls mount/boot/loader/entries/                                                                                                                                                                                       
d851058d2fc9482cdc6a55bea203d869-5.14.0-42.el9.x86_64.conf  ffffffffffffffffffffffffffffffff-5.14.0-1.7.1.el9.x86_64.conf

While the first one looks correct:
cat mount/boot/loader/entries/d851058d2fc9482cdc6a55bea203d869-5.14.0-42.el9.x86_64.conf                                                                                                                            
title Red Hat Enterprise Linux (5.14.0-42.el9.x86_64) 9.0 (Plow)                                                                                                                                                                              
version 5.14.0-42.el9.x86_64                                                                                                                                                                                                                  
linux /boot/vmlinuz-5.14.0-42.el9.x86_64                                                                                                                                                                                                      
initrd /boot/initramfs-5.14.0-42.el9.x86_64.img                                                                                                                                                                                               
options root=LABEL=img-rootfs ro console=tty0 console=ttyS0,115200n8 no_timer_check  crashkernel=auto console=tty0 console=ttyS0,115200 no_timer_check nofb nomodeset vga=normal console=tty0 console=ttyS0,115200 audit=1 nousb              
grub_users $grub_users                                                                                                                                                                                                                        
grub_arg --unrestricted                                                                                                                                                                                                                       
grub_class rhel

The second one seems incorrect, at least for the "options" line:
cat mount/boot/loader/entries/ffffffffffffffffffffffffffffffff-5.14.0-1.7.1.el9.x86_64.conf                                                                                                                         
title Red Hat Enterprise Linux (5.14.0-1.7.1.el9.x86_64) 9.0 (Plow)                                                                                                                                                                           
version 5.14.0-1.7.1.el9.x86_64                                                                                                                                                                                                               
linux /boot/vmlinuz-5.14.0-1.7.1.el9.x86_64                                                                                                                                                                                                   
initrd /boot/initramfs-5.14.0-1.7.1.el9.x86_64.img                                                                                                                                                                                            
options root=UUID=b0bb50ab-82ac-45de-bbd8-51a4314e7719 console=tty0 console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M                                                                          
grub_users $grub_users                                                                                                                                                                                                                        
grub_arg --unrestricted                                                                                                                                                                                                                       
grub_class rhel

We're still pointing to the "root=UUID=...."

I'm wondering how this is possible, when reading your 03-reset-bls-entries - we're supposed to end with only one file in there, aren't we?

Also, here's the content of the /boot:

ls -l mount/boot/
total 78760
-rw-r--r--. 1 root root   212901 Jan 13 21:48 config-5.14.0-42.el9.x86_64
drwxr-xr-x. 3 root root    16384 Jan  1  1970 efi
drwx------. 5 root root       79 Feb 17 09:09 grub2
-rw-------. 1 root root 64086578 Feb 17 09:10 initramfs-5.14.0-42.el9.x86_64.img
drwxr-xr-x. 3 root root       21 Oct 26 16:57 loader
lrwxrwxrwx. 1 root root       44 Feb 17 09:06 symvers-5.14.0-42.el9.x86_64.gz -> /lib/modules/5.14.0-42.el9.x86_64/symvers.gz
-rw-------. 1 root root  5233256 Jan 13 21:48 System.map-5.14.0-42.el9.x86_64
-rwxr-xr-x. 1 root root 11096016 Jan 13 21:48 vmlinuz-5.14.0-42.el9.x86_64

Note:
disk layout seems to be as follow:
Device          Start       End  Sectors  Size Type
/dev/nbd0p1      2048     34815    32768   16M EFI System
/dev/nbd0p2     34816     51199    16384    8M BIOS boot
/dev/nbd0p3     51200  11769855 11718656  5.6G Linux filesystem
/dev/nbd0p4 209582080 209715166   133087   65M Linux filesystem

p3 has the lvm things, and is divided as follow:
ls /dev/vg -1                                                                                                                                                                                                       
lv_audit                                                                                                                                                                                                                                      
lv_home                                                                                                                                                                                                                                       
lv_log                                                                                                                                                                                                                                        
lv_root                                                                                                                                                                                                                                       
lv_srv                                                                                                                                                                                                                                        
lv_tmp                                                                                                                                                                                                                                        
lv_var

the /etc/fstab is:
cat mount/etc/fstab                                                                                                                                                                                                 
LABEL=img-rootfs / xfs rw,relatime 0 1                                                                                                                                                                                                        
LABEL=MKFS_ESP /boot/efi vfat defaults 0 2                                                                                                                                                                                                    
LABEL=fs_tmp /tmp xfs rw,nosuid,nodev,noexec,relatime 0 2                                                                                                                                                                                     
LABEL=fs_var /var xfs rw,relatime 0 2                                                                                                                                                                                                         
LABEL=fs_log /var/log xfs rw,relatime 0 2                                                                                                                                                                                                     
LABEL=fs_audit /var/log/audit xfs rw,relatime 0 2                                                                                                                                                                                             
LABEL=fs_home /home xfs rw,nodev,relatime 0 2                                                                                                                                                                                                 
LABEL=fs_srv /srv xfs rw,nodev,relatime 0 2


So all seems to be just fine. Just.... that dual loader file thing - it's a bit weird.

Comment 12 Steve Baker 2022-03-14 21:52:20 UTC

The fix is now in RHOS-17.0-RHEL-8-20220314.n.2 compose, so this should be propagating into built overcloud-hardened-uefi-full images.

Comment 14 Yaniv Kaul 2022-06-27 09:20:03 UTC

Do we know why this BZ is stuck in MODIFIED?

Comment 15 pweeks 2022-06-27 13:20:52 UTC

 bz should be moved to on_qa once we get all the acks, I'll followup on that.

Comment 23 errata-xmlrpc 2022-09-21 12:18:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543

Comment 24 michelle653burkes 2024-11-18 11:20:45 UTC Comment hidden (spam)

This comment was flagged a spam, view the edit history to see the original text if required.

Note You need to log in before you can comment on or make changes to this bug.