Description of problem: All of undercloud and overcloud nodes are KVM guests and using BIOS boot. In this environment, openstack overcloud node provision fails on "Wait for provisioned nodes to boot" TASK when overcloud-hardened-uefi-full.raw is used. ~~~ (undercloud) [stack@undercloud ~]$ openstack overcloud image upload --image-path /home/stack/images/ Image "file:///var/lib/ironic/images/overcloud-hardened-uefi-full.raw" was copied. +----------------------------------------------------------------+------------------------------+------------+ | Path | Name | Size | +----------------------------------------------------------------+------------------------------+------------+ | file:///var/lib/ironic/images/overcloud-hardened-uefi-full.raw | overcloud-hardened-uefi-full | 6442450944 | +----------------------------------------------------------------+------------------------------+------------+ (undercloud) [stack@undercloud ~]$ openstack overcloud node unprovision --all --stack overcloud --network-ports /home/stack/templates/overcloud-baremetal-deploy.yaml : PLAY [Overcloud Node Grow Volumes] ********************************************* 2022-10-27 09:28:23.001772 | 5254005a-676a-cf85-a24a-00000000000d | TASK | Wait for provisioned nodes to boot [WARNING]: Unhandled error in Python interpreter discovery for host overcloud- controller-1: Failed to connect to the host via ssh: ssh: connect to host 192.168.24.20 port 22: No route to host [WARNING]: Unhandled error in Python interpreter discovery for host overcloud- controller-0: Failed to connect to the host via ssh: ssh: connect to host 192.168.24.18 port 22: No route to host [WARNING]: Unhandled error in Python interpreter discovery for host overcloud- controller-2: Failed to connect to the host via ssh: ssh: connect to host 192.168.24.23 port 22: No route to host 2022-10-27 09:38:33.699850 | 5254005a-676a-cf85-a24a-00000000000d | FATAL | Wait for provisioned nodes to boot | overcloud-controller-0 | error={"changed": false, "elapsed": 610, "msg": "timed out waiting for ping module test: Data could not be sent to remote host \"192.168.24.18\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.18 port 22: No route to host\r\n"} 2022-10-27 09:38:33.702294 | 5254005a-676a-cf85-a24a-00000000000d | TIMING | Wait for provisioned nodes to boot | overcloud-controller-0 | 0:10:10.729095 | 610.70s 2022-10-27 09:38:33.703025 | 5254005a-676a-cf85-a24a-00000000000d | FATAL | Wait for provisioned nodes to boot | overcloud-controller-1 | error={"changed": false, "elapsed": 610, "msg": "timed out waiting for ping module test: Data could not be sent to remote host \"192.168.24.20\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.20 port 22: No route to host\r\n"} 2022-10-27 09:38:33.703538 | 5254005a-676a-cf85-a24a-00000000000d | TIMING | Wait for provisioned nodes to boot | overcloud-controller-1 | 0:10:10.730368 | 610.69s 2022-10-27 09:38:33.704214 | 5254005a-676a-cf85-a24a-00000000000d | FATAL | Wait for provisioned nodes to boot | overcloud-controller-2 | error={"changed": false, "elapsed": 610, "msg": "timed out waiting for ping module test: Data could not be sent to remote host \"192.168.24.23\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.23 port 22: No route to host\r\n"} 2022-10-27 09:38:33.704827 | 5254005a-676a-cf85-a24a-00000000000d | TIMING | Wait for provisioned nodes to boot | overcloud-controller-2 | 0:10:10.731656 | 610.68s NO MORE HOSTS LEFT ************************************************************* PLAY RECAP ********************************************************************* overcloud-controller-0 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0 overcloud-controller-1 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0 overcloud-controller-2 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0 ~~~ Overcloud nodes' console show following error message and enters grub console. ~~~ Booting from Hard Disk... .. error: ../../grub-core/kern/disk.c:236:disk `lvmid/QOu03J-psX3-51ct-lt17-E8sK-MbYn-uQMSVY/G6A04k-BfSN-RwIS-LIOF-OW6u-688n-UILD1H` not found. Entering rescue mode... grub rescue> ~~~ I noticed that VG ID is inconsistent between grub.cfg and actual VG ID in overcloud-hardened-uefi-full.raw image. On the other hand, LV ID is correct. ~~~ (undercloud) [stack@undercloud images]$ guestfish --rw -a /var/lib/ironic/images/overcloud-hardened-uefi-full.raw -i ><fs> vgs-full [0] = { vg_name: vg vg_uuid: W0krhKqeboSeZYKx63VOtmaNgHYhoeO2 : ><fs> lvs-full [0] = { lv_name: lv_root lv_uuid: G6A04kBfSNRwISLIOFOW6u688nUILD1H : ><fs> grep lvmid /boot/grub2/grub.cfg set root='lvmid/QOu03J-psX3-51ct-lt17-E8sK-MbYn-uQMSVY/G6A04k-BfSN-RwIS-LIOF-OW6u-688n-UILD1H' search --no-floppy --fs-uuid --set=root --hint='lvmid/QOu03J-psX3-51ct-lt17-E8sK-MbYn-uQMSVY/G6A04k-BfSN-RwIS-LIOF-OW6u-688n-UILD1H' 2ef3f2e5-ad4d-448a-b641-254514b34b01 set boot='lvmid/QOu03J-psX3-51ct-lt17-E8sK-MbYn-uQMSVY/G6A04k-BfSN-RwIS-LIOF-OW6u-688n-UILD1H' search --no-floppy --fs-uuid --set=boot --hint='lvmid/QOu03J-psX3-51ct-lt17-E8sK-MbYn-uQMSVY/G6A04k-BfSN-RwIS-LIOF-OW6u-688n-UILD1H' 2ef3f2e5-ad4d-448a-b641-254514b34b01 ~~~ As a workaround, openstack overcloud node provision succeeds when overcloud-full image is used. ~~~ $ sudo dnf install rhosp-director-images -y $ tar xvf /usr/share/rhosp-director-images/overcloud-full-latest-17.0-x86_64.tar -C ~/images $ openstack overcloud image upload --image-path /home/stack/images/ $ sudo rm /var/lib/ironic/images/overcloud-hardened-uefi-full.raw $ openstack overcloud node provision \ --stack overcloud \ --network-config \ --output /home/stack/templates/overcloud-baremetal-deployed.yaml \ /home/stack/templates/overcloud-baremetal-deploy.yaml ~~~ I'm not sure if this is a bug or expected behavior. Cannot overcloud-hardened-uefi-full be used for BIOS environment? Version-Release number of selected component (if applicable): RHOSP 17.0 GA How reproducible: Steps to Reproduce: 1. Create undercloud and overcloud nodes as KVM guests with BIOS boot 2. run "openstack overcloud node provision" with overcloud-hardened-uefi-full ~~~ (undercloud) [stack@director images]$ for i in /usr/share/rhosp-director-images/ironic-python-agent-latest.tar /usr/share/rhosp-director-images/overcloud-hardened-uefi-full-latest.tar; do tar -xvf $i; done (undercloud) [stack@director images]$ openstack overcloud image upload --image-path /home/stack/images/ (undercloud) [stack@undercloud ~]$ openstack overcloud node provision --stack overcloud --network-config --output /home/stack/templates/overcloud-baremetal-deployed.yaml /home/stack/templates/overcloud-baremetal-deploy.yaml ~~~ Actual results: openstack overcloud node provision fails Expected results: openstack overcloud node provision succeeds
Generally I wouldn't expect this to work in that the image was intended for use with UEFI machines, but I think @sbaker may have mentioned something about grub support of LVM in bios boot mode recently. I'm going to needsinfo him, and from there we can figure out if the images should "just kind of work" in this miss-configured case, or if there is a legitimate bug hiding here.
Could you please provide the output of the following for a <node> which shows this issue? baremetal node show <node> -o yaml This will confirm whether the node has boot mode uefi, even though the VM has boot mode bios. Meanwhile, we'll need an environment which replicates this issue, we'll start by setting up a CI job.
> Could you please provide the output of the following for a <node> which shows this issue? > > baremetal node show <node> -o yaml I reproduced the issue again. I ran the following command to set boot_mode to bios after an introspection. (undercloud) [stack@undercloud ~]$ openstack baremetal node list -f value -c UUID| while read NODE; do openstack baremetal node set --property capabilities="boot_mode:bios,$(openstack baremetal node show $NODE -f json -c properties | jq -r .properties.capabilities | sed "s/boot_mode:[^,]*,//g")" $NODE;done And then I tried openstack overcloud node provision but it fails with the same error message. (undercloud) [stack@undercloud ~]$ openstack overcloud node provision \ --stack overcloud \ --network-config \ --output /home/stack/templates/overcloud-baremetal-deployed.yaml \ /home/stack/templates/overcloud-baremetal-deploy.yaml The following is the result of "baremetal node show" after the failure of "overcloud node provision". ~~~ (undercloud) [stack@undercloud ~]$ openstack baremetal node show controller0 -f yaml allocation_uuid: a9a06bf9-9b43-4570-bcfe-322203f46b9d automated_clean: null bios_interface: no-bios boot_interface: ipxe chassis_uuid: null clean_step: {} conductor: undercloud.yatanaka.example.com conductor_group: '' console_enabled: false console_interface: ipmitool-socat created_at: '2022-11-01T01:28:33+00:00' deploy_interface: direct deploy_step: {} description: null driver: ipmi driver_info: deploy_kernel: file:///var/lib/ironic/httpboot/agent.kernel deploy_ramdisk: file:///var/lib/ironic/httpboot/agent.ramdisk ipmi_address: 192.168.24.254 ipmi_password: '******' ipmi_port: '6230' ipmi_username: admin rescue_kernel: file:///var/lib/ironic/httpboot/agent.kernel rescue_ramdisk: file:///var/lib/ironic/httpboot/agent.ramdisk driver_internal_info: agent_cached_clean_steps_refreshed: '2022-11-01 01:37:02.647434' agent_cached_deploy_steps_refreshed: '2022-11-01 01:46:29.068391' agent_continue_if_ata_erase_failed: false agent_continue_if_secure_erase_failed: false agent_enable_ata_secure_erase: true agent_enable_nvme_secure_erase: true agent_erase_devices_iterations: 1 agent_erase_devices_zeroize: true agent_erase_skip_read_only: false agent_last_heartbeat: '2022-11-01T01:50:48.986160' agent_version: 7.0.3.dev18 clean_steps: null deploy_boot_mode: uefi deploy_steps: null disk_erasure_concurrency: 1 hardware_manager_version: generic_hardware_manager: '1.1' is_whole_disk_image: true last_power_state_change: '2022-11-01T01:51:14.419960' root_uuid_or_disk_id: '0x00000000' extra: metalsmith_attached_ports: - d6307c4b-4f4c-4a49-a174-c1202f374e0c metalsmith_created_ports: - d6307c4b-4f4c-4a49-a174-c1202f374e0c fault: null inspect_interface: inspector inspection_finished_at: null inspection_started_at: '2022-11-01T01:29:10+00:00' instance_info: capabilities: boot_option: local configdrive: '******' display_name: overcloud-controller-0 image_checksum: null image_disk_format: raw image_os_hash_algo: sha256 image_os_hash_value: 3913a3db0d9fd1d3cc014af1a0959e1f02471ecef90461fe8e52c1bd2a50cf57 image_source: file:///var/lib/ironic/images/overcloud-hardened-uefi-full.raw image_type: whole-disk-image image_url: '******' root_gb: 98 traits: [] instance_uuid: a9a06bf9-9b43-4570-bcfe-322203f46b9d last_error: null lessee: null maintenance: false maintenance_reason: null management_interface: ipmitool name: controller0 network_data: {} network_interface: flat owner: null power_interface: ipmitool power_state: power on properties: capabilities: boot_mode:bios,cpu_vt:true,cpu_aes:true,cpu_hugepages:true,cpu_hugepages_1g:true cpu_arch: x86_64 cpus: '8' local_gb: '99' memory_mb: '32768' vendor: unknown protected: false protected_reason: null provision_state: active provision_updated_at: '2022-11-01T01:51:25+00:00' raid_config: {} raid_interface: no-raid rescue_interface: agent reservation: null resource_class: baremetal retired: false retired_reason: null storage_interface: noop target_power_state: null target_provision_state: null target_raid_config: {} traits: [] updated_at: '2022-11-01T01:51:25+00:00' uuid: 2f463858-ef1e-463a-9772-225d8b6f38b4 vendor_interface: ipmitool ~~~
Could you provide us with the output of `virsh dumpxml <vm_id>` from the overall hypervisor? The deployment obviously completes based upon the data in the node show output, but the overall bootloader setup logic path differs depending on the running state versus requested state of the VM. Specifically, VMs are static and the overall operating mode is not changed like most hardware can be changed. The fact your manually changing the boot mode state on a VM likely doesn't help this situation. In fact, it might actually be the root cause of the configuration difference here. If you can extract the deployment logs from your undercloud, that would likely give us a full picture of what is going on.
I've found the root cause of this, virt-sysprep is run during the RPM build of rhosp-director-images (See overcloud-uefi.tdl), and one of the default operations being run is lvm-uuids: Change LVM2 PV and VG UUIDs. On Linux guests that have LVM2 physical volumes (PVs) or volume groups (VGs), new random UUIDs are generated and assigned to those PVs and VGs. This operation should be excluded from the defaults, so the virt-sysprep call would become: virt-sysprep --operations defaults,-customize,-lvm-uuids --format qcow2 -a /image-build/overcloud-hardened-uefi-full.qcow2 Could DFG:PCD be responsible for making this change in rhosp-director-images?
*** Bug 2135615 has been marked as a duplicate of this bug. ***
Targeting this to 17.0 z1, since we're publishing an invalid image it would be good to correct it.
My verification failed, I'm proposing a follow-up change.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 17.0.1 director image RPMs), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:0278