Bug 2138046 - openstack overcloud node provision fails when overcloud-hardened-uefi-full.raw is used on BIOS environment
Summary: openstack overcloud node provision fails when overcloud-hardened-uefi-full.ra...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director-images
Version: 17.0 (Wallaby)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z1
: 17.0
Assignee: Steve Baker
QA Contact: David Rosenfeld
URL:
Whiteboard:
: 2135615 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-10-27 02:35 UTC by yatanaka
Modified: 2023-01-25 12:28 UTC (History)
11 users (show)

Fixed In Version: rhosp-director-images-17.0-20221118.2.el9ost
Doc Type: Bug Fix
Doc Text:
Before this update, when you used the whole disk image `overcloud-hardened-uefi-full` to boot overcloud nodes, nodes that used the Legacy BIOS boot mode failed to boot because the `lvmid` of the root volume was different to the `lvmid` referenced in `grub.cfg`. With this update, the `virt-sysprep` task to reset the `lvmid` has been disabled, and nodes with Legacy BIOS boot mode can now be booted with the whole disk image.
Clone Of:
Environment:
Last Closed: 2023-01-25 12:28:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-19738 0 None None None 2022-10-27 02:55:07 UTC
Red Hat Product Errata RHBA-2023:0278 0 None None None 2023-01-25 12:28:28 UTC

Description yatanaka 2022-10-27 02:35:53 UTC
Description of problem:

All of undercloud and overcloud nodes are KVM guests and using BIOS boot.
In this environment, openstack overcloud node provision fails on "Wait for provisioned nodes to boot" TASK when overcloud-hardened-uefi-full.raw is used.

~~~
(undercloud) [stack@undercloud ~]$ openstack overcloud image upload --image-path /home/stack/images/
Image "file:///var/lib/ironic/images/overcloud-hardened-uefi-full.raw" was copied.
+----------------------------------------------------------------+------------------------------+------------+
|                              Path                              |             Name             |    Size    |
+----------------------------------------------------------------+------------------------------+------------+
| file:///var/lib/ironic/images/overcloud-hardened-uefi-full.raw | overcloud-hardened-uefi-full | 6442450944 |
+----------------------------------------------------------------+------------------------------+------------+


(undercloud) [stack@undercloud ~]$ openstack overcloud node unprovision --all   --stack overcloud   --network-ports   /home/stack/templates/overcloud-baremetal-deploy.yaml

  :

PLAY [Overcloud Node Grow Volumes] *********************************************
2022-10-27 09:28:23.001772 | 5254005a-676a-cf85-a24a-00000000000d |       TASK | Wait for provisioned nodes to boot
[WARNING]: Unhandled error in Python interpreter discovery for host overcloud-
controller-1: Failed to connect to the host via ssh: ssh: connect to host
192.168.24.20 port 22: No route to host
[WARNING]: Unhandled error in Python interpreter discovery for host overcloud-
controller-0: Failed to connect to the host via ssh: ssh: connect to host
192.168.24.18 port 22: No route to host
[WARNING]: Unhandled error in Python interpreter discovery for host overcloud-
controller-2: Failed to connect to the host via ssh: ssh: connect to host
192.168.24.23 port 22: No route to host
2022-10-27 09:38:33.699850 | 5254005a-676a-cf85-a24a-00000000000d |      FATAL | Wait for provisioned nodes to boot | overcloud-controller-0 | error={"changed": false, "elapsed": 610, "msg": "timed out waiting for ping module test: Data could not be sent to remote host \"192.168.24.18\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.18 port 22: No route to host\r\n"}
2022-10-27 09:38:33.702294 | 5254005a-676a-cf85-a24a-00000000000d |     TIMING | Wait for provisioned nodes to boot | overcloud-controller-0 | 0:10:10.729095 | 610.70s
2022-10-27 09:38:33.703025 | 5254005a-676a-cf85-a24a-00000000000d |      FATAL | Wait for provisioned nodes to boot | overcloud-controller-1 | error={"changed": false, "elapsed": 610, "msg": "timed out waiting for ping module test: Data could not be sent to remote host \"192.168.24.20\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.20 port 22: No route to host\r\n"}
2022-10-27 09:38:33.703538 | 5254005a-676a-cf85-a24a-00000000000d |     TIMING | Wait for provisioned nodes to boot | overcloud-controller-1 | 0:10:10.730368 | 610.69s
2022-10-27 09:38:33.704214 | 5254005a-676a-cf85-a24a-00000000000d |      FATAL | Wait for provisioned nodes to boot | overcloud-controller-2 | error={"changed": false, "elapsed": 610, "msg": "timed out waiting for ping module test: Data could not be sent to remote host \"192.168.24.23\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.23 port 22: No route to host\r\n"}
2022-10-27 09:38:33.704827 | 5254005a-676a-cf85-a24a-00000000000d |     TIMING | Wait for provisioned nodes to boot | overcloud-controller-2 | 0:10:10.731656 | 610.68s

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
overcloud-controller-0     : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   
overcloud-controller-1     : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   
overcloud-controller-2     : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   
~~~

Overcloud nodes' console show following error message and enters grub console.

~~~
Booting from Hard Disk...
..
error: ../../grub-core/kern/disk.c:236:disk `lvmid/QOu03J-psX3-51ct-lt17-E8sK-MbYn-uQMSVY/G6A04k-BfSN-RwIS-LIOF-OW6u-688n-UILD1H` not found.
Entering rescue mode...
grub rescue> 
~~~

I noticed that VG ID is inconsistent between grub.cfg and actual VG ID in overcloud-hardened-uefi-full.raw image.
On the other hand, LV ID is correct.

~~~
(undercloud) [stack@undercloud images]$ guestfish --rw -a /var/lib/ironic/images/overcloud-hardened-uefi-full.raw  -i

><fs> vgs-full
[0] = {
  vg_name: vg
  vg_uuid: W0krhKqeboSeZYKx63VOtmaNgHYhoeO2
    :

><fs> lvs-full
[0] = {
  lv_name: lv_root
  lv_uuid: G6A04kBfSNRwISLIOFOW6u688nUILD1H
    :

><fs> grep lvmid /boot/grub2/grub.cfg 
set root='lvmid/QOu03J-psX3-51ct-lt17-E8sK-MbYn-uQMSVY/G6A04k-BfSN-RwIS-LIOF-OW6u-688n-UILD1H'
  search --no-floppy --fs-uuid --set=root --hint='lvmid/QOu03J-psX3-51ct-lt17-E8sK-MbYn-uQMSVY/G6A04k-BfSN-RwIS-LIOF-OW6u-688n-UILD1H'  2ef3f2e5-ad4d-448a-b641-254514b34b01
set boot='lvmid/QOu03J-psX3-51ct-lt17-E8sK-MbYn-uQMSVY/G6A04k-BfSN-RwIS-LIOF-OW6u-688n-UILD1H'
  search --no-floppy --fs-uuid --set=boot --hint='lvmid/QOu03J-psX3-51ct-lt17-E8sK-MbYn-uQMSVY/G6A04k-BfSN-RwIS-LIOF-OW6u-688n-UILD1H'  2ef3f2e5-ad4d-448a-b641-254514b34b01
~~~

As a workaround, openstack overcloud node provision succeeds when overcloud-full image is used.

~~~
$ sudo dnf install rhosp-director-images -y
$ tar xvf /usr/share/rhosp-director-images/overcloud-full-latest-17.0-x86_64.tar -C ~/images
$ openstack overcloud image upload --image-path /home/stack/images/
$ sudo rm  /var/lib/ironic/images/overcloud-hardened-uefi-full.raw
$ openstack overcloud node provision \
--stack overcloud \
--network-config \
--output /home/stack/templates/overcloud-baremetal-deployed.yaml \
/home/stack/templates/overcloud-baremetal-deploy.yaml
~~~

I'm not sure if this is a bug or expected behavior.
Cannot overcloud-hardened-uefi-full be used for BIOS environment?


Version-Release number of selected component (if applicable):
RHOSP 17.0 GA


How reproducible:

Steps to Reproduce:
1. Create undercloud and overcloud nodes as KVM guests with BIOS boot
2. run "openstack overcloud node provision" with overcloud-hardened-uefi-full
  ~~~
  (undercloud) [stack@director images]$ for i in /usr/share/rhosp-director-images/ironic-python-agent-latest.tar /usr/share/rhosp-director-images/overcloud-hardened-uefi-full-latest.tar; do tar -xvf $i; done
  (undercloud) [stack@director images]$ openstack overcloud image upload --image-path /home/stack/images/
  (undercloud) [stack@undercloud ~]$ openstack overcloud node provision --stack overcloud --network-config --output /home/stack/templates/overcloud-baremetal-deployed.yaml /home/stack/templates/overcloud-baremetal-deploy.yaml
  ~~~



Actual results:
openstack overcloud node provision fails

Expected results:
openstack overcloud node provision succeeds

Comment 1 Julia Kreger 2022-10-27 16:09:45 UTC
Generally I wouldn't expect this to work in that the image was intended for use with UEFI machines, but I think @sbaker may have mentioned something about grub support of LVM in bios boot mode recently. I'm going to needsinfo him, and from there we can figure out if the images should "just kind of work" in this miss-configured case, or if there is a legitimate bug hiding here.

Comment 3 Steve Baker 2022-10-31 19:45:41 UTC
Could you please provide the output of the following for a <node> which shows this issue?

   baremetal node show <node> -o yaml

This will confirm whether the node has boot mode uefi, even though the VM has boot mode bios.

Meanwhile, we'll need an environment which replicates this issue, we'll start by setting up a CI job.

Comment 4 yatanaka 2022-11-01 02:07:57 UTC
> Could you please provide the output of the following for a <node> which shows this issue?
> 
>   baremetal node show <node> -o yaml

I reproduced the issue again.
I ran the following command to set boot_mode to bios after an introspection.

  (undercloud) [stack@undercloud ~]$ openstack baremetal node list -f value -c UUID| while read NODE; do openstack baremetal node set --property capabilities="boot_mode:bios,$(openstack baremetal node show $NODE -f json -c properties | jq -r .properties.capabilities | sed "s/boot_mode:[^,]*,//g")" $NODE;done 

And then I tried openstack overcloud node provision but it fails with the same error message.

  (undercloud) [stack@undercloud ~]$ 
  openstack overcloud node provision \
  --stack overcloud \
  --network-config \
  --output /home/stack/templates/overcloud-baremetal-deployed.yaml \
  /home/stack/templates/overcloud-baremetal-deploy.yaml

The following is the result of "baremetal node show" after the failure of "overcloud node provision".

~~~
(undercloud) [stack@undercloud ~]$ openstack baremetal node show controller0 -f yaml
allocation_uuid: a9a06bf9-9b43-4570-bcfe-322203f46b9d
automated_clean: null
bios_interface: no-bios
boot_interface: ipxe
chassis_uuid: null
clean_step: {}
conductor: undercloud.yatanaka.example.com
conductor_group: ''
console_enabled: false
console_interface: ipmitool-socat
created_at: '2022-11-01T01:28:33+00:00'
deploy_interface: direct
deploy_step: {}
description: null
driver: ipmi
driver_info:
  deploy_kernel: file:///var/lib/ironic/httpboot/agent.kernel
  deploy_ramdisk: file:///var/lib/ironic/httpboot/agent.ramdisk
  ipmi_address: 192.168.24.254
  ipmi_password: '******'
  ipmi_port: '6230'
  ipmi_username: admin
  rescue_kernel: file:///var/lib/ironic/httpboot/agent.kernel
  rescue_ramdisk: file:///var/lib/ironic/httpboot/agent.ramdisk
driver_internal_info:
  agent_cached_clean_steps_refreshed: '2022-11-01 01:37:02.647434'
  agent_cached_deploy_steps_refreshed: '2022-11-01 01:46:29.068391'
  agent_continue_if_ata_erase_failed: false
  agent_continue_if_secure_erase_failed: false
  agent_enable_ata_secure_erase: true
  agent_enable_nvme_secure_erase: true
  agent_erase_devices_iterations: 1
  agent_erase_devices_zeroize: true
  agent_erase_skip_read_only: false
  agent_last_heartbeat: '2022-11-01T01:50:48.986160'
  agent_version: 7.0.3.dev18
  clean_steps: null
  deploy_boot_mode: uefi
  deploy_steps: null
  disk_erasure_concurrency: 1
  hardware_manager_version:
    generic_hardware_manager: '1.1'
  is_whole_disk_image: true
  last_power_state_change: '2022-11-01T01:51:14.419960'
  root_uuid_or_disk_id: '0x00000000'
extra:
  metalsmith_attached_ports:
  - d6307c4b-4f4c-4a49-a174-c1202f374e0c
  metalsmith_created_ports:
  - d6307c4b-4f4c-4a49-a174-c1202f374e0c
fault: null
inspect_interface: inspector
inspection_finished_at: null
inspection_started_at: '2022-11-01T01:29:10+00:00'
instance_info:
  capabilities:
    boot_option: local
  configdrive: '******'
  display_name: overcloud-controller-0
  image_checksum: null
  image_disk_format: raw
  image_os_hash_algo: sha256
  image_os_hash_value: 3913a3db0d9fd1d3cc014af1a0959e1f02471ecef90461fe8e52c1bd2a50cf57
  image_source: file:///var/lib/ironic/images/overcloud-hardened-uefi-full.raw
  image_type: whole-disk-image
  image_url: '******'
  root_gb: 98
  traits: []
instance_uuid: a9a06bf9-9b43-4570-bcfe-322203f46b9d
last_error: null
lessee: null
maintenance: false
maintenance_reason: null
management_interface: ipmitool
name: controller0
network_data: {}
network_interface: flat
owner: null
power_interface: ipmitool
power_state: power on
properties:
  capabilities: boot_mode:bios,cpu_vt:true,cpu_aes:true,cpu_hugepages:true,cpu_hugepages_1g:true
  cpu_arch: x86_64
  cpus: '8'
  local_gb: '99'
  memory_mb: '32768'
  vendor: unknown
protected: false
protected_reason: null
provision_state: active
provision_updated_at: '2022-11-01T01:51:25+00:00'
raid_config: {}
raid_interface: no-raid
rescue_interface: agent
reservation: null
resource_class: baremetal
retired: false
retired_reason: null
storage_interface: noop
target_power_state: null
target_provision_state: null
target_raid_config: {}
traits: []
updated_at: '2022-11-01T01:51:25+00:00'
uuid: 2f463858-ef1e-463a-9772-225d8b6f38b4
vendor_interface: ipmitool
~~~

Comment 5 Julia Kreger 2022-11-01 18:04:10 UTC
Could you provide us with the output of `virsh dumpxml <vm_id>` from the overall hypervisor?

The deployment obviously completes based upon the data in the node show output, but the overall bootloader setup logic path differs depending on the running state versus requested state of the VM. Specifically, VMs are static and the overall operating mode is not changed like most hardware can be changed.

The fact your manually changing the boot mode state on a VM likely doesn't help this situation. In fact, it might actually be the root cause of the configuration difference here.

If you can extract the deployment logs from your undercloud, that would likely give us a full picture of what is going on.

Comment 8 Steve Baker 2022-11-02 00:21:41 UTC
I've found the root cause of this, virt-sysprep is run during the RPM build of rhosp-director-images (See overcloud-uefi.tdl), and one of the default operations being run is lvm-uuids:

    Change LVM2 PV and VG UUIDs.

    On Linux guests that have LVM2 physical volumes (PVs) or volume groups (VGs), new random UUIDs are generated and assigned to those PVs and VGs.

This operation should be excluded from the defaults, so the virt-sysprep call would become:

    virt-sysprep --operations defaults,-customize,-lvm-uuids --format qcow2 -a /image-build/overcloud-hardened-uefi-full.qcow2

Could DFG:PCD be responsible for making this change in rhosp-director-images?

Comment 9 Steve Baker 2022-11-02 00:22:17 UTC
*** Bug 2135615 has been marked as a duplicate of this bug. ***

Comment 13 Steve Baker 2022-11-15 21:18:03 UTC
Targeting this to 17.0 z1, since we're publishing an invalid image it would be good to correct it.

Comment 16 Steve Baker 2022-11-18 00:20:34 UTC
My verification failed, I'm proposing a follow-up change.

Comment 27 errata-xmlrpc 2023-01-25 12:28:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 17.0.1 director image RPMs), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:0278


Note You need to log in before you can comment on or make changes to this bug.