Description of problem: OCP 4.6.0 deployment using baremetal-deploy(openshift-installer) on Supermicro nodes are not working as expected. Once openshift-installer kicks in, it goes through a couple of boots using "PXE" on UEFI mode for RHCOS installation and finally boots with persistent "disk" on UEFI mode, this doesn't seem to be working properly in Supermicro(espl. 1029p) nodes. The logs have been captured during installation from the bootstrap VM and verified via IPMI tool. Issue is, the final reboot with "disk" goes thru "PXE"(servers default boot order) and does OS installation again. It can be skipped only by interrupting manually via console and change the boot order to disk. The same flow is working fine in Dell servers with UEFI modes. Ironic conductor logs ====================== 2020-09-30 00:13:13.766 1 DEBUG ironic.common.utils [req-fd37b81f-fc6f-498e-beb4-77900f7b8b45 - - - - -] Execution completed, command line is "ipmitool -I lanplus -H mgmt-f06-h14-000-1029p.rdu2.scalelab.redhat.com -L ADMINISTRATOR -p 623 -U quads -R 1 -N 1 -f /tmp/tmp500y8ha3 chassis bootdev pxe options=efiboot" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:77 2020-09-30 00:06:30.874 1 DEBUG ironic.common.utils [req-579626f8-3b40-4824-ae41-ac4036196f22 - - - - -] Execution completed, command line is "ipmitool -I lanplus -H mgmt-f06-h14-000-1029p.rdu2.scalelab.redhat.com -L ADMINISTRATOR -p 623 -U quads -R 1 -N 1 -f /tmp/tmpdzwtq6pz chassis bootdev pxe options=efiboot" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:77 2020-09-30 00:19:46.761 1 DEBUG oslo_concurrency.processutils [req-14dae137-9147-40d4-bbf6-48244f15d950 - - - - -] Running cmd (subprocess): ipmitool -I lanplus -H mgmt-f06-h14-000-1029p.rdu2.scalelab.redhat.com -L ADMINISTRATOR -p 623 -U quads -R 1 -N 1 -f /tmp/tmp0e5_f3xu raw 0x00 0x08 0x05 0xe0 0x08 0x00 0x00 0x00 execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:372 IPMI tool on boot status ======================== While PXE boot # ipmitool -I lanplus -U xxxxx -P xxxxx -H mgmt-f06-h14-000-1029p.rdu2.scalelab.redhat.com chassis bootparam get 5 Boot parameter version: 1 Boot parameter 5 is valid/unlocked Boot parameter data: a004000000 Boot Flags : - Boot Flag Valid - Options apply to only next boot - BIOS EFI boot - Boot Device Selector : Force PXE - Console Redirection control : System Default - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default) - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST While Disk boot # ipmitool -I lanplus -U xxxxx -P xxxxx -H mgmt-f06-h14-000-1029p.rdu2.scalelab.redhat.com chassis bootparam get 5 Boot parameter version: 1 Boot parameter 5 is valid/unlocked Boot parameter data: e008000000 Boot Flags : - Boot Flag Valid - Options apply to all future boots - BIOS EFI boot - Boot Device Selector : Force Boot from default Hard-Drive - Console Redirection control : System Default - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default) - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST Version-Release number of selected component (if applicable): OCP 4.6.0 How reproducible: Consistently on Supermicro nodes Steps to Reproduce: 1. Just deploy a fresh cluster on Supermicro nodes using beremetal-deploy or openshift-installer 2. Observe the final boot order and the behavior on console Actual results: Final boot is using PXE and does OS installation again by the PXE server Expected results: Final boot should be via disk as in Dell server Additional info: Hardware - Supermicro 1029P Firmware Revision: 01.71.17 BIOS Version: 3.0a Redfish Version: 1.0.1
I've never used supermicro before, but I have some questions that may help us understand a bit of what is happening. 1- Just to confirm you are usuing the ipmi driver? 2- Have you try using BIOS instead of UEFI? I'm wondering if the same problem persists or not. 3- This sounds a bit of firmware version problem, maybe try updating and testing with bios and uefi to see how it goes? Thanks!
(In reply to Iury Gregory Melo Ferreira from comment #1) > I've never used supermicro before, but I have some questions that may help > us understand a bit of what is happening. > > 1- Just to confirm you are usuing the ipmi driver? Yes, ipmi driver > 2- Have you try using BIOS instead of UEFI? I'm wondering if the same > problem persists or not. BIOS works > 3- This sounds a bit of firmware version problem, maybe try updating and > testing with bios and uefi to see how it goes? > > Thanks!
Openshift-installer 4.6.0 is also failing to boot Supermicro nodes with UEFI mode. I have updated this ticket to OCP 4.6.0 Same behavior as seen in previous version, UEFI is failing, using IPMI driver but BIOS mode works.
Failed again with updated firmware and BIOS. Firmware Revision: 01.71.19 BIOS Version: 3.3
Looks like an issue installing the whole disk image via UEFI. On the conductor I see: 2020-10-05 20:27:26.325 1 INFO ironic.drivers.modules.agent_base [req-3e8e212e-e1bc-429f-bbe9-8a77605c9791 - - - - -] Could not install bootloader for whole disk image for node 44ab8813-3d0e-47f8-b3b7-244b4c75b4fa, Error: No partition with UUID 0x00000000 found on device /dev/sda" This error is repeated for 3 nodes: In the corresponding ramdisk image for node 44ab8813-3d0e-47f8-b3b7-244b4c75b4fa we get: ct 05 16:27:20 master-0 ironic-python-agent[2176]: 2020-10-05 16:27:20.423 2176 ERROR root [-] Command failed: install_bootloader, error: Error finding the disk or partition device to deploy the image onto: No partition with UUID 0x00000000 found on device /dev/sda: ironic_python_agent.errors.DeviceNotFound: Error finding the disk or partition device to deploy the image onto: No partition with UUID 0x00000000 found on device /dev/sda 2020-10-05 16:27:20.423 2176 ERROR root Traceback (most recent call last): 2020-10-05 16:27:20.423 2176 ERROR root File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py", line 163, in run 2020-10-05 16:27:20.423 2176 ERROR root result = self.execute_method(**self.command_params) 2020-10-05 16:27:20.423 2176 ERROR root File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/image.py", line 753, in install_bootloader 2020-10-05 16:27:20.423 2176 ERROR root target_boot_mode=target_boot_mode) 2020-10-05 16:27:20.423 2176 ERROR root File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/image.py", line 503, in _install_grub2 2020-10-05 16:27:20.423 2176 ERROR root root_partition = _get_partition(device, uuid=root_uuid) 2020-10-05 16:27:20.423 2176 ERROR root File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/image.py", line 132, in _get_partition 2020-10-05 16:27:20.423 2176 ERROR root raise errors.DeviceNotFound(error_msg) 2020-10-05 16:27:20.423 2176 ERROR root ironic_python_agent.errors.DeviceNotFound: Error finding the disk or partition device to deploy the image onto: No partition with UUID 0x00000000 found on device /dev/sda 2020-10-05 16:27:20.423 2176 ERROR root Which is why booting off disk is failing when using UEFI. What's the version of ironic-conductor pkg being used? I realize this is 4.5 but can you get the pkg version from the ironic container?
Thanks Bob for the update, These are from 4.6 build. openstack-ironic-api-15.1.1-0.20200724075308.3e92fd0.el8ost.noarch python3-ironic-lib-4.3.0-0.20200605221931.df238ba.el8ost.noarch openstack-ironic-common-15.1.1-0.20200724075308.3e92fd0.el8ost.noarch openstack-ironic-conductor-15.1.1-0.20200724075308.3e92fd0.el8ost.noarch python3-ironic-prometheus-exporter-0.0.1-0.20190712090404.f7e9344.el8ost.noarch
Talking with Julia, Sai, and Murali - Julia thought based on the errors in Comment 6 that the Supermicro is actually in BIOS mode and ironic-python-agent attempts to boot with UEFI. We were able to confirm that in the IPA logs here: Oct 05 16:25:14 master-0 ironic-python-agent[2176]: 2020-10-05 16:25:14.946 2176 DEBUG root [-] The current boot mode is bios get_boot_info /usr/lib/python3.6/site-packages/ironic_python_agent/hardware.py:1149 Which is set here https://github.com/openstack/ironic-python-agent/blob/99dee5067ea4f06d3083170d801e600f46842170/ironic_python_agent/hardware.py#L1189 based on '/sys/firmware/efi' not being present. We'll need to get into the BIOS configuration on the Supermicro and set it for UEFI. Julia found these notes from Supermicro that may be useful - https://www.supermicro.com/support/faqs/faq.cfm?faq=22208.
> ironic_python_agent.errors.DeviceNotFound: Error finding the disk or partition device to deploy the image onto: No partition with UUID 0x00000000 found on device /dev/sda This is a red herring, ignore it. At some point we should just stop calling the code that can never succeed.. > Which is why booting off disk is failing when using UEFI. Whole disk images can boot even if grub installation fails (it always does exactly the way you show).
While the error is a red herring, I do agree with Bob's conclusions. OpenShift requests UEFI boot, but the node is actually booted in legacy mode. Since IPMI cannot change the boot mode, it has to be done manually. Could you try it? If you cannot make such modifications, you need to use legacy (BIOS) boot.
Created attachment 1719855 [details] ramdisk log after BOOT Mode changed to UEFI in BIOS config
So we made some progress changing to UEFI mode in the BIOS configuration and the bootloader install works now, but we still are not booting off the hard drive even though Ironic is setting it via ipmitool and it appears to be set correctly according to ipmitool. In the BIOS configuration Kambiz changed the Boot Mode from dual to UEFI? After that we see the current boot mode detected correctly in IPA: Oct 07 15:10:57 f06-h15-000-1029p.rdu2.scalelab.redhat.com ironic-python-agent[2175]: 2020-10-07 15:10:57.639 2175 DEBUG root [-] The current boot mode is uefi get_boot_info /usr/lib/python3.6/site-packages/ironic_python_agent/hardware.py:1149 Oct 07 15:12:35 f06-h15-000-1029p.rdu2.scalelab.redhat.com ironic-python-agent[2175]: 2020-10-07 15:12:35.261 2175 DEBUG ironic_lib.utils [-] Command stdout is: "Model: ATA INTEL SSDSC2BB48 (scsi) Disk /dev/sda: 480GB Sector size (logical/physical): 512B/4096B Partition Table: gpt Disk Flags: Number Start End Size File system Name Flags 1 1049kB 404MB 403MB ext4 boot 2 404MB 537MB 133MB fat16 EFI-SYSTEM boot, esp 3 537MB 538MB 1049kB BIOS-BOOT bios_grub 4 538MB 3553MB 3015MB luks_root 5 480GB 480GB 68.0MB Ironic sets the boot device by: 2020-10-07 19:12:55.272 1 DEBUG oslo_concurrency.processutils [req-829d1ae7-dc1a-4f3b-b9f2-f5ec45701931 - - - - -] CMD "ipmitool -I lanplus -H mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com -L ADMINISTRATOR -p 623 -U quads -R 1 -N 5 -f /tmp/tmpb7xfpvoq raw 0x00 0x08 0x05 0xe0 0x08 0x00 0x00 0x00" returned: 0 in 0.045s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:416 And it looks like its set correctly: ipmitool -I lanplus -U quads -P rdu2@479 -H mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com chassis bootparam get 5 Boot parameter version: 1 Boot parameter 5 is valid/unlocked Boot parameter data: a004000000 Boot Flags : - Boot Flag Valid - Options apply to only next boot - BIOS EFI boot - Boot Device Selector : Force PXE - Console Redirection control : System Default - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default) - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST But the next boot is PXE and the dnmasq ends up providing inspector.ipxe I've attached the ramdisk log for this node.
Actually I grabbed the wrong output from ipmitool, it should be this one that shows its set to boot of the hard-drive: # ipmitool -I lanplus -U quads -P rdu2@479 -H mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com chassis bootparam get 5 Boot parameter version: 1 Boot parameter 5 is valid/unlocked Boot parameter data: e008000000 Boot Flags : - Boot Flag Valid - Options apply to all future boots - BIOS EFI boot - Boot Device Selector : Force Boot from default Hard-Drive - Console Redirection control : System Default - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default) - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST
Also we tested a manual reboot via ipmi with the settings as in Comment 13 (Force Boot from default Hard-Drive) as this also resulted in a PXE boot.
Would replacing IPMI by Redfish to manage this node work?
(In reply to Ramon Acedo from comment #15) > Would replacing IPMI by Redfish to manage this node work? I think they don't have the redfish license, but would be really good to see if it would work with redfish + uefi.
Do you have ramdisk logs from comment 12? The ones in comment 11 still show "The current boot mode is bios".
Changed BIOS configuration to boot UEFI mode for "mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com" host only, other hosts are still in BIOS mode. Oct 07 15:10:57 f06-h15-000-1029p.rdu2.scalelab.redhat.com ironic-python-agent[2175]: 2020-10-07 15:10:57.639 2175 DEBUG root [-] The current boot mode is uefi get_boot_info /usr/lib/python3.6/site-packages/ironic_python_agent/hardware.py:1149
Created attachment 1720006 [details] ramdisk in uefi mode
The boot order seems correct for the last run: Oct 07 15:12:36 f06-h15-000-1029p.rdu2.scalelab.redhat.com ironic-python-agent[2175]: 2020-10-07 15:12:36.014 2175 DEBUG ironic_lib.utils [-] Execution completed, command line is "efibootmgr -c -d /dev/sda -p 2 -w -L ironic1 -l \EFI\BOOT\BOOTX64.EFI" execute /usr/lib/python3.6/site-packages/ironic_lib/utils.py:101 Oct 07 15:12:36 f06-h15-000-1029p.rdu2.scalelab.redhat.com ironic-python-agent[2175]: 2020-10-07 15:12:36.027 2175 DEBUG ironic_lib.utils [-] Command stdout is: "BootCurrent: 0007 Timeout: 1 seconds BootOrder: 0000,0004,0005,0006,0007,0008,0009,000A,000B,000C,000D,000E,000F,0003,0001 Boot0001 Hard Drive Boot0003* UEFI: Built-in EFI Shell Boot0004* (B28/D0/F0) UEFI: PXE IPv4 Intel(R) Ethernet Connection X722 for 1GbE(MAC:0cc47afa192a) Boot0005* (B28/D0/F1) UEFI: PXE IPv4 Intel(R) Ethernet Connection X722 for 1GbE(MAC:0cc47afa192b) Boot0006* (B94/D0/F0) UEFI: PXE IPv4 Intel(R) Ethernet Controller X710 for 10GbE SFP+(MAC:ac1f6b2d19d4) Boot0007* (B94/D0/F1) UEFI: PXE IPv4 Intel(R) Ethernet Controller X710 for 10GbE SFP+(MAC:ac1f6b2d19d5) Boot0008* (B94/D0/F2) UEFI: PXE IPv4 Intel(R) Ethernet Controller X710 for 10GbE SFP+(MAC:ac1f6b2d19d6) Boot0009* (B94/D0/F3) UEFI: PXE IPv4 Intel(R) Ethernet Controller X710 for 10GbE SFP+(MAC:ac1f6b2d19d7) Boot000A* (B28/D0/F0) UEFI: PXE IPv6 Intel(R) Ethernet Connection X722 for 1GbE(MAC:0cc47afa192a) Boot000B* (B28/D0/F1) UEFI: PXE IPv6 Intel(R) Ethernet Connection X722 for 1GbE(MAC:0cc47afa192b) Boot000C* (B94/D0/F0) UEFI: PXE IPv6 Intel(R) Ethernet Controller X710 for 10GbE SFP+(MAC:ac1f6b2d19d4) Boot000D* (B94/D0/F1) UEFI: PXE IPv6 Intel(R) Ethernet Controller X710 for 10GbE SFP+(MAC:ac1f6b2d19d5) Boot000E* (B94/D0/F2) UEFI: PXE IPv6 Intel(R) Ethernet Controller X710 for 10GbE SFP+(MAC:ac1f6b2d19d6) Boot000F* (B94/D0/F3) UEFI: PXE IPv6 Intel(R) Ethernet Controller X710 for 10GbE SFP+(MAC:ac1f6b2d19d7) Boot0000* ironic1 " execute /usr/lib/python3.6/site-packages/ironic_lib/utils.py:103 The current hypothesis is that calling `ipmitool bootdev` after that confuses the EFI firmware and resets the boot order to something else. A potential fix is to call ipmitool before efibootmgr, https://review.opendev.org/#/c/756881/ achieves exactly that. We need to find a way to test it.
We tried using redfish on these models and it rebooted properly with disk + efi on reboot, Ironic conductor logs, first PXE boot, 2020-10-09 17:50:32.334 1 DEBUG sushy.connector [req-06df134d-6bef-4af9-9cd4-b4ce00e5a77a - - - - -] HTTP request: PATCH https://mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com/redfish/v1/Systems/1; headers: {'OData-Version': '4.0'}; body: {'Boot': {'BootSourceOverrideTarget': 'Pxe', 'BootSourceOverrideEnabled': 'Once'}}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:102 2020-10-09 17:50:32.543 1 DEBUG sushy.connector [req-06df134d-6bef-4af9-9cd4-b4ce00e5a77a - - - - -] HTTP response for PATCH https://mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com/redfish/v1/Systems/1: status code: 200 _op /usr/lib/python3.6/site-packages/sushy/connector.py:156 reboot after deploy, 2020-10-09 18:01:24.471 1 DEBUG sushy.connector [req-9edbf591-f8f5-4ce4-884f-9c391253ddd2 - - - - -] HTTP request: PATCH https://mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com/redfish/v1/Systems/1; headers: {'OData-Version': '4.0'}; body: {'Boot': {'BootSourceOverrideTarget': 'Hdd'}}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:102 2020-10-09 18:01:24.716 1 DEBUG sushy.connector [req-9edbf591-f8f5-4ce4-884f-9c391253ddd2 - - - - -] HTTP response for PATCH https://mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com/redfish/v1/Systems/1: status code: 200 _op /usr/lib/python3.6/site-packages/sushy/connector.py:156 But observed some inconsistency while using redfish + uefi on a re-deploying node(or a node which had persistent disk option set in BIOS), that first PXE boot set by redfish is not recognized by the server, it continued to boot via disk. We had to manually reset the boot option to pxe + uefi before re-deployment. It is reproducible in this server model.
(In reply to Murali Krishnasamy from comment #24) > We tried using redfish on these models and it rebooted properly with disk + > efi on reboot, Ironic conductor logs, > first PXE boot, > 2020-10-09 17:50:32.334 1 DEBUG sushy.connector > [req-06df134d-6bef-4af9-9cd4-b4ce00e5a77a - - - - -] HTTP request: PATCH > https://mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com/redfish/v1/Systems/1; > headers: {'OData-Version': '4.0'}; body: {'Boot': > {'BootSourceOverrideTarget': 'Pxe', 'BootSourceOverrideEnabled': 'Once'}}; > blocking: False; timeout: 60; session arguments: {}; _op > /usr/lib/python3.6/site-packages/sushy/connector.py:102 > 2020-10-09 17:50:32.543 1 DEBUG sushy.connector > [req-06df134d-6bef-4af9-9cd4-b4ce00e5a77a - - - - -] HTTP response for PATCH > https://mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com/redfish/v1/Systems/1: > status code: 200 _op /usr/lib/python3.6/site-packages/sushy/connector.py:156 > > reboot after deploy, > 2020-10-09 18:01:24.471 1 DEBUG sushy.connector > [req-9edbf591-f8f5-4ce4-884f-9c391253ddd2 - - - - -] HTTP request: PATCH > https://mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com/redfish/v1/Systems/1; > headers: {'OData-Version': '4.0'}; body: {'Boot': > {'BootSourceOverrideTarget': 'Hdd'}}; blocking: False; timeout: 60; session > arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:102 > 2020-10-09 18:01:24.716 1 DEBUG sushy.connector > [req-9edbf591-f8f5-4ce4-884f-9c391253ddd2 - - - - -] HTTP response for PATCH > https://mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com/redfish/v1/Systems/1: > status code: 200 _op /usr/lib/python3.6/site-packages/sushy/connector.py:156 > > But observed some inconsistency while using redfish + uefi on a re-deploying > node(or a node which had persistent disk option set in BIOS), that first PXE > boot set by redfish is not recognized by the server, it continued to boot > via disk. We had to manually reset the boot option to pxe + uefi before > re-deployment. It is reproducible in this server model. Please open a specific bug covering the redfish behavior issues you've observed. Due to the cross-product nature we'll need to duplicate the bugs and track them independently through testing to final resolution. If you can also provide a direct curl of https://bmc_ip/redfish/v1/Systems/1 before, and after, as well as what the exact steps that were taken to reset it to UEFI + PXE given idealy the BMC should have refused the initial request but clearly we have some sort of bug in the behavior that we need to sort out.
Regarding the Redfish issue in Comment 25, I created https://bugzilla.redhat.com/show_bug.cgi?id=1888072 and one upstream -https://storyboard.openstack.org/#!/story/2008252.
Removing NeedInfo as Redfish bug has been created and this IPMI issue is understood.
Upstream patch is still under review.
Backported to Victoria.
Patch has merged and is included in tagged package openstack-ironic-16.0.3-0.20201219231205.4ae5375.el8 https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1424819
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633