Bug 1918558 - Supermicro nodes boot to PXE upon reboot after successful deployment to disk
Summary: Supermicro nodes boot to PXE upon reboot after successful deployment to disk
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Bob Fournier
QA Contact: Raviv Bar-Tal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-21 03:04 UTC by Bob Fournier
Modified: 2021-02-24 15:55 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When set to boot via Redfish, Supermicro nodes require that BootSourceOverrideEnabled be set whenever BootSourceOverrideTarget is set or the default for BootSourceOverideEnabled (Once) will be used. This is different than other vendors which require that BootSourceOverideEnabled not be set if the value is the same as current setting. Consequence: Supermicro nodes will boot to HDD only one time and then will revert to booting using PXE. Fix: On Supermicro nodes, always set BootSourceOverideEnabled when setting BootSourceOverrideTarget. Result: Supermicro nodes will boot to disk persistently after deployment.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:55:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ironic-image pull 145 0 None closed Bug 1918558: Update ironic version with Supermicro boot fix 2021-02-15 19:28:58 UTC
OpenStack gerrit 772239 0 None MERGED For Supermicro BMCs set enable when changing boot device 2021-02-15 19:28:58 UTC
OpenStack gerrit 773656 0 None MERGED For Supermicro BMCs set enable when changing boot device 2021-02-15 19:28:58 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:55:35 UTC
Storyboard 2008547 0 None None None 2021-01-21 21:15:17 UTC

Description Bob Fournier 2021-01-21 03:04:40 UTC
Description of problem:

When metal3 deploys Supermicro nodes with Redfish and UEFI the deployment goes fine - the nodes run inspection via fast-track, the RHCOS image is installed on the node, and the nodes boot to that image on disk. On the subsequent reboot however the node attempts to PXE boot instead of booting to disk and it installs the inspector image and eventually stays looping in IPA.  The deployment fails.

The reason appears to be because the Redfish request to set to boot to disk Persistent (BootSourceOverrideEnabled': 'Continuous):
2021-01-20 23:54:49.103 1 DEBUG sushy.connector [req-d3cc400a-5752-400d-b77d-f8b2b672de4c - - - - -] HTTP request: PATCH https://mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com/redfish/v1/Systems/1; headers: {'OData-Version': '4.0'}; body: {'Boot': {'BootSourceOverrideTarget': 'Hdd', 'BootSourceOverrideEnabled': 'Continuous'}}


Is then followed a request to boot to disk WITHOUT the BootSourceOverrideEnabled set:
2021-01-20 23:54:52.045 1 DEBUG sushy.connector [req-d3cc400a-5752-400d-b77d-f8b2b672de4c - - - - -] HTTP request: PATCH https://mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com/redfish/v1/Systems/1; headers: {'OData-Version': '4.0'}; body: {'Boot': {'BootSourceOverrideTarget': 'Hdd'}}

This is then followed by a set of boot mode (to fix https://bugzilla.redhat.com/show_bug.cgi?id=1888072):
2021-01-20 23:54:53.360 1 DEBUG sushy.connector [req-d3cc400a-5752-400d-b77d-f8b2b672de4c - - - - -] HTTP request: PATCH https://mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com/redfish/v1/Systems/1; headers: {'OData-Version': '4.0'}; body: {'Boot': {'BootSourceOverrideMode': 'UEFI'}}

As a result, before the boot to disk the Redfish settings are:
{
  "BootSourceOverrideEnabled": "Once",
  "BootSourceOverrideMode": "UEFI",
  "BootSourceOverrideTarget": "Hdd",

This works fine for the first boot but after that the Redfish settings read are:
  "BootSourceOverrideEnabled": "Disabled",
  "BootSourceOverrideMode": "Legacy",
  "BootSourceOverrideTarget": "None",

So all subsequent boots will be to PXE which is the default.  Note that we tried to change the Boot Order in BIOS so that HDD is first.  It would seem that if the BootSourceOverrideTarget was not set then it would use this boot order but it didn't work.


Version-Release number of selected component (if applicable):

This is seen on 4.7 build from 12/18

# rpm -qa | grep ironic
python3-ironic-prometheus-exporter-0.0.1-0.20190712090404.f7e9344.el8ost.noarch
python3-ironic-lib-4.4.1-0.20201120173800.e1b5e12.el8.noarch
openstack-ironic-common-16.0.3-0.20201219231205.4ae5375.el8.noarch
openstack-ironic-conductor-16.0.3-0.20201219231205.4ae5375.el8.noarch
openstack-ironic-api-16.0.3-0.20201219231205.4ae5375.el8.noarch

How reproducible:

Happens every time.

Steps to Reproduce:
1. Run ansible install script to deploy the cluster.

Actual results:  Nodes boot to PXE after first booting to disk after image installed.


Expected results: Nodes boot to disk every time after image is installed.


Additional info:
Its likely the issue is here in sushy - 
https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/redfish/management.py#L96

The BootSourceOverrideTarget won't be set if its getting set to the same value.  However Supermicro requires it to be set each time.

Comment 1 Bob Fournier 2021-01-21 13:28:14 UTC
This is related to https://review.opendev.org/c/openstack/ironic/+/731644 which references http://lists.openstack.org/pipermail/openstack-discuss/2020-April/014543.html and provides context on how BootSourceOverrideEnabled should be be used.

Comment 2 Bob Fournier 2021-01-21 21:16:46 UTC
We tried setting pxe.enable_netboot_fallback in ironic.conf but that had no affect and the Supermicro node still booted to PXE after an external reboot.

We then changed this line https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/redfish/management.py#L96 to not check if BootSourceOverideEnabled is already set and it works fine. The 2nd PATCH request matches the 1st PATCH request and it sets Continuous.  After RHCOS reboots the node (outside of Ironic) the node boots to disk.

Comment 3 Bob Fournier 2021-02-03 13:05:07 UTC
Fix has merged and is available in brew https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1486293

Comment 5 Murali Krishnasamy 2021-02-16 14:55:55 UTC
Got a chance to test this patch on Supermicro 1029Ps using 4.7.0-0.nightly-2021-02-04-031352 and it is verified.

We could see the logs setting HDD boot with "Continuous" flag,

2021-02-15 21:22:01.566 1 DEBUG sushy.connector [req-d8db6171-f384-48c6-b6ac-afe873710e9b - - - - -] HTTP request: PATCH https://mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com/redfish/v1/Systems/1; headers: {'OData-Version': '4.0'}; body: {'Boot': {'BootSourceOverrideTarget': 'Hdd', 'BootSourceOverrideEnabled': 'Continuous'}}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:102
2021-02-15 21:22:03.056 1 DEBUG sushy.connector [req-d8db6171-f384-48c6-b6ac-afe873710e9b - - - - -] HTTP request: PATCH https://mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com/redfish/v1/Systems/1; headers: {'OData-Version': '4.0'}; body: {'Boot': {'BootSourceOverrideMode': 'UEFI'}}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:102

and on second request as well,
2021-02-15 21:22:04.501 1 DEBUG sushy.connector [req-d8db6171-f384-48c6-b6ac-afe873710e9b - - - - -] HTTP request: PATCH https://mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com/redfish/v1/Systems/1; headers: {'OData-Version': '4.0'}; body: {'Boot': {'BootSourceOverrideTarget': 'Hdd', 'BootSourceOverrideEnabled': 'Continuous'}}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:102
2021-02-15 21:22:05.959 1 DEBUG sushy.connector [req-d8db6171-f384-48c6-b6ac-afe873710e9b - - - - -] HTTP request: PATCH https://mgmt-f06-h15-000-1029p.rdu2.scalelab.redhat.com/redfish/v1/Systems/1; headers: {'OData-Version': '4.0'}; body: {'Boot': {'BootSourceOverrideMode': 'UEFI'}}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:102

Redfish boot config is set with same and it booted properly,

{
  "BootSourceOverrideEnabled": "Continuous",
  "BootSourceOverrideMode": "UEFI",
  "BootSourceOverrideTarget": "Hdd",
  "BootSourceOverrideTarget": [
    "None",
    "Pxe",
    "Floppy",
    "Cd",
    "Usb",
    "Hdd",
    "BiosSetup"
  ]
}

Thanks Bob for working on this.

Comment 8 errata-xmlrpc 2021-02-24 15:55:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.