Bug 1917751 - vmedia/redfish deployment fails with `Error: Setting boot mode to uefi`
Summary: vmedia/redfish deployment fails with `Error: Setting boot mode to uefi`
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.8.0
Assignee: Dmitry Tantsur
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks: dit
TreeView+ depends on / blocked
 
Reported: 2021-01-19 10:25 UTC by Yuval Kashtan
Modified: 2021-02-05 20:40 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-05 20:40:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
install-config.yaml and conductor.log from cluster (2.21 MB, application/gzip)
2021-01-19 10:25 UTC, Yuval Kashtan
no flags Details
pending job (90.32 KB, image/png)
2021-01-19 10:27 UTC, Yuval Kashtan
no flags Details
pending bios config change (113.04 KB, image/png)
2021-01-19 10:28 UTC, Yuval Kashtan
no flags Details
ironic and conductor logs (251.22 KB, application/gzip)
2021-01-25 11:38 UTC, Yuval Kashtan
no flags Details

Description Yuval Kashtan 2021-01-19 10:25:35 UTC
Created attachment 1748684 [details]
install-config.yaml and conductor.log from cluster

Description of problem:
when trying to deploy a new baremetal, where one of the worker nodes is being deployed with redfish,
that node deployment fails

I think that the relevant part from conductor log is:
2021-01-19 06:45:37.443 1 ERROR ironic.conductor.manager [req-9c2d2a82-059a-450b-84a2-89e771a0844d ironic-user - - - -] Failed to inspect node ba246ba2-3820-4a50-a14d-8409e598bc03: Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Setting boot mode to uefi failed for node ba246ba2-3820-4a50-a14d-8409e598bc03. Error: HTTP PATCH https://10.46.61.145/redfish/v1/Systems/System.Embedded.1 returned code 400. Base.1.5.GeneralError: Pending configuration values are already committed, unable to perform another set operation. Extended information: [{'Message': 'Pending configuration values are already committed, unable to perform another set operation.', 'MessageArgs': ['BootSourceOverrideMode'], 'MessageArgs': 1, 'MessageId': 'IDRAC.2.1.SYS011', 'RelatedProperties': ['BootSourceOverrideMode'], 'RelatedProperties': 1, 'Resolution': 'Wait for the scheduled job to complete or delete the configuration jobs before attempting more set attribute operations.', 'Severity': 'Warning'}]: ironic.common.exception.HardwareInspectionFailure: Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Setting boot mode to uefi failed for node ba246ba2-3820-4a50-a14d-8409e598bc03. Error: HTTP PATCH https://10.46.61.145/redfish/v1/Systems/System.Embedded.1 returned code 400. Base.1.5.GeneralError: Pending configuration values are already committed, unable to perform another set operation. Extended information: [{'Message': 'Pending configuration values are already committed, unable to perform another set operation.', 'MessageArgs': ['BootSourceOverrideMode'], 'MessageArgs': 1, 'MessageId': 'IDRAC.2.1.SYS011', 'RelatedProperties': ['BootSourceOverrideMode'], 'RelatedProperties': 1, 'Resolution': 'Wait for the scheduled job to complete or delete the configuration jobs before attempting more set attribute operations.', 'Severity': 'Warning'}]

the node is Dell r640, firmware version 4.22.00.00

Version-Release number of selected component (if applicable):
4.6

attached my install-config.yaml and conductor log from the cluster

Comment 1 Yuval Kashtan 2021-01-19 10:27:42 UTC
Created attachment 1748687 [details]
pending job

pending job from BMC

Comment 2 Yuval Kashtan 2021-01-19 10:28:28 UTC
Created attachment 1748688 [details]
pending bios config change

Comment 3 Yuval Kashtan 2021-01-19 10:29:20 UTC
I think it's because of the pending config change job.
nothing triggers applying it.

Comment 4 Yuval Kashtan 2021-01-19 11:19:42 UTC
possible a hardware malfunction.
seems server refuses to power-on, entirely.

I'm investigating.

will provide more info once this is resolved, or if I can recreate issue on a different server (same, r640, fw 4.22.0.0)

Comment 5 Dmitry Tantsur 2021-01-19 17:07:34 UTC
We're looking into resetting iDRAC early in the deployment process, but that's a large chunk of work that will likely be delivered as a feature for 4.8.

The only thing we can do for 4.6 and 4.7 is to document the iDRAC reset. What do you think?

Comment 6 Dmitry Tantsur 2021-01-19 17:44:29 UTC
Oh, actually, you're using redfish:// for Dell nodes, this won't work. You need idrac:// or idrac-virtual-media://

Could you try that?

Comment 7 Dmitry Tantsur 2021-01-19 17:46:53 UTC
Hmm, actually no, redfish:// should be fine (redfish-virtual-media:// would not be). So yeah, it probably boils down to resetting the iDRAC. Could you do it via the web UI and retry?

Comment 9 Yuval Kashtan 2021-01-25 11:37:41 UTC
ok, so I rerun the whole thing this morning:

BMH:
```
# oc get bmh -A
NAMESPACE               NAME               STATUS   PROVISIONING STATUS      CONSUMER                       BMC                                                           HARDWARE PROFILE   ONLINE   ERROR
openshift-machine-api   cnfdt16-master-0   OK       externally provisioned   cnfdt16-c8mqf-master-0         ipmi://10.46.55.219:6230                                                         true     
openshift-machine-api   cnfdt16-master-1   OK       externally provisioned   cnfdt16-c8mqf-master-1         ipmi://10.46.55.219:6231                                                         true     
openshift-machine-api   cnfdt16-master-2   OK       externally provisioned   cnfdt16-c8mqf-master-2         ipmi://10.46.55.219:6232                                                         true     
openshift-machine-api   cnfdt16-worker-0   OK       provisioned              cnfdt16-c8mqf-worker-0-hhf8h   ipmi://10.46.55.219:6240                                      unknown            true     
openshift-machine-api   cnfdt16-worker-1   error    inspecting                                              redfish://10.46.61.143/redfish/v1/Systems/System.Embedded.1                      true     Introspection timeout
```

I'll attach conductor and inspector logs

iDRAC 4.22.00.00
and I was running a clean installation

I can attach the install-config too if needed, but the important part:
```
    - name: cnfdt16-worker-1
      role: worker
      bmc:
        address: redfish://<hidden>/redfish/v1/Systems/System.Embedded.1
        disableCertificateVerification: True
        username: <hidden>
        password: <hidden>
      bootMACAddress: <hidden>
      hardwareProfile: unknown
```

Comment 10 Yuval Kashtan 2021-01-25 11:38:22 UTC
Created attachment 1750489 [details]
ironic and conductor logs

Comment 11 Tomas Sedovic 2021-01-25 14:38:55 UTC
Yuval, have you tried resetting the iDRAC job queue as Dmitry asked too? I hadn't seen that in the report.

Comment 12 Yuval Kashtan 2021-01-25 20:04:32 UTC
yes, this system was pre-configured for UEFI 
and there are no pending jobs in the queue

Comment 15 Dmitry Tantsur 2021-01-28 12:10:59 UTC
Hi!

Did you intend to use idrac-virtualmedia instead of redfish? You bug title says "vmedia", but you're using redfish with network boot, not virtual media.

If the intention was to use redfish, you can work around the issue by using IPMI.

Comment 17 Dmitry Tantsur 2021-01-28 12:25:33 UTC
Additionally, your second set of logs don't show any Redfish issues, just a generic introspection timeout, that may be caused by networking misconfiguration. Could you check what is happening on the machine by logging into its iDRAC and checking its virtual console?

Your nodes have the same IPMI address, are they virtual?

Comment 18 Yuval Kashtan 2021-01-28 12:37:17 UTC
the uefi/redfish node is physical

the rest are virtual (but uses ipmi)

iDRAC shows machine is turned off
and there are no pending jobs

bios is pre-set to uefi boot.

Comment 19 Dmitry Tantsur 2021-01-28 13:54:01 UTC
Did you check after the failure or in the process? I'm afraid that something may power off the machine after the timeout is hit.

Do you still see the initial failure somewhere or only the timeout?

Comment 20 Yuval Kashtan 2021-01-28 15:21:27 UTC
I did not check during the process
just before and after

BTW: that same node deployed previously (over bios mode, with IPMI) just fine.

I'll try a manual rerun, by deleteing and recreating the bmh and secret, and will try to monitor more closely

Comment 23 Dmitry Tantsur 2021-02-04 11:12:34 UTC
The node is going legacy (not UEFI) boot, which is weird (we default to UEFI). You mentioned using BIOS mode previously, did you set it up this way? Was the same install-config used as attached previously?

Could you also paste the dnsmasq and up-to-date ironic-conductor logs?

Comment 25 Dmitry Tantsur 2021-02-04 16:55:19 UTC
The IP range does not even match. Are you sure your node is booting from the right NIC? Unfortunately, the order can be only specified by entering the boot menu and changing the NIC settings.

Comment 26 Yuval Kashtan 2021-02-05 20:40:02 UTC
so apparently it wasnt booting from the right NIC
because UEFI uses a different set of configuration
so while it was configured to PXE boot from the right NIC in bios mode,
for UEFI it was configured on an incorrect NIC.

fixing that, fixed the issue.

BTW: NIC itself was configured properly

Thanks everyone for helping
and especially Dmitry and Bob


Note You need to log in before you can comment on or make changes to this bug.