Created attachment 1748684 [details] install-config.yaml and conductor.log from cluster Description of problem: when trying to deploy a new baremetal, where one of the worker nodes is being deployed with redfish, that node deployment fails I think that the relevant part from conductor log is: 2021-01-19 06:45:37.443 1 ERROR ironic.conductor.manager [req-9c2d2a82-059a-450b-84a2-89e771a0844d ironic-user - - - -] Failed to inspect node ba246ba2-3820-4a50-a14d-8409e598bc03: Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Setting boot mode to uefi failed for node ba246ba2-3820-4a50-a14d-8409e598bc03. Error: HTTP PATCH https://10.46.61.145/redfish/v1/Systems/System.Embedded.1 returned code 400. Base.1.5.GeneralError: Pending configuration values are already committed, unable to perform another set operation. Extended information: [{'Message': 'Pending configuration values are already committed, unable to perform another set operation.', 'MessageArgs': ['BootSourceOverrideMode'], 'MessageArgs': 1, 'MessageId': 'IDRAC.2.1.SYS011', 'RelatedProperties': ['BootSourceOverrideMode'], 'RelatedProperties': 1, 'Resolution': 'Wait for the scheduled job to complete or delete the configuration jobs before attempting more set attribute operations.', 'Severity': 'Warning'}]: ironic.common.exception.HardwareInspectionFailure: Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Setting boot mode to uefi failed for node ba246ba2-3820-4a50-a14d-8409e598bc03. Error: HTTP PATCH https://10.46.61.145/redfish/v1/Systems/System.Embedded.1 returned code 400. Base.1.5.GeneralError: Pending configuration values are already committed, unable to perform another set operation. Extended information: [{'Message': 'Pending configuration values are already committed, unable to perform another set operation.', 'MessageArgs': ['BootSourceOverrideMode'], 'MessageArgs': 1, 'MessageId': 'IDRAC.2.1.SYS011', 'RelatedProperties': ['BootSourceOverrideMode'], 'RelatedProperties': 1, 'Resolution': 'Wait for the scheduled job to complete or delete the configuration jobs before attempting more set attribute operations.', 'Severity': 'Warning'}] the node is Dell r640, firmware version 4.22.00.00 Version-Release number of selected component (if applicable): 4.6 attached my install-config.yaml and conductor log from the cluster
Created attachment 1748687 [details] pending job pending job from BMC
Created attachment 1748688 [details] pending bios config change
I think it's because of the pending config change job. nothing triggers applying it.
possible a hardware malfunction. seems server refuses to power-on, entirely. I'm investigating. will provide more info once this is resolved, or if I can recreate issue on a different server (same, r640, fw 4.22.0.0)
We're looking into resetting iDRAC early in the deployment process, but that's a large chunk of work that will likely be delivered as a feature for 4.8. The only thing we can do for 4.6 and 4.7 is to document the iDRAC reset. What do you think?
Oh, actually, you're using redfish:// for Dell nodes, this won't work. You need idrac:// or idrac-virtual-media:// Could you try that?
Hmm, actually no, redfish:// should be fine (redfish-virtual-media:// would not be). So yeah, it probably boils down to resetting the iDRAC. Could you do it via the web UI and retry?
ok, so I rerun the whole thing this morning: BMH: ``` # oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api cnfdt16-master-0 OK externally provisioned cnfdt16-c8mqf-master-0 ipmi://10.46.55.219:6230 true openshift-machine-api cnfdt16-master-1 OK externally provisioned cnfdt16-c8mqf-master-1 ipmi://10.46.55.219:6231 true openshift-machine-api cnfdt16-master-2 OK externally provisioned cnfdt16-c8mqf-master-2 ipmi://10.46.55.219:6232 true openshift-machine-api cnfdt16-worker-0 OK provisioned cnfdt16-c8mqf-worker-0-hhf8h ipmi://10.46.55.219:6240 unknown true openshift-machine-api cnfdt16-worker-1 error inspecting redfish://10.46.61.143/redfish/v1/Systems/System.Embedded.1 true Introspection timeout ``` I'll attach conductor and inspector logs iDRAC 4.22.00.00 and I was running a clean installation I can attach the install-config too if needed, but the important part: ``` - name: cnfdt16-worker-1 role: worker bmc: address: redfish://<hidden>/redfish/v1/Systems/System.Embedded.1 disableCertificateVerification: True username: <hidden> password: <hidden> bootMACAddress: <hidden> hardwareProfile: unknown ```
Created attachment 1750489 [details] ironic and conductor logs
Yuval, have you tried resetting the iDRAC job queue as Dmitry asked too? I hadn't seen that in the report.
yes, this system was pre-configured for UEFI and there are no pending jobs in the queue
Hi! Did you intend to use idrac-virtualmedia instead of redfish? You bug title says "vmedia", but you're using redfish with network boot, not virtual media. If the intention was to use redfish, you can work around the issue by using IPMI.
Additionally, your second set of logs don't show any Redfish issues, just a generic introspection timeout, that may be caused by networking misconfiguration. Could you check what is happening on the machine by logging into its iDRAC and checking its virtual console? Your nodes have the same IPMI address, are they virtual?
the uefi/redfish node is physical the rest are virtual (but uses ipmi) iDRAC shows machine is turned off and there are no pending jobs bios is pre-set to uefi boot.
Did you check after the failure or in the process? I'm afraid that something may power off the machine after the timeout is hit. Do you still see the initial failure somewhere or only the timeout?
I did not check during the process just before and after BTW: that same node deployed previously (over bios mode, with IPMI) just fine. I'll try a manual rerun, by deleteing and recreating the bmh and secret, and will try to monitor more closely
The node is going legacy (not UEFI) boot, which is weird (we default to UEFI). You mentioned using BIOS mode previously, did you set it up this way? Was the same install-config used as attached previously? Could you also paste the dnsmasq and up-to-date ironic-conductor logs?
The IP range does not even match. Are you sure your node is booting from the right NIC? Unfortunately, the order can be only specified by entering the boot menu and changing the NIC settings.
so apparently it wasnt booting from the right NIC because UEFI uses a different set of configuration so while it was configured to PXE boot from the right NIC in bios mode, for UEFI it was configured on an incorrect NIC. fixing that, fixed the issue. BTW: NIC itself was configured properly Thanks everyone for helping and especially Dmitry and Bob