Bug 1917751
Summary: | vmedia/redfish deployment fails with `Error: Setting boot mode to uefi` | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Yuval Kashtan <ykashtan> | ||||||||||
Component: | Bare Metal Hardware Provisioning | Assignee: | Dmitry Tantsur <dtantsur> | ||||||||||
Bare Metal Hardware Provisioning sub component: | ironic | QA Contact: | Amit Ugol <augol> | ||||||||||
Status: | CLOSED NOTABUG | Docs Contact: | |||||||||||
Severity: | high | ||||||||||||
Priority: | medium | CC: | athomas, bfournie, lshilin, pablo.iranzo, sdasu, tsedovic | ||||||||||
Version: | 4.6 | Keywords: | Triaged | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | 4.8.0 | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2021-02-05 20:40:02 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1831748 | ||||||||||||
Attachments: |
|
Description
Yuval Kashtan
2021-01-19 10:25:35 UTC
Created attachment 1748687 [details]
pending job
pending job from BMC
Created attachment 1748688 [details]
pending bios config change
I think it's because of the pending config change job. nothing triggers applying it. possible a hardware malfunction. seems server refuses to power-on, entirely. I'm investigating. will provide more info once this is resolved, or if I can recreate issue on a different server (same, r640, fw 4.22.0.0) We're looking into resetting iDRAC early in the deployment process, but that's a large chunk of work that will likely be delivered as a feature for 4.8. The only thing we can do for 4.6 and 4.7 is to document the iDRAC reset. What do you think? Oh, actually, you're using redfish:// for Dell nodes, this won't work. You need idrac:// or idrac-virtual-media:// Could you try that? Hmm, actually no, redfish:// should be fine (redfish-virtual-media:// would not be). So yeah, it probably boils down to resetting the iDRAC. Could you do it via the web UI and retry? ok, so I rerun the whole thing this morning: BMH: ``` # oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api cnfdt16-master-0 OK externally provisioned cnfdt16-c8mqf-master-0 ipmi://10.46.55.219:6230 true openshift-machine-api cnfdt16-master-1 OK externally provisioned cnfdt16-c8mqf-master-1 ipmi://10.46.55.219:6231 true openshift-machine-api cnfdt16-master-2 OK externally provisioned cnfdt16-c8mqf-master-2 ipmi://10.46.55.219:6232 true openshift-machine-api cnfdt16-worker-0 OK provisioned cnfdt16-c8mqf-worker-0-hhf8h ipmi://10.46.55.219:6240 unknown true openshift-machine-api cnfdt16-worker-1 error inspecting redfish://10.46.61.143/redfish/v1/Systems/System.Embedded.1 true Introspection timeout ``` I'll attach conductor and inspector logs iDRAC 4.22.00.00 and I was running a clean installation I can attach the install-config too if needed, but the important part: ``` - name: cnfdt16-worker-1 role: worker bmc: address: redfish://<hidden>/redfish/v1/Systems/System.Embedded.1 disableCertificateVerification: True username: <hidden> password: <hidden> bootMACAddress: <hidden> hardwareProfile: unknown ``` Created attachment 1750489 [details]
ironic and conductor logs
Yuval, have you tried resetting the iDRAC job queue as Dmitry asked too? I hadn't seen that in the report. yes, this system was pre-configured for UEFI and there are no pending jobs in the queue Hi! Did you intend to use idrac-virtualmedia instead of redfish? You bug title says "vmedia", but you're using redfish with network boot, not virtual media. If the intention was to use redfish, you can work around the issue by using IPMI. Additionally, your second set of logs don't show any Redfish issues, just a generic introspection timeout, that may be caused by networking misconfiguration. Could you check what is happening on the machine by logging into its iDRAC and checking its virtual console? Your nodes have the same IPMI address, are they virtual? the uefi/redfish node is physical the rest are virtual (but uses ipmi) iDRAC shows machine is turned off and there are no pending jobs bios is pre-set to uefi boot. Did you check after the failure or in the process? I'm afraid that something may power off the machine after the timeout is hit. Do you still see the initial failure somewhere or only the timeout? I did not check during the process just before and after BTW: that same node deployed previously (over bios mode, with IPMI) just fine. I'll try a manual rerun, by deleteing and recreating the bmh and secret, and will try to monitor more closely The node is going legacy (not UEFI) boot, which is weird (we default to UEFI). You mentioned using BIOS mode previously, did you set it up this way? Was the same install-config used as attached previously? Could you also paste the dnsmasq and up-to-date ironic-conductor logs? The IP range does not even match. Are you sure your node is booting from the right NIC? Unfortunately, the order can be only specified by entering the boot menu and changing the NIC settings. so apparently it wasnt booting from the right NIC because UEFI uses a different set of configuration so while it was configured to PXE boot from the right NIC in bios mode, for UEFI it was configured on an incorrect NIC. fixing that, fixed the issue. BTW: NIC itself was configured properly Thanks everyone for helping and especially Dmitry and Bob |