Description of problem: Using Assisted Service to install SNO on a baremetal server. The InfraEnv shows the ISO was generated and BMH attaches the ISO via idrac and boots server. The agent never connects back to Assisted Service and the conditions of the AgentClusterInstall show: conditions: - lastProbeTime: "2021-06-28T21:15:56Z" lastTransitionTime: "2021-06-28T21:15:56Z" message: The Spec has been successfully applied reason: SyncOK status: "True" type: SpecSynced - lastProbeTime: "2021-06-28T21:15:56Z" lastTransitionTime: "2021-06-28T21:15:56Z" message: The cluster is not ready to begin the installation reason: ClusterNotReady status: "False" type: RequirementsMet - lastProbeTime: "2021-06-28T21:15:56Z" lastTransitionTime: "2021-06-28T21:15:56Z" message: 'The cluster''s validations are failing: Single-node clusters must have a single master node and no workers.' reason: ValidationsFailing status: "False" type: Validated - lastProbeTime: "2021-06-28T21:15:56Z" lastTransitionTime: "2021-06-28T21:15:56Z" message: The installation has not yet started reason: InstallationNotStarted status: "False" type: Completed - lastProbeTime: "2021-06-28T21:15:56Z" lastTransitionTime: "2021-06-28T21:15:56Z" message: The installation has not failed reason: InstallationNotFailed status: "False" type: Failed - lastProbeTime: "2021-06-28T21:15:56Z" lastTransitionTime: "2021-06-28T21:15:56Z" message: The installation is waiting to start or in progress reason: InstallationNotStopped status: "False" type: Stopped InfraEnv reports ISO generated and available at URL: https://assisted-service-open-cluster-management.apps.ran-acm02.ptp.lab.eng.bos.redhat.com/api/assisted-install/v1/clusters/0bcb67f7-67bd-4b10-beec-f1762d46a97f/downloads/image?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiMGJjYjY3ZjctNjdiZC00YjEwLWJlZWMtZjE3NjJkNDZhOTdmIn0.kIPi2W45KYbXZ8eIW4pw5vY4Swx2e4IzF2PkoIp-tFF6o00eM-ETlj_rk-XUsDa_fr_5yKUtJXOH5bRS3-myvw BMH attaches ISO to node via idrac: http://10.16.231.227:6180/redfish/boot-0f2b3367-f922-4e87-aea4-cb07a97991ec.iso?filename=tmp3ujubhn9.iso Version-Release number of selected component (if applicable): 4.8 How reproducible: Always (against a specific baremetal server) Steps to Reproduce: 1. Delete namespace containing installation CRs (ClusterDeployment, InfraEnv, BMH, AgentClusterInstall, and NMStateConfig) 2. Create namespace and installation CRs 3. Check state of AgentClusterInstall, watch console for reboot Actual results: Installation stuck. Expected results: Installation starts Additional info:
Root cause was found to be an issue with the baremetal server failing to actually mount the ISO (despite success in attaching it). When the server was booting (as initiated by the BMH) a message was seen briefly on the screen reporting failure to boot from the ISO: Booting from Virtual Optical Drive: Red Hat Enterprise Linux Boot Failed: Virtual Optical Drive: Red Hat Enterprise Linux Booting from Virtual Optical Drive: Red Hat Enterprise Linux Boot Failed: Virtual Optical Drive: Red Hat Enterprise Linux Booting from RAID Controller in Slot 5: Red Hat Enterprise Linux The boot eventually fell back to the previously installed image on disk. Attempts to manually mount the ISO (via idrac) as well as various other "known good" (non OpenShift) ISO images all resulted in the same behavior -- failure to boot from the ISO. An issue with the BIOS/idrac version was identified which can prevent the ISO from correctly mounting. The installed BIOS version was 2.8.2 BIOS version 2.11.2 contains a fix "- Fix Booting to ISO image failed when the iDRAC USB port is Configure as All Ports OFF/AllPortsOff (Dynamic)." The issue was resolved by: - Update BIOS to 2.11.2 (and idrac to 4.40.40.00) - In idrac dashboard: Configuration -> Virtual Console -> Plug-In Type , set to HTML5 (not eHTML5) - Cleared out previous UEFI boot images (unknown if this step was required)
> I would suggest to move it on the ironic team? What makes you think it's related to ironic? The following message suggests the opposite: > Attempts to manually mount the ISO (via idrac) as well as various other "known good" (non OpenShift) ISO images all resulted in the same behavior -- failure to boot from the ISO. If the BIOS upgrade solved the problem, what's else left to be solved?
(In reply to Ian Miller from comment #1) > The issue was resolved by: > ... > - Cleared out previous UEFI boot images (unknown if this step was required) If this occurs again, can you confirm if this step alone solves the problem, its possible this is another occurrence of this bug https://storyboard.openstack.org/#!/story/2008763 where the first CD boot entry in the list is selected when vmedia boot is requested (rather then the generic one)
I ran into this issue again today on several Single Node OpenShift clusters that I was installing. Following Comment 6 the only step necessary to recover the systems was editing the UEFI boot order to remove the entries that correlate to old installations. Details from one of the clusters where this happened: hub cluster OCP 4.8.12 ACM 2.3.2 cnfdf03 Installation was stuck, AgentClusterInstall reporting "The cluster's validations are failing: Single-node clusters must have a single master node and no workers." ssh to the node confirms it was booted into the old on-disk image idrac version 4.40.00.00 BIOS Version 2.10.0 UEFI Boot order - entries marked with (x) I removed which allowed installation to proceed (x) Unavailable: Red Hat Enterprise Linux (x) RAID Controller in Slot 6: Red Hat Enterprise Linux Virtual Optical Drive PXE Device 1: Integrated NIC 1 Port 1 Partition 1 Virtual Floppy Drive Once the 2 marked entries were removed from the UEFI boot order (via idrac web ui) and the node rebooted, the system booted from the ISO and cluster installation proceeded normally
As @rfreiman said the boot order is not related to AI, can you please check the BIOS version? If it's doesn't help then someone from Ironic need to take a look
I upgraded to the latest version of BIOS (2.12.2) and iDRAC (5.00.10.20). Installation CRs were created on hub cluster to restart deployment. The issue was the same, the node booted into the existing install on the disk rather than the installed image. The console flashed by the following messages: Booting from Virtual Optical Drive: Red Hat Enterprise Linux Boot Failed: Virtual Optical Drive: Red Hat Enterprise Linux Booting from RAID Controller in Slot 6: Red Hat Enterprise Linux
@dtantsur Can someone from you team please take a look?
In this case, Ian, does it fail the same way if you attach an ISO manually? Have you tried resetting iDRAC? P.S. Please use Slack for asking for help.
Could you please check the driver in the created BMH? It has to be idrac-virtualmedia, not just redfish-virtualmedia. Otherwise, if it does not work when attaching an ISO manually, it's a hardware problem, not something we can help with.
From the BareMetalHost CR (some data obfuscated) spec: automatedCleaningMode: disabled bmc: address: idrac-virtualmedia+https://xxx.xxx.xxx.xxx/redfish/v1/Systems/System.Embedded.1 credentialsName: bmh-secret disableCertificateVerification: true bootMACAddress: xx:xx:xx:xx:xx:xx bootMode: UEFI image: format: live-iso url: xxxxxxxxx online: true rootDeviceHints: deviceName: /dev/sda
Marking this ticket as a duplicate of BZ 2011306. *** This bug has been marked as a duplicate of bug 2011306 ***