Bug 1978314 - Installation stuck at "The cluster is not ready to begin the installation"
Summary: Installation stuck at "The cluster is not ready to begin the installation"
Keywords:
Status: CLOSED DUPLICATE of bug 2011306
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Michael Filanov
QA Contact: Yuri Obshansky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-01 14:16 UTC by Ian Miller
Modified: 2021-11-09 07:39 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-28 17:38:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ian Miller 2021-07-01 14:16:20 UTC
Description of problem:
Using Assisted Service to install SNO on a baremetal server. The InfraEnv shows the ISO was generated and BMH attaches the ISO via idrac and boots server. The agent never connects back to Assisted Service and the conditions of the AgentClusterInstall show:

    conditions:
    - lastProbeTime: "2021-06-28T21:15:56Z"
      lastTransitionTime: "2021-06-28T21:15:56Z"
      message: The Spec has been successfully applied
      reason: SyncOK
      status: "True"
      type: SpecSynced
    - lastProbeTime: "2021-06-28T21:15:56Z"
      lastTransitionTime: "2021-06-28T21:15:56Z"
      message: The cluster is not ready to begin the installation
      reason: ClusterNotReady
      status: "False"
      type: RequirementsMet
    - lastProbeTime: "2021-06-28T21:15:56Z"
      lastTransitionTime: "2021-06-28T21:15:56Z"
      message: 'The cluster''s validations are failing: Single-node clusters must have a single master node and no workers.'
      reason: ValidationsFailing
      status: "False"
      type: Validated
    - lastProbeTime: "2021-06-28T21:15:56Z"
      lastTransitionTime: "2021-06-28T21:15:56Z"
      message: The installation has not yet started
      reason: InstallationNotStarted
      status: "False"
      type: Completed
    - lastProbeTime: "2021-06-28T21:15:56Z"
      lastTransitionTime: "2021-06-28T21:15:56Z"
      message: The installation has not failed
      reason: InstallationNotFailed
      status: "False"
      type: Failed
    - lastProbeTime: "2021-06-28T21:15:56Z"
      lastTransitionTime: "2021-06-28T21:15:56Z"
      message: The installation is waiting to start or in progress
      reason: InstallationNotStopped
      status: "False"
      type: Stopped

InfraEnv reports ISO generated and available at URL: https://assisted-service-open-cluster-management.apps.ran-acm02.ptp.lab.eng.bos.redhat.com/api/assisted-install/v1/clusters/0bcb67f7-67bd-4b10-beec-f1762d46a97f/downloads/image?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiMGJjYjY3ZjctNjdiZC00YjEwLWJlZWMtZjE3NjJkNDZhOTdmIn0.kIPi2W45KYbXZ8eIW4pw5vY4Swx2e4IzF2PkoIp-tFF6o00eM-ETlj_rk-XUsDa_fr_5yKUtJXOH5bRS3-myvw

BMH attaches ISO to node via idrac: http://10.16.231.227:6180/redfish/boot-0f2b3367-f922-4e87-aea4-cb07a97991ec.iso?filename=tmp3ujubhn9.iso

Version-Release number of selected component (if applicable): 4.8


How reproducible: Always (against a specific baremetal server)


Steps to Reproduce:
1. Delete namespace containing installation CRs (ClusterDeployment, InfraEnv, BMH, AgentClusterInstall, and NMStateConfig)
2. Create namespace and installation CRs
3. Check state of AgentClusterInstall, watch console for reboot

Actual results: Installation stuck.


Expected results: Installation starts


Additional info:

Comment 1 Ian Miller 2021-07-01 14:29:52 UTC
Root cause was found to be an issue with the baremetal server failing to actually mount the ISO (despite success in attaching it).  When the server was booting (as initiated by the BMH) a message was seen briefly on the screen reporting failure to boot from the ISO:

    Booting from Virtual Optical Drive: Red Hat Enterprise Linux
    Boot Failed: Virtual Optical Drive: Red Hat Enterprise Linux

    Booting from Virtual Optical Drive: Red Hat Enterprise Linux
    Boot Failed: Virtual Optical Drive: Red Hat Enterprise Linux

    Booting from RAID Controller in Slot 5: Red Hat Enterprise Linux

The boot eventually fell back to the previously installed image on disk.

Attempts to manually mount the ISO (via idrac) as well as various other "known good" (non OpenShift) ISO images all resulted in the same behavior -- failure to boot from the ISO.

An issue with the BIOS/idrac version was identified which can prevent the ISO from correctly mounting. The installed BIOS version was 2.8.2
BIOS version 2.11.2 contains a fix "- Fix Booting to ISO image failed when the iDRAC USB port is Configure as All Ports OFF/AllPortsOff (Dynamic)." 

The issue was resolved by:
- Update BIOS to 2.11.2 (and idrac to 4.40.40.00)
- In idrac dashboard: Configuration -> Virtual Console -> Plug-In Type , set to HTML5 (not eHTML5)
- Cleared out previous UEFI boot images (unknown if this step was required)

Comment 4 Dmitry Tantsur 2021-07-07 12:34:13 UTC
> I would suggest to move it on the ironic team?

What makes you think it's related to ironic? The following message suggests the opposite:

> Attempts to manually mount the ISO (via idrac) as well as various other "known good" (non OpenShift) ISO images all resulted in the same behavior -- failure to boot from the ISO.

If the BIOS upgrade solved the problem, what's else left to be solved?

Comment 6 Derek Higgins 2021-07-07 14:02:50 UTC
(In reply to Ian Miller from comment #1)
> The issue was resolved by:
> ...
> - Cleared out previous UEFI boot images (unknown if this step was required)

If this occurs again, can you confirm if this step alone solves the problem,
its possible this is another occurrence of this bug https://storyboard.openstack.org/#!/story/2008763 
where the first CD boot entry in the list is selected when vmedia boot is requested (rather then the generic one)

Comment 7 Ian Miller 2021-10-15 19:23:50 UTC
I ran into this issue again today on several Single Node OpenShift clusters that I was installing. Following Comment 6 the only step necessary to recover the systems was editing the UEFI boot order to remove the entries that correlate to old installations. Details from one of the clusters where this happened:

hub cluster
OCP 4.8.12
ACM 2.3.2

cnfdf03
Installation was stuck, AgentClusterInstall reporting "The cluster's validations are failing: Single-node clusters must have a single master node and no workers."
ssh to the node confirms it was booted into the old on-disk image

idrac version 4.40.00.00
BIOS Version 2.10.0
UEFI Boot order - entries marked with (x) I removed which allowed installation to proceed
  (x) Unavailable: Red Hat Enterprise Linux
  (x) RAID Controller in Slot 6: Red Hat Enterprise Linux
  Virtual Optical Drive
  PXE Device 1: Integrated NIC 1 Port 1 Partition 1
  Virtual Floppy Drive

Once the 2 marked entries were removed from the UEFI boot order (via idrac web ui) and the node rebooted, the system booted from the ISO and cluster installation proceeded normally

Comment 8 Michael Filanov 2021-10-17 06:44:10 UTC
As @rfreiman said the boot order is not related to AI, can you please check the BIOS version? 
If it's doesn't help then someone from Ironic need to take a look

Comment 9 Ian Miller 2021-10-18 14:53:59 UTC
I upgraded to the latest version of BIOS (2.12.2) and iDRAC (5.00.10.20). Installation CRs were created on hub cluster to restart deployment. The issue was the same, the node booted into the existing install on the disk rather than the installed image. The console flashed by the following messages:

Booting from Virtual Optical Drive: Red Hat Enterprise Linux
Boot Failed: Virtual Optical Drive: Red Hat Enterprise Linux

Booting from RAID Controller in Slot 6: Red Hat Enterprise Linux

Comment 10 Michael Filanov 2021-10-18 19:54:10 UTC
@dtantsur Can someone from you team please take a look?

Comment 11 Dmitry Tantsur 2021-10-19 07:51:18 UTC
In this case, Ian, does it fail the same way if you attach an ISO manually? Have you tried resetting iDRAC?

P.S.
Please use Slack for asking for help.

Comment 17 Dmitry Tantsur 2021-10-28 13:36:24 UTC
Could you please check the driver in the created BMH? It has to be idrac-virtualmedia, not just redfish-virtualmedia.

Otherwise, if it does not work when attaching an ISO manually, it's a hardware problem, not something we can help with.

Comment 18 Ian Miller 2021-10-28 15:44:28 UTC
From the BareMetalHost CR (some data obfuscated)

spec:                     
  automatedCleaningMode: disabled
  bmc:             
    address: idrac-virtualmedia+https://xxx.xxx.xxx.xxx/redfish/v1/Systems/System.Embedded.1
    credentialsName: bmh-secret
    disableCertificateVerification: true
  bootMACAddress: xx:xx:xx:xx:xx:xx
  bootMode: UEFI
  image:
    format: live-iso
    url: xxxxxxxxx
  online: true
  rootDeviceHints:
    deviceName: /dev/sda

Comment 19 Ian Miller 2021-10-28 17:38:51 UTC
Marking this ticket as a duplicate of BZ 2011306.

*** This bug has been marked as a duplicate of bug 2011306 ***


Note You need to log in before you can comment on or make changes to this bug.