Description of the problem: Multi-node 4.11 disconnected spoke installation fails as hosts fail to reboot within timeout. Two of the master hosts are stuck in "Rebooting" state, and the AgentClusterinstall is stuck in "installing-pending-user-action". $ oc get agentclusterinstalls.extensions.hive.openshift.io -n spoke-0 spoke-0 NAME CLUSTER STATE spoke-0 spoke-0 installing-pending-user-action $ oc get agent -n infraenv-0 NAME CLUSTER APPROVED ROLE STAGE 06cac740-01a6-40e3-b588-cfbd74d7153a spoke-0 true worker Waiting for control plane 785d0473-ebd8-4977-ae43-e2a32da26096 spoke-0 true master Rebooting 880de015-df05-41a1-8889-f6e55993aba8 spoke-0 true master Waiting for control plane c106a889-c9bb-4275-8342-2b5fd6d531cb spoke-0 true worker Waiting for control plane eb0aa75c-e718-4229-95c9-e34fac6b5e5e spoke-0 true master Rebooting Rebooting the hosts from the ISO doesn't seem to work. Release version: On hub cluster: $ oc version Client Version: 4.10.0-0.nightly-2022-06-08-150219 Server Version: 4.10.0-0.nightly-2022-06-08-150219 Kubernetes Version: v1.23.5+3afdacb Operator snapshot version: ACM 2.5.1-DOWNSTREAM-2022-06-27-22-23-39 OCP version: 4.11.0-0.nightly-2022-07-01-065600 Steps to reproduce: 1. On 4.10 disconnected hub, install ACM operator (2.5.1-DOWNSTREAM-2022-06-27-22-23-39 snapshot) 2. Create infraenv with 3 master hosts and 2 workers, create agentclusterinstall, clusterdeployment and BMH resources 3. Bind hosts to cluster (latebinding flow) Actual results: Spoke cluster installation is stuck in "installing-pending-user-action" state as hosts fail to reboot within timeout Expected results: Spoke cluster completes installation successfully
Pending user action usually point to a wrong boot order. Is that the case here? if it's not, what happen after the host manage to boot?
These are actually KVM virtual machines, and in the XML there is only one boot option, not sure if it can still be skipping the booting from the infraenv iso. BTW at the beginning I thought it was an issue similar to BZ2074483 and its duplicate BZ2093486. I've asked there (https://bugzilla.redhat.com/show_bug.cgi?id=2074483#c21) but I don't find any error in libvirt logs or in the boot logs of the VM. After booting the VM I can ssh to it and I can see that the agent service is running without errors. Something else I could check?
When rebooting during installation the host should boot into the disk and not he discovery ISO. @itsoiref are you aware of any libvirt configuration that should be checked in this case?
@epassaro can you please get events from the cluster? you should have a link in the agent cluster install. Logs from the service will help as well
Boot order we use is "disk,cdrom" it allows to start with cdrom when disk is clean and to start from disk after we write rchos. Can you please show your dumpxml?
actually it is duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2100456#c3
*** This bug has been marked as a duplicate of bug 2100456 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days