2103744 – [ACM 2.5.1] Multi-node 4.11 spoke installation failing as hosts fail to reboot within timeout

Bug 2103744 - [ACM 2.5.1] Multi-node 4.11 spoke installation failing as hosts fail to reboot within timeout

Summary: [ACM 2.5.1] Multi-node 4.11 spoke installation failing as hosts fail to reboo...

Keywords:
Status:	CLOSED DUPLICATE of bug 2100456
Alias:	None
Product:	Red Hat Advanced Cluster Management for Kubernetes
Classification:	Red Hat
Component:	Infrastructure Operator
Sub Component:
Version:	rhacm-2.5.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Igal Tsoiref
QA Contact:	Chad Crum
Docs Contact:	Derek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-07-04 16:59 UTC by epassaro
Modified:	2023-09-15 01:56 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-07-06 15:05:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	stolostron backlog issues 24018	0	None	None	None	2022-07-04 18:39:53 UTC
Red Hat Issue Tracker	MGMTBUGSM-457	0	None	None	None	2022-07-04 17:05:11 UTC

Description epassaro 2022-07-04 16:59:49 UTC

Description of the problem: 

Multi-node 4.11 disconnected spoke installation fails as hosts fail to reboot within timeout. 

Two of the master hosts are stuck in "Rebooting" state, and the AgentClusterinstall is stuck in "installing-pending-user-action".

$ oc get agentclusterinstalls.extensions.hive.openshift.io -n spoke-0 spoke-0 
NAME      CLUSTER   STATE
spoke-0   spoke-0   installing-pending-user-action


$ oc get agent -n infraenv-0
NAME                                   CLUSTER   APPROVED   ROLE     STAGE
06cac740-01a6-40e3-b588-cfbd74d7153a   spoke-0   true       worker   Waiting for control plane
785d0473-ebd8-4977-ae43-e2a32da26096   spoke-0   true       master   Rebooting
880de015-df05-41a1-8889-f6e55993aba8   spoke-0   true       master   Waiting for control plane
c106a889-c9bb-4275-8342-2b5fd6d531cb   spoke-0   true       worker   Waiting for control plane
eb0aa75c-e718-4229-95c9-e34fac6b5e5e   spoke-0   true       master   Rebooting

Rebooting the hosts from the ISO doesn't seem to work.


Release version: 

On hub cluster:

$ oc version
Client Version: 4.10.0-0.nightly-2022-06-08-150219
Server Version: 4.10.0-0.nightly-2022-06-08-150219
Kubernetes Version: v1.23.5+3afdacb


Operator snapshot version: ACM 2.5.1-DOWNSTREAM-2022-06-27-22-23-39

OCP version: 4.11.0-0.nightly-2022-07-01-065600

Steps to reproduce:

1. On 4.10 disconnected hub, install ACM operator (2.5.1-DOWNSTREAM-2022-06-27-22-23-39 snapshot)

2. Create infraenv with 3 master hosts and 2 workers, create agentclusterinstall, clusterdeployment and BMH resources

3. Bind hosts to cluster (latebinding flow)

Actual results:
Spoke cluster installation is stuck in "installing-pending-user-action" state as  hosts fail to reboot within timeout

Expected results: 
Spoke cluster completes installation successfully

Comment 2 Michael Filanov 2022-07-05 06:19:20 UTC

Pending user action usually point to a wrong boot order. 
Is that the case here? if it's not, what happen after the host manage to boot?

Comment 3 epassaro 2022-07-05 09:34:01 UTC

These are actually KVM virtual machines, and in the XML there is only one boot option, not sure if it can still be skipping the booting from the infraenv iso. 

BTW at the beginning I thought it was an issue similar to BZ2074483 and its duplicate BZ2093486. I've asked there (https://bugzilla.redhat.com/show_bug.cgi?id=2074483#c21) but I don't find any error in libvirt logs or in the boot logs of the VM.


After booting the VM I can ssh to it and I can see that the agent service is running without errors. Something else I could check?

Comment 4 Michael Filanov 2022-07-05 09:38:32 UTC

When rebooting during installation the host should boot into the disk and not he discovery ISO. 
@itsoiref are you aware of any libvirt configuration that should be checked in this case?

Comment 5 Michael Filanov 2022-07-05 09:49:56 UTC

@epassaro can you please get events from the cluster? you should have a link in the agent cluster install. 
Logs from the service will help as well

Comment 6 Igal Tsoiref 2022-07-05 09:51:07 UTC

Boot order we use is "disk,cdrom" it allows to start with cdrom when disk is clean and to start from disk after we write rchos.
Can you please show your dumpxml?

Comment 8 Igal Tsoiref 2022-07-05 19:23:05 UTC

actually it is duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2100456#c3

Comment 9 Michael Filanov 2022-07-06 15:05:02 UTC


*** This bug has been marked as a duplicate of bug 2100456 ***

Comment 11 Red Hat Bugzilla 2023-09-15 01:56:35 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.