2096106 – Assisted service triggers the worker nodes re-provisioning on the hub cluster when the converged flow is enabled

Bug 2096106 - Assisted service triggers the worker nodes re-provisioning on the hub cluster when the converged flow is enabled

Summary: Assisted service triggers the worker nodes re-provisioning on the hub cluster...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Advanced Cluster Management for Kubernetes
Classification:	Red Hat
Component:	Infrastructure Operator
Sub Component:
Version:	rhacm-2.6
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	rhacm-2.6
Assignee:	Eran Cohen
QA Contact:	Chad Crum
Docs Contact:	Derek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-12 21:26 UTC by tali@redhat.com
Modified:	2022-09-06 22:31 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-09-06 22:30:54 UTC
Target Upstream Version:
Embargoed:
Flags:	cbynum: rhacm-2.6+ cbynum: rhacm-2.6.z+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	stolostron backlog issues 23177	None	None	None	2022-06-13 02:50:59 UTC
Red Hat Issue Tracker	MGMTBUGSM-431	None	None	None	2022-06-12 21:34:49 UTC
Red Hat Product Errata	RHSA-2022:6370	None	None	None	2022-09-06 22:31:11 UTC

Description tali@redhat.com 2022-06-12 21:26:27 UTC

Description of the problem:
Deploying the assisted service operator bundle on a 4.11 hub cluster triggered the two already provisioned worker nodes going through deprovisioning and provisioning processes. The two workers eventually stuck in a provisioning state.

This was due to BMAC setting the customDeploy method to "start_assisted_install" on the two BMHs (when the converged flow is enabled). 


	oc get bmh -n  openshift-machine-api   cnfdt08-worker-0  -ojsonpath={'.spec.customDeploy}' | jq
	{
	  "method": "start_assisted_install"
	}
	
        oc get bmh -n  openshift-machine-api   cnfdt08-worker-1  -ojsonpath={'.spec.customDeploy}' | jq
	{
	  "method": "start_assisted_install"
	}
	
	oc get bmh -A
	NAMESPACE               NAME               STATE                    CONSUMER                       ONLINE   ERROR   AGE
	openshift-machine-api   cnfdt08-master-0   externally provisioned   cnfdt08-2qtb2-master-0         true             15h
	openshift-machine-api   cnfdt08-master-1   externally provisioned   cnfdt08-2qtb2-master-1         true             15h
	openshift-machine-api   cnfdt08-master-2   externally provisioned   cnfdt08-2qtb2-master-2         true             15h
	openshift-machine-api   cnfdt08-worker-0   provisioning             cnfdt08-2qtb2-worker-0-z8cdr   true             15h
        openshift-machine-api   cnfdt08-worker-1   provisioning             cnfdt08-2qtb2-worker-0-tc6rc   true             15h


Release version:
- Latest upstream assisted-service-operator (with converged flow enabled)
- OCP 4.11 on hub (4.11.0-0.nightly-2022-06-06-201913)


Steps to reproduce:
1. Deploy a 4.11 hub cluster with 3 masters and 2 workers
2. Deploy the assisted service operator bundle with converged flow enabled


Actual results:
The hub cluster is in a bad state and many pods (including the assisted-service pods) can't find an available host.

Expected results:
The hub cluster is in a healthy state that allows deploying a spoke cluster

Additional info:

Comment 1 Eran Cohen 2022-06-13 11:03:32 UTC

I'm not sure if the start_assisted_install customDeploy is the reason that triggered the de-provisioning or perhaps it's something else we do (perhaps we updated the preprovisioning image with an new ISO URL.

BMAC (the BMH controller in assisted-service) is reconciling BMH CRs that belong to an InfraEnv.
Can you attach:
1. The full workers BMH yaml?
2. All the preprovisioningimages you see in the cluster?
3. The assisted-service and metal3 logs

Comment 4 tali@redhat.com 2022-06-13 14:20:02 UTC

I have attached the full workers BMH yaml, preprovisioningimages and BMO logs.
The assisted-service pods were running on the worker node and the logs were not available.

oc get PreprovisioningImage  -A
NAMESPACE               NAME               READY   REASON
openshift-machine-api   cnfdt08-worker-0   True    ImageSuccess
openshift-machine-api   cnfdt08-worker-1   True    ImageSuccess

oc logs -n assisted-installer assisted-service-684bd8cd48-brpbb -c assisted-service
Error from server: Get "https://10.46.55.22:10250/containerLogs/assisted-installer/assisted-service-684bd8cd48-brpbb/assisted-service": dial tcp 10.46.55.22:10250: connect: no route to host

Comment 5 tali@redhat.com 2022-06-29 21:02:13 UTC

I re-tested with the latest upstream operator and this issue has been resolved.

Comment 6 Chad Crum 2022-07-19 17:17:40 UTC

Verified per previous entry (Also QE does not see this in CI with recent builds)

Comment 9 errata-xmlrpc 2022-09-06 22:30:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.6.0 security updates and bug fixes), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6370

Note You need to log in before you can comment on or make changes to this bug.