Bug 2117577

Summary:	Assisted Installed fails to deploy spoke compact cluster
Product:	Red Hat Advanced Cluster Management for Kubernetes	Reporter:	Gurenko Alex <agurenko>
Component:	Infrastructure Operator	Assignee:	Michael Filanov <mfilanov>
Status:	CLOSED UPSTREAM	QA Contact:	Chad Crum <ccrum>
Severity:	high	Docs Contact:	Derek <dcadzow>
Priority:	unspecified
Version:	rhacm-2.4	CC:	ccrum, efried, trwest, yfirst
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-11 15:49:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Gurenko Alex 2022-08-11 11:21:16 UTC

Description of the problem:

As a part of Altiostar deployment, when deploying the spoke compact cluster (3 masters only) deployment fails (times out). After investigation hive-operator pod is crashing and nodes report unreachability. If agents are manually restarted and hive-operator pod is re-created there is a new attempt to deploy cluster until hive-operator crash again.

Release version:

Operator snapshot version: 2.4.6

OCP version: 4.8.43

Browser Info:

Steps to reproduce:
1. Start spoke compact cluster deployment
2. Wait for agents to register

Actual results:

Deployment does not proceed with

The cluster has hosts that are not ready to install.

hive-operator is crashing with following in a log:

time="2022-08-10T11:29:30Z" level=info msg="reconcile complete" controller=hive elapsedMillis=620 elapsedMillisGT=0 outcome=unspecified
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x123e5fc]

NAME READY STATUS RESTARTS AGE

assisted-service-56d7794d8c-kgslh 2/2 Running 14 13h
hive-operator-66dd64b6b7-qfkbk 0/1 CrashLoopBackOff 108 13h

Expected results:

Deployment continues

Additional info:

With previous ocp 4.8.34 I saw this issue once or twice and re-deployment solved it. With .43 currently I've got really stuck as it's reproduced 3 out of 3 complete redeployments.

Comment 1 Michael Filanov 2022-08-11 12:30:41 UTC

Can you please attach cluster deployment and agent cluster install?

Comment 3 Eric Fried 2022-08-11 15:04:53 UTC

This seems to be caused by building with github.com/modern-go/reflect2 < v1.0.2 under go1.18. That's why this just showed up despite no code changes in hive's ocm-2.4 branch in 6mo -- ACM's build recently switched to using go1.18.

Hive is upgrading the dependency via https://issues.redhat.com/browse/HIVE-1997

After that, ACM will need to pick up the change and respin its build.

This will need to be done for 2.3 as well.

Comment 4 Eric Fried 2022-08-16 15:16:48 UTC

Hive side is all done here.

Comment 6 Michael Filanov 2022-08-17 14:57:31 UTC

yes this one is irrelevant, please followup the jira ticket