Bug 2117577

Summary: Assisted Installed fails to deploy spoke compact cluster
Product: Red Hat Advanced Cluster Management for Kubernetes Reporter: Gurenko Alex <agurenko>
Component: Infrastructure OperatorAssignee: Michael Filanov <mfilanov>
Status: CLOSED UPSTREAM QA Contact: Chad Crum <ccrum>
Severity: high Docs Contact: Derek <dcadzow>
Priority: unspecified    
Version: rhacm-2.4CC: ccrum, efried, trwest, yfirst
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-11 15:49:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gurenko Alex 2022-08-11 11:21:16 UTC
Description of the problem:

As a part of Altiostar deployment, when deploying the spoke compact cluster (3 masters only) deployment fails (times out). After investigation hive-operator pod is crashing and nodes report unreachability. If agents are manually restarted and hive-operator pod is re-created there is a new attempt to deploy cluster until hive-operator crash again.

Release version:

Operator snapshot version: 2.4.6

OCP version: 4.8.43

Browser Info:

Steps to reproduce:
1. Start spoke compact cluster deployment
2. Wait for agents to register

Actual results:

Deployment does not proceed with 

The cluster has hosts that are not ready to install.

hive-operator is crashing with following in a log:

time="2022-08-10T11:29:30Z" level=info msg="reconcile complete" controller=hive elapsedMillis=620 elapsedMillisGT=0 outcome=unspecified
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x123e5fc]

NAME                                                             READY   STATUS             RESTARTS   AGE

assisted-service-56d7794d8c-kgslh                                2/2     Running            14         13h
hive-operator-66dd64b6b7-qfkbk                                   0/1     CrashLoopBackOff   108        13h

Expected results:

Deployment continues

Additional info:

With previous ocp 4.8.34 I saw this issue once or twice and re-deployment solved it. With .43 currently I've got really stuck as it's reproduced 3 out of 3 complete redeployments.

Comment 1 Michael Filanov 2022-08-11 12:30:41 UTC
Can you please attach cluster deployment and agent cluster install?

Comment 3 Eric Fried 2022-08-11 15:04:53 UTC
This seems to be caused by building with github.com/modern-go/reflect2 < v1.0.2 under go1.18. That's why this just showed up despite no code changes in hive's ocm-2.4 branch in 6mo -- ACM's build recently switched to using go1.18.

Hive is upgrading the dependency via https://issues.redhat.com/browse/HIVE-1997

After that, ACM will need to pick up the change and respin its build.

This will need to be done for 2.3 as well.

Comment 4 Eric Fried 2022-08-16 15:16:48 UTC
Hive side is all done here.

Comment 6 Michael Filanov 2022-08-17 14:57:31 UTC
yes this one is irrelevant, please followup the jira ticket