Bug 2117577 - Assisted Installed fails to deploy spoke compact cluster
Summary: Assisted Installed fails to deploy spoke compact cluster
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Red Hat Advanced Cluster Management for Kubernetes
Classification: Red Hat
Component: Infrastructure Operator
Version: rhacm-2.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Michael Filanov
QA Contact: Chad Crum
Derek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-11 11:21 UTC by Gurenko Alex
Modified: 2022-08-17 14:57 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-11 15:49:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github stolostron backlog issues 25051 0 None None None 2022-08-11 11:54:07 UTC
Red Hat Issue Tracker MGMT-11570 0 None None None 2022-08-11 11:21:15 UTC
Red Hat Issue Tracker MGMTBUGSM-518 0 None None None 2022-08-11 11:24:45 UTC

Description Gurenko Alex 2022-08-11 11:21:16 UTC
Description of the problem:

As a part of Altiostar deployment, when deploying the spoke compact cluster (3 masters only) deployment fails (times out). After investigation hive-operator pod is crashing and nodes report unreachability. If agents are manually restarted and hive-operator pod is re-created there is a new attempt to deploy cluster until hive-operator crash again.

Release version:

Operator snapshot version: 2.4.6

OCP version: 4.8.43

Browser Info:

Steps to reproduce:
1. Start spoke compact cluster deployment
2. Wait for agents to register

Actual results:

Deployment does not proceed with 

The cluster has hosts that are not ready to install.

hive-operator is crashing with following in a log:

time="2022-08-10T11:29:30Z" level=info msg="reconcile complete" controller=hive elapsedMillis=620 elapsedMillisGT=0 outcome=unspecified
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x123e5fc]

NAME                                                             READY   STATUS             RESTARTS   AGE

assisted-service-56d7794d8c-kgslh                                2/2     Running            14         13h
hive-operator-66dd64b6b7-qfkbk                                   0/1     CrashLoopBackOff   108        13h

Expected results:

Deployment continues

Additional info:

With previous ocp 4.8.34 I saw this issue once or twice and re-deployment solved it. With .43 currently I've got really stuck as it's reproduced 3 out of 3 complete redeployments.

Comment 1 Michael Filanov 2022-08-11 12:30:41 UTC
Can you please attach cluster deployment and agent cluster install?

Comment 3 Eric Fried 2022-08-11 15:04:53 UTC
This seems to be caused by building with github.com/modern-go/reflect2 < v1.0.2 under go1.18. That's why this just showed up despite no code changes in hive's ocm-2.4 branch in 6mo -- ACM's build recently switched to using go1.18.

Hive is upgrading the dependency via https://issues.redhat.com/browse/HIVE-1997

After that, ACM will need to pick up the change and respin its build.

This will need to be done for 2.3 as well.

Comment 4 Eric Fried 2022-08-16 15:16:48 UTC
Hive side is all done here.

Comment 6 Michael Filanov 2022-08-17 14:57:31 UTC
yes this one is irrelevant, please followup the jira ticket


Note You need to log in before you can comment on or make changes to this bug.