2028003 – Multi-node spoke clusters deployments fail intermittently due to failure to reach kube-API on internal network

Bug 2028003 - Multi-node spoke clusters deployments fail intermittently due to failure to reach kube-API on internal network

Summary: Multi-node spoke clusters deployments fail intermittently due to failure to r...

Keywords:
Status:	CLOSED DUPLICATE of bug 2095264
Alias:	None
Product:	Red Hat Advanced Cluster Management for Kubernetes
Classification:	Red Hat
Component:	Infrastructure Operator
Sub Component:
Version:	rhacm-2.4.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Mat Kowalski
QA Contact:	bjacot
Docs Contact:	Derek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-01 09:42 UTC by Yona First
Modified:	2022-07-25 12:32 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-07-25 12:32:35 UTC
Target Upstream Version:
Embargoed:
Flags:	yuhe: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	open-cluster-management backlog issues 18296	0	None	None	None	2021-12-05 21:27:18 UTC
Red Hat Issue Tracker	MGMTBUGSM-177	0	None	None	None	2022-03-10 02:26:54 UTC

Description Yona First 2021-12-01 09:42:24 UTC

Description of the problem:

When deploying 4.9/2.4 multi-node spoke clusters with RHACM, the installation times out after 1h while the worker agents are waiting for the control plane. The control plane never comes online, and the installation fails. 

This issue appears intermittently on different network types: IPv4/6, dualstack vlan, singlestack bond.

After inspection of a live environment by developers, the issue appears to be that a networking problem is causing the assisted-controller pod that is running on a host network to fail to reach the kube-api on the internal network.

From the logs on one of the masters:

[root@spoke-master-0-0 core]# oc logs assisted-installer-controller1--1-blsrg -n assisted-installer
time="2021-11-30T16:34:56Z" level=info msg="Start running Assisted-Controller. Configuration is:\n struct ControllerConfig {\n\tClusterID: \"d05d7fdd-ec97-44d7-b751-b374dafae622\",\n\tURL: \"https://assisted-service-rhacm.apps.ocp-edge-cluster-assisted-0.qe.lab.redhat.com\",\n\tPullSecretToken: <SECRET>,\n\tSkipCertVerification: false,\n\tCACertPath: \"/etc/assisted-service/service-ca-cert.crt\",\n\tNamespace: \"assisted-installer\",\n\tOpenshiftVersion: \"4.9.9\",\n\tHighAvailabilityMode: \"Full\",\n\tWaitForClusterVersion: true,\n\tMustGatherImage: \"registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5000/openshift-release-dev@sha256:1fd487f1ce9a40c982f9284d7404029e280d503a787e9bb46a8b6a7d2cb64bda\",\n}"
W1130 16:34:56.467008       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2021/11/30 16:34:56 Failed to create k8 client failed to create runtime client: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: connect: no route to host

RHACM snapshot version: quay.io/acm-d/acm-custom-registry:2.4.1-DOWNSTREAM-2021-11-23-15-19-17
OCP version on hub: 4.9.7

Steps to reproduce:

We have been seeing this issue intermittently, so it may not be reproducible every time.   

1. Deploy a hub cluster + RHACM
2. Deploy spoke multi-node via AI

Actual results:

Installation fails after 2 of the master agents are 'Joined', the bootstrap is 'waiting for controller', and the workers are 'Waiting for control plane'. 

[kni@provisionhost-0-0 ~]$ oc get agents -n spoke-0 
NAME                                   CLUSTER   APPROVED   ROLE     STAGE
10750fd8-0cf9-4d85-b0a8-fa23230001c6   spoke-0   true       master   Waiting for controller
3e8926a7-b814-4f82-9a31-056183858a44   spoke-0   true       master   Joined
4a0807dd-cf36-4a9f-ab4e-fdbc6a1e9c35   spoke-0   true       worker   Waiting for control plane
8da94ccd-2a36-4092-b0b0-5316b997f5fc   spoke-0   true       worker   Waiting for control plane
daed0544-c65f-4f95-a373-5b0d708e39a3   spoke-0   true       master   Joined


Expected results:

Cluster should install normally.

Additional info:

See relevant slack thread: https://coreos.slack.com/archives/CUPJTHQ5P/p1638281441015200

Comment 2 Michael Filanov 2021-12-01 09:55:11 UTC

@aconstan another issue that may require your help

Comment 4 Igal Tsoiref 2021-12-02 15:31:22 UTC

The issue is that we can't reach pods network from host and pods that are running with hostnet=true. It causes our controller pod to fail to reach kube-api and seems like a bug in OVN as in SDN it is allowed and in OVN on 4.9.0 it works too.

Comment 11 Michael Filanov 2022-01-15 13:04:46 UTC

@yfirst looks like we will need to see a live system in order for us to debug it

Comment 12 Yona First 2022-01-16 09:49:32 UTC

Ok, I will try to reproduce with a live env and notify you accordingly.

Comment 14 Yuanyuan He 2022-03-10 02:20:25 UTC

Comment 15 Yuanyuan He 2022-03-10 02:21:02 UTC

Comment 16 Yuanyuan He 2022-03-10 02:22:12 UTC

@yfirst is there any update on this? Thanks!

Comment 17 Yona First 2022-03-10 09:42:10 UTC

Hi, still have not been able to recreate with live env yet.

Comment 18 Rom Freiman 2022-03-30 07:30:12 UTC

If managed to reproduce, please reopen

Comment 19 Yona First 2022-06-14 10:33:42 UTC

Reproduced on live env.

Hub OCP version: 4.9.0-0.nightly-2022-06-08-150705
MCE/ACM version: 2.5.1-DOWNSTREAM-2022-06-09-17-48-38
Spoke OCP version: quay.io/openshift-release-dev/ocp-release:4.9.38-x86_64

4.9 IPv4 connected hub, dualstack bond spoke.

Comment 22 Yona First 2022-06-14 15:21:44 UTC

See related slack thread here: https://coreos.slack.com/archives/CUPJTHQ5P/p1655201180120019

Issue appears to actually be connected to this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2095264

Comment 23 Michael Filanov 2022-07-25 12:32:35 UTC


*** This bug has been marked as a duplicate of bug 2095264 ***

Note You need to log in before you can comment on or make changes to this bug.