Description of the problem: When deploying 4.9/2.4 multi-node spoke clusters with RHACM, the installation times out after 1h while the worker agents are waiting for the control plane. The control plane never comes online, and the installation fails. This issue appears intermittently on different network types: IPv4/6, dualstack vlan, singlestack bond. After inspection of a live environment by developers, the issue appears to be that a networking problem is causing the assisted-controller pod that is running on a host network to fail to reach the kube-api on the internal network. From the logs on one of the masters: [root@spoke-master-0-0 core]# oc logs assisted-installer-controller1--1-blsrg -n assisted-installer time="2021-11-30T16:34:56Z" level=info msg="Start running Assisted-Controller. Configuration is:\n struct ControllerConfig {\n\tClusterID: \"d05d7fdd-ec97-44d7-b751-b374dafae622\",\n\tURL: \"https://assisted-service-rhacm.apps.ocp-edge-cluster-assisted-0.qe.lab.redhat.com\",\n\tPullSecretToken: <SECRET>,\n\tSkipCertVerification: false,\n\tCACertPath: \"/etc/assisted-service/service-ca-cert.crt\",\n\tNamespace: \"assisted-installer\",\n\tOpenshiftVersion: \"4.9.9\",\n\tHighAvailabilityMode: \"Full\",\n\tWaitForClusterVersion: true,\n\tMustGatherImage: \"registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5000/openshift-release-dev@sha256:1fd487f1ce9a40c982f9284d7404029e280d503a787e9bb46a8b6a7d2cb64bda\",\n}" W1130 16:34:56.467008 1 client_config.go:608] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. 2021/11/30 16:34:56 Failed to create k8 client failed to create runtime client: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: connect: no route to host RHACM snapshot version: quay.io/acm-d/acm-custom-registry:2.4.1-DOWNSTREAM-2021-11-23-15-19-17 OCP version on hub: 4.9.7 Steps to reproduce: We have been seeing this issue intermittently, so it may not be reproducible every time. 1. Deploy a hub cluster + RHACM 2. Deploy spoke multi-node via AI Actual results: Installation fails after 2 of the master agents are 'Joined', the bootstrap is 'waiting for controller', and the workers are 'Waiting for control plane'. [kni@provisionhost-0-0 ~]$ oc get agents -n spoke-0 NAME CLUSTER APPROVED ROLE STAGE 10750fd8-0cf9-4d85-b0a8-fa23230001c6 spoke-0 true master Waiting for controller 3e8926a7-b814-4f82-9a31-056183858a44 spoke-0 true master Joined 4a0807dd-cf36-4a9f-ab4e-fdbc6a1e9c35 spoke-0 true worker Waiting for control plane 8da94ccd-2a36-4092-b0b0-5316b997f5fc spoke-0 true worker Waiting for control plane daed0544-c65f-4f95-a373-5b0d708e39a3 spoke-0 true master Joined Expected results: Cluster should install normally. Additional info: See relevant slack thread: https://coreos.slack.com/archives/CUPJTHQ5P/p1638281441015200
@aconstan another issue that may require your help
The issue is that we can't reach pods network from host and pods that are running with hostnet=true. It causes our controller pod to fail to reach kube-api and seems like a bug in OVN as in SDN it is allowed and in OVN on 4.9.0 it works too.
@yfirst looks like we will need to see a live system in order for us to debug it
Ok, I will try to reproduce with a live env and notify you accordingly.
@
@yfirst is there any update on this? Thanks!
Hi, still have not been able to recreate with live env yet.
If managed to reproduce, please reopen
Reproduced on live env. Hub OCP version: 4.9.0-0.nightly-2022-06-08-150705 MCE/ACM version: 2.5.1-DOWNSTREAM-2022-06-09-17-48-38 Spoke OCP version: quay.io/openshift-release-dev/ocp-release:4.9.38-x86_64 4.9 IPv4 connected hub, dualstack bond spoke.
See related slack thread here: https://coreos.slack.com/archives/CUPJTHQ5P/p1655201180120019 Issue appears to actually be connected to this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2095264
*** This bug has been marked as a duplicate of bug 2095264 ***