Description of problem: When creating a HA 4.11 Hypershift cluster as guest of a 4.10 management cluster, the ovnkube-master Statefulset doesn't seem to properly rollout. The first replica eventually transitions to `Running` after some time and failures, but the other two replicas are stuck in `ContainerCreating`: ``` ovnkube-master-0 6/6 Running 2 (3m7s ago) 6m49s ovnkube-master-1 0/6 ContainerCreating 0 6m49s ovnkube-master-2 0/6 ContainerCreating 0 6m49s ``` From what I can tell, this seems to be caused by outbound traffic from these pods not working at all. Due to the pod being stuck in ContainerCreating (presumably caused by their PostStartHook not completing?), `oc exec` does not work and instead `oc debug node` has to be used to get to the node, then `crictl` to get into the container: ``` oc debug node/ip-10-0-165-48.ec2.internal chroot /host crictl ps|grep nbdb 72f147f01a722 594b3aa6152048a605704680672a05aceedf458a01ab54e2b462e3a9189777f2 crictl exec -it 72f147f01a722 bash # Curl another pod curl https://ovnkube-master-0.ovnkube-master-internal.clusters-alvaro-test.svc.cluster.local:9641 -k -v --connect-timeout 5 * Rebuilt URL to: https://ovnkube-master-0.ovnkube-master-internal.clusters-alvaro-test.svc.cluster.local:9641/ * Resolving timed out after 5000 milliseconds * Could not resolve host: ovnkube-master-0.ovnkube-master-internal.clusters-alvaro-test.svc.cluster.local * Closing connection 0 curl: (28) Resolving timed out after 5000 milliseconds # Try to curl a public google.com server to check if DNS is the issue curl https://172.217.1.14 -k -v --connect-timeout 5 * Rebuilt URL to: https://172.217.1.14/ * Trying 172.217.1.14... * TCP_NODELAY set * Connection timed out after 5000 milliseconds * Closing connection 0 curl: (28) Connection timed out after 5000 millisecond ``` Executing the same commands in a different pod works perfectly fine, so the issue seems to be somehow specific to the ovnkube-master pods: ``` k exec -it kube-controller-manager-69d46c5f58-9745r bash kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead. Defaulted container "kube-controller-manager" out of: kube-controller-manager, token-minter, availability-prober (init) # Curl to pod errors due to invalid protocol, but reaches the endpoint bash-4.4$ curl https://ovnkube-master-0.ovnkube-master-internal.clusters-alvaro-test.svc.cluster.local:9641 -k curl: (56) OpenSSL SSL_read: error:1409445C:SSL routines:ssl3_read_bytes:tlsv13 alert certificate required, errno 0 # Curl to the public google.com server using it's IP also works curl https://172.217.1.14 -k --connect-timeout 5 <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8"> <TITLE>301 Moved</TITLE></HEAD><BODY> <H1>301 Moved</H1> The document has moved <A HREF="http://www.google.com/">here</A>. </BODY></HTML ``` Some details about the management cluster that might or might not be relevant: * It is a 4.10 Hypershift cluster with SDN * It uses a custom --service-cidr * The isuse is not reproducible when using a standard self-hosted 4.10 cluster as management cluster Version-Release number of selected component (if applicable): Management cluster uses quay.io/openshift-release-dev/ocp-release:4.10.17-x86_64 Guest cluster uses registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-06-02-09270 How reproducible: 100% Actual results: Expected results: Additional info: I can provide access to a management cluster where this happens to aid in further debugging
*** This bug has been marked as a duplicate of bug 2096456 ***