Bug 2093057

Summary: ovnkube-master statefulset doesn't roll out in nested Hypershift setup
Product: OpenShift Container Platform Reporter: aaleman
Component: NetworkingAssignee: Patryk Diak <pdiak>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: medium CC: murali, pdiak
Version: 4.11   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-07 09:43:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description aaleman 2022-06-02 20:04:35 UTC
Description of problem:

When creating a HA 4.11 Hypershift cluster as guest of a 4.10 management cluster, the ovnkube-master Statefulset doesn't seem to properly rollout. The first replica eventually transitions to `Running` after some time and failures, but the other two replicas are stuck in `ContainerCreating`:

```
ovnkube-master-0                                  6/6     Running             2 (3m7s ago)   6m49s
ovnkube-master-1                                  0/6     ContainerCreating   0              6m49s
ovnkube-master-2                                  0/6     ContainerCreating   0              6m49s
```

From what I can tell, this seems to be caused by outbound traffic from these pods not working at all. Due to the pod being stuck in ContainerCreating (presumably caused by their PostStartHook not completing?), `oc exec` does not work and instead `oc debug node` has to be used to get to the node, then `crictl` to get into the container:

```
oc debug node/ip-10-0-165-48.ec2.internal
chroot /host
crictl ps|grep nbdb
72f147f01a722       594b3aa6152048a605704680672a05aceedf458a01ab54e2b462e3a9189777f2
crictl exec -it 72f147f01a722 bash

# Curl another pod
curl https://ovnkube-master-0.ovnkube-master-internal.clusters-alvaro-test.svc.cluster.local:9641 -k  -v --connect-timeout 5
* Rebuilt URL to: https://ovnkube-master-0.ovnkube-master-internal.clusters-alvaro-test.svc.cluster.local:9641/
* Resolving timed out after 5000 milliseconds
* Could not resolve host: ovnkube-master-0.ovnkube-master-internal.clusters-alvaro-test.svc.cluster.local
* Closing connection 0
curl: (28) Resolving timed out after 5000 milliseconds

# Try to curl a public google.com server to check if DNS is the issue
curl https://172.217.1.14 -k -v --connect-timeout 5 
* Rebuilt URL to: https://172.217.1.14/
*   Trying 172.217.1.14...
* TCP_NODELAY set
* Connection timed out after 5000 milliseconds
* Closing connection 0
curl: (28) Connection timed out after 5000 millisecond
```

Executing the same commands in a different pod works perfectly fine, so the issue seems to be somehow specific to the ovnkube-master pods:

```
 k exec -it kube-controller-manager-69d46c5f58-9745r bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "kube-controller-manager" out of: kube-controller-manager, token-minter, availability-prober (init)

# Curl to pod errors due to invalid protocol, but reaches the endpoint
bash-4.4$ curl https://ovnkube-master-0.ovnkube-master-internal.clusters-alvaro-test.svc.cluster.local:9641 -k    
curl: (56) OpenSSL SSL_read: error:1409445C:SSL routines:ssl3_read_bytes:tlsv13 alert certificate required, errno 0

# Curl to the public google.com server using it's IP also works
 curl https://172.217.1.14 -k --connect-timeout 5 
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML
```

Some details about the management cluster that might or might not be relevant:
* It is a 4.10 Hypershift cluster with SDN
* It uses a custom --service-cidr
* The isuse is not reproducible when using a standard self-hosted 4.10 cluster as management cluster


Version-Release number of selected component (if applicable):

Management cluster uses quay.io/openshift-release-dev/ocp-release:4.10.17-x86_64
Guest cluster uses registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-06-02-09270


How reproducible:

100%



Actual results:


Expected results:


Additional info:

I can provide access to a management cluster where this happens to aid in further debugging

Comment 2 Patryk Diak 2022-09-07 09:43:07 UTC

*** This bug has been marked as a duplicate of bug 2096456 ***