2093057 – ovnkube-master statefulset doesn't roll out in nested Hypershift setup

Bug 2093057 - ovnkube-master statefulset doesn't roll out in nested Hypershift setup

Summary: ovnkube-master statefulset doesn't roll out in nested Hypershift setup

Keywords:
Status:	CLOSED DUPLICATE of bug 2096456
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Patryk Diak
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-02 20:04 UTC by aaleman
Modified:	2022-09-07 09:43 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-09-07 09:43:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 1512	0	None	open	[release-4.11] Bug 2096456: Add init container to ensure that Status.podIP is set before postStart hooks run	2022-07-20 08:59:15 UTC

Description aaleman 2022-06-02 20:04:35 UTC

Description of problem:

When creating a HA 4.11 Hypershift cluster as guest of a 4.10 management cluster, the ovnkube-master Statefulset doesn't seem to properly rollout. The first replica eventually transitions to `Running` after some time and failures, but the other two replicas are stuck in `ContainerCreating`:

```
ovnkube-master-0                                  6/6     Running             2 (3m7s ago)   6m49s
ovnkube-master-1                                  0/6     ContainerCreating   0              6m49s
ovnkube-master-2                                  0/6     ContainerCreating   0              6m49s
```

From what I can tell, this seems to be caused by outbound traffic from these pods not working at all. Due to the pod being stuck in ContainerCreating (presumably caused by their PostStartHook not completing?), `oc exec` does not work and instead `oc debug node` has to be used to get to the node, then `crictl` to get into the container:

```
oc debug node/ip-10-0-165-48.ec2.internal
chroot /host
crictl ps|grep nbdb
72f147f01a722       594b3aa6152048a605704680672a05aceedf458a01ab54e2b462e3a9189777f2
crictl exec -it 72f147f01a722 bash

# Curl another pod
curl https://ovnkube-master-0.ovnkube-master-internal.clusters-alvaro-test.svc.cluster.local:9641 -k  -v --connect-timeout 5
* Rebuilt URL to: https://ovnkube-master-0.ovnkube-master-internal.clusters-alvaro-test.svc.cluster.local:9641/
* Resolving timed out after 5000 milliseconds
* Could not resolve host: ovnkube-master-0.ovnkube-master-internal.clusters-alvaro-test.svc.cluster.local
* Closing connection 0
curl: (28) Resolving timed out after 5000 milliseconds

# Try to curl a public google.com server to check if DNS is the issue
curl https://172.217.1.14 -k -v --connect-timeout 5 
* Rebuilt URL to: https://172.217.1.14/
*   Trying 172.217.1.14...
* TCP_NODELAY set
* Connection timed out after 5000 milliseconds
* Closing connection 0
curl: (28) Connection timed out after 5000 millisecond
```

Executing the same commands in a different pod works perfectly fine, so the issue seems to be somehow specific to the ovnkube-master pods:

```
 k exec -it kube-controller-manager-69d46c5f58-9745r bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "kube-controller-manager" out of: kube-controller-manager, token-minter, availability-prober (init)

# Curl to pod errors due to invalid protocol, but reaches the endpoint
bash-4.4$ curl https://ovnkube-master-0.ovnkube-master-internal.clusters-alvaro-test.svc.cluster.local:9641 -k    
curl: (56) OpenSSL SSL_read: error:1409445C:SSL routines:ssl3_read_bytes:tlsv13 alert certificate required, errno 0

# Curl to the public google.com server using it's IP also works
 curl https://172.217.1.14 -k --connect-timeout 5 
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML
```

Some details about the management cluster that might or might not be relevant:
* It is a 4.10 Hypershift cluster with SDN
* It uses a custom --service-cidr
* The isuse is not reproducible when using a standard self-hosted 4.10 cluster as management cluster


Version-Release number of selected component (if applicable):

Management cluster uses quay.io/openshift-release-dev/ocp-release:4.10.17-x86_64
Guest cluster uses registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-06-02-09270


How reproducible:

100%



Actual results:


Expected results:


Additional info:

I can provide access to a management cluster where this happens to aid in further debugging

Comment 2 Patryk Diak 2022-09-07 09:43:07 UTC


*** This bug has been marked as a duplicate of bug 2096456 ***

Note You need to log in before you can comment on or make changes to this bug.