Bug 2001078

Summary:	bootstrap apiserver fails to ready in IPv6 environment after using cri-o 1.22
Product:	OpenShift Container Platform	Reporter:	Micah Abbott <miabbott>
Component:	Node	Assignee:	Peter Hunt <pehunt>
Node sub component:	CRI-O	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	aguclu, aos-bugs, miabbott, nagrawal
Version:	4.9
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-09-09 15:32:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Micah Abbott 2021-09-03 17:56:06 UTC

In https://github.com/openshift/installer/pull/5168, we bumped the RHCOS boot images used by the installer which brought in a number of package changes, including a cri-o bump from 1.21 -> 1.22

The bare metal IPI folks noticed failures in the CI jobs when IPv6 networking was involved that coincided with the RHCOS boot image bump.  See https://bugzilla.redhat.com/show_bug.cgi?id=1998643

Sippy search showing the fall off on the jobs - https://sippy.ci.openshift.org/sippy-ng/jobs/4.9/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22ipv6%22%7D%2C%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22dualstack%22%7D%5D%2C%22linkOperator%22%3A%22or%22%7D

A list of failed CI jobs - https://sippy.ci.openshift.org/sippy-ng/jobs/4.9/runs?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22ipi-ovn-ipv6%22%7D%2C%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22ipi-ovn-dualstack%22%7D%5D%2C%22linkOperator%22%3A%22or%22%7D&sort=desc&sortField=timestamp

---

In https://github.com/openshift/installer/pull/5180, the boot image bump was reverted, bringing us back to cri-o 1.21.  This improved the success rate of the CI jobs.

We were advised to try another boot image bump that included the most recent cri-o (cri-o-1.22.0-68.rhaos4.9.git011c10a.el8), but testing with that PR continues to show the failure.  See https://github.com/openshift/installer/pull/5192

Investigation into the root cause of the failure has found a few error states, but it is not entirely clear if any are the root cause for the failure.

For example:

```
$ crictl logs e6d65fe7c3563
E0902 10:11:30.001696   36592 remote_runtime.go:334] "ContainerStatus from runtime service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: permission denied\"" containerID="e6d65fe7c3563"
FATA[0000] rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: permission denied" 
```

```
Another symptom is after failure to start a container, the pod restart gets stuck in SANDBOX_NOTREADY - @Arda Guclu also noticed that podman-auto-update.service was failing after the update (not clues as to why in the logs though)
```

`The only thing that comes to my mind for the SANDBOX_NOTREADY issue is https://github.com/openshift/machine-config-operator/pull/2210`

`bootkube is restarted more than 1 by getting timeout(when I checked from the green jobs, it is 1). Bootstrap apiserver is not ready until the timeout and it fails. On our local, if we wait for a long time, eventually it heals itself.`


```
Sep 03 16:23:02 localhost kubelet.sh[2826]: I0903 16:23:02.987782    2861 scope.go:110] "RemoveContainer" containerID="6d3f8c2d05b4f2f3eb4b210cfe56500ea8f5bce4b524dad8936f19e10a86506a"
Sep 03 16:23:02 localhost kubelet.sh[2826]: E0903 16:23:02.988613    2861 pod_workers.go:765] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller-manager\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-controller-manager pod=bootstrap-kube-controller-manager-localhost_kube-system(18ba2e2b46a10fd746523fa6e4ce6d33)\"" pod="kube-system/bootstrap-kube-controller-manager-localhost" podUID=18ba2e2b46a10fd746523fa6e4ce6d33
```

---


We think we need eyes from networking + API server to help triage the current failures on https://github.com/openshift/installer/pull/5192

Comment 1 Micah Abbott 2021-09-03 19:59:24 UTC

Current working theory is that this may be related to https://github.com/kubernetes/kubernetes/issues/104648 

See https://bugzilla.redhat.com/show_bug.cgi?id=1999133

Comment 3 Peter Hunt 2021-09-09 15:32:20 UTC

yeah, marking this as a dup

*** This bug has been marked as a duplicate of bug 1999133 ***