Bug 2001078

Summary: bootstrap apiserver fails to ready in IPv6 environment after using cri-o 1.22
Product: OpenShift Container Platform Reporter: Micah Abbott <miabbott>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: unspecified CC: aguclu, aos-bugs, miabbott, nagrawal
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-09 15:32:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Micah Abbott 2021-09-03 17:56:06 UTC
In https://github.com/openshift/installer/pull/5168, we bumped the RHCOS boot images used by the installer which brought in a number of package changes, including a cri-o bump from 1.21 -> 1.22

The bare metal IPI folks noticed failures in the CI jobs when IPv6 networking was involved that coincided with the RHCOS boot image bump.  See https://bugzilla.redhat.com/show_bug.cgi?id=1998643

Sippy search showing the fall off on the jobs - https://sippy.ci.openshift.org/sippy-ng/jobs/4.9/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22ipv6%22%7D%2C%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22dualstack%22%7D%5D%2C%22linkOperator%22%3A%22or%22%7D

A list of failed CI jobs - https://sippy.ci.openshift.org/sippy-ng/jobs/4.9/runs?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22ipi-ovn-ipv6%22%7D%2C%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22ipi-ovn-dualstack%22%7D%5D%2C%22linkOperator%22%3A%22or%22%7D&sort=desc&sortField=timestamp


In https://github.com/openshift/installer/pull/5180, the boot image bump was reverted, bringing us back to cri-o 1.21.  This improved the success rate of the CI jobs.

We were advised to try another boot image bump that included the most recent cri-o (cri-o-1.22.0-68.rhaos4.9.git011c10a.el8), but testing with that PR continues to show the failure.  See https://github.com/openshift/installer/pull/5192

Investigation into the root cause of the failure has found a few error states, but it is not entirely clear if any are the root cause for the failure.

For example:

$ crictl logs e6d65fe7c3563
E0902 10:11:30.001696   36592 remote_runtime.go:334] "ContainerStatus from runtime service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: permission denied\"" containerID="e6d65fe7c3563"
FATA[0000] rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: permission denied" 

Another symptom is after failure to start a container, the pod restart gets stuck in SANDBOX_NOTREADY - @Arda Guclu also noticed that podman-auto-update.service was failing after the update (not clues as to why in the logs though)

`The only thing that comes to my mind for the SANDBOX_NOTREADY issue is https://github.com/openshift/machine-config-operator/pull/2210`

`bootkube is restarted more than 1 by getting timeout(when I checked from the green jobs, it is 1). Bootstrap apiserver is not ready until the timeout and it fails. On our local, if we wait for a long time, eventually it heals itself.`

Sep 03 16:23:02 localhost kubelet.sh[2826]: I0903 16:23:02.987782    2861 scope.go:110] "RemoveContainer" containerID="6d3f8c2d05b4f2f3eb4b210cfe56500ea8f5bce4b524dad8936f19e10a86506a"
Sep 03 16:23:02 localhost kubelet.sh[2826]: E0903 16:23:02.988613    2861 pod_workers.go:765] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller-manager\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-controller-manager pod=bootstrap-kube-controller-manager-localhost_kube-system(18ba2e2b46a10fd746523fa6e4ce6d33)\"" pod="kube-system/bootstrap-kube-controller-manager-localhost" podUID=18ba2e2b46a10fd746523fa6e4ce6d33


We think we need eyes from networking + API server to help triage the current failures on https://github.com/openshift/installer/pull/5192

Comment 1 Micah Abbott 2021-09-03 19:59:24 UTC
Current working theory is that this may be related to https://github.com/kubernetes/kubernetes/issues/104648 

See https://bugzilla.redhat.com/show_bug.cgi?id=1999133

Comment 3 Peter Hunt 2021-09-09 15:32:20 UTC
yeah, marking this as a dup

*** This bug has been marked as a duplicate of bug 1999133 ***