In https://github.com/openshift/installer/pull/5168, we bumped the RHCOS boot images used by the installer which brought in a number of package changes, including a cri-o bump from 1.21 -> 1.22 The bare metal IPI folks noticed failures in the CI jobs when IPv6 networking was involved that coincided with the RHCOS boot image bump. See https://bugzilla.redhat.com/show_bug.cgi?id=1998643 Sippy search showing the fall off on the jobs - https://sippy.ci.openshift.org/sippy-ng/jobs/4.9/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22ipv6%22%7D%2C%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22dualstack%22%7D%5D%2C%22linkOperator%22%3A%22or%22%7D A list of failed CI jobs - https://sippy.ci.openshift.org/sippy-ng/jobs/4.9/runs?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22ipi-ovn-ipv6%22%7D%2C%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22ipi-ovn-dualstack%22%7D%5D%2C%22linkOperator%22%3A%22or%22%7D&sort=desc&sortField=timestamp --- In https://github.com/openshift/installer/pull/5180, the boot image bump was reverted, bringing us back to cri-o 1.21. This improved the success rate of the CI jobs. We were advised to try another boot image bump that included the most recent cri-o (cri-o-1.22.0-68.rhaos4.9.git011c10a.el8), but testing with that PR continues to show the failure. See https://github.com/openshift/installer/pull/5192 Investigation into the root cause of the failure has found a few error states, but it is not entirely clear if any are the root cause for the failure. For example: ``` $ crictl logs e6d65fe7c3563 E0902 10:11:30.001696 36592 remote_runtime.go:334] "ContainerStatus from runtime service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: permission denied\"" containerID="e6d65fe7c3563" FATA[0000] rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: permission denied" ``` ``` Another symptom is after failure to start a container, the pod restart gets stuck in SANDBOX_NOTREADY - @Arda Guclu also noticed that podman-auto-update.service was failing after the update (not clues as to why in the logs though) ``` `The only thing that comes to my mind for the SANDBOX_NOTREADY issue is https://github.com/openshift/machine-config-operator/pull/2210` `bootkube is restarted more than 1 by getting timeout(when I checked from the green jobs, it is 1). Bootstrap apiserver is not ready until the timeout and it fails. On our local, if we wait for a long time, eventually it heals itself.` ``` Sep 03 16:23:02 localhost kubelet.sh[2826]: I0903 16:23:02.987782 2861 scope.go:110] "RemoveContainer" containerID="6d3f8c2d05b4f2f3eb4b210cfe56500ea8f5bce4b524dad8936f19e10a86506a" Sep 03 16:23:02 localhost kubelet.sh[2826]: E0903 16:23:02.988613 2861 pod_workers.go:765] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller-manager\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-controller-manager pod=bootstrap-kube-controller-manager-localhost_kube-system(18ba2e2b46a10fd746523fa6e4ce6d33)\"" pod="kube-system/bootstrap-kube-controller-manager-localhost" podUID=18ba2e2b46a10fd746523fa6e4ce6d33 ``` --- We think we need eyes from networking + API server to help triage the current failures on https://github.com/openshift/installer/pull/5192
Current working theory is that this may be related to https://github.com/kubernetes/kubernetes/issues/104648 See https://bugzilla.redhat.com/show_bug.cgi?id=1999133
yeah, marking this as a dup *** This bug has been marked as a duplicate of bug 1999133 ***