[sig-arch] Check if alerts are firing during or after upgrade success is failing periodically in CI, see: https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=%5Bsig-arch%5D%20Check%20if%20alerts%20are%20firing%20during%20or%20after%20upgrade%20success This bug is specific to: alert KubePodNotReady fired for 60 seconds with labels: {namespace="openshift-authentication", pod="oauth-openshift-85d55cb75f-r7mbk", severity="warning"} Search results: https://search.ci.openshift.org/?search=alert+KubePodNotReady.*authentication&maxAge=48h&context=1&type=junit&name=4.10.*aws.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
This is failing 10% of aws upgrades: https://search.ci.openshift.org/?search=alert+KubePodNotReady+fired.*authentication&maxAge=48h&context=1&type=junit&name=4.10.*aws.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job bumping priority to high so we can free up CI.
The events for this pod suggest possible race condition in kubelet/ovn where the ready probes seem to be executed before the network interface is being added to the pod.
Examining another occurence of this in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-ovn-upgrade/1446112191311777792 Presumably similar to what Michal was looking at: 25m Warning FailedCreatePodSandBox pod/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal_openshift-kube-apiserver_a004db65-7082-4872-8275-57e32bafa212_0(40ac4a7dfb35737489489754b8f9036bfaaebd2b992e99d1194860dfa5ecc4bd): error adding pod openshift-kube-apiserver_revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal to CNI network "multus-cni-network": [openshift-kube-apiserver/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-kube-apiserver/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal 40ac4a7dfb35737489489754b8f9036bfaaebd2b992e99d1194860dfa5ecc4bd] [openshift-kube-apiserver/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal 40ac4a7dfb35737489489754b8f9036bfaaebd2b992e99d1194860dfa5ecc4bd] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded
I1007 16:00:30.875960 1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Node total 25 items received E1007 16:00:30.876101 1 kube.go:76] Error in setting annotation on pod openshift-kube-apiserver/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal: Patch "https://api-int.ci-op-1shyf3hw-effab.aws-2.ci.openshift.org:6443/api/v1/namespaces/openshift-kube-apiserver/pods/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal": read tcp 10.0.255.155:49312->10.0.140.33:6443: read: connection reset by peer I1007 16:00:30.876189 1 pods.go:370] Released IPs: 10.130.0.6 for node: ip-10-0-191-10.us-west-2.compute.internal I guess not surprising, but the apiserver wasn't working at this point in the upgrade, so ovnkube-master got an error. What's surprising is why it didn't retry the pod creation, which it's supposed to do; addLogicalPort() prints out the "Released IPs" message so we know the defer() cleaned up, but the error should also get returned to the caller.
Based on the original ci search link in this bug report, I do not think the merged fix has helped the situation.
This may be related to, and fixed by the same pr as https://bugzilla.redhat.com/show_bug.cgi?id=2013222
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056