Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2011386

Summary:	[sig-arch] Check if alerts are firing during or after upgrade success --- alert KubePodNotReady fired for 60 seconds with labels
Product:	OpenShift Container Platform	Reporter:	Devan Goodwin <dgoodwin>
Component:	Networking	Assignee:	Tim Rozet <trozet>
Networking sub component:	ovn-kubernetes	QA Contact:	Anurag saxena <anusaxen>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs, dcbw, deads, mfojtik, sippy, surbania, trozet, wking
Version:	4.10	Keywords:	DeliveryBlocker
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-10 16:17:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Devan Goodwin 2021-10-06 14:17:42 UTC

[sig-arch] Check if alerts are firing during or after upgrade success

is failing periodically in CI, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=%5Bsig-arch%5D%20Check%20if%20alerts%20are%20firing%20during%20or%20after%20upgrade%20success

This bug is specific to: alert KubePodNotReady fired for 60 seconds with labels: {namespace="openshift-authentication", pod="oauth-openshift-85d55cb75f-r7mbk", severity="warning"}

Search results:

https://search.ci.openshift.org/?search=alert+KubePodNotReady.*authentication&maxAge=48h&context=1&type=junit&name=4.10.*aws.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 David Eads 2021-10-06 14:23:53 UTC

This is failing 10% of aws upgrades: https://search.ci.openshift.org/?search=alert+KubePodNotReady+fired.*authentication&maxAge=48h&context=1&type=junit&name=4.10.*aws.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

bumping priority to high so we can free up CI.

Comment 2 Michal Fojtik 2021-10-06 15:30:15 UTC

The events for this pod suggest possible race condition in kubelet/ovn where the ready probes seem to be executed before the network interface is being added to the pod.

Comment 3 Devan Goodwin 2021-10-07 17:46:34 UTC

Examining another occurence of this in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-ovn-upgrade/1446112191311777792

Presumably similar to what Michal was looking at:

25m        Warning  FailedCreatePodSandBox                  pod/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal   Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal_openshift-kube-apiserver_a004db65-7082-4872-8275-57e32bafa212_0(40ac4a7dfb35737489489754b8f9036bfaaebd2b992e99d1194860dfa5ecc4bd): error adding pod openshift-kube-apiserver_revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal to CNI network "multus-cni-network": [openshift-kube-apiserver/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-kube-apiserver/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal 40ac4a7dfb35737489489754b8f9036bfaaebd2b992e99d1194860dfa5ecc4bd] [openshift-kube-apiserver/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal 40ac4a7dfb35737489489754b8f9036bfaaebd2b992e99d1194860dfa5ecc4bd] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded

Comment 4 Dan Williams 2021-10-07 21:46:27 UTC

I1007 16:00:30.875960       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Node total 25 items received
E1007 16:00:30.876101       1 kube.go:76] Error in setting annotation on pod openshift-kube-apiserver/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal: Patch "https://api-int.ci-op-1shyf3hw-effab.aws-2.ci.openshift.org:6443/api/v1/namespaces/openshift-kube-apiserver/pods/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal": read tcp 10.0.255.155:49312->10.0.140.33:6443: read: connection reset by peer
I1007 16:00:30.876189       1 pods.go:370] Released IPs: 10.130.0.6 for node: ip-10-0-191-10.us-west-2.compute.internal

I guess not surprising, but the apiserver wasn't working at this point in the upgrade, so ovnkube-master got an error. What's surprising is why it didn't retry the pod creation, which it's supposed to do; addLogicalPort() prints out the "Released IPs" message so we know the defer() cleaned up, but the error should also get returned to the caller.

Comment 8 Devan Goodwin 2021-10-12 12:40:13 UTC

Based on the original ci search link in this bug report, I do not think the merged fix has helped the situation.

Comment 9 Devan Goodwin 2021-10-12 13:33:48 UTC

This may be related to, and fixed by the same pr as https://bugzilla.redhat.com/show_bug.cgi?id=2013222

Comment 13 errata-xmlrpc 2022-03-10 16:17:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056