Bug 2011386 - [sig-arch] Check if alerts are firing during or after upgrade success --- alert KubePodNotReady fired for 60 seconds with labels
Summary: [sig-arch] Check if alerts are firing during or after upgrade success --- ale...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Tim Rozet
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-06 14:17 UTC by Devan Goodwin
Modified: 2022-03-10 16:17 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-10 16:17:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 787 0 None Merged Bug 2011386: pods: fix overwriting returned error from defer() 2022-02-08 18:01:29 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:17:44 UTC

Description Devan Goodwin 2021-10-06 14:17:42 UTC
[sig-arch] Check if alerts are firing during or after upgrade success

is failing periodically in CI, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=%5Bsig-arch%5D%20Check%20if%20alerts%20are%20firing%20during%20or%20after%20upgrade%20success

This bug is specific to: alert KubePodNotReady fired for 60 seconds with labels: {namespace="openshift-authentication", pod="oauth-openshift-85d55cb75f-r7mbk", severity="warning"}

Search results:

https://search.ci.openshift.org/?search=alert+KubePodNotReady.*authentication&maxAge=48h&context=1&type=junit&name=4.10.*aws.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 2 Michal Fojtik 2021-10-06 15:30:15 UTC
The events for this pod suggest possible race condition in kubelet/ovn where the ready probes seem to be executed before the network interface is being added to the pod.

Comment 3 Devan Goodwin 2021-10-07 17:46:34 UTC
Examining another occurence of this in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-ovn-upgrade/1446112191311777792

Presumably similar to what Michal was looking at:

25m        Warning  FailedCreatePodSandBox                  pod/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal   Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal_openshift-kube-apiserver_a004db65-7082-4872-8275-57e32bafa212_0(40ac4a7dfb35737489489754b8f9036bfaaebd2b992e99d1194860dfa5ecc4bd): error adding pod openshift-kube-apiserver_revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal to CNI network "multus-cni-network": [openshift-kube-apiserver/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-kube-apiserver/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal 40ac4a7dfb35737489489754b8f9036bfaaebd2b992e99d1194860dfa5ecc4bd] [openshift-kube-apiserver/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal 40ac4a7dfb35737489489754b8f9036bfaaebd2b992e99d1194860dfa5ecc4bd] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded

Comment 4 Dan Williams 2021-10-07 21:46:27 UTC
I1007 16:00:30.875960       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Node total 25 items received
E1007 16:00:30.876101       1 kube.go:76] Error in setting annotation on pod openshift-kube-apiserver/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal: Patch "https://api-int.ci-op-1shyf3hw-effab.aws-2.ci.openshift.org:6443/api/v1/namespaces/openshift-kube-apiserver/pods/revision-pruner-9-ip-10-0-191-10.us-west-2.compute.internal": read tcp 10.0.255.155:49312->10.0.140.33:6443: read: connection reset by peer
I1007 16:00:30.876189       1 pods.go:370] Released IPs: 10.130.0.6 for node: ip-10-0-191-10.us-west-2.compute.internal

I guess not surprising, but the apiserver wasn't working at this point in the upgrade, so ovnkube-master got an error. What's surprising is why it didn't retry the pod creation, which it's supposed to do; addLogicalPort() prints out the "Released IPs" message so we know the defer() cleaned up, but the error should also get returned to the caller.

Comment 8 Devan Goodwin 2021-10-12 12:40:13 UTC
Based on the original ci search link in this bug report, I do not think the merged fix has helped the situation.

Comment 9 Devan Goodwin 2021-10-12 13:33:48 UTC
This may be related to, and fixed by the same pr as https://bugzilla.redhat.com/show_bug.cgi?id=2013222

Comment 13 errata-xmlrpc 2022-03-10 16:17:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.