Bug 1877346

Summary: KubeAPIErrorsHigh firing due to "too large resource version"
Product: OpenShift Container Platform Reporter: Lukasz Szaszkiewicz <lszaszki>
Component: kube-apiserverAssignee: Lukasz Szaszkiewicz <lszaszki>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: aabhishe, abhinkum, agabriel, aivaras.laimikis, alchan, amsingh, andbartl, aos-bugs, apjagtap, aprajapa, ChetRHosey, cruhm, dahernan, fhirtz, fshaikh, hgomes, jappleii, jrosenta, jseunghw, lars.erhardt.extern, mfojtik, mjahangi, naoto30, oarribas, palonsor, pkanthal, rekhan, rkshirsa, rsandu, sbhavsar, scott.worthington, sferguso, shishika, shsaxena, sparpate, xxia, yhe
Target Milestone: ---Keywords: Reopened
Target Release: 4.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: A watch cache (in Kube API) is initialized from the global revision (etcd) and might stay on it for an undefined period if no changes were (add, modify) made. Consequence: It might lead to a situation in which a client gets a resource version (RV) from a server that has observed a newer RV, disconnect (due to a network error) from it, and reconnect to a server that is behind, resulting in "Too large resource version" errors. Fix: Fix the reflector so that it can recover from "Too large resource version" errors Result: Operators that use client-go library for getting notifications from the server can recover and make progress upon receiving "Too large resource version" error
Story Points: ---
Clone Of:
: 1877367 (view as bug list) Environment:
Last Closed: 2021-03-03 04:40:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1877367    
Bug Blocks: 1879901    

Description Lukasz Szaszkiewicz 2020-09-09 12:33:44 UTC
If you are on a 4.5 cluster and seeing KubeAPIErrorsHigh followed by "Timeout: Too large resource version" error reporting by  a kubelet then you have hit https://github.com/kubernetes/kubernetes/issues/91073

Note that this BZ was created from https://bugzilla.redhat.com/show_bug.cgi?id=1748434 and it is meant to keep track of issues that are specific to a 4.5 cluster.

Comment 22 hgomes 2020-09-22 12:39:15 UTC
Hi Lukasz,

My customer, after upgrading to 4.5 is currently experiencing the alerts below. Similar but not the same of the description but the same of others cases linked. 
- Should I consider this BZ too to track, or open a new one?

~~~
API server is returning errors for 100% of requests for LIST networkpolicies.
The API server is burning too much error budget
API server is returning errors for 100% of requests for LIST csidrivers .
~~~

Comment 23 Lukasz Szaszkiewicz 2020-09-23 06:58:36 UTC
(In reply to hgomes from comment #22)
> Hi Lukasz,
> 
> My customer, after upgrading to 4.5 is currently experiencing the alerts
> below. Similar but not the same of the description but the same of others
> cases linked. 
> - Should I consider this BZ too to track, or open a new one?
> 
> ~~~
> API server is returning errors for 100% of requests for LIST networkpolicies.
> The API server is burning too much error budget
> API server is returning errors for 100% of requests for LIST csidrivers .
> ~~~

Please check the audit logs. If they contain HTTP 504 for LIST networkpolicies and csidrivers then the customer run into the same issue.
You should also be able to tell which pods caused those errors, in that case make sure the logs contain "Timeout: Too large resource version"

Comment 26 Ke Wang 2020-10-09 04:30:14 UTC
Verified with OCP 4.5.0-0.nightly-2020-10-08-211214, according to the PR https://github.com/kubernetes/kubernetes/issues/91073, disconnected a node from network for a few minutes. 

$ oc debug node/ip-xx-0-xxx-242.us-east-2.compute.internal
sh-4.4# cat > test.sh <<EOF
ifconfig ens3 down
sleep 300
ifconfig ens3 up
EOF
sh-4.4# bash ./test.sh &

After connection recovered, reconnected to node, check if the such error messages in bug can be found, 
kubelet logs,
$ oc debug node/ip-xx-0-xxx-242.us-east-2.compute.internal
sh-4.4# journalctl -b -u kubelet |  grep -i 'Too large resource version'

No longer see the such error messages in kubelet logs.

Comment 32 errata-xmlrpc 2020-10-19 14:54:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4228

Comment 33 Lukasz Szaszkiewicz 2020-11-04 14:06:29 UTC
I'm reopening this BZ as not all operators have been fixed. 
https://bugzilla.redhat.com/show_bug.cgi?id=1879901 is a top-level issue that captures all operators that should be fixed.

Comment 51 Lukasz Szaszkiewicz 2021-02-05 13:25:05 UTC
Moving this BZ to QE since all dependant BZs in https://bugzilla.redhat.com/show_bug.cgi?id=1879901 have been verified.

Comment 53 Ke Wang 2021-02-08 10:06:07 UTC
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2021-02-05-192721   True        False         5h47m   Cluster version is 4.5.0-0.nightly-2021-02-05-192721

To verify, we need to do following according to the PR https://github.com/kubernetes/kubernetes/issues/91073, disconnected a node from network for a few minutes. 

$ oc debug node/ip-xx-0-xxx-242.us-east-2.compute.internal
sh-4.4# cat > test.sh <<EOF
ifconfig ens3 down
sleep 300
ifconfig ens3 up
EOF
sh-4.4# bash ./test.sh &

After connection recovered, reconnected to node, check if the such error messages in bug can be found, 
kubelet logs,
$ oc debug node/ip-xx-0-xxx-41.us-east-2.compute.internal
sh-4.4# journalctl -b -u kubelet |  grep -i 'Too large resource version'

No longer see the such error messages in kubelet logs.

Check all pods log files:
sh-4.4# cd /var/log
sh-4.4# ls
audit  btmp  chrony  containers  crio  glusterfs  journal  kube-apiserver  lastlog  openshift-apiserver  openvswitch  pods  private  samba  sssd  wtmp

sh-4.4# grep -nr 'Too large resource version' | grep -v debug

No longer see the such error messages in bug, based on above test results, move the bug VERIFIED.

Comment 56 errata-xmlrpc 2021-03-03 04:40:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.5.33 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0428