Bug 1877346 - KubeAPIErrorsHigh firing due to "too large resource version"
Summary: KubeAPIErrorsHigh firing due to "too large resource version"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.z
Assignee: Lukasz Szaszkiewicz
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On: 1877367
Blocks: 1879901
TreeView+ depends on / blocked
 
Reported: 2020-09-09 12:33 UTC by Lukasz Szaszkiewicz
Modified: 2024-03-25 16:27 UTC (History)
37 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: A watch cache (in Kube API) is initialized from the global revision (etcd) and might stay on it for an undefined period if no changes were (add, modify) made. Consequence: It might lead to a situation in which a client gets a resource version (RV) from a server that has observed a newer RV, disconnect (due to a network error) from it, and reconnect to a server that is behind, resulting in "Too large resource version" errors. Fix: Fix the reflector so that it can recover from "Too large resource version" errors Result: Operators that use client-go library for getting notifications from the server can recover and make progress upon receiving "Too large resource version" error
Clone Of:
: 1877367 (view as bug list)
Environment:
Last Closed: 2021-03-03 04:40:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 25489 0 None closed Bug 1877346: KubeAPIErrorsHigh firing due to too large resource version 2021-02-19 11:15:34 UTC
Github openshift origin pull 25490 0 None closed Bug 1877346: Fix bug for inconsistent lists served from etcd 2021-02-19 11:15:34 UTC
Red Hat Knowledge Base (Solution) 5392711 0 None None None 2020-09-11 15:20:03 UTC
Red Hat Product Errata RHBA-2020:4228 0 None None None 2020-10-19 14:54:48 UTC
Red Hat Product Errata RHSA-2021:0428 0 None None None 2021-03-03 04:40:56 UTC

Description Lukasz Szaszkiewicz 2020-09-09 12:33:44 UTC
If you are on a 4.5 cluster and seeing KubeAPIErrorsHigh followed by "Timeout: Too large resource version" error reporting by  a kubelet then you have hit https://github.com/kubernetes/kubernetes/issues/91073

Note that this BZ was created from https://bugzilla.redhat.com/show_bug.cgi?id=1748434 and it is meant to keep track of issues that are specific to a 4.5 cluster.

Comment 22 hgomes 2020-09-22 12:39:15 UTC
Hi Lukasz,

My customer, after upgrading to 4.5 is currently experiencing the alerts below. Similar but not the same of the description but the same of others cases linked. 
- Should I consider this BZ too to track, or open a new one?

~~~
API server is returning errors for 100% of requests for LIST networkpolicies.
The API server is burning too much error budget
API server is returning errors for 100% of requests for LIST csidrivers .
~~~

Comment 23 Lukasz Szaszkiewicz 2020-09-23 06:58:36 UTC
(In reply to hgomes from comment #22)
> Hi Lukasz,
> 
> My customer, after upgrading to 4.5 is currently experiencing the alerts
> below. Similar but not the same of the description but the same of others
> cases linked. 
> - Should I consider this BZ too to track, or open a new one?
> 
> ~~~
> API server is returning errors for 100% of requests for LIST networkpolicies.
> The API server is burning too much error budget
> API server is returning errors for 100% of requests for LIST csidrivers .
> ~~~

Please check the audit logs. If they contain HTTP 504 for LIST networkpolicies and csidrivers then the customer run into the same issue.
You should also be able to tell which pods caused those errors, in that case make sure the logs contain "Timeout: Too large resource version"

Comment 26 Ke Wang 2020-10-09 04:30:14 UTC
Verified with OCP 4.5.0-0.nightly-2020-10-08-211214, according to the PR https://github.com/kubernetes/kubernetes/issues/91073, disconnected a node from network for a few minutes. 

$ oc debug node/ip-xx-0-xxx-242.us-east-2.compute.internal
sh-4.4# cat > test.sh <<EOF
ifconfig ens3 down
sleep 300
ifconfig ens3 up
EOF
sh-4.4# bash ./test.sh &

After connection recovered, reconnected to node, check if the such error messages in bug can be found, 
kubelet logs,
$ oc debug node/ip-xx-0-xxx-242.us-east-2.compute.internal
sh-4.4# journalctl -b -u kubelet |  grep -i 'Too large resource version'

No longer see the such error messages in kubelet logs.

Comment 32 errata-xmlrpc 2020-10-19 14:54:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4228

Comment 33 Lukasz Szaszkiewicz 2020-11-04 14:06:29 UTC
I'm reopening this BZ as not all operators have been fixed. 
https://bugzilla.redhat.com/show_bug.cgi?id=1879901 is a top-level issue that captures all operators that should be fixed.

Comment 51 Lukasz Szaszkiewicz 2021-02-05 13:25:05 UTC
Moving this BZ to QE since all dependant BZs in https://bugzilla.redhat.com/show_bug.cgi?id=1879901 have been verified.

Comment 53 Ke Wang 2021-02-08 10:06:07 UTC
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2021-02-05-192721   True        False         5h47m   Cluster version is 4.5.0-0.nightly-2021-02-05-192721

To verify, we need to do following according to the PR https://github.com/kubernetes/kubernetes/issues/91073, disconnected a node from network for a few minutes. 

$ oc debug node/ip-xx-0-xxx-242.us-east-2.compute.internal
sh-4.4# cat > test.sh <<EOF
ifconfig ens3 down
sleep 300
ifconfig ens3 up
EOF
sh-4.4# bash ./test.sh &

After connection recovered, reconnected to node, check if the such error messages in bug can be found, 
kubelet logs,
$ oc debug node/ip-xx-0-xxx-41.us-east-2.compute.internal
sh-4.4# journalctl -b -u kubelet |  grep -i 'Too large resource version'

No longer see the such error messages in kubelet logs.

Check all pods log files:
sh-4.4# cd /var/log
sh-4.4# ls
audit  btmp  chrony  containers  crio  glusterfs  journal  kube-apiserver  lastlog  openshift-apiserver  openvswitch  pods  private  samba  sssd  wtmp

sh-4.4# grep -nr 'Too large resource version' | grep -v debug

No longer see the such error messages in bug, based on above test results, move the bug VERIFIED.

Comment 56 errata-xmlrpc 2021-03-03 04:40:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.5.33 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0428


Note You need to log in before you can comment on or make changes to this bug.