1877346 – KubeAPIErrorsHigh firing due to "too large resource version"

Bug 1877346 - KubeAPIErrorsHigh firing due to "too large resource version"

Summary: KubeAPIErrorsHigh firing due to "too large resource version"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.z
Assignee:	Lukasz Szaszkiewicz
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:	1877367
Blocks:	1879901
TreeView+	depends on / blocked

Reported:	2020-09-09 12:33 UTC by Lukasz Szaszkiewicz
Modified:	2024-10-01 16:51 UTC (History)
CC List:	37 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: A watch cache (in Kube API) is initialized from the global revision (etcd) and might stay on it for an undefined period if no changes were (add, modify) made. Consequence: It might lead to a situation in which a client gets a resource version (RV) from a server that has observed a newer RV, disconnect (due to a network error) from it, and reconnect to a server that is behind, resulting in "Too large resource version" errors. Fix: Fix the reflector so that it can recover from "Too large resource version" errors Result: Operators that use client-go library for getting notifications from the server can recover and make progress upon receiving "Too large resource version" error
Clone Of:
Clones:	1877367 (view as bug list)
Environment:
Last Closed:	2021-03-03 04:40:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift origin pull 25489	None	closed	Bug 1877346: KubeAPIErrorsHigh firing due to too large resource version	2021-02-19 11:15:34 UTC
Github	openshift origin pull 25490	None	closed	Bug 1877346: Fix bug for inconsistent lists served from etcd	2021-02-19 11:15:34 UTC
Red Hat Knowledge Base (Solution)	5392711	None	None	None	2020-09-11 15:20:03 UTC
Red Hat Product Errata	RHBA-2020:4228	None	None	None	2020-10-19 14:54:48 UTC
Red Hat Product Errata	RHSA-2021:0428	None	None	None	2021-03-03 04:40:56 UTC

Description Lukasz Szaszkiewicz 2020-09-09 12:33:44 UTC

If you are on a 4.5 cluster and seeing KubeAPIErrorsHigh followed by "Timeout: Too large resource version" error reporting by  a kubelet then you have hit https://github.com/kubernetes/kubernetes/issues/91073

Note that this BZ was created from https://bugzilla.redhat.com/show_bug.cgi?id=1748434 and it is meant to keep track of issues that are specific to a 4.5 cluster.

Comment 22 hgomes 2020-09-22 12:39:15 UTC

Hi Lukasz,

My customer, after upgrading to 4.5 is currently experiencing the alerts below. Similar but not the same of the description but the same of others cases linked. 
- Should I consider this BZ too to track, or open a new one?

~~~
API server is returning errors for 100% of requests for LIST networkpolicies.
The API server is burning too much error budget
API server is returning errors for 100% of requests for LIST csidrivers .
~~~

Comment 23 Lukasz Szaszkiewicz 2020-09-23 06:58:36 UTC

(In reply to hgomes from comment #22)
> Hi Lukasz,
> 
> My customer, after upgrading to 4.5 is currently experiencing the alerts
> below. Similar but not the same of the description but the same of others
> cases linked. 
> - Should I consider this BZ too to track, or open a new one?
> 
> ~~~
> API server is returning errors for 100% of requests for LIST networkpolicies.
> The API server is burning too much error budget
> API server is returning errors for 100% of requests for LIST csidrivers .
> ~~~

Please check the audit logs. If they contain HTTP 504 for LIST networkpolicies and csidrivers then the customer run into the same issue.
You should also be able to tell which pods caused those errors, in that case make sure the logs contain "Timeout: Too large resource version"

Comment 26 Ke Wang 2020-10-09 04:30:14 UTC

Verified with OCP 4.5.0-0.nightly-2020-10-08-211214, according to the PR https://github.com/kubernetes/kubernetes/issues/91073, disconnected a node from network for a few minutes. 

$ oc debug node/ip-xx-0-xxx-242.us-east-2.compute.internal
sh-4.4# cat > test.sh <<EOF
ifconfig ens3 down
sleep 300
ifconfig ens3 up
EOF
sh-4.4# bash ./test.sh &

After connection recovered, reconnected to node, check if the such error messages in bug can be found, 
kubelet logs,
$ oc debug node/ip-xx-0-xxx-242.us-east-2.compute.internal
sh-4.4# journalctl -b -u kubelet |  grep -i 'Too large resource version'

No longer see the such error messages in kubelet logs.

Comment 32 errata-xmlrpc 2020-10-19 14:54:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4228

Comment 33 Lukasz Szaszkiewicz 2020-11-04 14:06:29 UTC

I'm reopening this BZ as not all operators have been fixed. 
https://bugzilla.redhat.com/show_bug.cgi?id=1879901 is a top-level issue that captures all operators that should be fixed.

Comment 51 Lukasz Szaszkiewicz 2021-02-05 13:25:05 UTC

Moving this BZ to QE since all dependant BZs in https://bugzilla.redhat.com/show_bug.cgi?id=1879901 have been verified.

Comment 53 Ke Wang 2021-02-08 10:06:07 UTC

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2021-02-05-192721   True        False         5h47m   Cluster version is 4.5.0-0.nightly-2021-02-05-192721

To verify, we need to do following according to the PR https://github.com/kubernetes/kubernetes/issues/91073, disconnected a node from network for a few minutes. 

$ oc debug node/ip-xx-0-xxx-242.us-east-2.compute.internal
sh-4.4# cat > test.sh <<EOF
ifconfig ens3 down
sleep 300
ifconfig ens3 up
EOF
sh-4.4# bash ./test.sh &

After connection recovered, reconnected to node, check if the such error messages in bug can be found, 
kubelet logs,
$ oc debug node/ip-xx-0-xxx-41.us-east-2.compute.internal
sh-4.4# journalctl -b -u kubelet |  grep -i 'Too large resource version'

No longer see the such error messages in kubelet logs.

Check all pods log files:
sh-4.4# cd /var/log
sh-4.4# ls
audit  btmp  chrony  containers  crio  glusterfs  journal  kube-apiserver  lastlog  openshift-apiserver  openvswitch  pods  private  samba  sssd  wtmp

sh-4.4# grep -nr 'Too large resource version' | grep -v debug

No longer see the such error messages in bug, based on above test results, move the bug VERIFIED.

Comment 56 errata-xmlrpc 2021-03-03 04:40:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.5.33 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0428

Note You need to log in before you can comment on or make changes to this bug.

aabhishe
abhinkum
agabriel
aivaraslaimikis
alchan
amsingh
andbartl
aos-bugs
apjagtap
aprajapa
ChetRHosey
cruhm
dahernan
fhirtz
fshaikh
hgomes
jappleii
jrosenta
jseunghw
lars.erhardt.extern
mfojtik
mjahangi
naoto30
oarribas
palonsor
pkanthal
rekhan
rkshirsa
rsandu
sbhavsar
scott.worthington
sferguso
shishika
shsaxena
sparpate
xxia
yhe