Bug 1879901

Summary:	[release 4.5] Top-level cross-platform: Fix bug in reflector not recovering from "Too large resource version"
Product:	OpenShift Container Platform	Reporter:	Lukasz Szaszkiewicz <lszaszki>
Component:	kube-apiserver	Assignee:	Lukasz Szaszkiewicz <lszaszki>
Status:	CLOSED ERRATA	QA Contact:	Ke Wang <kewang>
Severity:	high	Docs Contact:
Priority:	medium
Version:	4.5	CC:	abhinkum, abraj, adeshpan, agarcial, aivaraslaimikis, aos-bugs, ChetRHosey, dahernan, fhirtz, hgomes, lars.erhardt.extern, mcalizo, mfojtik, micmurph, mrhodes, naoto30, oarribas, palonsor, sferguso, sople, ssadhale, sttts, wking, xxia
Target Milestone:	---	Flags:	mfojtik: needinfo?
Target Release:	4.5.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Fix bug in reflector that couldn't recover from "Too large resource version" errors The root cause of the issue is the fact that a watch cache is initialized from the global revision (etcd) and might stay on it for an undefined period (if no changes were (add, modify) made). That means that the watch cache across server instances may be out of sync. That might lead to a situation in which a client gets a resource version from a server that has observed a newer rv, disconnect (due to a network error) from it, and reconnect to a server that is behind, resulting in “Too large resource version“ errors.	Story Points:	---
Clone Of:
Clones:	1879991 (view as bug list)		Environment:
Last Closed:	2021-03-03 04:40:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1877346, 1879991, 1880301, 1880304, 1880307, 1880309, 1880311, 1880313, 1880314, 1880315, 1880318, 1880320, 1880322, 1880324, 1880326, 1880327, 1880333, 1880341, 1880343, 1880344, 1880348, 1880353, 1880357, 1880359, 1880360, 1880366, 1880368, 1880369, 1881043, 1881077, 1881103, 1881108, 1881111, 1881134, 1881819, 1882071, 1882073, 1882077, 1882379, 1882448, 1892583, 1892585, 1892587, 1892590, 1893637, 1894666, 1894667
Bug Blocks:

Description Lukasz Szaszkiewicz 2020-09-17 10:29:25 UTC

A recent fix in the reflector/informer https://github.com/kubernetes/kubernetes/pull/92688 prevents components/operators from entering a hotloop and stuck.

There are already reported cases that have run into that issue and were stuck for hours or even days. For example https://bugzilla.redhat.com/show_bug.cgi?id=1877346.

The root cause of the issue is the fact that a watch cache is initialized from the global revision (etcd) and might stay on it for an undefined period (if no changes were (add, modify) made).
That means that the watch cache across server instances may be out of sync.
That might lead to a situation in which a client gets a resource version from a server that has observed a newer rv, disconnect (due to a network error) from it, and reconnect to a server that is behind, resulting in “Too large resource version“ errors.

More details in https://github.com/kubernetes/kubernetes/issues/91073 and https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/1904-efficient-watch-resumption

It looks like the issue only affects 1.18. According to https://github.com/kubernetes/kubernetes/issues/91073#issuecomment-652251669 the issue was first introduced in that version by changes done to the reflector.
The fix is already present in 1.19.

Components/Operators using client-go in 1.18 version must update to a version that includes https://github.com/kubernetes/kubernetes/pull/92688

I'm creating this BZ as a central place that will allow us to track the progress.

Comment 3 Lukasz Szaszkiewicz 2020-10-02 09:28:10 UTC

This bug is actively worked on.

Comment 4 Michal Fojtik 2020-10-17 14:12:07 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 5 Pablo Alonso Rodriguez 2020-10-19 09:59:13 UTC

This bug is just a common tracker for many other bugs so I understand it must be kept open until all of the dependent bugs are closed.

Removing LifecycleStale. Please don't re-add it unless this bug is to be closed for a valid reason.

Comment 9 Mike Murphy 2020-11-24 22:35:24 UTC

Hi, wondering if we have an update on this. Thanks.

Comment 13 Lukasz Szaszkiewicz 2021-02-05 13:22:54 UTC

I'm closing this BZ since all dependent BZs have been closed.

Comment 14 Pablo Alonso Rodriguez 2021-02-05 13:24:24 UTC

Isn't https://bugzilla.redhat.com/show_bug.cgi?id=1880333 still pending? I see it in verified state

Comment 17 Ke Wang 2021-02-08 10:10:31 UTC

The last related bug 1877346 was verified,  all dependent BZs have been verified, so the Top-level bug can be verified now.

Comment 19 errata-xmlrpc 2021-03-03 04:40:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.5.33 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0428