Bug 1950809

Summary: cluster-samples-operator restarts approximately two times per day and logs too many same messages
Product: OpenShift Container Platform Reporter: Gabe Montero <gmontero>
Component: SamplesAssignee: Gabe Montero <gmontero>
Status: CLOSED ERRATA QA Contact: XiuJuan Wang <xiuwang>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.6CC: adam.kaplan, gmontero, hfukumot, xiuwang
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: based on the timing of the events it receives, the samples operator could end up breaking the contract with k8s SharedInformers an mutate the controller cache for the objects (Samples Config, Templates, Imagestreams) that it watches. Also, the frequency of concurrent updates to the samples config cr instance when tracking imagestream status lead to increasing the likelihood of hitting this timing window with incorrectly mutating the controller cache. Consequence: In many cases, robustness in k8s kept thing OK, but we've now seen cases where this violation produced a panic in k8s when samples operator tried to updated the objects it watches. Fix: Stop mutating the cache via better use of k8s DeepCopy prior to updates. Also adjusted when we copy config information from spec to status in the samples config CR instances. Also, in 4.6.z a change has been added which reduces, though we cannot eliminate, concurrent attempts during imagestream event proccessing to update the samples operator config CR instance. Result: the samples operator no longer mutates its SharedInformer cache, and avoids panics in k8s when updating the objects it manages. Also, the amount of update conflicts that can occur when concurrent imagestream events result in updating the samples operator CR instances, has been greatly reduced, though they cannot be eliminated, an a few of them still occurring are OK.
Story Points: ---
Clone Of: 1950808 Environment:
Last Closed: 2021-05-12 12:18:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1950808    
Bug Blocks:    

Comment 6 errata-xmlrpc 2021-05-12 12:18:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.28 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1487