Bug 1949481
Summary: | cluster-samples-operator restarts approximately two times per day and logs too many same messages | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Hideshi Fukumoto <hfukumot> | |
Component: | Samples | Assignee: | Gabe Montero <gmontero> | |
Status: | CLOSED ERRATA | QA Contact: | XiuJuan Wang <xiuwang> | |
Severity: | medium | Docs Contact: | ||
Priority: | high | |||
Version: | 4.6 | CC: | adam.kaplan, gmontero | |
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: based on the timing of the events it receives, the samples operator could end up breaking the contract with k8s SharedInformers an mutate the controller cache for the objects (Samples Config, Templates, Imagestreams) that it watches.
Consequence: In many cases, robustness in k8s kept thing OK, but we've now seen cases where this violation produced a panic in k8s when samples operator tried to updated the objects it watches.
Fix: Stop mutating the cache via better use of k8s DeepCopy prior to updates. Also adjusted when we copy config information from spec to status in the samples config CR instances.
Result: the samples operator no longer mutates its SharedInformer cache, and avoids panics in k8s when updating the objects it manages.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1950808 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 23:00:33 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1950808 |
Description
Hideshi Fukumoto
2021-04-14 11:52:28 UTC
Root cause: A race condition in the samples operator causes the kubernetes client to panic [1]. In OCP 4.6, the samples operator reports imagestream import failures by using multiple goroutines to update the same status object. During this process, the object retrieved from a shared cache is mutated. Copying the status object from the cache in each goroutine eliminates this panic. In OCP 4.7 and later, the samples operator changed how it reported failures by writing failed imports to multiple ConfigMaps (one per failed imagestream). This change significantly lowers the probability that this bug would occur, but it does not eliminate the root cause. Work arounds: If the samples operator has skipped imagestreams and/or templates, clear these [2]. Skipped imagestreams and templates can make this bug more likely to occur. If this does not address the issue, set the samples operator management state to "Removed" [2]. This has the downside of removing all imagestreams and templates installed by the Samples Operator. Setting the management state back to "Managed" will re-install the imagestreams and templates. [1] https://github.com/kubernetes/kubernetes/issues/82497 [2] https://docs.openshift.com/container-platform/4.7/openshift_images/configuring-samples-operator.html#samples-operator-configuration_configuring-samples-operator quick follow on to Adam's note Per https://access.redhat.com/documentation/en-us/red_hat_fuse/7.8/html/fuse_on_openshift_guide/get-started-admin#install-fuse-on-openshift4 the fuse related samples are being manually installed via oc apply after adding those entries to the skip list. You still manually install the fuse samples via https://access.redhat.com/documentation/en-us/red_hat_fuse/7.8/html/fuse_on_openshift_guide/get-started-admin#install-fuse-on-openshift4 after setting the samples operator to removed. Setting samples to removed is a replacement to adding the fuse samples to the skipped list. But for any customers, they have to manually apply, similar to what is articulated in https://access.redhat.com/documentation/en-us/red_hat_fuse/7.8/html/fuse_on_openshift_guide/get-started-admin#install-fuse-on-openshift4 for the samples they care about. Gabe, thanks for the info Following the steps the customer did, samples operator don't restart anymore. There are no the errors shown mentioned in the bug. Verified on 4.8.0-0.nightly-2021-04-20-011453 cluster. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |