Bug 1949481 - cluster-samples-operator restarts approximately two times per day and logs too many same messages
Summary: cluster-samples-operator restarts approximately two times per day and logs to...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Samples
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.8.0
Assignee: Gabe Montero
QA Contact: XiuJuan Wang
URL:
Whiteboard:
Depends On:
Blocks: 1950808
TreeView+ depends on / blocked
 
Reported: 2021-04-14 11:52 UTC by Hideshi Fukumoto
Modified: 2021-07-27 23:00 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: based on the timing of the events it receives, the samples operator could end up breaking the contract with k8s SharedInformers an mutate the controller cache for the objects (Samples Config, Templates, Imagestreams) that it watches. Consequence: In many cases, robustness in k8s kept thing OK, but we've now seen cases where this violation produced a panic in k8s when samples operator tried to updated the objects it watches. Fix: Stop mutating the cache via better use of k8s DeepCopy prior to updates. Also adjusted when we copy config information from spec to status in the samples config CR instances. Result: the samples operator no longer mutates its SharedInformer cache, and avoids panics in k8s when updating the objects it manages.
Clone Of:
: 1950808 (view as bug list)
Environment:
Last Closed: 2021-07-27 23:00:33 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-samples-operator pull 369 0 None open Bug 1949481: add DeepCopy to avoid SharedInformer cache mutation 2021-04-18 22:32:08 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:00:47 UTC

Description Hideshi Fukumoto 2021-04-14 11:52:28 UTC
Created attachment 1771872 [details]
goroutine_panic.log

Created attachment 1771872 [details]
goroutine_panic.log

Description of problem:

My customer's OCP 4.6 cluster has three issues related to cluster-sample-operator.

1. cluster-samples-operator restarts approximately two times per day.
   And, we can find a panic message for cluster-samples-operator in the log.

   $ omg get pod -A
   NAMESPACE                                         NAME                                                             READY  STATUS     RESTARTS  AGE

   openshift-cluster-samples-operator                cluster-samples-operator-8c9f8b55f-8vrm9                         2/2    Running    16        9d
   ~~~
   
   From the attached file - goroutine_panic.log
   ~~~
   [/var/log/containers/cluster-samples-operator-8c9f8b55f-8vrm9_openshift-cluster-samples-operator_cluster-samples-operator-4b01ba1606694d7a696283fc9c7f40f4de4d862f0570854097745fe6d38cefa4.log]
   stderr F E0404 12:59:22.520853 227 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
   stderr F goroutine 182 [running]:
   stderr F k8s.io/apimachinery/pkg/util/runtime.logPanic(0x19d85a0, 0x2a5daf0)
   stderr F /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa6
   stderr F k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
   stderr F /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89
   stderr F panic(0x19d85a0, 0x2a5daf0)
   stderr F /usr/lib/golang/src/runtime/panic.go:969 +0x1b9
   stderr F k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
   stderr F /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x10c
   stderr F panic(0x19d85a0, 0x2a5daf0)
   stderr F /usr/lib/golang/src/runtime/panic.go:969 +0x1b9
   stderr F strings.Count(0x0, 0x20, 0x1c23bf6, 0x1, 0xc000bc2918)
   stderr F /usr/lib/golang/src/strings/strings.go:47 +0xf7
   stderr F k8s.io/apimachinery/pkg/runtime/schema.ParseGroupVersion(0x0, 0x20, 0x999576, 0x19bd820, 0xc00038d080, 0xc000bc29a0, 0x2aba160, 0x1af9700)
   stderr F /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/runtime/schema/group_version.go:218 +0x97
   stderr F k8s.io/apimachinery/pkg/runtime/schema.FromAPIVersionAndKind(0x0, 0x20, 0x184d242, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
   stderr F /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/runtime/schema/group_version.go:301 +0x5a
   stderr F k8s.io/apimachinery/pkg/apis/meta/v1.(*TypeMeta).GroupVersionKind(0xc00039b440, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
   stderr F /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/apis/meta/v1/meta.go:128 +0x5e
   stderr F k8s.io/apimachinery/pkg/runtime.WithVersionEncoder.Encode(0x1e71bc0, 0xc000381920, 0x1e6c280, 0xc0000a0be0, 0x1e6c180, 0xc000728f50, 0x1e66900, 0xc00039b440, 0x1e57980, 0xc0009342d0, ...)
    ...
   stderr P panic:
   stderr P runtime error: invalid memory address or nil pointer dereference
   stderr F
   stderr P [signal
   stderr P SIGSEGV: segmentation violation
   stderr P code=
   stderr P 0x1
   stderr P addr=
   stderr P 0x0
   stderr P pc=
   stderr P 0x407893
   stderr F ]
   stderr F
   stderr P goroutine
   stderr P 182
   stderr P [
   stderr P running
   stderr F ]:
   stderr P k8s.io/apimachinery/pkg/util/runtime.HandleCrash
   ~~~

   From must-gather.local.7397819650786332896/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-06b29bfc5a096fc5685e9e4a6970a39b2f8f205e4c6e2b861463f60e6424ca14/namespaces/openshift-cluster-samples-operator/pods/cluster-samples-operator-8c9f8b55f-8vrm9/cluster-samples-operator/cluster-samples-operator/logs/previous.log
   ~~~
   2021-04-05T21:10:29.102906666Z time="2021-04-05T21:10:29Z" level=info msg="processing file jboss-datavirt64-driver-openshift-rhel7.json from dir /opt/openshift/operator/x86_64/datavirt/imagestreams"
   2021-04-05T21:10:29.103074323Z fatal error: concurrent map iteration and map write
   2021-04-05T21:10:29.104688429Z 
   2021-04-05T21:10:29.104688429Z goroutine 111 [running]:
   2021-04-05T21:10:29.104718122Z runtime.throw(0x1c5680e, 0x26)
   2021-04-05T21:10:29.104718122Z 	/usr/lib/golang/src/runtime/panic.go:1116 +0x72 fp=0xc000235300 sp=0xc0002352d0 pc=2021-04-05T21:10:29.104740391Z 0x43d8d2
   2021-04-05T21:10:29.104753993Z runtime.mapiternext(0xc000235458)
   2021-04-05T21:10:29.104753993Z 	/usr/lib/golang/src/runtime/map.go2021-04-05T21:10:29.104767819Z :853 +0x554 fp=0xc000235380 sp=0xc000235300 pc=0x416e342021-04-05T21:10:29.104781336Z 
   2021-04-05T21:10:29.104781336Z github.com/openshift/cluster-samples-operator/pkg/stub.(*Handler).createSamples(0xc000628b40, 0xc0002f4480, 2021-04-05T21:10:29.104795257Z 0x0, 0xc00024ecc0, 0xc00024ec90, 0x0, 0x02021-04-05T21:10:29.104808964Z , 0x0)
   2021-04-05T21:10:29.104808964Z 	/go/src/github.com/openshift/cluster-samples-operator/pkg/stub/handler.go:1051 +0xc52021-04-05T21:10:29.104822596Z  fp=0xc0002354c8 sp=0xc000235380 pc=0x182cd45
   ~~~

2. We can find out too many same messages like below:
   ~~~
      unable to sync: Operation cannot be fulfilled on configs.samples.operator.openshift.io "cluster":
      the object has been modified; please apply your changes to the latest version and try again, requeuing
   ~~~

   ~~~
   $ omg logs pod/cluster-samples-operator-8c9f8b55f-8vrm9 -c cluster-samples-operator -n openshift-cluster-samples-operator > cluster-samples-operator-8c9f8b55f-8vrm9.log
   $ grep "unable to sync" cluster-samples-operator-8c9f8b55f-8vrm9.log | wc
    111    3334   35780

   $ grep "unable to sync" cluster-samples-operator-8c9f8b55f-8vrm9.log | head -3
   2021-04-06T09:00:40.088528589Z time="2021-04-06T09:00:40Z" level=error msg="unable to sync: Operation cannot be fulfilled on configs.samples.operator.openshift.io \"cluster\": the object has been modified; please apply your changes to the latest version and try again, requeuing"
   2021-04-06T09:00:40.090388797Z time="2021-04-06T09:00:40Z" level=error msg="unable to sync: Operation cannot be fulfilled on configs.samples.operator.openshift.io \"cluster\": the object has been modified; please apply your changes to the latest version and try again, requeuing"
   2021-04-06T09:00:40.091979942Z time="2021-04-06T09:00:40Z" level=error msg="unable to sync: Operation cannot be fulfilled on configs.samples.operator.openshift.io \"cluster\": the object has been modified; please apply your changes to the latest version and try again, requeuing"

   $ grep "unable to sync" cluster-samples-operator-8c9f8b55f-8vrm9.log | tail -3
   2021-04-06T09:21:04.473589730Z time="2021-04-06T09:21:04Z" level=error msg="unable to sync: Operation cannot be fulfilled on configs.samples.operator.openshift.io \"cluster\": the object has been modified; please apply your changes to the latest version and try again, requeuing"
   2021-04-06T09:21:04.518428689Z time="2021-04-06T09:21:04Z" level=error msg="unable to sync: Operation cannot be fulfilled on configs.samples.operator.openshift.io \"cluster\": the object has been modified; please apply your changes to the latest version and try again, requeuing"
   2021-04-06T09:21:07.861347232Z time="2021-04-06T09:21:07Z" level=error msg="unable to sync: Operation cannot be fulfilled on configs.samples.operator.openshift.io \"cluster\": the object has been modified; please apply your changes to the latest version and try again, requeuing"
   ~~~

3. We can find out "Object 'Kind' is missing" message like below:

  must-gather.local.7397819650786332896/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-06b29bfc5a096fc5685e9e4a6970a39b2f8f205e4c6e2b861463f60e6424ca14/namespaces/openshift-cluster-samples-operator/pods/cluster-samples-operator-8c9f8b55f-8vrm9/cluster-samples-operator/cluster-samples-operator/logs/current.log

   ~~~
   2021-04-06T09:00:49.358894519Z time="2021-04-06T09:00:49Z" level=error msg="
   unable to sync: Config in version \"v1\" cannot be handled as a Config:
    unmarshalerDecoder: Object 'Kind' is missing in 
   '{\"metadata\":
      {\"name\":\"cluster\",
       \"selfLink\":\"/apis/samples.operator.openshift.io/v1/configs/cluster\",
       \"uid\":\"c98d98dd-919c-47d9-a646-05aff07ec738\",
       \"resourceVersion\":\"25736305\",
       \"generation\":2,
       \"creationTimestamp\":\"2021-03-04T06:03:05Z\",
   ...
   ~~~~

  I don't know if this is related to the other "unable to sync: Operation cannot be ..." messages,
  however, I think it means that the current configuration is not good - no "kind: Config" entry.
  So, I asked my customer to share the output of the following command:

   $ oc get configs.samples.operator.openshift.io -o yaml

  They run and share it as "1527_1_configs.samples.operator.openshift.io.yaml" file.
  Unfortunately, I do not find out any issue in the file.

[Request]

I think the followings are the important requests from the customer side

1. Provide the fix these issues.
2. Explain why these issues/message happened

 |   1. Why is the logs being output even if the cluster openshift-samples operator is not degraded ?
 |   2. What is the impact for OCP cluster behavior if this message is output?
 |   3. How to prevent the logs output?
 |   4. Does the goroutine panic message related the error massage in the cluster-samples-operator?
 |   5. If answer is no, what is the reason of the goroutine panic message?


[Business Impact]

The customer business impact is as follows:
---
They are running the system that composed of 3scale and fuse on the OCP.
Therefore, they have added the fuse imagestreams to "Skipped Imagestreams" section and Fuse and Spring Boot 2 templates to "Skipped Templates" section.

*1:https://access.redhat.com/documentation/en-us/red_hat_fuse/7.8/html/fuse_on_openshift_guide/get-started-admin#install-fuse-on-openshift4
---

Comment 11 Adam Kaplan 2021-04-16 16:25:27 UTC
Root cause:

A race condition in the samples operator causes the kubernetes client to panic [1].
In OCP 4.6, the samples operator reports imagestream import failures by using multiple goroutines to update the same status object.
During this process, the object retrieved from a shared cache is mutated.
Copying the status object from the cache in each goroutine eliminates this panic.

In OCP 4.7 and later, the samples operator changed how it reported failures by writing failed imports to multiple ConfigMaps (one per failed imagestream).
This change significantly lowers the probability that this bug would occur, but it does not eliminate the root cause.

Work arounds:

If the samples operator has skipped imagestreams and/or templates, clear these [2]. Skipped imagestreams and templates can make this bug more likely to occur.

If this does not address the issue, set the samples operator management state to "Removed" [2].
This has the downside of removing all imagestreams and templates installed by the Samples Operator.
Setting the management state back to "Managed" will re-install the imagestreams and templates.


[1] https://github.com/kubernetes/kubernetes/issues/82497
[2] https://docs.openshift.com/container-platform/4.7/openshift_images/configuring-samples-operator.html#samples-operator-configuration_configuring-samples-operator

Comment 12 Gabe Montero 2021-04-16 16:32:38 UTC
quick follow on to Adam's note

Per https://access.redhat.com/documentation/en-us/red_hat_fuse/7.8/html/fuse_on_openshift_guide/get-started-admin#install-fuse-on-openshift4 
the fuse related samples are being manually installed via oc apply after adding those entries to the skip list.

You still manually install the fuse samples via https://access.redhat.com/documentation/en-us/red_hat_fuse/7.8/html/fuse_on_openshift_guide/get-started-admin#install-fuse-on-openshift4 
after setting the samples operator to removed.

Setting samples to removed is a replacement to adding the fuse samples to the skipped list.

But for any customers, they have to manually apply, similar to what is articulated in https://access.redhat.com/documentation/en-us/red_hat_fuse/7.8/html/fuse_on_openshift_guide/get-started-admin#install-fuse-on-openshift4
for the samples they care about.

Comment 19 XiuJuan Wang 2021-04-20 08:55:06 UTC
Gabe, thanks for the info

Following the steps the customer did, samples operator don't restart anymore. There are no the errors shown mentioned in the bug.

Verified on 4.8.0-0.nightly-2021-04-20-011453 cluster.

Comment 26 errata-xmlrpc 2021-07-27 23:00:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.