Hide Forgot
Created attachment 1771872 [details] goroutine_panic.log Created attachment 1771872 [details] goroutine_panic.log Description of problem: My customer's OCP 4.6 cluster has three issues related to cluster-sample-operator. 1. cluster-samples-operator restarts approximately two times per day. And, we can find a panic message for cluster-samples-operator in the log. $ omg get pod -A NAMESPACE NAME READY STATUS RESTARTS AGE openshift-cluster-samples-operator cluster-samples-operator-8c9f8b55f-8vrm9 2/2 Running 16 9d ~~~ From the attached file - goroutine_panic.log ~~~ [/var/log/containers/cluster-samples-operator-8c9f8b55f-8vrm9_openshift-cluster-samples-operator_cluster-samples-operator-4b01ba1606694d7a696283fc9c7f40f4de4d862f0570854097745fe6d38cefa4.log] stderr F E0404 12:59:22.520853 227 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) stderr F goroutine 182 [running]: stderr F k8s.io/apimachinery/pkg/util/runtime.logPanic(0x19d85a0, 0x2a5daf0) stderr F /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa6 stderr F k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) stderr F /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89 stderr F panic(0x19d85a0, 0x2a5daf0) stderr F /usr/lib/golang/src/runtime/panic.go:969 +0x1b9 stderr F k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) stderr F /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x10c stderr F panic(0x19d85a0, 0x2a5daf0) stderr F /usr/lib/golang/src/runtime/panic.go:969 +0x1b9 stderr F strings.Count(0x0, 0x20, 0x1c23bf6, 0x1, 0xc000bc2918) stderr F /usr/lib/golang/src/strings/strings.go:47 +0xf7 stderr F k8s.io/apimachinery/pkg/runtime/schema.ParseGroupVersion(0x0, 0x20, 0x999576, 0x19bd820, 0xc00038d080, 0xc000bc29a0, 0x2aba160, 0x1af9700) stderr F /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/runtime/schema/group_version.go:218 +0x97 stderr F k8s.io/apimachinery/pkg/runtime/schema.FromAPIVersionAndKind(0x0, 0x20, 0x184d242, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) stderr F /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/runtime/schema/group_version.go:301 +0x5a stderr F k8s.io/apimachinery/pkg/apis/meta/v1.(*TypeMeta).GroupVersionKind(0xc00039b440, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) stderr F /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/apis/meta/v1/meta.go:128 +0x5e stderr F k8s.io/apimachinery/pkg/runtime.WithVersionEncoder.Encode(0x1e71bc0, 0xc000381920, 0x1e6c280, 0xc0000a0be0, 0x1e6c180, 0xc000728f50, 0x1e66900, 0xc00039b440, 0x1e57980, 0xc0009342d0, ...) ... stderr P panic: stderr P runtime error: invalid memory address or nil pointer dereference stderr F stderr P [signal stderr P SIGSEGV: segmentation violation stderr P code= stderr P 0x1 stderr P addr= stderr P 0x0 stderr P pc= stderr P 0x407893 stderr F ] stderr F stderr P goroutine stderr P 182 stderr P [ stderr P running stderr F ]: stderr P k8s.io/apimachinery/pkg/util/runtime.HandleCrash ~~~ From must-gather.local.7397819650786332896/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-06b29bfc5a096fc5685e9e4a6970a39b2f8f205e4c6e2b861463f60e6424ca14/namespaces/openshift-cluster-samples-operator/pods/cluster-samples-operator-8c9f8b55f-8vrm9/cluster-samples-operator/cluster-samples-operator/logs/previous.log ~~~ 2021-04-05T21:10:29.102906666Z time="2021-04-05T21:10:29Z" level=info msg="processing file jboss-datavirt64-driver-openshift-rhel7.json from dir /opt/openshift/operator/x86_64/datavirt/imagestreams" 2021-04-05T21:10:29.103074323Z fatal error: concurrent map iteration and map write 2021-04-05T21:10:29.104688429Z 2021-04-05T21:10:29.104688429Z goroutine 111 [running]: 2021-04-05T21:10:29.104718122Z runtime.throw(0x1c5680e, 0x26) 2021-04-05T21:10:29.104718122Z /usr/lib/golang/src/runtime/panic.go:1116 +0x72 fp=0xc000235300 sp=0xc0002352d0 pc=2021-04-05T21:10:29.104740391Z 0x43d8d2 2021-04-05T21:10:29.104753993Z runtime.mapiternext(0xc000235458) 2021-04-05T21:10:29.104753993Z /usr/lib/golang/src/runtime/map.go2021-04-05T21:10:29.104767819Z :853 +0x554 fp=0xc000235380 sp=0xc000235300 pc=0x416e342021-04-05T21:10:29.104781336Z 2021-04-05T21:10:29.104781336Z github.com/openshift/cluster-samples-operator/pkg/stub.(*Handler).createSamples(0xc000628b40, 0xc0002f4480, 2021-04-05T21:10:29.104795257Z 0x0, 0xc00024ecc0, 0xc00024ec90, 0x0, 0x02021-04-05T21:10:29.104808964Z , 0x0) 2021-04-05T21:10:29.104808964Z /go/src/github.com/openshift/cluster-samples-operator/pkg/stub/handler.go:1051 +0xc52021-04-05T21:10:29.104822596Z fp=0xc0002354c8 sp=0xc000235380 pc=0x182cd45 ~~~ 2. We can find out too many same messages like below: ~~~ unable to sync: Operation cannot be fulfilled on configs.samples.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again, requeuing ~~~ ~~~ $ omg logs pod/cluster-samples-operator-8c9f8b55f-8vrm9 -c cluster-samples-operator -n openshift-cluster-samples-operator > cluster-samples-operator-8c9f8b55f-8vrm9.log $ grep "unable to sync" cluster-samples-operator-8c9f8b55f-8vrm9.log | wc 111 3334 35780 $ grep "unable to sync" cluster-samples-operator-8c9f8b55f-8vrm9.log | head -3 2021-04-06T09:00:40.088528589Z time="2021-04-06T09:00:40Z" level=error msg="unable to sync: Operation cannot be fulfilled on configs.samples.operator.openshift.io \"cluster\": the object has been modified; please apply your changes to the latest version and try again, requeuing" 2021-04-06T09:00:40.090388797Z time="2021-04-06T09:00:40Z" level=error msg="unable to sync: Operation cannot be fulfilled on configs.samples.operator.openshift.io \"cluster\": the object has been modified; please apply your changes to the latest version and try again, requeuing" 2021-04-06T09:00:40.091979942Z time="2021-04-06T09:00:40Z" level=error msg="unable to sync: Operation cannot be fulfilled on configs.samples.operator.openshift.io \"cluster\": the object has been modified; please apply your changes to the latest version and try again, requeuing" $ grep "unable to sync" cluster-samples-operator-8c9f8b55f-8vrm9.log | tail -3 2021-04-06T09:21:04.473589730Z time="2021-04-06T09:21:04Z" level=error msg="unable to sync: Operation cannot be fulfilled on configs.samples.operator.openshift.io \"cluster\": the object has been modified; please apply your changes to the latest version and try again, requeuing" 2021-04-06T09:21:04.518428689Z time="2021-04-06T09:21:04Z" level=error msg="unable to sync: Operation cannot be fulfilled on configs.samples.operator.openshift.io \"cluster\": the object has been modified; please apply your changes to the latest version and try again, requeuing" 2021-04-06T09:21:07.861347232Z time="2021-04-06T09:21:07Z" level=error msg="unable to sync: Operation cannot be fulfilled on configs.samples.operator.openshift.io \"cluster\": the object has been modified; please apply your changes to the latest version and try again, requeuing" ~~~ 3. We can find out "Object 'Kind' is missing" message like below: must-gather.local.7397819650786332896/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-06b29bfc5a096fc5685e9e4a6970a39b2f8f205e4c6e2b861463f60e6424ca14/namespaces/openshift-cluster-samples-operator/pods/cluster-samples-operator-8c9f8b55f-8vrm9/cluster-samples-operator/cluster-samples-operator/logs/current.log ~~~ 2021-04-06T09:00:49.358894519Z time="2021-04-06T09:00:49Z" level=error msg=" unable to sync: Config in version \"v1\" cannot be handled as a Config: unmarshalerDecoder: Object 'Kind' is missing in '{\"metadata\": {\"name\":\"cluster\", \"selfLink\":\"/apis/samples.operator.openshift.io/v1/configs/cluster\", \"uid\":\"c98d98dd-919c-47d9-a646-05aff07ec738\", \"resourceVersion\":\"25736305\", \"generation\":2, \"creationTimestamp\":\"2021-03-04T06:03:05Z\", ... ~~~~ I don't know if this is related to the other "unable to sync: Operation cannot be ..." messages, however, I think it means that the current configuration is not good - no "kind: Config" entry. So, I asked my customer to share the output of the following command: $ oc get configs.samples.operator.openshift.io -o yaml They run and share it as "1527_1_configs.samples.operator.openshift.io.yaml" file. Unfortunately, I do not find out any issue in the file. [Request] I think the followings are the important requests from the customer side 1. Provide the fix these issues. 2. Explain why these issues/message happened | 1. Why is the logs being output even if the cluster openshift-samples operator is not degraded ? | 2. What is the impact for OCP cluster behavior if this message is output? | 3. How to prevent the logs output? | 4. Does the goroutine panic message related the error massage in the cluster-samples-operator? | 5. If answer is no, what is the reason of the goroutine panic message? [Business Impact] The customer business impact is as follows: --- They are running the system that composed of 3scale and fuse on the OCP. Therefore, they have added the fuse imagestreams to "Skipped Imagestreams" section and Fuse and Spring Boot 2 templates to "Skipped Templates" section. *1:https://access.redhat.com/documentation/en-us/red_hat_fuse/7.8/html/fuse_on_openshift_guide/get-started-admin#install-fuse-on-openshift4 ---
Root cause: A race condition in the samples operator causes the kubernetes client to panic [1]. In OCP 4.6, the samples operator reports imagestream import failures by using multiple goroutines to update the same status object. During this process, the object retrieved from a shared cache is mutated. Copying the status object from the cache in each goroutine eliminates this panic. In OCP 4.7 and later, the samples operator changed how it reported failures by writing failed imports to multiple ConfigMaps (one per failed imagestream). This change significantly lowers the probability that this bug would occur, but it does not eliminate the root cause. Work arounds: If the samples operator has skipped imagestreams and/or templates, clear these [2]. Skipped imagestreams and templates can make this bug more likely to occur. If this does not address the issue, set the samples operator management state to "Removed" [2]. This has the downside of removing all imagestreams and templates installed by the Samples Operator. Setting the management state back to "Managed" will re-install the imagestreams and templates. [1] https://github.com/kubernetes/kubernetes/issues/82497 [2] https://docs.openshift.com/container-platform/4.7/openshift_images/configuring-samples-operator.html#samples-operator-configuration_configuring-samples-operator
quick follow on to Adam's note Per https://access.redhat.com/documentation/en-us/red_hat_fuse/7.8/html/fuse_on_openshift_guide/get-started-admin#install-fuse-on-openshift4 the fuse related samples are being manually installed via oc apply after adding those entries to the skip list. You still manually install the fuse samples via https://access.redhat.com/documentation/en-us/red_hat_fuse/7.8/html/fuse_on_openshift_guide/get-started-admin#install-fuse-on-openshift4 after setting the samples operator to removed. Setting samples to removed is a replacement to adding the fuse samples to the skipped list. But for any customers, they have to manually apply, similar to what is articulated in https://access.redhat.com/documentation/en-us/red_hat_fuse/7.8/html/fuse_on_openshift_guide/get-started-admin#install-fuse-on-openshift4 for the samples they care about.
Gabe, thanks for the info Following the steps the customer did, samples operator don't restart anymore. There are no the errors shown mentioned in the bug. Verified on 4.8.0-0.nightly-2021-04-20-011453 cluster.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438