4.9 clusters are reporting panics in the insight operator. It appears the shutdown for insights operator (which *might* be graceful?) is reporting a full panic stack trace due to the way klog is configured in the component. This triggers panic detection in CI, which means we're getting noise that is complicating detection of other panics. In general, in normal shutdown we shouldn't be dumping stack traces like this. insights operator needs to investigate: 1. is this a true failure, or a "normal shutdown" a. fix the failure b. stop reporting a stack trace 2. ensure insights operator on normal shutdown exits with exit code 0, not 1 (all components are required to gracefully shut down This is high because it's generating noise in CI and components are expected to shut down cleanly. Please prioritize. Happens in a large number of recent jobs. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-network-stress/1417154237854388224 I0719 16:28:52.506704 1 configobserver.go:102] Found cloud.openshift.com token I0719 16:28:52.506785 1 configobserver.go:120] Refreshing configuration from cluster secret I0719 16:28:52.588744 1 configobserver.go:124] Support secret does not exist E0719 16:29:12.481702 1 operator.go:189] Couldn't get Insights Operator Pod to detect its status. Error: pods "insights-operator-7585bb899f-4cxzn" is forbidden: User "system:serviceaccount:openshift-insights:gather" cannot get resource "pods" in API group "" in the namespace "openshift-insights": RBAC: clusterrole.rbac.authorization.k8s.io "cluster-reader" not found E0719 16:29:32.485746 1 operator.go:189] Couldn't get Insights Operator Pod to detect its status. Error: pods "insights-operator-7585bb899f-4cxzn" is forbidden: User "system:serviceaccount:openshift-insights:gather" cannot get resource "pods" in API group "" in the namespace "openshift-insights": RBAC: clusterrole.rbac.authorization.k8s.io "cluster-reader" not found E0719 16:29:52.482602 1 operator.go:189] Couldn't get Insights Operator Pod to detect its status. Error: pods "insights-operator-7585bb899f-4cxzn" is forbidden: User "system:serviceaccount:openshift-insights:gather" cannot get resource "pods" in API group "" in the namespace "openshift-insights": RBAC: clusterrole.rbac.authorization.k8s.io "cluster-reader" not found E0719 16:30:12.485425 1 operator.go:189] Couldn't get Insights Operator Pod to detect its status. Error: pods "insights-operator-7585bb899f-4cxzn" is forbidden: User "system:serviceaccount:openshift-insights:gather" cannot get resource "pods" in API group "" in the namespace "openshift-insights": RBAC: clusterrole.rbac.authorization.k8s.io "cluster-reader" not found I0719 16:30:16.129319 1 observer_polling.go:111] Observed file "/var/run/secrets/serving-cert/tls.crt" has been created (hash="74a6512bc804b8f77a48b4012d558f1da9e0f286c358768e4a03f8b26654d564") W0719 16:30:16.129691 1 builder.go:122] Restart triggered because of file /var/run/secrets/serving-cert/tls.crt was created I0719 16:30:16.129757 1 observer_polling.go:111] Observed file "/var/run/secrets/serving-cert/tls.key" has been created (hash="733dee124dc86ef5ef77104972165018e45b4ee156069fda982564a4eaed713a") I0719 16:30:16.129838 1 observer_polling.go:111] Observed file "/var/run/configmaps/service-ca-bundle/service-ca.crt" has been created (hash="39270c5349550443d12ceb3f98a004a103aca1d4ca01a7547c13762f11206b44") I0719 16:30:16.129886 1 configmap_cafile_content.go:223] Shutting down client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file I0719 16:30:16.129900 1 reflector.go:225] Stopping reflector *v1.ConfigMap (12h0m0s) from k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206 I0719 16:30:16.129904 1 requestheader_controller.go:183] Shutting down RequestHeaderAuthRequestController I0719 16:30:16.129953 1 dynamic_serving_content.go:145] Shutting down serving-cert::/tmp/serving-cert-312932287/tls.crt::/tmp/serving-cert-312932287/tls.key I0719 16:30:16.129968 1 tlsconfig.go:255] Shutting down DynamicServingCertificateController I0719 16:30:16.129909 1 configmap_cafile_content.go:223] Shutting down client-ca::kube-system::extension-apiserver-authentication::client-ca-file I0719 16:30:16.130016 1 reflector.go:225] Stopping reflector *v1.ConfigMap (12h0m0s) from k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206 I0719 16:30:16.130060 1 secure_serving.go:241] Stopped listening on [::]:8443 I0719 16:30:16.130095 1 reflector.go:225] Stopping reflector *v1.ConfigMap (12h0m0s) from k8s.io/apiserver/pkg/authentication/request/headerrequest/requestheader_controller.go:172 I0719 16:30:16.131211 1 builder.go:271] server exited I0719 16:30:21.129592 1 observer_polling.go:162] Shutting down file observer I0719 16:30:32.481851 1 operator.go:137] Unable to check insights-operator pod status. Setting initial delay to 11m16.69318209s F0719 16:30:32.482265 1 start.go:167] unable to set initial cluster status: context canceled goroutine 1 [running]: k8s.io/klog/v2.stacks(0xc000012001, 0xc000919400, 0x63, 0x1e9) /go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:1021 +0xb9 k8s.io/klog/v2.(*loggingT).output(0x33464e0, 0xc000000003, 0x0, 0x0, 0xc000590070, 0x29861ee, 0x8, 0xa7, 0x414600) /go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:970 +0x191 k8s.io/klog/v2.(*loggingT).printDepth(0x33464e0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc000435300, 0x1, 0x1) /go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:733 +0x16f k8s.io/klog/v2.(*loggingT).print(...) /go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:715 k8s.io/klog/v2.Fatal(...) /go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:1489 github.com/openshift/insights-operator/pkg/cmd/start.runOperator.func1(0xc00016a2c0, 0xc0001683e0, 0x0, 0x2) /go/src/github.com/openshift/insights-operator/pkg/cmd/start/start.go:167 +0xb25 github.com/spf13/cobra.(*Command).execute(0xc00016a2c0, 0xc0001683c0, 0x2, 0x2, 0xc00016a2c0, 0xc0001683c0) /go/src/github.com/openshift/insights-operator/vendor/github.com/spf13/cobra/command.go:854 +0x2c2 github.com/spf13/cobra.(*Command).ExecuteC(0xc00016a000, 0x21a0e44, 0x4, 0x0) /go/src/github.com/openshift/insights-operator/vendor/github.com/spf13/cobra/command.go:958 +0x375 github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/insights-operator/vendor/github.com/spf13/cobra/command.go:895 main.main() /go/src/github.com/openshift/insights-operator/cmd/insights-operator/main.go:29 +0x195 goroutine 6 [chan receive]: k8s.io/klog/v2.(*loggingT).flushDaemon(0x33464e0) /go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:1164 +0x8b created by k8s.io/klog/v2.init.0 /go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:418 +0xdf goroutine 23 [select]: k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x22be0f8, 0x245dec0, 0xc0002a9a40, 0x1, 0xc0000a40c0) /go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:167 +0x118 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x22be0f8, 0x12a05f200, 0x0, 0xc00044e001, 0xc0000a40c0) /go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98 k8s.io/apimachinery/pkg/util/wait.Until(...) /go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 k8s.io/apimachinery/pkg/util/wait.Forever(0x22be0f8, 0x12a05f200)
Interesting. It looks like the IO calls klog.Fatal (it's always been there, so it's interesting that it's showing up now :)) which logs the goroutines stack traces and then calls os.Exit(255) so I guess we should avoid calling klog.Fatal.
I checked the history of aws-network-stress job on https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-network-stress and I didn't find "Undiagnosed panic detected in pod" of the insights operator.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759