Bug 1984047

Summary: insight-operator logs a panic when shutdown, triggering panic detections in CI jobs
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: Insights OperatorAssignee: Tomas Remes <tremes>
Status: CLOSED ERRATA QA Contact: Dmitry Misharov <dmisharo>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.9CC: aos-bugs, inecas, mklika, tremes
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:40:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2021-07-20 13:52:53 UTC
4.9 clusters are reporting panics in the insight operator.  It appears the shutdown for insights operator (which *might* be graceful?) is reporting a full panic stack trace due to the way klog is configured in the component.

This triggers panic detection in CI, which means we're getting noise that is complicating detection of other panics.  In general, in normal shutdown we shouldn't be dumping stack traces like this.

insights operator needs to investigate:

1. is this a true failure, or a "normal shutdown"
   a. fix the failure
   b. stop reporting a stack trace
2. ensure insights operator on normal shutdown exits with exit code 0, not 1 (all components are required to gracefully shut down

This is high because it's generating noise in CI and components are expected to shut down cleanly.  Please prioritize.

Happens in a large number of recent jobs.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-network-stress/1417154237854388224

I0719 16:28:52.506704       1 configobserver.go:102] Found cloud.openshift.com token
I0719 16:28:52.506785       1 configobserver.go:120] Refreshing configuration from cluster secret
I0719 16:28:52.588744       1 configobserver.go:124] Support secret does not exist
E0719 16:29:12.481702       1 operator.go:189] Couldn't get Insights Operator Pod to detect its status. Error: pods "insights-operator-7585bb899f-4cxzn" is forbidden: User "system:serviceaccount:openshift-insights:gather" cannot get resource "pods" in API group "" in the namespace "openshift-insights": RBAC: clusterrole.rbac.authorization.k8s.io "cluster-reader" not found
E0719 16:29:32.485746       1 operator.go:189] Couldn't get Insights Operator Pod to detect its status. Error: pods "insights-operator-7585bb899f-4cxzn" is forbidden: User "system:serviceaccount:openshift-insights:gather" cannot get resource "pods" in API group "" in the namespace "openshift-insights": RBAC: clusterrole.rbac.authorization.k8s.io "cluster-reader" not found
E0719 16:29:52.482602       1 operator.go:189] Couldn't get Insights Operator Pod to detect its status. Error: pods "insights-operator-7585bb899f-4cxzn" is forbidden: User "system:serviceaccount:openshift-insights:gather" cannot get resource "pods" in API group "" in the namespace "openshift-insights": RBAC: clusterrole.rbac.authorization.k8s.io "cluster-reader" not found
E0719 16:30:12.485425       1 operator.go:189] Couldn't get Insights Operator Pod to detect its status. Error: pods "insights-operator-7585bb899f-4cxzn" is forbidden: User "system:serviceaccount:openshift-insights:gather" cannot get resource "pods" in API group "" in the namespace "openshift-insights": RBAC: clusterrole.rbac.authorization.k8s.io "cluster-reader" not found
I0719 16:30:16.129319       1 observer_polling.go:111] Observed file "/var/run/secrets/serving-cert/tls.crt" has been created (hash="74a6512bc804b8f77a48b4012d558f1da9e0f286c358768e4a03f8b26654d564")
W0719 16:30:16.129691       1 builder.go:122] Restart triggered because of file /var/run/secrets/serving-cert/tls.crt was created
I0719 16:30:16.129757       1 observer_polling.go:111] Observed file "/var/run/secrets/serving-cert/tls.key" has been created (hash="733dee124dc86ef5ef77104972165018e45b4ee156069fda982564a4eaed713a")
I0719 16:30:16.129838       1 observer_polling.go:111] Observed file "/var/run/configmaps/service-ca-bundle/service-ca.crt" has been created (hash="39270c5349550443d12ceb3f98a004a103aca1d4ca01a7547c13762f11206b44")
I0719 16:30:16.129886       1 configmap_cafile_content.go:223] Shutting down client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0719 16:30:16.129900       1 reflector.go:225] Stopping reflector *v1.ConfigMap (12h0m0s) from k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206
I0719 16:30:16.129904       1 requestheader_controller.go:183] Shutting down RequestHeaderAuthRequestController
I0719 16:30:16.129953       1 dynamic_serving_content.go:145] Shutting down serving-cert::/tmp/serving-cert-312932287/tls.crt::/tmp/serving-cert-312932287/tls.key
I0719 16:30:16.129968       1 tlsconfig.go:255] Shutting down DynamicServingCertificateController
I0719 16:30:16.129909       1 configmap_cafile_content.go:223] Shutting down client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0719 16:30:16.130016       1 reflector.go:225] Stopping reflector *v1.ConfigMap (12h0m0s) from k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206
I0719 16:30:16.130060       1 secure_serving.go:241] Stopped listening on [::]:8443
I0719 16:30:16.130095       1 reflector.go:225] Stopping reflector *v1.ConfigMap (12h0m0s) from k8s.io/apiserver/pkg/authentication/request/headerrequest/requestheader_controller.go:172
I0719 16:30:16.131211       1 builder.go:271] server exited
I0719 16:30:21.129592       1 observer_polling.go:162] Shutting down file observer
I0719 16:30:32.481851       1 operator.go:137] Unable to check insights-operator pod status. Setting initial delay to 11m16.69318209s
F0719 16:30:32.482265       1 start.go:167] unable to set initial cluster status: context canceled
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000012001, 0xc000919400, 0x63, 0x1e9)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:1021 +0xb9
k8s.io/klog/v2.(*loggingT).output(0x33464e0, 0xc000000003, 0x0, 0x0, 0xc000590070, 0x29861ee, 0x8, 0xa7, 0x414600)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:970 +0x191
k8s.io/klog/v2.(*loggingT).printDepth(0x33464e0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc000435300, 0x1, 0x1)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:733 +0x16f
k8s.io/klog/v2.(*loggingT).print(...)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:715
k8s.io/klog/v2.Fatal(...)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:1489
github.com/openshift/insights-operator/pkg/cmd/start.runOperator.func1(0xc00016a2c0, 0xc0001683e0, 0x0, 0x2)
	/go/src/github.com/openshift/insights-operator/pkg/cmd/start/start.go:167 +0xb25
github.com/spf13/cobra.(*Command).execute(0xc00016a2c0, 0xc0001683c0, 0x2, 0x2, 0xc00016a2c0, 0xc0001683c0)
	/go/src/github.com/openshift/insights-operator/vendor/github.com/spf13/cobra/command.go:854 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0xc00016a000, 0x21a0e44, 0x4, 0x0)
	/go/src/github.com/openshift/insights-operator/vendor/github.com/spf13/cobra/command.go:958 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
	/go/src/github.com/openshift/insights-operator/vendor/github.com/spf13/cobra/command.go:895
main.main()
	/go/src/github.com/openshift/insights-operator/cmd/insights-operator/main.go:29 +0x195

goroutine 6 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x33464e0)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:1164 +0x8b
created by k8s.io/klog/v2.init.0
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:418 +0xdf

goroutine 23 [select]:
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x22be0f8, 0x245dec0, 0xc0002a9a40, 0x1, 0xc0000a40c0)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:167 +0x118
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x22be0f8, 0x12a05f200, 0x0, 0xc00044e001, 0xc0000a40c0)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/apimachinery/pkg/util/wait.Forever(0x22be0f8, 0x12a05f200)

Comment 1 Tomas Remes 2021-07-20 15:27:42 UTC
Interesting. It looks like the IO calls klog.Fatal (it's always been there, so it's interesting that it's showing up now :)) which logs the goroutines stack traces and then calls os.Exit(255) so I guess we should avoid calling klog.Fatal.

Comment 3 Dmitry Misharov 2021-08-03 10:54:54 UTC
I checked the history of aws-network-stress job on https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-network-stress and I didn't find "Undiagnosed panic detected in pod" of the insights operator.

Comment 6 errata-xmlrpc 2021-10-18 17:40:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759