1984047 – insight-operator logs a panic when shutdown, triggering panic detections in CI jobs

Bug 1984047 - insight-operator logs a panic when shutdown, triggering panic detections in CI jobs

Summary: insight-operator logs a panic when shutdown, triggering panic detections in C...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Insights Operator
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Tomas Remes
QA Contact:	Dmitry Misharov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-20 13:52 UTC by Clayton Coleman
Modified:	2021-10-18 17:40 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:40:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift insights-operator pull 477	0	None	open	Bug 1984047: Do not use klog.Fatal	2021-07-21 07:22:32 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:40:43 UTC

Description Clayton Coleman 2021-07-20 13:52:53 UTC

4.9 clusters are reporting panics in the insight operator.  It appears the shutdown for insights operator (which *might* be graceful?) is reporting a full panic stack trace due to the way klog is configured in the component.

This triggers panic detection in CI, which means we're getting noise that is complicating detection of other panics.  In general, in normal shutdown we shouldn't be dumping stack traces like this.

insights operator needs to investigate:

1. is this a true failure, or a "normal shutdown"
   a. fix the failure
   b. stop reporting a stack trace
2. ensure insights operator on normal shutdown exits with exit code 0, not 1 (all components are required to gracefully shut down

This is high because it's generating noise in CI and components are expected to shut down cleanly.  Please prioritize.

Happens in a large number of recent jobs.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-network-stress/1417154237854388224

I0719 16:28:52.506704       1 configobserver.go:102] Found cloud.openshift.com token
I0719 16:28:52.506785       1 configobserver.go:120] Refreshing configuration from cluster secret
I0719 16:28:52.588744       1 configobserver.go:124] Support secret does not exist
E0719 16:29:12.481702       1 operator.go:189] Couldn't get Insights Operator Pod to detect its status. Error: pods "insights-operator-7585bb899f-4cxzn" is forbidden: User "system:serviceaccount:openshift-insights:gather" cannot get resource "pods" in API group "" in the namespace "openshift-insights": RBAC: clusterrole.rbac.authorization.k8s.io "cluster-reader" not found
E0719 16:29:32.485746       1 operator.go:189] Couldn't get Insights Operator Pod to detect its status. Error: pods "insights-operator-7585bb899f-4cxzn" is forbidden: User "system:serviceaccount:openshift-insights:gather" cannot get resource "pods" in API group "" in the namespace "openshift-insights": RBAC: clusterrole.rbac.authorization.k8s.io "cluster-reader" not found
E0719 16:29:52.482602       1 operator.go:189] Couldn't get Insights Operator Pod to detect its status. Error: pods "insights-operator-7585bb899f-4cxzn" is forbidden: User "system:serviceaccount:openshift-insights:gather" cannot get resource "pods" in API group "" in the namespace "openshift-insights": RBAC: clusterrole.rbac.authorization.k8s.io "cluster-reader" not found
E0719 16:30:12.485425       1 operator.go:189] Couldn't get Insights Operator Pod to detect its status. Error: pods "insights-operator-7585bb899f-4cxzn" is forbidden: User "system:serviceaccount:openshift-insights:gather" cannot get resource "pods" in API group "" in the namespace "openshift-insights": RBAC: clusterrole.rbac.authorization.k8s.io "cluster-reader" not found
I0719 16:30:16.129319       1 observer_polling.go:111] Observed file "/var/run/secrets/serving-cert/tls.crt" has been created (hash="74a6512bc804b8f77a48b4012d558f1da9e0f286c358768e4a03f8b26654d564")
W0719 16:30:16.129691       1 builder.go:122] Restart triggered because of file /var/run/secrets/serving-cert/tls.crt was created
I0719 16:30:16.129757       1 observer_polling.go:111] Observed file "/var/run/secrets/serving-cert/tls.key" has been created (hash="733dee124dc86ef5ef77104972165018e45b4ee156069fda982564a4eaed713a")
I0719 16:30:16.129838       1 observer_polling.go:111] Observed file "/var/run/configmaps/service-ca-bundle/service-ca.crt" has been created (hash="39270c5349550443d12ceb3f98a004a103aca1d4ca01a7547c13762f11206b44")
I0719 16:30:16.129886       1 configmap_cafile_content.go:223] Shutting down client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0719 16:30:16.129900       1 reflector.go:225] Stopping reflector *v1.ConfigMap (12h0m0s) from k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206
I0719 16:30:16.129904       1 requestheader_controller.go:183] Shutting down RequestHeaderAuthRequestController
I0719 16:30:16.129953       1 dynamic_serving_content.go:145] Shutting down serving-cert::/tmp/serving-cert-312932287/tls.crt::/tmp/serving-cert-312932287/tls.key
I0719 16:30:16.129968       1 tlsconfig.go:255] Shutting down DynamicServingCertificateController
I0719 16:30:16.129909       1 configmap_cafile_content.go:223] Shutting down client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0719 16:30:16.130016       1 reflector.go:225] Stopping reflector *v1.ConfigMap (12h0m0s) from k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206
I0719 16:30:16.130060       1 secure_serving.go:241] Stopped listening on [::]:8443
I0719 16:30:16.130095       1 reflector.go:225] Stopping reflector *v1.ConfigMap (12h0m0s) from k8s.io/apiserver/pkg/authentication/request/headerrequest/requestheader_controller.go:172
I0719 16:30:16.131211       1 builder.go:271] server exited
I0719 16:30:21.129592       1 observer_polling.go:162] Shutting down file observer
I0719 16:30:32.481851       1 operator.go:137] Unable to check insights-operator pod status. Setting initial delay to 11m16.69318209s
F0719 16:30:32.482265       1 start.go:167] unable to set initial cluster status: context canceled
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000012001, 0xc000919400, 0x63, 0x1e9)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:1021 +0xb9
k8s.io/klog/v2.(*loggingT).output(0x33464e0, 0xc000000003, 0x0, 0x0, 0xc000590070, 0x29861ee, 0x8, 0xa7, 0x414600)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:970 +0x191
k8s.io/klog/v2.(*loggingT).printDepth(0x33464e0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc000435300, 0x1, 0x1)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:733 +0x16f
k8s.io/klog/v2.(*loggingT).print(...)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:715
k8s.io/klog/v2.Fatal(...)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:1489
github.com/openshift/insights-operator/pkg/cmd/start.runOperator.func1(0xc00016a2c0, 0xc0001683e0, 0x0, 0x2)
	/go/src/github.com/openshift/insights-operator/pkg/cmd/start/start.go:167 +0xb25
github.com/spf13/cobra.(*Command).execute(0xc00016a2c0, 0xc0001683c0, 0x2, 0x2, 0xc00016a2c0, 0xc0001683c0)
	/go/src/github.com/openshift/insights-operator/vendor/github.com/spf13/cobra/command.go:854 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0xc00016a000, 0x21a0e44, 0x4, 0x0)
	/go/src/github.com/openshift/insights-operator/vendor/github.com/spf13/cobra/command.go:958 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
	/go/src/github.com/openshift/insights-operator/vendor/github.com/spf13/cobra/command.go:895
main.main()
	/go/src/github.com/openshift/insights-operator/cmd/insights-operator/main.go:29 +0x195

goroutine 6 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x33464e0)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:1164 +0x8b
created by k8s.io/klog/v2.init.0
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/klog/v2/klog.go:418 +0xdf

goroutine 23 [select]:
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x22be0f8, 0x245dec0, 0xc0002a9a40, 0x1, 0xc0000a40c0)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:167 +0x118
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x22be0f8, 0x12a05f200, 0x0, 0xc00044e001, 0xc0000a40c0)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/apimachinery/pkg/util/wait.Forever(0x22be0f8, 0x12a05f200)

Comment 1 Tomas Remes 2021-07-20 15:27:42 UTC

Interesting. It looks like the IO calls klog.Fatal (it's always been there, so it's interesting that it's showing up now :)) which logs the goroutines stack traces and then calls os.Exit(255) so I guess we should avoid calling klog.Fatal.

Comment 3 Dmitry Misharov 2021-08-03 10:54:54 UTC

I checked the history of aws-network-stress job on https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-network-stress and I didn't find "Undiagnosed panic detected in pod" of the insights operator.

Comment 6 errata-xmlrpc 2021-10-18 17:40:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.