Bug 1891143
Summary: | CVO deadlocked while shutting down, shortly after fresh cluster install (metrics goroutine) | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Michael Hrivnak <mhrivnak> | ||||
Component: | Cluster Version Operator | Assignee: | W. Trevor King <wking> | ||||
Status: | CLOSED ERRATA | QA Contact: | Yang Yang <yanyang> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 4.6 | CC: | aos-bugs, bleanhar, itsoiref, jokerman, lmohanty, ohochman, rfreiman, sasha, wking, yanyang | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.7.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: Changes in 4.6 introduced a race where the cluster-version operator could hang on shutdown due to a metrics-serving goroutine.
Consequence: CVO behavior like managed-object reconciliation and monitoring would no longer happen. Updates and installs could hang.
Fix: The CVO now times out after a few minutes, abandoning any hung metrics goroutines.
Result: The CVO no longer hangs this way during graceful shutdown attempts.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1892267 (view as bug list) | Environment: | |||||
Last Closed: | 2021-02-24 15:27:53 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1892267, 1896584 | ||||||
Attachments: |
|
Key goroutine from the stack trace: goroutine 1 [select, 1462 minutes]: github.com/openshift/cluster-version-operator/pkg/start.(*Options).run(0xc0000fc000, 0x1c1e720, 0xc000044078, 0xc0000ad9a0, 0xc00042b040) /go/src/github.com/openshift/cluster-version-operator/pkg/start/start.go:279 +0x66c github.com/openshift/cluster-version-operator/pkg/start.(*Options).Run(0xc0000fc000, 0x1c1e720, 0xc000044078, 0x0, 0x1) /go/src/github.com/openshift/cluster-version-operator/pkg/start/start.go:145 +0x2f8 main.init.3.func1(0xc000016000, 0xc00069b800, 0x0, 0x6) /go/src/github.com/openshift/cluster-version-operator/cmd/start.go:23 +0xed github.com/spf13/cobra.(*Command).execute(0xc000016000, 0xc00069b7a0, 0x6, 0x6, 0xc000016000, 0xc00069b7a0) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:846 +0x2c2 github.com/spf13/cobra.(*Command).ExecuteC(0x269d1e0, 0x168221e, 0x269d1e0, 0xc000127f40) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:950 +0x375 github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:887 main.main() /go/src/github.com/openshift/cluster-version-operator/cmd/main.go:26 +0x53 Checking the rc.4 commit: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.6.0-rc.4-x86_64 | grep cluster-version cluster-version-operator https://github.com/openshift/cluster-version-operator c35a4e1e37fdf3559d208cae97cf5c9ded9064a7 Checking the code around start.go:279: $ git cat-file -p c35a4e1e37fdf3:pkg/start/start.go | grep -n . | grep -B1 -A15 279: 278: } else { // shutting down 279: select { 280: case <-shutdownTimer.C: // never triggers after the channel is stopped, although it would not matter much if it did because subsequent cancel calls do nothing. 281: shutdownCancel() 282: shutdownTimer.Stop() 283: case result := <-resultChannel: 284: resultChannelCount-- 285: if result.error == nil { 286: klog.Infof("Collected %s goroutine.", result.name) 287: } else { 288: klog.Errorf("Collected %s goroutine: %v", result.name, result.error) 289: } 290: if result.name == "main operator" { 291: postMainCancel() 292: } 293: } 294: } More from the logs: ... I1022 20:55:27.334586 1 cvo.go:332] Shutting down ClusterVersionOperator ... I1022 20:55:27.334610 1 start.go:286] Collected main operator goroutine. I1022 20:55:27.334632 1 start.go:260] Waiting on 1 outstanding goroutines. ... So we got the main operator, and are still waiting for somebody. Who? Also from the stacks: goroutine 86 [chan receive, 1462 minutes]: github.com/openshift/cluster-version-operator/pkg/cvo.RunMetrics(0x1c1e6e0, 0xc00178eb00, 0x1c1e6e0, 0xc00178eac0, 0x19e3ec3, 0xc, 0xc0006c4a80, 0x0, 0x0) /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/metrics.go:154 +0x2ba github.com/openshift/cluster-version-operator/pkg/start.(*Options).run.func3(0x1c1e6e0, 0xc00178eb00, 0x1c1e6e0, 0xc00178eac0, 0xc0000fc000, 0xc0006c4a80, 0xc0017d4000) /go/src/github.com/openshift/cluster-version-operator/pkg/start/start.go:210 +0xc8 created by github.com/openshift/cluster-version-operator/pkg/start.(*Options).run /go/src/github.com/openshift/cluster-version-operator/pkg/start/start.go:208 +0x3ad Checking on that: $ git cat-file -p c35a4e1e37fdf3:pkg/cvo/metrics.go | grep -n . | grep -B1 -A9 154 153: if shutdown { 154: err := <-errorChannel 155: errorChannelCount-- 156: if err != nil && err != http.ErrServerClosed && err != cmux.ErrListenerClosed { 157: if loopError == nil { 158: loopError = err 159: } else if err != nil { // log the error we are discarding 160: klog.Errorf("Failed to gracefully shut down metrics server: %s", err) 161: } 162: } 163: } else { So the main metrics goroutine is waiting on some children. Back to the stacks: goroutine 148 [chan receive, 1481 minutes]: github.com/cockroachdb/cmux.muxListener.Accept(...) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/cockroachdb/cmux/cmux.go:184 crypto/tls.(*listener).Accept(0xc0002a8160, 0x18eb440, 0xc000c5a0f0, 0x17be300, 0x268fde0) /usr/lib/golang/src/crypto/tls/tls.go:67 +0x37 net/http.(*Server).Serve(0xc000a560e0, 0x1c09fe0, 0xc0002a8160, 0x0, 0x0) /usr/lib/golang/src/net/http/server.go:2937 +0x266 github.com/openshift/cluster-version-operator/pkg/cvo.RunMetrics.func2(0x1c0a0a0, 0xc0006a85c0, 0xc0006c4a80, 0x19e3ec3, 0xc, 0xc0017d4c00, 0xc000a560e0) /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/metrics.go:141 +0x1b4 created by github.com/openshift/cluster-version-operator/pkg/cvo.RunMetrics /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/metrics.go:138 +0x894 Checking on that: $ git cat-file -p c35a4e1e37fdf3:pkg/cvo/metrics.go | grep -n . | grep -B4 -A1 141 137: errorChannelCount++ 138: go func() { 139: tlsListener := tls.NewListener(mux.Match(cmux.Any()), tlsConfig) 140: klog.Infof("Metrics port listening for HTTPS on %v", listenAddress) 141: errorChannel <- server.Serve(tlsListener) 142: }() Our 'server.Shutdown(shutdownContext)' call *should* be killing that one. But maybe there's some issue in that cancel... Also weird about this case is that I would expect the kubelet to come in and SIGQUIT us 130s after the SIGTERM: $ git --no-pager grep terminationGracePeriodSeconds c35a4e1e37fdf3 -- bootstrap install c35a4e1e37fdf3:bootstrap/bootstrap-pod.yaml: terminationGracePeriodSeconds: 130 c35a4e1e37fdf3:install/0000_00_cluster-version-operator_03_deployment.yaml: terminationGracePeriodSeconds: 130 Checking on a the rc.3->rc.4 CI run to ensure that's getting set [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1316106094371672064/artifacts/e2e-aws-upgrade/pods.json | jq -r '.items[] | select(.metadata.name | startswith("cluster-version-operator-")).spec.terminationGracePeriodSeconds' 130 [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1316106094371672064 Looking at 1.15.3's Go implementation (I'm not sure exactly which Go version we used to build rc.4, but I expect this Go code is fairly stable by now): * Server.Serve() calls Accept() with no Context [1]. * Server.Shutdown is [2]. Ah, I see we failed part of that and have: E1022 20:57:27.334266 1 metrics.go:160] Failed to gracefully shut down metrics server: accept tcp [::]:9099: use of closed network connection in the logs. Hmm... [1]: https://github.com/golang/go/blob/go1.15.3/src/net/http/server.go#L2937 [2]: https://github.com/golang/go/blob/go1.15.3/src/net/http/server.go#L2679-L2722 A few more notes from the cluster where this happened: $ oc -n openshift-cluster-version get events LAST SEEN TYPE REASON OBJECT MESSAGE 65m Normal Created pod/cluster-version-operator-859f4cccc5-hxjn9 Created container cluster-version-operator 65m Normal Started pod/cluster-version-operator-859f4cccc5-hxjn9 Started container cluster-version-operator 65m Normal Pulled pod/cluster-version-operator-859f4cccc5-hxjn9 Container image "quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5" already present on machine 64m Normal LeaderElection configmap/version master-0-1_7a8dcc82-dddb-49a6-bd3d-4c1a4ea49a4d became leader 64m Normal RetrievePayload clusterversion/version retrieving payload version="4.6.0-rc.4" image="quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5" 64m Normal VerifyPayload clusterversion/version verifying payload version="4.6.0-rc.4" image="quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5" 64m Normal PayloadLoaded clusterversion/version payload loaded version="4.6.0-rc.4" image="quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5" $ oc -n openshift-cluster-version get -o wide pods NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-version-operator-859f4cccc5-hxjn9 1/1 Running 3 26h 192.168.123.100 master-0-1 <none> <none> $ oc -n openshift-cluster-version get -o json pod cluster-version-operator-859f4cccc5-hxjn9 | jq '.status.containerStatuses[].lastState.terminated' { "containerID": "cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9", "exitCode": 2, "finishedAt": "2020-10-23T21:18:35Z", "message": "1e6e0, 0xc00178ea80, 0x1c34000, 0xc0001112a0, 0xc0010cbdc0)\n\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:361 +0x85\ngithub.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run.func2(0x1c1e6e0, 0xc00178ea80)\n\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:311 +0x79\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()\n\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185 +0x37\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000445f50)\n\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0010cbf50, 0x1bd9dc0, 0xc001018030, 0xc00178ea01, 0xc00147a1e0)\n\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000445f50, 0x3b9aca00, 0x0, 0x1, 0xc00147a1e0)\n\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1c1e6e0, 0xc00178ea80, 0xc0009c0070, 0x3b9aca00, 0x0, 0x1ab2d01)\n\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185 +0xa6\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1c1e6e0, 0xc00178ea80, 0xc0009c0070, 0x3b9aca00)\n\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99 +0x57\ncreated by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run\n\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:311 +0x4c6\n\nrax 0xca\nrbx 0x26b17e0\nrcx 0x473dc3\nrdx 0x0\nrdi 0x26b1928\nrsi 0x80\nrbp 0x7ffe380a64d0\nrsp 0x7ffe380a6488\nr8 0x0\nr9 0x0\nr10 0x0\nr11 0x286\nr12 0x51945a144962\nr13 0x1\nr14 0x51945a144962\nr15 0xc00026f260\nrip 0x473dc1\nrflags 0x286\ncs 0x33\nfs 0x0\ngs 0x0\n", "reason": "Error", "startedAt": "2020-10-22T20:37:29Z" } The CVO got the TERM at 20:55:27. Give 27s grace and gather node logs: $ oc adm node-logs --since '2020-10-23 20:55:00' master-0-1 | gzip >node.logs.gz $ zcat node.logs.gz | head -n3 Data from the specified boot (-1) is not available: No such boot ID in journal -- Logs begin at Fri 2020-10-23 06:12:54 UTC, end at Fri 2020-10-23 22:44:39 UTC. -- Oct 23 20:55:00.001013 master-0-1 hyperkube[1974]: I1023 20:55:00.000912 1974 prober.go:159] Exec-Probe Pod: mdns-publisher-master-0-1, Container: mdns-publisher, Command: [pgrep mdns-publisher] $ zcat node.logs.gz | tail -n1 Oct 23 22:44:45.938151 master-0-1 hyperkube[1974]: ] Kubelet logs for the container get a bit odd around teardown: [root@titan91 ~]# zgrep '^Oct 23 21:1[789].*c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9' node.logs.gz Oct 23 21:17:56.661544 master-0-1 hyperkube[1974]: I1023 21:17:56.661437 1974 status_manager.go:429] Ignoring same status for pod "cluster-version-operator-859f4cccc5-hxjn9_openshift-cluster-version(0d6165c6-fd5b-4220-a786-1008b9c9a346)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-22 20:26:29 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-22 20:37:29 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-22 20:37:29 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-22 20:26:29 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:192.168.123.100 PodIP:192.168.123.100 PodIPs:[{IP:192.168.123.100}] StartTime:2020-10-22 20:26:29 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:cluster-version-operator State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-10-22 20:37:29 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:0,Signal:0,Reason:Completed,Message:,StartedAt:2020-10-22 20:34:28 +0000 UTC,FinishedAt:2020-10-22 20:37:14 +0000 UTC,ContainerID:cri-o://329b20cba46290b9b260c649576036c4659f8da924cde36111f9c7e826d1507f,}} Ready:true RestartCount:2 Image:quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 ImageID:quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 ContainerID:cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9 Started:0xc0042cad10}] QOSClass:Burstable EphemeralContainerStatuses:[]} Oct 23 21:18:35.422468 master-0-1 systemd[1]: crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope: Consumed 1min 44.086s CPU time Oct 23 21:18:35.429264 master-0-1 systemd[1]: crio-conmon-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope: Consumed 1.215s CPU time Oct 23 21:18:35.429472 master-0-1 hyperkube[1974]: I1023 21:18:35.429361 1974 manager.go:1044] Destroyed container: "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-conmon-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope" (aliases: [], namespace: "") Oct 23 21:18:35.429552 master-0-1 hyperkube[1974]: I1023 21:18:35.429434 1974 handler.go:325] Added event &{/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-conmon-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope 2020-10-23 21:18:35.429409201 +0000 UTC m=+89713.490310038 containerDeletion {<nil>}} Oct 23 21:18:36.230341 master-0-1 hyperkube[1974]: I1023 21:18:36.230298 1974 generic.go:155] GenericPLEG: 0d6165c6-fd5b-4220-a786-1008b9c9a346/c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9: running -> exited Oct 23 21:18:36.233285 master-0-1 hyperkube[1974]: I1023 21:18:36.233238 1974 kubelet.go:1872] SyncLoop (PLEG): "cluster-version-operator-859f4cccc5-hxjn9_openshift-cluster-version(0d6165c6-fd5b-4220-a786-1008b9c9a346)", event: &pleg.PodLifecycleEvent{ID:"0d6165c6-fd5b-4220-a786-1008b9c9a346", Type:"ContainerDied", Data:"c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9"} Oct 23 21:18:36.233908 master-0-1 hyperkube[1974]: I1023 21:18:36.233907 1974 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9 Oct 23 21:18:36.257345 master-0-1 hyperkube[1974]: I1023 21:18:36.256949 1974 status_manager.go:564] Patch status for pod "cluster-version-operator-859f4cccc5-hxjn9_openshift-cluster-version(0d6165c6-fd5b-4220-a786-1008b9c9a346)" with "{\"metadata\":{\"uid\":\"0d6165c6-fd5b-4220-a786-1008b9c9a346\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"lastTransitionTime\":\"2020-10-23T21:18:36Z\",\"message\":\"containers with unready status: [cluster-version-operator]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"Ready\"},{\"lastTransitionTime\":\"2020-10-23T21:18:36Z\",\"message\":\"containers with unready status: [cluster-version-operator]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"containerID\":\"cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9\",\"image\":\"quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5\",\"imageID\":\"quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5\",\"lastState\":{\"terminated\":{\"containerID\":\"cri-o://329b20cba46290b9b260c649576036c4659f8da924cde36111f9c7e826d1507f\",\"exitCode\":0,\"finishedAt\":\"2020-10-22T20:37:14Z\",\"reason\":\"Completed\",\"startedAt\":\"2020-10-22T20:34:28Z\"}},\"name\":\"cluster-version-operator\",\"ready\":false,\"restartCount\":2,\"started\":false,\"state\":{\"terminated\":{\"containerID\":\"cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9\",\"exitCode\":2,\"finishedAt\":\"2020-10-23T21:18:35Z\",\"message\":\"1e6e0, 0xc00178ea80, 0x1c34000, 0xc0001112a0, 0xc0010cbdc0)\\n\\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:361 +0x85\\ngithub.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run.func2(0x1c1e6e0, 0xc00178ea80)\\n\\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:311 +0x79\\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185 +0x37\\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000445f50)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f\\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0010cbf50, 0x1bd9dc0, 0xc001018030, 0xc00178ea01, 0xc00147a1e0)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad\\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000445f50, 0x3b9aca00, 0x0, 0x1, 0xc00147a1e0)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98\\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1c1e6e0, 0xc00178ea80, 0xc0009c0070, 0x3b9aca00, 0x0, 0x1ab2d01)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185 +0xa6\\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1c1e6e0, 0xc00178ea80, 0xc0009c0070, 0x3b9aca00)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99 +0x57\\ncreated by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run\\n\\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:311 +0x4c6\\n\\nrax 0xca\\nrbx 0x26b17e0\\nrcx 0x473dc3\\nrdx 0x0\\nrdi 0x26b1928\\nrsi 0x80\\nrbp 0x7ffe380a64d0\\nrsp 0x7ffe380a6488\\nr8 0x0\\nr9 0x0\\nr10 0x0\\nr11 0x286\\nr12 0x51945a144962\\nr13 0x1\\nr14 0x51945a144962\\nr15 0xc00026f260\\nrip 0x473dc1\\nrflags 0x286\\ncs 0x33\\nfs 0x0\\ngs 0x0\\n\",\"reason\":\"Error\",\"startedAt\":\"2020-10-22T20:37:29Z\"}}}]}}" Oct 23 21:18:36.257666 master-0-1 hyperkube[1974]: ,StartedAt:2020-10-22 20:37:29 +0000 UTC,FinishedAt:2020-10-23 21:18:35 +0000 UTC,ContainerID:cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9,}} LastTerminationState:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:0,Signal:0,Reason:Completed,Message:,StartedAt:2020-10-22 20:34:28 +0000 UTC,FinishedAt:2020-10-22 20:37:14 +0000 UTC,ContainerID:cri-o://329b20cba46290b9b260c649576036c4659f8da924cde36111f9c7e826d1507f,}} Ready:false RestartCount:2 Image:quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 ImageID:quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 ContainerID:cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9 Started:0xc0043ee91f}] QOSClass:Burstable EphemeralContainerStatuses:[]}) Oct 23 21:18:37.280077 master-0-1 hyperkube[1974]: I1023 21:18:37.279503 1974 status_manager.go:564] Patch status for pod "cluster-version-operator-859f4cccc5-hxjn9_openshift-cluster-version(0d6165c6-fd5b-4220-a786-1008b9c9a346)" with "{\"metadata\":{\"uid\":\"0d6165c6-fd5b-4220-a786-1008b9c9a346\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"lastTransitionTime\":\"2020-10-23T21:18:37Z\",\"message\":null,\"reason\":null,\"status\":\"True\",\"type\":\"Ready\"},{\"lastTransitionTime\":\"2020-10-23T21:18:37Z\",\"message\":null,\"reason\":null,\"status\":\"True\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"containerID\":\"cri-o://48c8d8512c47b968c5c54a751b095a12542faedebd11b0981d3a980c01b89794\",\"image\":\"quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5\",\"imageID\":\"quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5\",\"lastState\":{\"terminated\":{\"containerID\":\"cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9\",\"exitCode\":2,\"finishedAt\":\"2020-10-23T21:18:35Z\",\"message\":\"1e6e0, 0xc00178ea80, 0x1c34000, 0xc0001112a0, 0xc0010cbdc0)\\n\\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:361 +0x85\\ngithub.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run.func2(0x1c1e6e0, 0xc00178ea80)\\n\\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:311 +0x79\\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185 +0x37\\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000445f50)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f\\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0010cbf50, 0x1bd9dc0, 0xc001018030, 0xc00178ea01, 0xc00147a1e0)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad\\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000445f50, 0x3b9aca00, 0x0, 0x1, 0xc00147a1e0)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98\\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1c1e6e0, 0xc00178ea80, 0xc0009c0070, 0x3b9aca00, 0x0, 0x1ab2d01)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185 +0xa6\\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1c1e6e0, 0xc00178ea80, 0xc0009c0070, 0x3b9aca00)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99 +0x57\\ncreated by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run\\n\\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:311 +0x4c6\\n\\nrax 0xca\\nrbx 0x26b17e0\\nrcx 0x473dc3\\nrdx 0x0\\nrdi 0x26b1928\\nrsi 0x80\\nrbp 0x7ffe380a64d0\\nrsp 0x7ffe380a6488\\nr8 0x0\\nr9 0x0\\nr10 0x0\\nr11 0x286\\nr12 0x51945a144962\\nr13 0x1\\nr14 0x51945a144962\\nr15 0xc00026f260\\nrip 0x473dc1\\nrflags 0x286\\ncs 0x33\\nfs 0x0\\ngs 0x0\\n\",\"reason\":\"Error\",\"startedAt\":\"2020-10-22T20:37:29Z\"}},\"name\":\"cluster-version-operator\",\"ready\":true,\"restartCount\":3,\"started\":true,\"state\":{\"running\":{\"startedAt\":\"2020-10-23T21:18:36Z\"}}}]}}" Oct 23 21:18:37.280354 master-0-1 hyperkube[1974]: ,StartedAt:2020-10-22 20:37:29 +0000 UTC,FinishedAt:2020-10-23 21:18:35 +0000 UTC,ContainerID:cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9,}} Ready:true RestartCount:3 Image:quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 ImageID:quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 ContainerID:cri-o://48c8d8512c47b968c5c54a751b095a12542faedebd11b0981d3a980c01b89794 Started:0xc005744329}] QOSClass:Burstable EphemeralContainerStatuses:[]}) Oct 23 21:18:48.612410 master-0-1 hyperkube[1974]: I1023 21:18:48.612390 1974 handler.go:181] Unable to get Process Stats: couldn't open cpu cgroup procs file /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs : open /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs: no such file or directory Oct 23 21:19:06.379613 master-0-1 hyperkube[1974]: I1023 21:19:06.372544 1974 handler.go:181] Unable to get Process Stats: couldn't open cpu cgroup procs file /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs : open /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs: no such file or directory Oct 23 21:19:27.101168 master-0-1 hyperkube[1974]: I1023 21:19:27.100940 1974 handler.go:181] Unable to get Process Stats: couldn't open cpu cgroup procs file /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs : open /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs: no such file or directory Oct 23 21:19:41.748571 master-0-1 hyperkube[1974]: I1023 21:19:41.748539 1974 handler.go:181] Unable to get Process Stats: couldn't open cpu cgroup procs file /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs : open /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs: no such file or directory Oct 23 21:19:48.661544 master-0-1 hyperkube[1974]: ,StartedAt:2020-10-22 20:37:29 +0000 UTC,FinishedAt:2020-10-23 21:18:35 +0000 UTC,ContainerID:cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9,}} Ready:true RestartCount:3 Image:quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 ImageID:quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 ContainerID:cri-o://48c8d8512c47b968c5c54a751b095a12542faedebd11b0981d3a980c01b89794 Started:0xc001acc4d9}] QOSClass:Burstable EphemeralContainerStatuses:[]} Oct 23 21:19:53.178050 master-0-1 hyperkube[1974]: I1023 21:19:53.177959 1974 handler.go:181] Unable to get Process Stats: couldn't open cpu cgroup procs file /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs : open /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs: no such file or directory $ zgrep c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9 node.logs.gz | tail -n1 Oct 23 22:44:29.413229 master-0-1 hyperkube[1974]: I1023 22:44:29.413184 1974 handler.go:181] Unable to get Process Stats: couldn't open cpu cgroup procs file /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs : open /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs: no such file or directory That's either the kubelet or CRI-O banging away for at least another hour and change on 'Unable to get Process Stats...cgroup.procs: no such file or directory'. Bug 1879205 seems to be tracking the kubelet/CRI-O "Unable to get Process Stats" issue. This is the first week of sprint, so adding UpcomingSprint. I have no idea how to trigger the lock that the patch is guarding against. I guess 'oc -n openshift-cluster-version logs -f cluster-version-operator-...' in one terminal, and then 'oc -n openshift-cluster-version delete pod cluster-version-operator-...' in another to trigger the TERM and graceful shutdown. Watch the logs to ensure we get messages about the metrics children being collected. And hope you get lucky and see the "Abandoning ... uncollected metrics goroutines" message that shows the safety valve blowing open. I'm trying to reproduce it on 4.6, but I'm not lucky. Installing on Bare Metal succeeds and following the way introduced in comment#7 does not trigger the lock. It happened during assisted installer installations on 4.6 rc.4 FYI we are continuing to hit this problem with bare metal installs through assisted-installer. It does not happen every time, but it continues to happen sometimes. Today we reproduced it on 4.6.1. I would consider this very important to release as a 4.6 bugfix since users can now install 4.6 on bare metal with assisted-installer through cloud.redhat.com. This bug targets 4.7. It is expected that we continue to have the issue on 4.6 until this bug is verified and we can backport the patch to 4.6.z. To get the bug verified in 4.7, please connect QE with your assisted-installer flow and demonstrate that recent 4.7 nightlies no longer hang this way. Michael Hrivnak, I have no idea about assisted-installer. Could you please direct me to the document which describes the procedures to install with assisted-installer so that I can verify it? And could you please have a try with latest 4.7 nightly build? Thanks. There is some context about the assisted installer in [1], although I'm not sure if there is enough there for folks to know how to launch assisted installs. [1]: https://github.com/openshift/enhancements/blob/master/enhancements/installer/connected-assisted-installer.md We see that cvo looses leader due to etcd instability(another issue), while it looses leader it sops all it's workers and then stuck on it's way to shut down: I1027 17:03:46.786426 1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version .... E1027 17:04:08.520679 1 memcache.go:111] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request E1027 17:04:11.810461 1 leaderelection.go:321] error retrieving resource lock openshift-cluster-version/version: Get "https://api-int.ocp-cluster-titan91-0.qe.lab.redhat.com:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version": context deadline exceeded I1027 17:04:11.810579 1 leaderelection.go:278] failed to renew lease openshift-cluster-version/version: timed out waiting for the condition I1027 17:04:11.810737 1 request.go:834] Error in request: resource name may not be empty E1027 17:04:11.810782 1 leaderelection.go:297] Failed to release lock: resource name may not be empty ...... I1027 17:04:11.810804 1 start.go:250] Stopped leading; shutting down. .... I1027 17:05:19.814616 1 sync_worker.go:274] Stopped worker I1027 17:05:19.814643 1 sync_worker.go:361] Worker shut down E1027 17:05:26.046541 1 event.go:273] Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping) E1027 17:05:36.047031 1 event.go:273] Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping) E1027 17:05:46.047377 1 event.go:273] Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping) I1027 17:05:47.472264 1 discovery.go:214] Invalidating discovery information E1027 17:05:56.047658 1 event.go:273] Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping) E1027 17:05:56.047765 1 event.go:218] Unable to write event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:".1641e833df74ab68", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"ConfigMap", Namespace:"", Name:"", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"LeaderElection", Message:"master-0-1_9ee38044-6899-4502-a707-c2baa80bf7ed stopped leading", Source:v1.EventSource{Component:"openshift-cluster-version", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfde32a2f04ffd68, ext:226013344971, loc:(*time.Location)(0x26b0400)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfde32a2f04ffd68, ext:226013344971, loc:(*time.Location)(0x26b0400)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}' (retry limit exceeded!) I1027 17:06:11.811551 1 start.go:260] Waiting on 1 outstanding goroutines. E1027 17:06:11.811621 1 metrics.go:160] Failed to gracefully shut down metrics server: accept tcp [::]:9099: use of closed network connection I1027 17:06:47.472342 1 discovery.go:214] Invalidating discovery information I1027 17:07:47.472132 1 discovery.go:214] Invalidating discovery information I1027 17:08:47.472617 1 discovery.go:214] Invalidating discovery information I1027 17:09:47.472244 1 discovery.go:214] Invalidating discovery information I1027 17:10:47.472262 1 discovery.go:214] Invalidating discovery information maybe i miss something but looks like there is missing return in this line https://github.com/openshift/cluster-version-operator/blob/7ace0d3f0d83c87b003b9c9fac262e2888d90d92/pkg/cvo/metrics.go#L192 ? (In reply to W. Trevor King from comment #14) > This bug targets 4.7. It is expected that we continue to have the issue on > 4.6 until this bug is verified and we can backport the patch to 4.6.z. To > get the bug verified in 4.7, please connect QE with your assisted-installer > flow and demonstrate that recent 4.7 nightlies no longer hang this way. This bug version happens with 4.6 and it should be assigned till we get the fix available with 4.6.z builds, we can clone this bug to 4.7 and put the clone bug to ON_QA for QE verification against 4.7. At tthe moment assisted-installer deploying 4.6.1 builds, so this bug is the default behavior, and its TestBlocker for QE https://bugzilla.redhat.com/show_bug.cgi?id=1889970 I'll clone to 4.7 and move this one back to Assigned "Target Release" here is 4.7.0, and has been since I set it at 2020-10-23 23:43:37 UTC. The 4.7/master PR has landed already, linking *this bug* as the one it was addressing. This bug needs to remain the 4.7-targeting bug. Moving back to ON_QA. Once the 4.7-targeting bug is fixed, but not before, we can land a PR targeting 4.6. The cherry-pick tooling will clone a new bug for that. Please leave this bug in ON_QA going forward, so we can get it VERIFIED and unblock the 4.6 backport. *** Bug 1892267 has been marked as a duplicate of this bug. *** (In reply to Igal Tsoiref from comment #18) > maybe i miss something but looks like there is missing return in this line > https://github.com/openshift/cluster-version-operator/blob/ > 7ace0d3f0d83c87b003b9c9fac262e2888d90d92/pkg/cvo/metrics.go#L192 ? If we fail to shutdown the server, I think we still want to wait for at most runContext to reap the child goroutines. But if folks show that a failed Shutdown(...) call means we _never_ collect any child goroutines, then yeah, a return there would make sense. It's not clear from the log excerpt in comment 19 what version of the CVO was generating the logs. Do you still have the first line of that log file with the CVO version? I tried 3 times including 2 concurrent installations on baremetal using 4.7.0-0.ci-2020-10-27-124430 and the cluster is installed successfully for all of the 3 times. (In reply to W. Trevor King from comment #22) > (In reply to Igal Tsoiref from comment #18) > > maybe i miss something but looks like there is missing return in this line > > https://github.com/openshift/cluster-version-operator/blob/ > > 7ace0d3f0d83c87b003b9c9fac262e2888d90d92/pkg/cvo/metrics.go#L192 ? > > If we fail to shutdown the server, I think we still want to wait for at most > runContext to reap the child goroutines. But if folks show that a failed > Shutdown(...) call means we _never_ collect any child goroutines, then yeah, > a return there would make sense. It's not clear from the log excerpt in > comment 19 what version of the CVO was generating the logs. Do you still > have the first line of that log file with the CVO version? I1027 17:00:25.844829 1 start.go:21] ClusterVersionOperator 4.6.0-202010100331.p0-c35a4e1 (In reply to Yang Yang from comment #30) > I tried 3 times including 2 concurrent installations on baremetal using > 4.7.0-0.ci-2020-10-27-124430 and the cluster is installed successfully for > all of the 3 times. The issue is not stable etcd, it is reelecting while installation is in progress. (In reply to Igal Tsoiref from comment #31) > I1027 17:00:25.844829 1 start.go:21] ClusterVersionOperator 4.6.0-202010100331.p0-c35a4e1 Yup. Still no patches in that 4.6 branch until we either get this 4.7-targeting bug VERIFIED or decide to open the 4.6 backport early and use pre-merge verification [1] on the 4.6 backport. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1876486#c7 Summarizing the current sitatuation, since I haven't been doing a very good job at clear messaging: 1. 4.6 and master/4.7 have a CVO bug where a stuck metrics-serving goroutine can hang the CVO in shutdown, breaking install and generally gumming things up whenever we trip that hang. 2. [1] has landed in master/4.7, in an attempt to prevent the hang. 3. Something (presumably unrelated) is keeping ART from building 4.6 and 4.7 nightlies, so we have no nightlies with which to verify [1]. 4. The assisted-installer flow is the simplest approach to reproduce the condition that causes the hang. 5. There is some sort of issue that makes it hard to plug 4.7 releases into the assisted-installer flow. 6. We want to fix this on 4.6 too. 7. We don't want to land changes in 4.6 that don't actually fix the problem. 8. We usually prevent released-version churn around broken fixes by verifying fixes in master-ward branches before allowing the changes into released branches. E.g. verifying a 4.7 fix before allowing the 4.6 backport to land. 9. We have a new process around pre-merge verification [2], but I'm not particularly familiar with it. So, possible paths forward: a. i. Minimal sanity testing (comment 7, comment 30) of the 4.7 release. Mark this bug VERIFIED to address (8). ii. Create and land the 4.6 backport to address (6). iii. Assisted-install 4.6 verification to see if we actually fixed the issue (possibly constrained by (3), I dunno if assisted-install tooling is set up to handle CI builds). iv. If we fail verification, revert the 4.6 backport before the next 4.6 nightly is picked up for a 4.6.z, and back to the drawing board for 4.7/master fixes. b. i. Figure out how to reliably trip the hang in 4.7. ii. Confirm that the landed 4.7/master change fixes the hang, or continue to iterate with 4.7/master PRs until we can confirm the hang is fixed. iii. Create and land the 4.6 backport to address (6). iv. Assisted-install 4.6 verification to confirm that the 4.6 backport also fixes the hang. c. i. Create, but do not yet land the 4.6 backport. ii. Pre-merge verification of the 4.6 PR, to confirm that the backport fixes the hang. iii. Mark this bug VERIFIED, assuming that if the backport fixes the 4.6 hang, the 4.7/master [1] with the same code fixes the same issue in 4.7. iv. VERIFIED 4.7 bug unblocks the 4.6 backport PR landing, so land it. v. Assisted-install, post-merge 4.6 verification to sanity-check c.ii's pre-merge verification. I'm not clear which of those we are pursuing. It's really up to QE to let us know which route forward seems best to them. [1]: https://github.com/openshift/cluster-version-operator/pull/477 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1876486#c7 @wking referring to bullet 5: you can use /test e2e-metal-assisted to run assisted e2e coverage on 4.7. The problem is that it's not a very common failure. We do want this back port into 4.6 though - seems that it started happen more frequently once AI moved to install rc/ga. > @wking referring to bullet 5: you can use /test e2e-metal-assisted to run assisted e2e coverage on 4.7.
Only the cluster-etcd-operator and assisted-test-infra repositories:
release$ git grep -l 'trigger.*test.*e2e-metal-assisted' origin/master | sed 's|[^/]*$||' | uniq
origin/master:ci-operator/jobs/openshift-priv/assisted-test-infra/
origin/master:ci-operator/jobs/openshift-priv/cluster-etcd-operator/
origin/master:ci-operator/jobs/openshift/assisted-test-infra/
origin/master:ci-operator/jobs/openshift/cluster-etcd-operator/
But the 4.7 PR has landed. If folks want to bang away at the assisted job in a dummy PR against either of those repositories, and feel like failing to reproduce the hang after $N attempts would be enough to mark this 4.7-targetting bug VERIFIED, that's fine with me. I'm pretty confident in my patch, but I'm not QE, and it's up to QE to decide what they're comfortable with as grounds for marking this bug VERIFIED.
I'm still waiting for the nightly build which to verify with. If assisted-installer folks can't wait to get 4.6 back port to land, I'm okay with the pre-merge flow. I did a minimal sanity test comment#30 against ci build and it works fine. ART made an attempt to build a new 4.7 nightly [1], but it failed on: error: unable to create a release: operator "cluster-storage-operator" contained an invalid image-references file: no input image tag named "openstack-cinder-csi-driver-operator" Moving back to MODIFIED while we wait for them to complete a build. [1]: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2020-10-31-214011 The new 4.7 nightly build[1] fails to create: error: unable to create a release: operator "cluster-storage-operator" contained an invalid image-references file: no input image tag [1]https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2020-11-01-092917 Tried 4 times installation on baremetal using 4.7.0-0.nightly-2020-11-10-032055 and all passed. So I'm moving it to verified state. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |
Created attachment 1723852 [details] CVO log Description of problem: Installing a bare metal cluster with assisted-installer, about 30 minutes after the first node became available, and well after the bootstrap node had been shut down, CVO itself began to shut down and became deadlocked. I don't know what caused CVO itself to shutdown. In the attached log, you can see that starting at 20:55:27. There are numerous errors logged similar to this: "Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping)" CVO stopped updating the ClusterVersion and just hung for 24 hours, logging once per minute "discovery.go:214] Invalidating discovery information" Version-Release number of the following components: OCP 4.6.0-rc4 Additional info: The operator log is attached. At the end I sent it SIGQUIT to force a stacktrace.