Bug 1891143

Summary: CVO deadlocked while shutting down, shortly after fresh cluster install (metrics goroutine)
Product: OpenShift Container Platform Reporter: Michael Hrivnak <mhrivnak>
Component: Cluster Version OperatorAssignee: W. Trevor King <wking>
Status: CLOSED ERRATA QA Contact: Yang Yang <yanyang>
Severity: high Docs Contact:
Priority: high    
Version: 4.6CC: aos-bugs, bleanhar, itsoiref, jokerman, lmohanty, ohochman, rfreiman, sasha, wking, yanyang
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Changes in 4.6 introduced a race where the cluster-version operator could hang on shutdown due to a metrics-serving goroutine. Consequence: CVO behavior like managed-object reconciliation and monitoring would no longer happen. Updates and installs could hang. Fix: The CVO now times out after a few minutes, abandoning any hung metrics goroutines. Result: The CVO no longer hangs this way during graceful shutdown attempts.
Story Points: ---
Clone Of:
: 1892267 (view as bug list) Environment:
Last Closed: 2021-02-24 15:27:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1892267, 1896584    
Attachments:
Description Flags
CVO log none

Description Michael Hrivnak 2020-10-23 21:41:50 UTC
Created attachment 1723852 [details]
CVO log

Description of problem:

Installing a bare metal cluster with assisted-installer, about 30 minutes after the first node became available, and well after the bootstrap node had been shut down, CVO itself began to shut down and became deadlocked.

I don't know what caused CVO itself to shutdown. In the attached log, you can see that starting at 20:55:27.

There are numerous errors logged similar to this:

"Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping)"

CVO stopped updating the ClusterVersion and just hung for 24 hours, logging once per minute "discovery.go:214] Invalidating discovery information"


Version-Release number of the following components:
OCP 4.6.0-rc4


Additional info:
The operator log is attached. At the end I sent it SIGQUIT to force a stacktrace.

Comment 1 W. Trevor King 2020-10-23 22:02:00 UTC
Key goroutine from the stack trace:

goroutine 1 [select, 1462 minutes]:
github.com/openshift/cluster-version-operator/pkg/start.(*Options).run(0xc0000fc000, 0x1c1e720, 0xc000044078, 0xc0000ad9a0, 0xc00042b040)
	/go/src/github.com/openshift/cluster-version-operator/pkg/start/start.go:279 +0x66c
github.com/openshift/cluster-version-operator/pkg/start.(*Options).Run(0xc0000fc000, 0x1c1e720, 0xc000044078, 0x0, 0x1)
	/go/src/github.com/openshift/cluster-version-operator/pkg/start/start.go:145 +0x2f8
main.init.3.func1(0xc000016000, 0xc00069b800, 0x0, 0x6)
	/go/src/github.com/openshift/cluster-version-operator/cmd/start.go:23 +0xed
github.com/spf13/cobra.(*Command).execute(0xc000016000, 0xc00069b7a0, 0x6, 0x6, 0xc000016000, 0xc00069b7a0)
	/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:846 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0x269d1e0, 0x168221e, 0x269d1e0, 0xc000127f40)
	/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:950 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:887
main.main()
	/go/src/github.com/openshift/cluster-version-operator/cmd/main.go:26 +0x53

Checking the rc.4 commit:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.6.0-rc.4-x86_64 | grep cluster-version
    cluster-version-operator                       https://github.com/openshift/cluster-version-operator                       c35a4e1e37fdf3559d208cae97cf5c9ded9064a7

Checking the code around start.go:279:

  $ git cat-file -p c35a4e1e37fdf3:pkg/start/start.go | grep -n . | grep -B1 -A15 279:
  278:    } else { // shutting down
  279:      select {
  280:      case <-shutdownTimer.C: // never triggers after the channel is stopped, although it would not matter much if it did because subsequent cancel calls do nothing.
  281:        shutdownCancel()
  282:        shutdownTimer.Stop()
  283:      case result := <-resultChannel:
  284:        resultChannelCount--
  285:        if result.error == nil {
  286:          klog.Infof("Collected %s goroutine.", result.name)
  287:        } else {
  288:          klog.Errorf("Collected %s goroutine: %v", result.name, result.error)
  289:        }
  290:        if result.name == "main operator" {
  291:          postMainCancel()
  292:        }
  293:      }
  294:    }

More from the logs:

  ...
  I1022 20:55:27.334586       1 cvo.go:332] Shutting down ClusterVersionOperator
  ...
  I1022 20:55:27.334610       1 start.go:286] Collected main operator goroutine.
  I1022 20:55:27.334632       1 start.go:260] Waiting on 1 outstanding goroutines.
  ...

So we got the main operator, and are still waiting for somebody.  Who?  Also from the stacks:

goroutine 86 [chan receive, 1462 minutes]:
github.com/openshift/cluster-version-operator/pkg/cvo.RunMetrics(0x1c1e6e0, 0xc00178eb00, 0x1c1e6e0, 0xc00178eac0, 0x19e3ec3, 0xc, 0xc0006c4a80, 0x0, 0x0)
        /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/metrics.go:154 +0x2ba
github.com/openshift/cluster-version-operator/pkg/start.(*Options).run.func3(0x1c1e6e0, 0xc00178eb00, 0x1c1e6e0, 0xc00178eac0, 0xc0000fc000, 0xc0006c4a80, 0xc0017d4000)
        /go/src/github.com/openshift/cluster-version-operator/pkg/start/start.go:210 +0xc8
created by github.com/openshift/cluster-version-operator/pkg/start.(*Options).run
        /go/src/github.com/openshift/cluster-version-operator/pkg/start/start.go:208 +0x3ad

Checking on that:

  $ git cat-file -p c35a4e1e37fdf3:pkg/cvo/metrics.go | grep -n . | grep -B1 -A9 154
  153:		if shutdown {
  154:			err := <-errorChannel
  155:			errorChannelCount--
  156:			if err != nil && err != http.ErrServerClosed && err != cmux.ErrListenerClosed {
  157:				if loopError == nil {
  158:					loopError = err
  159:				} else if err != nil { // log the error we are discarding
  160:					klog.Errorf("Failed to gracefully shut down metrics server: %s", err)
  161:				}
  162:			}
  163:		} else {

So the main metrics goroutine is waiting on some children.  Back to the stacks:

goroutine 148 [chan receive, 1481 minutes]:
github.com/cockroachdb/cmux.muxListener.Accept(...)
        /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/cockroachdb/cmux/cmux.go:184
crypto/tls.(*listener).Accept(0xc0002a8160, 0x18eb440, 0xc000c5a0f0, 0x17be300, 0x268fde0)
        /usr/lib/golang/src/crypto/tls/tls.go:67 +0x37
net/http.(*Server).Serve(0xc000a560e0, 0x1c09fe0, 0xc0002a8160, 0x0, 0x0)
        /usr/lib/golang/src/net/http/server.go:2937 +0x266
github.com/openshift/cluster-version-operator/pkg/cvo.RunMetrics.func2(0x1c0a0a0, 0xc0006a85c0, 0xc0006c4a80, 0x19e3ec3, 0xc, 0xc0017d4c00, 0xc000a560e0)
        /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/metrics.go:141 +0x1b4
created by github.com/openshift/cluster-version-operator/pkg/cvo.RunMetrics
        /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/metrics.go:138 +0x894

Checking on that:

  $ git cat-file -p c35a4e1e37fdf3:pkg/cvo/metrics.go | grep -n . | grep -B4 -A1 141
  137:		errorChannelCount++
  138:		go func() {
  139:			tlsListener := tls.NewListener(mux.Match(cmux.Any()), tlsConfig)
  140:			klog.Infof("Metrics port listening for HTTPS on %v", listenAddress)
  141:			errorChannel <- server.Serve(tlsListener)
  142:		}()

Our 'server.Shutdown(shutdownContext)' call *should* be killing that one.  But maybe there's some issue in that cancel...

Comment 2 W. Trevor King 2020-10-23 22:07:56 UTC
Also weird about this case is that I would expect the kubelet to come in and SIGQUIT us 130s after the SIGTERM:

  $ git --no-pager grep terminationGracePeriodSeconds  c35a4e1e37fdf3 -- bootstrap install
  c35a4e1e37fdf3:bootstrap/bootstrap-pod.yaml:  terminationGracePeriodSeconds: 130
  c35a4e1e37fdf3:install/0000_00_cluster-version-operator_03_deployment.yaml:      terminationGracePeriodSeconds: 130

Checking on a the rc.3->rc.4 CI run to ensure that's getting set [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1316106094371672064/artifacts/e2e-aws-upgrade/pods.json | jq -r '.items[] | select(.metadata.name | startswith("cluster-version-operator-")).spec.terminationGracePeriodSeconds'
  130

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1316106094371672064

Comment 3 W. Trevor King 2020-10-23 22:22:09 UTC
Looking at 1.15.3's Go implementation (I'm not sure exactly which Go version we used to build rc.4, but I expect this Go code is fairly stable by now):

* Server.Serve() calls Accept() with no Context [1].
* Server.Shutdown is [2].  Ah, I see we failed part of that and have:

    E1022 20:57:27.334266       1 metrics.go:160] Failed to gracefully shut down metrics server: accept tcp [::]:9099: use of closed network connection

  in the logs.

Hmm...

[1]: https://github.com/golang/go/blob/go1.15.3/src/net/http/server.go#L2937
[2]: https://github.com/golang/go/blob/go1.15.3/src/net/http/server.go#L2679-L2722

Comment 4 W. Trevor King 2020-10-23 23:00:40 UTC
A few more notes from the cluster where this happened:

$ oc -n openshift-cluster-version get events
LAST SEEN   TYPE     REASON            OBJECT                                          MESSAGE
65m         Normal   Created           pod/cluster-version-operator-859f4cccc5-hxjn9   Created container cluster-version-operator
65m         Normal   Started           pod/cluster-version-operator-859f4cccc5-hxjn9   Started container cluster-version-operator
65m         Normal   Pulled            pod/cluster-version-operator-859f4cccc5-hxjn9   Container image "quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5" already present on machine
64m         Normal   LeaderElection    configmap/version                               master-0-1_7a8dcc82-dddb-49a6-bd3d-4c1a4ea49a4d became leader
64m         Normal   RetrievePayload   clusterversion/version                          retrieving payload version="4.6.0-rc.4" image="quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5"
64m         Normal   VerifyPayload     clusterversion/version                          verifying payload version="4.6.0-rc.4" image="quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5"
64m         Normal   PayloadLoaded     clusterversion/version                          payload loaded version="4.6.0-rc.4" image="quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5"
$ oc -n openshift-cluster-version get -o wide pods
NAME                                        READY   STATUS    RESTARTS   AGE   IP                NODE         NOMINATED NODE   READINESS GATES
cluster-version-operator-859f4cccc5-hxjn9   1/1     Running   3          26h   192.168.123.100   master-0-1   <none>           <none>
$ oc -n openshift-cluster-version get -o json pod cluster-version-operator-859f4cccc5-hxjn9 | jq '.status.containerStatuses[].lastState.terminated'
{
  "containerID": "cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9",
  "exitCode": 2,
  "finishedAt": "2020-10-23T21:18:35Z",
  "message": "1e6e0, 0xc00178ea80, 0x1c34000, 0xc0001112a0, 0xc0010cbdc0)\n\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:361 +0x85\ngithub.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run.func2(0x1c1e6e0, 0xc00178ea80)\n\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:311 +0x79\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()\n\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185 +0x37\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000445f50)\n\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0010cbf50, 0x1bd9dc0, 0xc001018030, 0xc00178ea01, 0xc00147a1e0)\n\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000445f50, 0x3b9aca00, 0x0, 0x1, 0xc00147a1e0)\n\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1c1e6e0, 0xc00178ea80, 0xc0009c0070, 0x3b9aca00, 0x0, 0x1ab2d01)\n\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185 +0xa6\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1c1e6e0, 0xc00178ea80, 0xc0009c0070, 0x3b9aca00)\n\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99 +0x57\ncreated by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run\n\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:311 +0x4c6\n\nrax    0xca\nrbx    0x26b17e0\nrcx    0x473dc3\nrdx    0x0\nrdi    0x26b1928\nrsi    0x80\nrbp    0x7ffe380a64d0\nrsp    0x7ffe380a6488\nr8     0x0\nr9     0x0\nr10    0x0\nr11    0x286\nr12    0x51945a144962\nr13    0x1\nr14    0x51945a144962\nr15    0xc00026f260\nrip    0x473dc1\nrflags 0x286\ncs     0x33\nfs     0x0\ngs     0x0\n",
  "reason": "Error",
  "startedAt": "2020-10-22T20:37:29Z"
}

The CVO got the TERM at 20:55:27.  Give 27s grace and gather node logs:

$ oc adm node-logs --since '2020-10-23 20:55:00' master-0-1 | gzip >node.logs.gz
$ zcat node.logs.gz | head -n3
Data from the specified boot (-1) is not available: No such boot ID in journal
-- Logs begin at Fri 2020-10-23 06:12:54 UTC, end at Fri 2020-10-23 22:44:39 UTC. --
Oct 23 20:55:00.001013 master-0-1 hyperkube[1974]: I1023 20:55:00.000912    1974 prober.go:159] Exec-Probe Pod: mdns-publisher-master-0-1, Container: mdns-publisher, Command: [pgrep mdns-publisher]
$ zcat node.logs.gz | tail -n1
Oct 23 22:44:45.938151 master-0-1 hyperkube[1974]: ]

Kubelet logs for the container get a bit odd around teardown:

[root@titan91 ~]# zgrep '^Oct 23 21:1[789].*c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9' node.logs.gz
Oct 23 21:17:56.661544 master-0-1 hyperkube[1974]: I1023 21:17:56.661437    1974 status_manager.go:429] Ignoring same status for pod "cluster-version-operator-859f4cccc5-hxjn9_openshift-cluster-version(0d6165c6-fd5b-4220-a786-1008b9c9a346)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-22 20:26:29 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-22 20:37:29 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-22 20:37:29 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-22 20:26:29 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:192.168.123.100 PodIP:192.168.123.100 PodIPs:[{IP:192.168.123.100}] StartTime:2020-10-22 20:26:29 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:cluster-version-operator State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-10-22 20:37:29 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:0,Signal:0,Reason:Completed,Message:,StartedAt:2020-10-22 20:34:28 +0000 UTC,FinishedAt:2020-10-22 20:37:14 +0000 UTC,ContainerID:cri-o://329b20cba46290b9b260c649576036c4659f8da924cde36111f9c7e826d1507f,}} Ready:true RestartCount:2 Image:quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 ImageID:quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 ContainerID:cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9 Started:0xc0042cad10}] QOSClass:Burstable EphemeralContainerStatuses:[]}
Oct 23 21:18:35.422468 master-0-1 systemd[1]: crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope: Consumed 1min 44.086s CPU time
Oct 23 21:18:35.429264 master-0-1 systemd[1]: crio-conmon-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope: Consumed 1.215s CPU time
Oct 23 21:18:35.429472 master-0-1 hyperkube[1974]: I1023 21:18:35.429361    1974 manager.go:1044] Destroyed container: "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-conmon-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope" (aliases: [], namespace: "")
Oct 23 21:18:35.429552 master-0-1 hyperkube[1974]: I1023 21:18:35.429434    1974 handler.go:325] Added event &{/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-conmon-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope 2020-10-23 21:18:35.429409201 +0000 UTC m=+89713.490310038 containerDeletion {<nil>}}
Oct 23 21:18:36.230341 master-0-1 hyperkube[1974]: I1023 21:18:36.230298    1974 generic.go:155] GenericPLEG: 0d6165c6-fd5b-4220-a786-1008b9c9a346/c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9: running -> exited
Oct 23 21:18:36.233285 master-0-1 hyperkube[1974]: I1023 21:18:36.233238    1974 kubelet.go:1872] SyncLoop (PLEG): "cluster-version-operator-859f4cccc5-hxjn9_openshift-cluster-version(0d6165c6-fd5b-4220-a786-1008b9c9a346)", event: &pleg.PodLifecycleEvent{ID:"0d6165c6-fd5b-4220-a786-1008b9c9a346", Type:"ContainerDied", Data:"c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9"}
Oct 23 21:18:36.233908 master-0-1 hyperkube[1974]: I1023 21:18:36.233907    1974 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9
Oct 23 21:18:36.257345 master-0-1 hyperkube[1974]: I1023 21:18:36.256949    1974 status_manager.go:564] Patch status for pod "cluster-version-operator-859f4cccc5-hxjn9_openshift-cluster-version(0d6165c6-fd5b-4220-a786-1008b9c9a346)" with "{\"metadata\":{\"uid\":\"0d6165c6-fd5b-4220-a786-1008b9c9a346\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"lastTransitionTime\":\"2020-10-23T21:18:36Z\",\"message\":\"containers with unready status: [cluster-version-operator]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"Ready\"},{\"lastTransitionTime\":\"2020-10-23T21:18:36Z\",\"message\":\"containers with unready status: [cluster-version-operator]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"containerID\":\"cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9\",\"image\":\"quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5\",\"imageID\":\"quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5\",\"lastState\":{\"terminated\":{\"containerID\":\"cri-o://329b20cba46290b9b260c649576036c4659f8da924cde36111f9c7e826d1507f\",\"exitCode\":0,\"finishedAt\":\"2020-10-22T20:37:14Z\",\"reason\":\"Completed\",\"startedAt\":\"2020-10-22T20:34:28Z\"}},\"name\":\"cluster-version-operator\",\"ready\":false,\"restartCount\":2,\"started\":false,\"state\":{\"terminated\":{\"containerID\":\"cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9\",\"exitCode\":2,\"finishedAt\":\"2020-10-23T21:18:35Z\",\"message\":\"1e6e0, 0xc00178ea80, 0x1c34000, 0xc0001112a0, 0xc0010cbdc0)\\n\\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:361 +0x85\\ngithub.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run.func2(0x1c1e6e0, 0xc00178ea80)\\n\\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:311 +0x79\\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185 +0x37\\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000445f50)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f\\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0010cbf50, 0x1bd9dc0, 0xc001018030, 0xc00178ea01, 0xc00147a1e0)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad\\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000445f50, 0x3b9aca00, 0x0, 0x1, 0xc00147a1e0)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98\\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1c1e6e0, 0xc00178ea80, 0xc0009c0070, 0x3b9aca00, 0x0, 0x1ab2d01)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185 +0xa6\\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1c1e6e0, 0xc00178ea80, 0xc0009c0070, 0x3b9aca00)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99 +0x57\\ncreated by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run\\n\\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:311 +0x4c6\\n\\nrax    0xca\\nrbx    0x26b17e0\\nrcx    0x473dc3\\nrdx    0x0\\nrdi    0x26b1928\\nrsi    0x80\\nrbp    0x7ffe380a64d0\\nrsp    0x7ffe380a6488\\nr8     0x0\\nr9     0x0\\nr10    0x0\\nr11    0x286\\nr12    0x51945a144962\\nr13    0x1\\nr14    0x51945a144962\\nr15    0xc00026f260\\nrip    0x473dc1\\nrflags 0x286\\ncs     0x33\\nfs     0x0\\ngs     0x0\\n\",\"reason\":\"Error\",\"startedAt\":\"2020-10-22T20:37:29Z\"}}}]}}"
Oct 23 21:18:36.257666 master-0-1 hyperkube[1974]: ,StartedAt:2020-10-22 20:37:29 +0000 UTC,FinishedAt:2020-10-23 21:18:35 +0000 UTC,ContainerID:cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9,}} LastTerminationState:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:0,Signal:0,Reason:Completed,Message:,StartedAt:2020-10-22 20:34:28 +0000 UTC,FinishedAt:2020-10-22 20:37:14 +0000 UTC,ContainerID:cri-o://329b20cba46290b9b260c649576036c4659f8da924cde36111f9c7e826d1507f,}} Ready:false RestartCount:2 Image:quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 ImageID:quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 ContainerID:cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9 Started:0xc0043ee91f}] QOSClass:Burstable EphemeralContainerStatuses:[]})
Oct 23 21:18:37.280077 master-0-1 hyperkube[1974]: I1023 21:18:37.279503    1974 status_manager.go:564] Patch status for pod "cluster-version-operator-859f4cccc5-hxjn9_openshift-cluster-version(0d6165c6-fd5b-4220-a786-1008b9c9a346)" with "{\"metadata\":{\"uid\":\"0d6165c6-fd5b-4220-a786-1008b9c9a346\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"lastTransitionTime\":\"2020-10-23T21:18:37Z\",\"message\":null,\"reason\":null,\"status\":\"True\",\"type\":\"Ready\"},{\"lastTransitionTime\":\"2020-10-23T21:18:37Z\",\"message\":null,\"reason\":null,\"status\":\"True\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"containerID\":\"cri-o://48c8d8512c47b968c5c54a751b095a12542faedebd11b0981d3a980c01b89794\",\"image\":\"quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5\",\"imageID\":\"quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5\",\"lastState\":{\"terminated\":{\"containerID\":\"cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9\",\"exitCode\":2,\"finishedAt\":\"2020-10-23T21:18:35Z\",\"message\":\"1e6e0, 0xc00178ea80, 0x1c34000, 0xc0001112a0, 0xc0010cbdc0)\\n\\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:361 +0x85\\ngithub.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run.func2(0x1c1e6e0, 0xc00178ea80)\\n\\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:311 +0x79\\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185 +0x37\\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000445f50)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f\\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0010cbf50, 0x1bd9dc0, 0xc001018030, 0xc00178ea01, 0xc00147a1e0)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad\\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000445f50, 0x3b9aca00, 0x0, 0x1, 0xc00147a1e0)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98\\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1c1e6e0, 0xc00178ea80, 0xc0009c0070, 0x3b9aca00, 0x0, 0x1ab2d01)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185 +0xa6\\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1c1e6e0, 0xc00178ea80, 0xc0009c0070, 0x3b9aca00)\\n\\t/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99 +0x57\\ncreated by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run\\n\\t/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:311 +0x4c6\\n\\nrax    0xca\\nrbx    0x26b17e0\\nrcx    0x473dc3\\nrdx    0x0\\nrdi    0x26b1928\\nrsi    0x80\\nrbp    0x7ffe380a64d0\\nrsp    0x7ffe380a6488\\nr8     0x0\\nr9     0x0\\nr10    0x0\\nr11    0x286\\nr12    0x51945a144962\\nr13    0x1\\nr14    0x51945a144962\\nr15    0xc00026f260\\nrip    0x473dc1\\nrflags 0x286\\ncs     0x33\\nfs     0x0\\ngs     0x0\\n\",\"reason\":\"Error\",\"startedAt\":\"2020-10-22T20:37:29Z\"}},\"name\":\"cluster-version-operator\",\"ready\":true,\"restartCount\":3,\"started\":true,\"state\":{\"running\":{\"startedAt\":\"2020-10-23T21:18:36Z\"}}}]}}"
Oct 23 21:18:37.280354 master-0-1 hyperkube[1974]: ,StartedAt:2020-10-22 20:37:29 +0000 UTC,FinishedAt:2020-10-23 21:18:35 +0000 UTC,ContainerID:cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9,}} Ready:true RestartCount:3 Image:quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 ImageID:quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 ContainerID:cri-o://48c8d8512c47b968c5c54a751b095a12542faedebd11b0981d3a980c01b89794 Started:0xc005744329}] QOSClass:Burstable EphemeralContainerStatuses:[]})
Oct 23 21:18:48.612410 master-0-1 hyperkube[1974]: I1023 21:18:48.612390    1974 handler.go:181] Unable to get Process Stats: couldn't open cpu cgroup procs file /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs : open /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs: no such file or directory
Oct 23 21:19:06.379613 master-0-1 hyperkube[1974]: I1023 21:19:06.372544    1974 handler.go:181] Unable to get Process Stats: couldn't open cpu cgroup procs file /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs : open /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs: no such file or directory
Oct 23 21:19:27.101168 master-0-1 hyperkube[1974]: I1023 21:19:27.100940    1974 handler.go:181] Unable to get Process Stats: couldn't open cpu cgroup procs file /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs : open /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs: no such file or directory
Oct 23 21:19:41.748571 master-0-1 hyperkube[1974]: I1023 21:19:41.748539    1974 handler.go:181] Unable to get Process Stats: couldn't open cpu cgroup procs file /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs : open /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs: no such file or directory
Oct 23 21:19:48.661544 master-0-1 hyperkube[1974]: ,StartedAt:2020-10-22 20:37:29 +0000 UTC,FinishedAt:2020-10-23 21:18:35 +0000 UTC,ContainerID:cri-o://c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9,}} Ready:true RestartCount:3 Image:quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 ImageID:quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 ContainerID:cri-o://48c8d8512c47b968c5c54a751b095a12542faedebd11b0981d3a980c01b89794 Started:0xc001acc4d9}] QOSClass:Burstable EphemeralContainerStatuses:[]}
Oct 23 21:19:53.178050 master-0-1 hyperkube[1974]: I1023 21:19:53.177959    1974 handler.go:181] Unable to get Process Stats: couldn't open cpu cgroup procs file /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs : open /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs: no such file or directory
$ zgrep c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9 node.logs.gz | tail -n1
Oct 23 22:44:29.413229 master-0-1 hyperkube[1974]: I1023 22:44:29.413184    1974 handler.go:181] Unable to get Process Stats: couldn't open cpu cgroup procs file /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs : open /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d6165c6_fd5b_4220_a786_1008b9c9a346.slice/crio-c26b186a81284c0f124ee0e590226750e33aef0d4770a264341c9e7c089665e9.scope/cgroup.procs: no such file or directory

That's either the kubelet or CRI-O banging away for at least another hour and change on 'Unable to get Process Stats...cgroup.procs: no such file or directory'.

Comment 5 W. Trevor King 2020-10-23 23:06:24 UTC
Bug 1879205 seems to be tracking the kubelet/CRI-O "Unable to get Process Stats" issue.

Comment 6 W. Trevor King 2020-10-26 14:23:15 UTC
This is the first week of sprint, so adding UpcomingSprint.

Comment 7 W. Trevor King 2020-10-27 03:00:51 UTC
I have no idea how to trigger the lock that the patch is guarding against.  I guess 'oc -n openshift-cluster-version logs -f cluster-version-operator-...'  in one terminal, and then 'oc -n openshift-cluster-version delete pod cluster-version-operator-...' in another to trigger the TERM and graceful shutdown.  Watch the logs to ensure we get messages about the metrics children being collected.  And hope you get lucky and see the "Abandoning ... uncollected metrics goroutines" message that shows the safety valve blowing open.

Comment 10 Yang Yang 2020-10-27 10:32:08 UTC
I'm trying to reproduce it on 4.6, but I'm not lucky. Installing on Bare Metal succeeds and following the way introduced in comment#7 does not trigger the lock.

Comment 11 Rom Freiman 2020-10-27 12:45:58 UTC
It happened during assisted installer installations on 4.6 rc.4

Comment 12 Michael Hrivnak 2020-10-27 19:04:27 UTC
FYI we are continuing to hit this problem with bare metal installs through assisted-installer. It does not happen every time, but it continues to happen sometimes. Today we reproduced it on 4.6.1. I would consider this very important to release as a 4.6 bugfix since users can now install 4.6 on bare metal with assisted-installer through cloud.redhat.com.

Comment 14 W. Trevor King 2020-10-27 20:00:04 UTC
This bug targets 4.7.  It is expected that we continue to have the issue on 4.6 until this bug is verified and we can backport the patch to 4.6.z.  To get the bug verified in 4.7, please connect QE with your assisted-installer flow and demonstrate that recent 4.7 nightlies no longer hang this way.

Comment 15 Yang Yang 2020-10-28 01:31:38 UTC
Michael Hrivnak, I have no idea about assisted-installer. Could you please direct me to the document which describes the procedures to install with assisted-installer so that I can verify it? And could you please have a try with latest 4.7 nightly build? Thanks.

Comment 16 W. Trevor King 2020-10-28 04:02:47 UTC
There is some context about the assisted installer in [1], although I'm not sure if there is enough there for folks to know how to launch assisted installs.

[1]: https://github.com/openshift/enhancements/blob/master/enhancements/installer/connected-assisted-installer.md

Comment 17 Igal Tsoiref 2020-10-28 09:02:26 UTC
We see that cvo looses leader due to etcd instability(another issue), while it looses leader it sops all it's workers and then stuck on it's way to shut down:

I1027 17:03:46.786426       1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version
....

E1027 17:04:08.520679       1 memcache.go:111] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
E1027 17:04:11.810461       1 leaderelection.go:321] error retrieving resource lock openshift-cluster-version/version: Get "https://api-int.ocp-cluster-titan91-0.qe.lab.redhat.com:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version": context deadline exceeded
I1027 17:04:11.810579       1 leaderelection.go:278] failed to renew lease openshift-cluster-version/version: timed out waiting for the condition
I1027 17:04:11.810737       1 request.go:834] Error in request: resource name may not be empty
E1027 17:04:11.810782       1 leaderelection.go:297] Failed to release lock: resource name may not be empty

......

I1027 17:04:11.810804       1 start.go:250] Stopped leading; shutting down.
....
I1027 17:05:19.814616       1 sync_worker.go:274] Stopped worker
I1027 17:05:19.814643       1 sync_worker.go:361] Worker shut down
E1027 17:05:26.046541       1 event.go:273] Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping)
E1027 17:05:36.047031       1 event.go:273] Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping)
E1027 17:05:46.047377       1 event.go:273] Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping)
I1027 17:05:47.472264       1 discovery.go:214] Invalidating discovery information
E1027 17:05:56.047658       1 event.go:273] Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping)
E1027 17:05:56.047765       1 event.go:218] Unable to write event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:".1641e833df74ab68", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"ConfigMap", Namespace:"", Name:"", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"LeaderElection", Message:"master-0-1_9ee38044-6899-4502-a707-c2baa80bf7ed stopped leading", Source:v1.EventSource{Component:"openshift-cluster-version", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfde32a2f04ffd68, ext:226013344971, loc:(*time.Location)(0x26b0400)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfde32a2f04ffd68, ext:226013344971, loc:(*time.Location)(0x26b0400)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}' (retry limit exceeded!)
I1027 17:06:11.811551       1 start.go:260] Waiting on 1 outstanding goroutines.
E1027 17:06:11.811621       1 metrics.go:160] Failed to gracefully shut down metrics server: accept tcp [::]:9099: use of closed network connection
I1027 17:06:47.472342       1 discovery.go:214] Invalidating discovery information
I1027 17:07:47.472132       1 discovery.go:214] Invalidating discovery information
I1027 17:08:47.472617       1 discovery.go:214] Invalidating discovery information
I1027 17:09:47.472244       1 discovery.go:214] Invalidating discovery information
I1027 17:10:47.472262       1 discovery.go:214] Invalidating discovery information

Comment 18 Igal Tsoiref 2020-10-28 09:23:06 UTC
maybe i miss something but looks like there is missing return in this line https://github.com/openshift/cluster-version-operator/blob/7ace0d3f0d83c87b003b9c9fac262e2888d90d92/pkg/cvo/metrics.go#L192 ?

Comment 19 Omri Hochman 2020-10-28 10:41:31 UTC
(In reply to W. Trevor King from comment #14)
> This bug targets 4.7.  It is expected that we continue to have the issue on
> 4.6 until this bug is verified and we can backport the patch to 4.6.z.  To
> get the bug verified in 4.7, please connect QE with your assisted-installer
> flow and demonstrate that recent 4.7 nightlies no longer hang this way.

This bug version happens with 4.6 and it should be assigned till we get the fix available with 4.6.z builds,  we can clone this bug to 4.7 and put the clone bug to ON_QA for QE verification against 4.7. 

At tthe moment assisted-installer deploying 4.6.1 builds, so this bug is the default behavior, and its TestBlocker for QE 
https://bugzilla.redhat.com/show_bug.cgi?id=1889970 

I'll clone to 4.7 and move this one back to Assigned

Comment 20 W. Trevor King 2020-10-28 12:56:54 UTC
"Target Release" here is 4.7.0, and has been since I set it at 2020-10-23 23:43:37 UTC.  The 4.7/master PR has landed already, linking *this bug* as the one it was addressing.  This bug needs to remain the 4.7-targeting bug.  Moving back to ON_QA.

Once the 4.7-targeting bug is fixed, but not before, we can land a PR targeting 4.6.  The cherry-pick tooling will clone a new bug for that.  Please leave this bug in ON_QA going forward, so we can get it VERIFIED and unblock the 4.6 backport.

Comment 21 W. Trevor King 2020-10-28 12:59:20 UTC
*** Bug 1892267 has been marked as a duplicate of this bug. ***

Comment 22 W. Trevor King 2020-10-28 13:04:53 UTC
(In reply to Igal Tsoiref from comment #18)
> maybe i miss something but looks like there is missing return in this line
> https://github.com/openshift/cluster-version-operator/blob/
> 7ace0d3f0d83c87b003b9c9fac262e2888d90d92/pkg/cvo/metrics.go#L192 ?

If we fail to shutdown the server, I think we still want to wait for at most runContext to reap the child goroutines.  But if folks show that a failed Shutdown(...) call means we _never_ collect any child goroutines, then yeah, a return there would make sense.  It's not clear from the log excerpt in comment 19 what version of the CVO was generating the logs.  Do you still have the first line of that log file with the CVO version?

Comment 30 Yang Yang 2020-10-29 11:03:49 UTC
I tried 3 times including 2 concurrent installations on baremetal using 4.7.0-0.ci-2020-10-27-124430 and the cluster is installed successfully for all of the 3 times.

Comment 31 Igal Tsoiref 2020-10-29 15:51:45 UTC
(In reply to W. Trevor King from comment #22)
> (In reply to Igal Tsoiref from comment #18)
> > maybe i miss something but looks like there is missing return in this line
> > https://github.com/openshift/cluster-version-operator/blob/
> > 7ace0d3f0d83c87b003b9c9fac262e2888d90d92/pkg/cvo/metrics.go#L192 ?
> 
> If we fail to shutdown the server, I think we still want to wait for at most
> runContext to reap the child goroutines.  But if folks show that a failed
> Shutdown(...) call means we _never_ collect any child goroutines, then yeah,
> a return there would make sense.  It's not clear from the log excerpt in
> comment 19 what version of the CVO was generating the logs.  Do you still
> have the first line of that log file with the CVO version?

I1027 17:00:25.844829       1 start.go:21] ClusterVersionOperator 4.6.0-202010100331.p0-c35a4e1

Comment 32 Igal Tsoiref 2020-10-29 15:53:21 UTC
(In reply to Yang Yang from comment #30)
> I tried 3 times including 2 concurrent installations on baremetal using
> 4.7.0-0.ci-2020-10-27-124430 and the cluster is installed successfully for
> all of the 3 times.

The issue is not stable etcd, it is reelecting while installation is in progress.

Comment 33 W. Trevor King 2020-10-29 16:50:14 UTC
(In reply to Igal Tsoiref from comment #31)
> I1027 17:00:25.844829       1 start.go:21] ClusterVersionOperator 4.6.0-202010100331.p0-c35a4e1

Yup.  Still no patches in that 4.6 branch until we either get this 4.7-targeting bug VERIFIED or decide to open the 4.6 backport early and use pre-merge verification [1] on the 4.6 backport.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1876486#c7

Comment 34 W. Trevor King 2020-10-29 17:13:08 UTC
Summarizing the current sitatuation, since I haven't been doing a very good job at clear messaging:

1. 4.6 and master/4.7 have a CVO bug where a stuck metrics-serving goroutine can hang the CVO in shutdown, breaking install and generally gumming things up whenever we trip that hang.
2. [1] has landed in master/4.7, in an attempt to prevent the hang.
3. Something (presumably unrelated) is keeping ART from building 4.6 and 4.7 nightlies, so we have no nightlies with which to verify [1].
4. The assisted-installer flow is the simplest approach to reproduce the condition that causes the hang.
5. There is some sort of issue that makes it hard to plug 4.7 releases into the assisted-installer flow.
6. We want to fix this on 4.6 too.
7. We don't want to land changes in 4.6 that don't actually fix the problem.
8. We usually prevent released-version churn around broken fixes by verifying fixes in master-ward branches before allowing the changes into released branches.  E.g. verifying a 4.7 fix before allowing the 4.6 backport to land.
9. We have a new process around pre-merge verification [2], but I'm not particularly familiar with it.

So, possible paths forward:

a. i. Minimal sanity testing (comment 7, comment 30) of the 4.7 release.  Mark this bug VERIFIED to address (8).
   ii. Create and land the 4.6 backport to address (6).
   iii. Assisted-install 4.6 verification to see if we actually fixed the issue (possibly constrained by (3), I dunno if assisted-install tooling is set up to handle CI builds).
   iv. If we fail verification, revert the 4.6 backport before the next 4.6 nightly is picked up for a 4.6.z, and back to the drawing board for 4.7/master fixes.

b. i. Figure out how to reliably trip the hang in 4.7.
   ii. Confirm that the landed 4.7/master change fixes the hang, or continue to iterate with 4.7/master PRs until we can confirm the hang is fixed.
   iii. Create and land the 4.6 backport to address (6).
   iv. Assisted-install 4.6 verification to confirm that the 4.6 backport also fixes the hang.

c. i. Create, but do not yet land the 4.6 backport.
   ii. Pre-merge verification of the 4.6 PR, to confirm that the backport fixes the hang.
   iii. Mark this bug VERIFIED, assuming that if the backport fixes the 4.6 hang, the 4.7/master [1] with the same code fixes the same issue in 4.7.
   iv. VERIFIED 4.7 bug unblocks the 4.6 backport PR landing, so land it.
   v. Assisted-install, post-merge 4.6 verification to sanity-check c.ii's pre-merge verification.

I'm not clear which of those we are pursuing.  It's really up to QE to let us know which route forward seems best to them.

[1]: https://github.com/openshift/cluster-version-operator/pull/477
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1876486#c7

Comment 35 Rom Freiman 2020-10-29 21:42:13 UTC
@wking referring to bullet 5: you can use /test e2e-metal-assisted to run assisted e2e coverage on 4.7.
The problem is that it's not a very common failure.

We do want this back port into 4.6 though - seems that it started happen more frequently once AI moved to install rc/ga.

Comment 36 W. Trevor King 2020-10-30 00:08:28 UTC
> @wking referring to bullet 5: you can use /test e2e-metal-assisted to run assisted e2e coverage on 4.7.

Only the cluster-etcd-operator and assisted-test-infra repositories:

  release$ git grep -l 'trigger.*test.*e2e-metal-assisted' origin/master | sed 's|[^/]*$||' | uniq
  origin/master:ci-operator/jobs/openshift-priv/assisted-test-infra/
  origin/master:ci-operator/jobs/openshift-priv/cluster-etcd-operator/
  origin/master:ci-operator/jobs/openshift/assisted-test-infra/
  origin/master:ci-operator/jobs/openshift/cluster-etcd-operator/

But the 4.7 PR has landed.  If folks want to bang away at the assisted job in a dummy PR against either of those repositories, and feel like failing to reproduce the hang after $N attempts would be enough to mark this 4.7-targetting bug VERIFIED, that's fine with me.  I'm pretty confident in my patch, but I'm not QE, and it's up to QE to decide what they're comfortable with as grounds for marking this bug VERIFIED.

Comment 37 Yang Yang 2020-10-30 01:47:24 UTC
I'm still waiting for the nightly build which to verify with. If assisted-installer folks can't wait to get 4.6 back port to land, I'm okay with the pre-merge flow. I did a minimal sanity test comment#30 against ci build and it works fine.

Comment 39 W. Trevor King 2020-10-31 22:23:50 UTC
ART made an attempt to build a new 4.7 nightly [1], but it failed on:

  error: unable to create a release: operator "cluster-storage-operator" contained an invalid image-references file: no input image tag named "openstack-cinder-csi-driver-operator"

Moving back to MODIFIED while we wait for them to complete a build.

[1]: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2020-10-31-214011

Comment 41 Yang Yang 2020-11-02 02:06:48 UTC
The new 4.7 nightly build[1] fails to create:

error: unable to create a release: operator "cluster-storage-operator" contained an invalid image-references file: no input image tag

[1]https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2020-11-01-092917

Comment 43 Yang Yang 2020-11-10 09:10:14 UTC
Tried 4 times installation on baremetal using 4.7.0-0.nightly-2020-11-10-032055 and all passed. So I'm moving it to verified state.

Comment 46 errata-xmlrpc 2021-02-24 15:27:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633