Bug 1873900 - Cluster-version operator stops running sync cycles on desired version change
Summary: Cluster-version operator stops running sync cycles on desired version change
Keywords:
Status: ON_QA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: W. Trevor King
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-30 22:10 UTC by W. Trevor King
Modified: 2020-09-21 17:24 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: A new-in-4.6 refactor caused the cluster-version operator to occasionally deadlock its manifest reconciliation worker when that worker was canceled. Consequence: Requesting an update to a new release cancels the manifest reconciliation worker, and sometimes it would deadlock. When that happened, the cluster-version operator would not apply the requested update. And it would also stop reconciling the previous release. Fix: The cluster-version operator was updated to avoid the deadlock. Result: The cluster-version operator no longer deadlocks, and is back to the 4.5 behavior of promptly accepting and applying requested updates.
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 448 None closed Bug 1873900: install/0000_00_cluster-version-operator_03_deployment: Bump to --v=5 2020-09-18 17:22:23 UTC
Github openshift cluster-version-operator pull 451 None closed Bug 1873900: pkg/cvo/sync_worker: Grow logging in SyncWorker.Update 2020-09-18 17:22:26 UTC
Github openshift cluster-version-operator pull 455 None closed Bug 1873900: pkg/payload/task_graph: Avoid deadlocking on cancel with workCh queue 2020-09-21 08:13:39 UTC

Description W. Trevor King 2020-08-30 22:10:01 UTC
With 4.6.0-0.nightly-2020-08-25-204643:

$ grep 'Loading payload\|Payload loaded\|Running sync.*in state\|Desired version from spec\|Result of work' /tmp/xx/namespaces/openshift-cluster-version/pods/cluster-version-operator-54589d4748-fj56v/cluster-version-operator/cluster-version-operator/logs/current.log
2020-08-28T03:36:00.384964545Z I0828 03:36:00.356808       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 1 in state Reconciling at attempt 0
2020-08-28T03:36:00.384964545Z I0828 03:36:00.356823       1 sync_worker.go:493] Loading payload
2020-08-28T03:36:01.854233287Z I0828 03:36:01.854208       1 sync_worker.go:593] Payload loaded from registry.svc.ci.openshift.org/ocp/release@sha256:56945dc7218d758e25ffe990374668890e8c77d72132c98e0cc8f6272c063cc7 with hash VdMtCylIGgw=
2020-08-28T03:39:05.050142724Z I0828 03:39:05.050129       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 1 in state Reconciling at attempt 0
2020-08-28T03:39:05.050166757Z I0828 03:39:05.050159       1 sync_worker.go:493] Loading payload
2020-08-28T03:39:05.770317300Z I0828 03:39:05.770275       1 sync_worker.go:593] Payload loaded from registry.svc.ci.openshift.org/ocp/release@sha256:56945dc7218d758e25ffe990374668890e8c77d72132c98e0cc8f6272c063cc7 with hash VdMtCylIGgw=
2020-08-28T03:42:05.761345833Z I0828 03:42:05.761295       1 task_graph.go:546] Result of work: []
2020-08-28T03:44:56.717918691Z I0828 03:44:56.717873       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 1 in state Reconciling at attempt 1
2020-08-28T03:45:40.127607986Z I0828 03:45:40.127597       1 task_graph.go:546] Result of work: []
...
2020-08-28T05:46:20.965430733Z I0828 05:46:20.965389       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 1 in state Reconciling at attempt 35
2020-08-28T05:47:00.525075212Z I0828 05:47:00.525051       1 task_graph.go:546] Result of work: []
2020-08-28T05:47:32.154197038Z I0828 05:47:32.154182       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 2 in state Updating at attempt 0
2020-08-28T05:47:32.285316684Z I0828 05:47:32.285292       1 task_graph.go:546] Result of work: []
2020-08-28T05:48:17.099951081Z I0828 05:48:17.096699       1 task_graph.go:546] Result of work: []
2020-08-28T05:49:38.130488133Z I0828 05:49:38.130460       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 3 in state Updating at attempt 0
2020-08-28T05:49:38.246716527Z I0828 05:49:38.246705       1 task_graph.go:546] Result of work: []
2020-08-28T05:50:02.959409172Z I0828 05:50:02.959384       1 task_graph.go:546] Result of work: [Could not update deployment "openshift-cloud-credential-operator/cloud-credential-operator" (193 of 598) Could not update role "openshift-machine-api/prometheus-k8s-cluster-autoscaler-operator" (222 of 598) Could not update clusterrole "system:openshift:aggregate-snapshots-to-view" (243 of 598) Could not update configmap "openshift-image-registry/trusted-ca" (258 of 598) Could not update service "openshift-ingress-operator/metrics" (277 of 598) Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (322 of 598) Could not update deployment "openshift-monitoring/cluster-monitoring-operator" (359 of 598) Could not update clusterrolebinding "console-extensions-reader" (375 of 598) Could not update rolebinding "openshift-config/insights-operator" (407 of 598) Could not update deployment "openshift-operator-lifecycle-manager/olm-operator" (433 of 598) Could not update servicemonitor "openshift-marketplace/marketplace-operator" (454 of 598) Could not update service "openshift-service-ca-operator/metrics" (460 of 598)]
2020-08-28T05:50:02.960435734Z I0828 05:50:02.960413       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 4 in state Updating at attempt 0
2020-08-28T05:50:04.939552794Z I0828 05:50:04.939539       1 task_graph.go:546] Result of work: []
2020-08-28T05:50:46.585629809Z I0828 05:50:46.585626       1 task_graph.go:546] Result of work: []
2020-08-28T05:51:33.800694010Z I0828 05:51:33.800672       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 5 in state Updating at attempt 0
2020-08-28T05:51:33.931867692Z I0828 05:51:33.931861       1 task_graph.go:546] Result of work: []
2020-08-28T05:56:53.162544970Z I0828 05:56:53.162531       1 task_graph.go:546] Result of work: [Cluster operator csi-snapshot-controller has not yet reported success]
2020-08-28T05:56:53.163581403Z I0828 05:56:53.163565       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 6 in state Updating at attempt 0
2020-08-28T05:56:53.285166167Z I0828 05:56:53.285162       1 task_graph.go:546] Result of work: []
2020-08-28T05:57:54.624727704Z I0828 05:57:54.624715       1 task_graph.go:546] Result of work: []
2020-08-28T05:58:25.498627106Z I0828 05:58:25.498596       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 7 in state Updating at attempt 0
2020-08-28T05:58:25.616645308Z I0828 05:58:25.616634       1 task_graph.go:546] Result of work: []
2020-08-28T05:59:10.409241880Z I0828 05:59:10.409232       1 task_graph.go:546] Result of work: []
2020-08-28T06:02:01.365927807Z I0828 06:02:01.365884       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 7 in state Reconciling at attempt 0
2020-08-28T06:02:44.773794796Z I0828 06:02:44.773786       1 task_graph.go:546] Result of work: []
2020-08-28T06:05:35.730466807Z I0828 06:05:35.730422       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 7 in state Reconciling at attempt 0
2020-08-28T06:06:19.137044203Z I0828 06:06:19.137032       1 task_graph.go:546] Result of work: []
2020-08-28T06:09:10.093712198Z I0828 06:09:10.093665       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 7 in state Reconciling at attempt 0
2020-08-28T10:30:29.184128618Z I0828 10:30:29.184105       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release-nightly@sha256:567be653c695fce831f848537f64ba7e06c86d47af8e8dd15718e8586a1f087d", Force:false}
2020-08-28T10:31:39.714275754Z I0828 10:31:39.714250       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release-nightly@sha256:567be653c695fce831f848537f64ba7e06c86d47af8e8dd15718e8586a1f087d", Force:false}
...
2020-08-28T13:44:13.345879177Z I0828 13:44:13.345865       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release-nightly@sha256:567be653c695fce831f848537f64ba7e06c86d47af8e8dd15718e8586a1f087d", Force:false}
2020-08-28T13:47:18.337033333Z I0828 13:47:18.336993       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-27-005538", Force:true}
...
2020-08-28T14:03:28.722402381Z I0828 14:03:28.722348       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-27-005538", Force:true}
2020-08-28T14:05:18.370948271Z I0828 14:05:18.370906       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release-nightly@sha256:567be653c695fce831f848537f64ba7e06c86d47af8e8dd15718e8586a1f087d", Force:false}
...
2020-08-28T23:29:02.546827101Z I0828 23:29:02.546802       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release-nightly@sha256:567be653c695fce831f848537f64ba7e06c86d47af8e8dd15718e8586a1f087d", Force:false}

Once the desired version is changed, and it changes a few times, there is no sign from the sync worker that it is still alive.  Not clear where it is getting stuck, or if it is not stuck and just iterating in some low-log pathway.

Comment 2 W. Trevor King 2020-08-31 21:46:17 UTC
Moving back to NEW.  The landed logging is not enough to actually fix anything, it will just help us understand the next time we catch this in a must-gather.

Comment 3 W. Trevor King 2020-09-11 05:11:11 UTC
Landing more logging.  Still not clear enough to fix the issue.

Comment 4 W. Trevor King 2020-09-14 20:04:18 UTC
Recent 4.6.0-0.ci-2020-09-14-081525 -> 4.6.0-0.ci-2020-09-14-103240 job [1] had still not acknowledged the update by gather-time:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1305495056425160704/artifacts/e2e-gcp-upgrade/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)'
  2020-09-14T13:21:41Z 2020-09-14T13:56:47Z 4.6.0-0.ci-2020-09-14-081525 Completed false

Running CVO is from before install-complete:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1305495056425160704/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-55d9c8686d-smksh_cluster-version-operator.log | head -n1
  I0914 13:32:27.214436       1 start.go:21] ClusterVersionOperator v1.0.0-372-g6d56c655-dirty

From those same logs, here's the update to 4.6.0-0.ci-2020-09-14-103240 making it through:

  I0914 13:59:51.815658       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"registry.svc.ci.openshift.org/ocp/release:4.6.0-0.ci-2020-09-14-103240", Force:true}
  I0914 13:59:51.815686       1 sync_worker.go:243] Cancel the sync worker's current loop
  I0914 13:59:51.815700       1 sync_worker.go:249] Notify the sync worker that new work is available
  I0914 13:59:51.815716       1 status.go:159] Synchronizing errs=field.ErrorList{} status=&cvo.SyncWorkerStatus{Generation:1, Step:"ApplyResources", Failure:error(nil), Fraction:0.78571427, Completed:0, Reconciling:true, Initial:false, VersionHash:"EeHTCcLjL-Y=", LastProgress:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}, Actual:v1.Release{Version:"4.6.0-0.ci-2020-09-14-081525", Image:"registry.svc.ci.openshift.org/ocp/release@sha256:139d14b24df55f821d7027db5804d7261d8eec024a6e2a55d972f344869b0ce8", URL:"", Channels:[]string(nil)}, Verified:false}
  I0914 13:59:51.815781       1 status.go:79] merge into existing history completed=false desired=v1.Release{Version:"4.6.0-0.ci-2020-09-14-081525", Image:"registry.svc.ci.openshift.org/ocp/release@sha256:139d14b24df55f821d7027db5804d7261d8eec024a6e2a55d972f344869b0ce8", URL:"", Channels:[]string(nil)} last=&v1.UpdateHistory{State:"Completed", StartedTime:v1.Time{Time:time.Time{wall:0x0, ext:63735686501, loc:(*time.Location)(0x28cfd60)}}, CompletionTime:(*v1.Time)(0xc000855ca0), Version:"4.6.0-0.ci-2020-09-14-081525", Image:"registry.svc.ci.openshift.org/ocp/release@sha256:139d14b24df55f821d7027db5804d7261d8eec024a6e2a55d972f344869b0ce8", Verified:false}
  E0914 13:59:51.815800       1 task.go:81] error running apply for configmap "openshift-console-operator/console-operator-config" (390 of 602): context canceled
  I0914 13:59:51.815874       1 task_graph.go:483] Canceled worker 1 while waiting for work
  ...
  E0914 13:59:51.815971       1 task.go:81] error running apply for service "openshift-machine-config-operator/machine-config-daemon" (494 of 602): context canceled
  I0914 13:59:51.817013       1 task_graph.go:483] Canceled worker 0 while waiting for work
  ...
  I0914 14:00:02.786276       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"registry.svc.ci.openshift.org/ocp/release:4.6.0-0.ci-2020-09-14-103240", Force:true}
  I0914 14:00:02.786311       1 sync_worker.go:222] Update work is equal to current target; no change required

and all subsequent sync-loop goroutine work hangs.  Maybe in RunGraph?  Maybe on consistentReporter.lock?  I think I need to catch this in a live cluster so I can SIGQUIT [2] and get a stack dump.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1305495056425160704
[2]: https://golang.org/pkg/os/signal/#hdr-Default_behavior_of_signals_in_Go_programs

Comment 5 W. Trevor King 2020-09-14 23:49:27 UTC
I think I reproduced this by installing 4.6.0-fc.5 and updating back and forth between fc.4 and fc.5 until the sync worker hung.  Then in one terminal I streamed down the logs:

  $ oc -n openshift-cluster-version logs -f cluster-version-operator-659794d67-x6l84

while in another terminal I exec'ed in and SIGQUIT the CVO:

  $ oc -n openshift-cluster-version exec -it cluster-version-operator-659794d67-x6l84 -- bash
  [root@ci-ln-gjxvdvb-f76d1-jrtws-master-1 /]# ps aux
  USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
  root           1  0.9  0.5 964340 89732 ?        Ssl  23:03   0:10 /usr/bin/cluster-version-operator start --release-image=registry.svc.ci.openshift.org/ocp/release@sha256:5883d0db15939484bd477147e6949c53fbc6f55
  root          16  0.2  0.0  11840  2952 pts/0    Ss   23:22   0:00 bash
  root          30  0.0  0.0  51768  3424 pts/0    R+   23:22   0:00 ps aux
  [root@ci-ln-gjxvdvb-f76d1-jrtws-master-1 /]# kill -s QUIT 1
  [root@ci-ln-gjxvdvb-f76d1-jrtws-master-1 /]# command terminated with exit code 137

to get a traceback.  The sync worker goroutine looked like:

  goroutine 113 [chan receive, 5 minutes]:
  github.com/openshift/cluster-version-operator/pkg/payload.RunGraph(0x1c31fc0, 0xc000c7e960, 0xc001c853c0, 0x2, 0xc0019c2360, 0x0, 0x0, 0x0)
    /go/src/github.com/openshift/cluster-version-operator/pkg/payload/task_graph.go:510 +0x450
  github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).apply(0xc00018a300, 0x1c31fc0, 0xc000c7e960, 0xc0007a8000, 0xc001c399e0, 0x2, 0x1beaa00, 0xc0009d8050, 0x18ee2a0, 0xc001b38600)
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:685 +0x558
  github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).syncOnce(0xc00018a300, 0x1c31fc0, 0xc000c7e960, 0xc001c399e0, 0x10, 0x1beaa00, 0xc0009d8050, 0xc000aa2480, 0xc00109f6e4, 0xa)
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:596 +0x34f
  github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1.1(0xc001c399e0, 0xc00018a300, 0x1c31f40, 0xc000b0e0c0, 0x1bf70c0, 0xc000115860, 0x1967d48, 0x7, 0x10, 0x0, ...)
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:324 +0x264
  github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1()
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:325 +0x366
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000f3bf40)
    /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001c85f40, 0x1beb200, 0xc000b64120, 0xc000b0e001, 0xc000038960)
    /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3
  k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000f3bf40, 0x989680, 0x0, 0x1, 0xc000038960)
    /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
  k8s.io/apimachinery/pkg/util/wait.Until(...)
    /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
  github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start(0xc00018a300, 0x1c31f40, 0xc000b0e0c0, 0x10, 0x1967d48, 0x7, 0x1bf70c0, 0xc000115860)
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:256 +0x14c
  created by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:310 +0x43b

And since the hung CVO was from rc.5, we can resolve those line numbers in [1]:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.6.0-fc.5-x86_64 | grep cluster-version-operator
    cluster-version-operator                       https://github.com/openshift/cluster-version-operator                       d6475eff00192a56cd6a833aa7f73d6f4f4a2cef

So it was hung in the RunGraph call that attempts release image reconciliation [2] while it's waiting to read from resultCh.  I probably flubbed context expiration handling in the new-in-4.6 [4].

[1]: https://github.com/openshift/cluster-version-operator/tree/d6475eff00192a56cd6a833aa7f73d6f4f4a2cef
[2]: https://github.com/openshift/cluster-version-operator/blob/d6475eff00192a56cd6a833aa7f73d6f4f4a2cef/pkg/cvo/sync_worker.go#L685
[3]: https://github.com/openshift/cluster-version-operator/blob/d6475eff00192a56cd6a833aa7f73d6f4f4a2cef/pkg/payload/task_graph.go#L510
[4]: https://github.com/openshift/cluster-version-operator/pull/264

Comment 6 W. Trevor King 2020-09-16 16:38:57 UTC
One workaround is to kill the cluster-version operator pod.  The deployment controller will create a replacement, and that replacement will begin syncing manifests again without getting stuck in the racy update-triggers-a-sync-worker-cancel hole.

Comment 7 W. Trevor King 2020-09-16 18:02:06 UTC
Took a stab at doc text, although I'm also fine with this introduced-in-4.6-and-fixed-before-GA bug not having doc text.

Comment 9 Johnny Liu 2020-09-21 08:20:05 UTC
> I think I reproduced this by installing 4.6.0-fc.5 and updating back and forth between fc.4 and fc.5 until the sync worker hung.

I tried this scenario for about 7 rounds (back to fc.4, then forth to fc.5 again), each upgrade is completed successfully.

@wking, do you have idea how to reproduce this issue?

Comment 10 W. Trevor King 2020-09-21 17:24:17 UTC
Per the description in [1], you want to have a big stack of parallel nodes, so that the workers will not be able to consume all the nodes in the queue before your target-release bump gets them canceled.  We have a big stack of parallel nodes when we are in the reconciliation phase, but during that phase the CVO is mostly doing quick "yeah, the in-cluster state is fine for this manifest" checks.  You might have more luck trying to catch it if you re-target mid-update while the CVO is early in the process of reconciling runlevel 50, after the kube-storage-version-migrator operator levels in runlevel 40.

But we also landed a unit test, so if we can't actually reproduce the breakage after a reasonable effort, I'm fine with just putting the new code through the usual regression testing.

[1]: https://github.com/openshift/cluster-version-operator/pull/455/commits/632e763fa209c3e1bd07098c2e33599f8c663e3e


Note You need to log in before you can comment on or make changes to this bug.