1873900 – Cluster-version operator stops running sync cycles on desired version change

Bug 1873900 - Cluster-version operator stops running sync cycles on desired version change

Summary: Cluster-version operator stops running sync cycles on desired version change

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	W. Trevor King
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1940652
TreeView+	depends on / blocked

Reported:	2020-08-30 22:10 UTC by W. Trevor King
Modified:	2021-03-18 19:23 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: A new-in-4.6 refactor caused the cluster-version operator to occasionally deadlock its manifest reconciliation worker when that worker was canceled. Consequence: Requesting an update to a new release cancels the manifest reconciliation worker, and sometimes it would deadlock. When that happened, the cluster-version operator would not apply the requested update. And it would also stop reconciling the previous release. Fix: The cluster-version operator was updated to avoid the deadlock. Result: The cluster-version operator no longer deadlocks, and is back to the 4.5 behavior of promptly accepting and applying requested updates.
Clone Of:
Clones:	1940652 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:36:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 448	None	closed	Bug 1873900: install/0000_00_cluster-version-operator_03_deployment: Bump to --v=5	2021-02-17 00:45:18 UTC
Github	openshift cluster-version-operator pull 451	None	closed	Bug 1873900: pkg/cvo/sync_worker: Grow logging in SyncWorker.Update	2021-02-17 00:45:18 UTC
Github	openshift cluster-version-operator pull 455	None	closed	Bug 1873900: pkg/payload/task_graph: Avoid deadlocking on cancel with workCh queue	2021-02-17 00:45:18 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:36:39 UTC

Description W. Trevor King 2020-08-30 22:10:01 UTC

With 4.6.0-0.nightly-2020-08-25-204643:

$ grep 'Loading payload\|Payload loaded\|Running sync.*in state\|Desired version from spec\|Result of work' /tmp/xx/namespaces/openshift-cluster-version/pods/cluster-version-operator-54589d4748-fj56v/cluster-version-operator/cluster-version-operator/logs/current.log
2020-08-28T03:36:00.384964545Z I0828 03:36:00.356808       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 1 in state Reconciling at attempt 0
2020-08-28T03:36:00.384964545Z I0828 03:36:00.356823       1 sync_worker.go:493] Loading payload
2020-08-28T03:36:01.854233287Z I0828 03:36:01.854208       1 sync_worker.go:593] Payload loaded from registry.svc.ci.openshift.org/ocp/release@sha256:56945dc7218d758e25ffe990374668890e8c77d72132c98e0cc8f6272c063cc7 with hash VdMtCylIGgw=
2020-08-28T03:39:05.050142724Z I0828 03:39:05.050129       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 1 in state Reconciling at attempt 0
2020-08-28T03:39:05.050166757Z I0828 03:39:05.050159       1 sync_worker.go:493] Loading payload
2020-08-28T03:39:05.770317300Z I0828 03:39:05.770275       1 sync_worker.go:593] Payload loaded from registry.svc.ci.openshift.org/ocp/release@sha256:56945dc7218d758e25ffe990374668890e8c77d72132c98e0cc8f6272c063cc7 with hash VdMtCylIGgw=
2020-08-28T03:42:05.761345833Z I0828 03:42:05.761295       1 task_graph.go:546] Result of work: []
2020-08-28T03:44:56.717918691Z I0828 03:44:56.717873       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 1 in state Reconciling at attempt 1
2020-08-28T03:45:40.127607986Z I0828 03:45:40.127597       1 task_graph.go:546] Result of work: []
...
2020-08-28T05:46:20.965430733Z I0828 05:46:20.965389       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 1 in state Reconciling at attempt 35
2020-08-28T05:47:00.525075212Z I0828 05:47:00.525051       1 task_graph.go:546] Result of work: []
2020-08-28T05:47:32.154197038Z I0828 05:47:32.154182       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 2 in state Updating at attempt 0
2020-08-28T05:47:32.285316684Z I0828 05:47:32.285292       1 task_graph.go:546] Result of work: []
2020-08-28T05:48:17.099951081Z I0828 05:48:17.096699       1 task_graph.go:546] Result of work: []
2020-08-28T05:49:38.130488133Z I0828 05:49:38.130460       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 3 in state Updating at attempt 0
2020-08-28T05:49:38.246716527Z I0828 05:49:38.246705       1 task_graph.go:546] Result of work: []
2020-08-28T05:50:02.959409172Z I0828 05:50:02.959384       1 task_graph.go:546] Result of work: [Could not update deployment "openshift-cloud-credential-operator/cloud-credential-operator" (193 of 598) Could not update role "openshift-machine-api/prometheus-k8s-cluster-autoscaler-operator" (222 of 598) Could not update clusterrole "system:openshift:aggregate-snapshots-to-view" (243 of 598) Could not update configmap "openshift-image-registry/trusted-ca" (258 of 598) Could not update service "openshift-ingress-operator/metrics" (277 of 598) Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (322 of 598) Could not update deployment "openshift-monitoring/cluster-monitoring-operator" (359 of 598) Could not update clusterrolebinding "console-extensions-reader" (375 of 598) Could not update rolebinding "openshift-config/insights-operator" (407 of 598) Could not update deployment "openshift-operator-lifecycle-manager/olm-operator" (433 of 598) Could not update servicemonitor "openshift-marketplace/marketplace-operator" (454 of 598) Could not update service "openshift-service-ca-operator/metrics" (460 of 598)]
2020-08-28T05:50:02.960435734Z I0828 05:50:02.960413       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 4 in state Updating at attempt 0
2020-08-28T05:50:04.939552794Z I0828 05:50:04.939539       1 task_graph.go:546] Result of work: []
2020-08-28T05:50:46.585629809Z I0828 05:50:46.585626       1 task_graph.go:546] Result of work: []
2020-08-28T05:51:33.800694010Z I0828 05:51:33.800672       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 5 in state Updating at attempt 0
2020-08-28T05:51:33.931867692Z I0828 05:51:33.931861       1 task_graph.go:546] Result of work: []
2020-08-28T05:56:53.162544970Z I0828 05:56:53.162531       1 task_graph.go:546] Result of work: [Cluster operator csi-snapshot-controller has not yet reported success]
2020-08-28T05:56:53.163581403Z I0828 05:56:53.163565       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 6 in state Updating at attempt 0
2020-08-28T05:56:53.285166167Z I0828 05:56:53.285162       1 task_graph.go:546] Result of work: []
2020-08-28T05:57:54.624727704Z I0828 05:57:54.624715       1 task_graph.go:546] Result of work: []
2020-08-28T05:58:25.498627106Z I0828 05:58:25.498596       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 7 in state Updating at attempt 0
2020-08-28T05:58:25.616645308Z I0828 05:58:25.616634       1 task_graph.go:546] Result of work: []
2020-08-28T05:59:10.409241880Z I0828 05:59:10.409232       1 task_graph.go:546] Result of work: []
2020-08-28T06:02:01.365927807Z I0828 06:02:01.365884       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 7 in state Reconciling at attempt 0
2020-08-28T06:02:44.773794796Z I0828 06:02:44.773786       1 task_graph.go:546] Result of work: []
2020-08-28T06:05:35.730466807Z I0828 06:05:35.730422       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 7 in state Reconciling at attempt 0
2020-08-28T06:06:19.137044203Z I0828 06:06:19.137032       1 task_graph.go:546] Result of work: []
2020-08-28T06:09:10.093712198Z I0828 06:09:10.093665       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 7 in state Reconciling at attempt 0
2020-08-28T10:30:29.184128618Z I0828 10:30:29.184105       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release-nightly@sha256:567be653c695fce831f848537f64ba7e06c86d47af8e8dd15718e8586a1f087d", Force:false}
2020-08-28T10:31:39.714275754Z I0828 10:31:39.714250       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release-nightly@sha256:567be653c695fce831f848537f64ba7e06c86d47af8e8dd15718e8586a1f087d", Force:false}
...
2020-08-28T13:44:13.345879177Z I0828 13:44:13.345865       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release-nightly@sha256:567be653c695fce831f848537f64ba7e06c86d47af8e8dd15718e8586a1f087d", Force:false}
2020-08-28T13:47:18.337033333Z I0828 13:47:18.336993       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-27-005538", Force:true}
...
2020-08-28T14:03:28.722402381Z I0828 14:03:28.722348       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-27-005538", Force:true}
2020-08-28T14:05:18.370948271Z I0828 14:05:18.370906       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release-nightly@sha256:567be653c695fce831f848537f64ba7e06c86d47af8e8dd15718e8586a1f087d", Force:false}
...
2020-08-28T23:29:02.546827101Z I0828 23:29:02.546802       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release-nightly@sha256:567be653c695fce831f848537f64ba7e06c86d47af8e8dd15718e8586a1f087d", Force:false}

Once the desired version is changed, and it changes a few times, there is no sign from the sync worker that it is still alive.  Not clear where it is getting stuck, or if it is not stuck and just iterating in some low-log pathway.

Comment 2 W. Trevor King 2020-08-31 21:46:17 UTC

Moving back to NEW.  The landed logging is not enough to actually fix anything, it will just help us understand the next time we catch this in a must-gather.

Comment 3 W. Trevor King 2020-09-11 05:11:11 UTC

Landing more logging.  Still not clear enough to fix the issue.

Comment 4 W. Trevor King 2020-09-14 20:04:18 UTC

Recent 4.6.0-0.ci-2020-09-14-081525 -> 4.6.0-0.ci-2020-09-14-103240 job [1] had still not acknowledged the update by gather-time:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1305495056425160704/artifacts/e2e-gcp-upgrade/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)'
  2020-09-14T13:21:41Z 2020-09-14T13:56:47Z 4.6.0-0.ci-2020-09-14-081525 Completed false

Running CVO is from before install-complete:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1305495056425160704/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-55d9c8686d-smksh_cluster-version-operator.log | head -n1
  I0914 13:32:27.214436       1 start.go:21] ClusterVersionOperator v1.0.0-372-g6d56c655-dirty

From those same logs, here's the update to 4.6.0-0.ci-2020-09-14-103240 making it through:

  I0914 13:59:51.815658       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"registry.svc.ci.openshift.org/ocp/release:4.6.0-0.ci-2020-09-14-103240", Force:true}
  I0914 13:59:51.815686       1 sync_worker.go:243] Cancel the sync worker's current loop
  I0914 13:59:51.815700       1 sync_worker.go:249] Notify the sync worker that new work is available
  I0914 13:59:51.815716       1 status.go:159] Synchronizing errs=field.ErrorList{} status=&cvo.SyncWorkerStatus{Generation:1, Step:"ApplyResources", Failure:error(nil), Fraction:0.78571427, Completed:0, Reconciling:true, Initial:false, VersionHash:"EeHTCcLjL-Y=", LastProgress:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}, Actual:v1.Release{Version:"4.6.0-0.ci-2020-09-14-081525", Image:"registry.svc.ci.openshift.org/ocp/release@sha256:139d14b24df55f821d7027db5804d7261d8eec024a6e2a55d972f344869b0ce8", URL:"", Channels:[]string(nil)}, Verified:false}
  I0914 13:59:51.815781       1 status.go:79] merge into existing history completed=false desired=v1.Release{Version:"4.6.0-0.ci-2020-09-14-081525", Image:"registry.svc.ci.openshift.org/ocp/release@sha256:139d14b24df55f821d7027db5804d7261d8eec024a6e2a55d972f344869b0ce8", URL:"", Channels:[]string(nil)} last=&v1.UpdateHistory{State:"Completed", StartedTime:v1.Time{Time:time.Time{wall:0x0, ext:63735686501, loc:(*time.Location)(0x28cfd60)}}, CompletionTime:(*v1.Time)(0xc000855ca0), Version:"4.6.0-0.ci-2020-09-14-081525", Image:"registry.svc.ci.openshift.org/ocp/release@sha256:139d14b24df55f821d7027db5804d7261d8eec024a6e2a55d972f344869b0ce8", Verified:false}
  E0914 13:59:51.815800       1 task.go:81] error running apply for configmap "openshift-console-operator/console-operator-config" (390 of 602): context canceled
  I0914 13:59:51.815874       1 task_graph.go:483] Canceled worker 1 while waiting for work
  ...
  E0914 13:59:51.815971       1 task.go:81] error running apply for service "openshift-machine-config-operator/machine-config-daemon" (494 of 602): context canceled
  I0914 13:59:51.817013       1 task_graph.go:483] Canceled worker 0 while waiting for work
  ...
  I0914 14:00:02.786276       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"registry.svc.ci.openshift.org/ocp/release:4.6.0-0.ci-2020-09-14-103240", Force:true}
  I0914 14:00:02.786311       1 sync_worker.go:222] Update work is equal to current target; no change required

and all subsequent sync-loop goroutine work hangs.  Maybe in RunGraph?  Maybe on consistentReporter.lock?  I think I need to catch this in a live cluster so I can SIGQUIT [2] and get a stack dump.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1305495056425160704
[2]: https://golang.org/pkg/os/signal/#hdr-Default_behavior_of_signals_in_Go_programs

Comment 5 W. Trevor King 2020-09-14 23:49:27 UTC

I think I reproduced this by installing 4.6.0-fc.5 and updating back and forth between fc.4 and fc.5 until the sync worker hung.  Then in one terminal I streamed down the logs:

  $ oc -n openshift-cluster-version logs -f cluster-version-operator-659794d67-x6l84

while in another terminal I exec'ed in and SIGQUIT the CVO:

  $ oc -n openshift-cluster-version exec -it cluster-version-operator-659794d67-x6l84 -- bash
  [root@ci-ln-gjxvdvb-f76d1-jrtws-master-1 /]# ps aux
  USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
  root           1  0.9  0.5 964340 89732 ?        Ssl  23:03   0:10 /usr/bin/cluster-version-operator start --release-image=registry.svc.ci.openshift.org/ocp/release@sha256:5883d0db15939484bd477147e6949c53fbc6f55
  root          16  0.2  0.0  11840  2952 pts/0    Ss   23:22   0:00 bash
  root          30  0.0  0.0  51768  3424 pts/0    R+   23:22   0:00 ps aux
  [root@ci-ln-gjxvdvb-f76d1-jrtws-master-1 /]# kill -s QUIT 1
  [root@ci-ln-gjxvdvb-f76d1-jrtws-master-1 /]# command terminated with exit code 137

to get a traceback.  The sync worker goroutine looked like:

  goroutine 113 [chan receive, 5 minutes]:
  github.com/openshift/cluster-version-operator/pkg/payload.RunGraph(0x1c31fc0, 0xc000c7e960, 0xc001c853c0, 0x2, 0xc0019c2360, 0x0, 0x0, 0x0)
    /go/src/github.com/openshift/cluster-version-operator/pkg/payload/task_graph.go:510 +0x450
  github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).apply(0xc00018a300, 0x1c31fc0, 0xc000c7e960, 0xc0007a8000, 0xc001c399e0, 0x2, 0x1beaa00, 0xc0009d8050, 0x18ee2a0, 0xc001b38600)
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:685 +0x558
  github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).syncOnce(0xc00018a300, 0x1c31fc0, 0xc000c7e960, 0xc001c399e0, 0x10, 0x1beaa00, 0xc0009d8050, 0xc000aa2480, 0xc00109f6e4, 0xa)
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:596 +0x34f
  github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1.1(0xc001c399e0, 0xc00018a300, 0x1c31f40, 0xc000b0e0c0, 0x1bf70c0, 0xc000115860, 0x1967d48, 0x7, 0x10, 0x0, ...)
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:324 +0x264
  github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1()
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:325 +0x366
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000f3bf40)
    /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001c85f40, 0x1beb200, 0xc000b64120, 0xc000b0e001, 0xc000038960)
    /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3
  k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000f3bf40, 0x989680, 0x0, 0x1, 0xc000038960)
    /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
  k8s.io/apimachinery/pkg/util/wait.Until(...)
    /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
  github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start(0xc00018a300, 0x1c31f40, 0xc000b0e0c0, 0x10, 0x1967d48, 0x7, 0x1bf70c0, 0xc000115860)
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:256 +0x14c
  created by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:310 +0x43b

And since the hung CVO was from rc.5, we can resolve those line numbers in [1]:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.6.0-fc.5-x86_64 | grep cluster-version-operator
    cluster-version-operator                       https://github.com/openshift/cluster-version-operator                       d6475eff00192a56cd6a833aa7f73d6f4f4a2cef

So it was hung in the RunGraph call that attempts release image reconciliation [2] while it's waiting to read from resultCh.  I probably flubbed context expiration handling in the new-in-4.6 [4].

[1]: https://github.com/openshift/cluster-version-operator/tree/d6475eff00192a56cd6a833aa7f73d6f4f4a2cef
[2]: https://github.com/openshift/cluster-version-operator/blob/d6475eff00192a56cd6a833aa7f73d6f4f4a2cef/pkg/cvo/sync_worker.go#L685
[3]: https://github.com/openshift/cluster-version-operator/blob/d6475eff00192a56cd6a833aa7f73d6f4f4a2cef/pkg/payload/task_graph.go#L510
[4]: https://github.com/openshift/cluster-version-operator/pull/264

Comment 6 W. Trevor King 2020-09-16 16:38:57 UTC

One workaround is to kill the cluster-version operator pod.  The deployment controller will create a replacement, and that replacement will begin syncing manifests again without getting stuck in the racy update-triggers-a-sync-worker-cancel hole.

Comment 7 W. Trevor King 2020-09-16 18:02:06 UTC

Took a stab at doc text, although I'm also fine with this introduced-in-4.6-and-fixed-before-GA bug not having doc text.

Comment 9 Johnny Liu 2020-09-21 08:20:05 UTC

> I think I reproduced this by installing 4.6.0-fc.5 and updating back and forth between fc.4 and fc.5 until the sync worker hung.

I tried this scenario for about 7 rounds (back to fc.4, then forth to fc.5 again), each upgrade is completed successfully.

@wking, do you have idea how to reproduce this issue?

Comment 10 W. Trevor King 2020-09-21 17:24:17 UTC

Per the description in [1], you want to have a big stack of parallel nodes, so that the workers will not be able to consume all the nodes in the queue before your target-release bump gets them canceled.  We have a big stack of parallel nodes when we are in the reconciliation phase, but during that phase the CVO is mostly doing quick "yeah, the in-cluster state is fine for this manifest" checks.  You might have more luck trying to catch it if you re-target mid-update while the CVO is early in the process of reconciling runlevel 50, after the kube-storage-version-migrator operator levels in runlevel 40.

But we also landed a unit test, so if we can't actually reproduce the breakage after a reasonable effort, I'm fine with just putting the new code through the usual regression testing.

[1]: https://github.com/openshift/cluster-version-operator/pull/455/commits/632e763fa209c3e1bd07098c2e33599f8c663e3e

Comment 11 Johnny Liu 2020-09-24 14:04:34 UTC

Thanks for your hint. 

I finally reproduced this issue against a cluster with 20 workers, and updating back and forth between fc.4 and fc.5, when I am trying to upgrade back to fc.4 in the 3rd round, the sync worker get hung.  

[root@preserve-jialiu-ansible ~]# oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-release@sha256:960ec73733150827076cbb5fa2c1f5aaa9a94bfbce1b4897e46432a56ac976c1 --allow-explicit-upgrade
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
Updating to release image quay.io/openshift-release-dev/ocp-release@sha256:960ec73733150827076cbb5fa2c1f5aaa9a94bfbce1b4897e46432a56ac976c1

[root@preserve-jialiu-ansible ~]# oc get clusterversion -o json| jq -r '.items[].spec'
{
  "channel": "stable-4.6",
  "clusterID": "ab92b0b2-0d37-4f5c-83f9-e467ec43af1a",
  "desiredUpdate": {
    "force": false,
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:960ec73733150827076cbb5fa2c1f5aaa9a94bfbce1b4897e46432a56ac976c1",
    "version": ""
  },
  "upstream": "https://api.openshift.com/api/upgrades_info/v1/graph"
}

[root@preserve-jialiu-ansible ~]# oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-fc.5   True        False         107m    Cluster version is 4.6.0-fc.5


On the same cluster, run updating back and forth between fc.5 and 4.6.0-0.nightly-2020-09-22-073212, run 4 rounds, all get passed.

Comment 12 W. Trevor King 2020-09-24 17:04:42 UTC

> On the same cluster, run updating back and forth between fc.5 and 4.6.0-0.nightly-2020-09-22-073212, run 4 rounds, all get passed.

Only the outgoing version should matter, so fc.5 -> patched nightly would still be vulnerable to the race.  But still, VERIFIED makes sense to me, and we'll hear about it in CI if we missed anything else related to accepting update requests.

Comment 15 errata-xmlrpc 2020-10-27 16:36:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.