1940652 – Bump logging to level 5

Bug 1940652 - Bump logging to level 5

Summary: Bump logging to level 5

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.5.z
Assignee:	Jack Ottofaro
QA Contact:	Yang Yang
Docs Contact:
URL:
Whiteboard:
Depends On:	1873900
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-18 19:23 UTC by Jack Ottofaro
Modified:	2021-04-28 14:24 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1873900
Environment:
Last Closed:	2021-04-28 14:23:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 533	0	None	open	Bug 1940652: Bump logging to --v=5	2021-03-18 19:26:24 UTC
Red Hat Product Errata	RHBA-2021:1300	0	None	None	None	2021-04-28 14:24:00 UTC

Description Jack Ottofaro 2021-03-18 19:23:14 UTC

+++ This bug was initially created as a clone of Bug #1873900 +++

With 4.6.0-0.nightly-2020-08-25-204643:

$ grep 'Loading payload\|Payload loaded\|Running sync.*in state\|Desired version from spec\|Result of work' /tmp/xx/namespaces/openshift-cluster-version/pods/cluster-version-operator-54589d4748-fj56v/cluster-version-operator/cluster-version-operator/logs/current.log
2020-08-28T03:36:00.384964545Z I0828 03:36:00.356808       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 1 in state Reconciling at attempt 0
2020-08-28T03:36:00.384964545Z I0828 03:36:00.356823       1 sync_worker.go:493] Loading payload
2020-08-28T03:36:01.854233287Z I0828 03:36:01.854208       1 sync_worker.go:593] Payload loaded from registry.svc.ci.openshift.org/ocp/release@sha256:56945dc7218d758e25ffe990374668890e8c77d72132c98e0cc8f6272c063cc7 with hash VdMtCylIGgw=
2020-08-28T03:39:05.050142724Z I0828 03:39:05.050129       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 1 in state Reconciling at attempt 0
2020-08-28T03:39:05.050166757Z I0828 03:39:05.050159       1 sync_worker.go:493] Loading payload
2020-08-28T03:39:05.770317300Z I0828 03:39:05.770275       1 sync_worker.go:593] Payload loaded from registry.svc.ci.openshift.org/ocp/release@sha256:56945dc7218d758e25ffe990374668890e8c77d72132c98e0cc8f6272c063cc7 with hash VdMtCylIGgw=
2020-08-28T03:42:05.761345833Z I0828 03:42:05.761295       1 task_graph.go:546] Result of work: []
2020-08-28T03:44:56.717918691Z I0828 03:44:56.717873       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 1 in state Reconciling at attempt 1
2020-08-28T03:45:40.127607986Z I0828 03:45:40.127597       1 task_graph.go:546] Result of work: []
...
2020-08-28T05:46:20.965430733Z I0828 05:46:20.965389       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 1 in state Reconciling at attempt 35
2020-08-28T05:47:00.525075212Z I0828 05:47:00.525051       1 task_graph.go:546] Result of work: []
2020-08-28T05:47:32.154197038Z I0828 05:47:32.154182       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 2 in state Updating at attempt 0
2020-08-28T05:47:32.285316684Z I0828 05:47:32.285292       1 task_graph.go:546] Result of work: []
2020-08-28T05:48:17.099951081Z I0828 05:48:17.096699       1 task_graph.go:546] Result of work: []
2020-08-28T05:49:38.130488133Z I0828 05:49:38.130460       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 3 in state Updating at attempt 0
2020-08-28T05:49:38.246716527Z I0828 05:49:38.246705       1 task_graph.go:546] Result of work: []
2020-08-28T05:50:02.959409172Z I0828 05:50:02.959384       1 task_graph.go:546] Result of work: [Could not update deployment "openshift-cloud-credential-operator/cloud-credential-operator" (193 of 598) Could not update role "openshift-machine-api/prometheus-k8s-cluster-autoscaler-operator" (222 of 598) Could not update clusterrole "system:openshift:aggregate-snapshots-to-view" (243 of 598) Could not update configmap "openshift-image-registry/trusted-ca" (258 of 598) Could not update service "openshift-ingress-operator/metrics" (277 of 598) Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (322 of 598) Could not update deployment "openshift-monitoring/cluster-monitoring-operator" (359 of 598) Could not update clusterrolebinding "console-extensions-reader" (375 of 598) Could not update rolebinding "openshift-config/insights-operator" (407 of 598) Could not update deployment "openshift-operator-lifecycle-manager/olm-operator" (433 of 598) Could not update servicemonitor "openshift-marketplace/marketplace-operator" (454 of 598) Could not update service "openshift-service-ca-operator/metrics" (460 of 598)]
2020-08-28T05:50:02.960435734Z I0828 05:50:02.960413       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 4 in state Updating at attempt 0
2020-08-28T05:50:04.939552794Z I0828 05:50:04.939539       1 task_graph.go:546] Result of work: []
2020-08-28T05:50:46.585629809Z I0828 05:50:46.585626       1 task_graph.go:546] Result of work: []
2020-08-28T05:51:33.800694010Z I0828 05:51:33.800672       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 5 in state Updating at attempt 0
2020-08-28T05:51:33.931867692Z I0828 05:51:33.931861       1 task_graph.go:546] Result of work: []
2020-08-28T05:56:53.162544970Z I0828 05:56:53.162531       1 task_graph.go:546] Result of work: [Cluster operator csi-snapshot-controller has not yet reported success]
2020-08-28T05:56:53.163581403Z I0828 05:56:53.163565       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 6 in state Updating at attempt 0
2020-08-28T05:56:53.285166167Z I0828 05:56:53.285162       1 task_graph.go:546] Result of work: []
2020-08-28T05:57:54.624727704Z I0828 05:57:54.624715       1 task_graph.go:546] Result of work: []
2020-08-28T05:58:25.498627106Z I0828 05:58:25.498596       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 7 in state Updating at attempt 0
2020-08-28T05:58:25.616645308Z I0828 05:58:25.616634       1 task_graph.go:546] Result of work: []
2020-08-28T05:59:10.409241880Z I0828 05:59:10.409232       1 task_graph.go:546] Result of work: []
2020-08-28T06:02:01.365927807Z I0828 06:02:01.365884       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 7 in state Reconciling at attempt 0
2020-08-28T06:02:44.773794796Z I0828 06:02:44.773786       1 task_graph.go:546] Result of work: []
2020-08-28T06:05:35.730466807Z I0828 06:05:35.730422       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 7 in state Reconciling at attempt 0
2020-08-28T06:06:19.137044203Z I0828 06:06:19.137032       1 task_graph.go:546] Result of work: []
2020-08-28T06:09:10.093712198Z I0828 06:09:10.093665       1 sync_worker.go:485] Running sync 4.6.0-0.nightly-2020-08-25-204643 (force=false) on generation 7 in state Reconciling at attempt 0
2020-08-28T10:30:29.184128618Z I0828 10:30:29.184105       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release-nightly@sha256:567be653c695fce831f848537f64ba7e06c86d47af8e8dd15718e8586a1f087d", Force:false}
2020-08-28T10:31:39.714275754Z I0828 10:31:39.714250       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release-nightly@sha256:567be653c695fce831f848537f64ba7e06c86d47af8e8dd15718e8586a1f087d", Force:false}
...
2020-08-28T13:44:13.345879177Z I0828 13:44:13.345865       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release-nightly@sha256:567be653c695fce831f848537f64ba7e06c86d47af8e8dd15718e8586a1f087d", Force:false}
2020-08-28T13:47:18.337033333Z I0828 13:47:18.336993       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-27-005538", Force:true}
...
2020-08-28T14:03:28.722402381Z I0828 14:03:28.722348       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-27-005538", Force:true}
2020-08-28T14:05:18.370948271Z I0828 14:05:18.370906       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release-nightly@sha256:567be653c695fce831f848537f64ba7e06c86d47af8e8dd15718e8586a1f087d", Force:false}
...
2020-08-28T23:29:02.546827101Z I0828 23:29:02.546802       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release-nightly@sha256:567be653c695fce831f848537f64ba7e06c86d47af8e8dd15718e8586a1f087d", Force:false}

Once the desired version is changed, and it changes a few times, there is no sign from the sync worker that it is still alive.  Not clear where it is getting stuck, or if it is not stuck and just iterating in some low-log pathway.

--- Additional comment from W. Trevor King on 2020-08-30 22:10:22 UTC ---

http://virt-openshift-05.lab.eng.nay.redhat.com/wduan/logs/must-gather.local.8306168597559322770_0829.tar.gz

--- Additional comment from W. Trevor King on 2020-08-31 21:46:17 UTC ---

Moving back to NEW.  The landed logging is not enough to actually fix anything, it will just help us understand the next time we catch this in a must-gather.

--- Additional comment from W. Trevor King on 2020-09-11 05:11:11 UTC ---

Landing more logging.  Still not clear enough to fix the issue.

--- Additional comment from W. Trevor King on 2020-09-14 20:04:18 UTC ---

Recent 4.6.0-0.ci-2020-09-14-081525 -> 4.6.0-0.ci-2020-09-14-103240 job [1] had still not acknowledged the update by gather-time:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1305495056425160704/artifacts/e2e-gcp-upgrade/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)'
  2020-09-14T13:21:41Z 2020-09-14T13:56:47Z 4.6.0-0.ci-2020-09-14-081525 Completed false

Running CVO is from before install-complete:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1305495056425160704/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-55d9c8686d-smksh_cluster-version-operator.log | head -n1
  I0914 13:32:27.214436       1 start.go:21] ClusterVersionOperator v1.0.0-372-g6d56c655-dirty

From those same logs, here's the update to 4.6.0-0.ci-2020-09-14-103240 making it through:

  I0914 13:59:51.815658       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"registry.svc.ci.openshift.org/ocp/release:4.6.0-0.ci-2020-09-14-103240", Force:true}
  I0914 13:59:51.815686       1 sync_worker.go:243] Cancel the sync worker's current loop
  I0914 13:59:51.815700       1 sync_worker.go:249] Notify the sync worker that new work is available
  I0914 13:59:51.815716       1 status.go:159] Synchronizing errs=field.ErrorList{} status=&cvo.SyncWorkerStatus{Generation:1, Step:"ApplyResources", Failure:error(nil), Fraction:0.78571427, Completed:0, Reconciling:true, Initial:false, VersionHash:"EeHTCcLjL-Y=", LastProgress:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}, Actual:v1.Release{Version:"4.6.0-0.ci-2020-09-14-081525", Image:"registry.svc.ci.openshift.org/ocp/release@sha256:139d14b24df55f821d7027db5804d7261d8eec024a6e2a55d972f344869b0ce8", URL:"", Channels:[]string(nil)}, Verified:false}
  I0914 13:59:51.815781       1 status.go:79] merge into existing history completed=false desired=v1.Release{Version:"4.6.0-0.ci-2020-09-14-081525", Image:"registry.svc.ci.openshift.org/ocp/release@sha256:139d14b24df55f821d7027db5804d7261d8eec024a6e2a55d972f344869b0ce8", URL:"", Channels:[]string(nil)} last=&v1.UpdateHistory{State:"Completed", StartedTime:v1.Time{Time:time.Time{wall:0x0, ext:63735686501, loc:(*time.Location)(0x28cfd60)}}, CompletionTime:(*v1.Time)(0xc000855ca0), Version:"4.6.0-0.ci-2020-09-14-081525", Image:"registry.svc.ci.openshift.org/ocp/release@sha256:139d14b24df55f821d7027db5804d7261d8eec024a6e2a55d972f344869b0ce8", Verified:false}
  E0914 13:59:51.815800       1 task.go:81] error running apply for configmap "openshift-console-operator/console-operator-config" (390 of 602): context canceled
  I0914 13:59:51.815874       1 task_graph.go:483] Canceled worker 1 while waiting for work
  ...
  E0914 13:59:51.815971       1 task.go:81] error running apply for service "openshift-machine-config-operator/machine-config-daemon" (494 of 602): context canceled
  I0914 13:59:51.817013       1 task_graph.go:483] Canceled worker 0 while waiting for work
  ...
  I0914 14:00:02.786276       1 cvo.go:435] Desired version from spec is v1.Update{Version:"", Image:"registry.svc.ci.openshift.org/ocp/release:4.6.0-0.ci-2020-09-14-103240", Force:true}
  I0914 14:00:02.786311       1 sync_worker.go:222] Update work is equal to current target; no change required

and all subsequent sync-loop goroutine work hangs.  Maybe in RunGraph?  Maybe on consistentReporter.lock?  I think I need to catch this in a live cluster so I can SIGQUIT [2] and get a stack dump.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1305495056425160704
[2]: https://golang.org/pkg/os/signal/#hdr-Default_behavior_of_signals_in_Go_programs

--- Additional comment from W. Trevor King on 2020-09-14 23:49:27 UTC ---

I think I reproduced this by installing 4.6.0-fc.5 and updating back and forth between fc.4 and fc.5 until the sync worker hung.  Then in one terminal I streamed down the logs:

  $ oc -n openshift-cluster-version logs -f cluster-version-operator-659794d67-x6l84

while in another terminal I exec'ed in and SIGQUIT the CVO:

  $ oc -n openshift-cluster-version exec -it cluster-version-operator-659794d67-x6l84 -- bash
  [root@ci-ln-gjxvdvb-f76d1-jrtws-master-1 /]# ps aux
  USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
  root           1  0.9  0.5 964340 89732 ?        Ssl  23:03   0:10 /usr/bin/cluster-version-operator start --release-image=registry.svc.ci.openshift.org/ocp/release@sha256:5883d0db15939484bd477147e6949c53fbc6f55
  root          16  0.2  0.0  11840  2952 pts/0    Ss   23:22   0:00 bash
  root          30  0.0  0.0  51768  3424 pts/0    R+   23:22   0:00 ps aux
  [root@ci-ln-gjxvdvb-f76d1-jrtws-master-1 /]# kill -s QUIT 1
  [root@ci-ln-gjxvdvb-f76d1-jrtws-master-1 /]# command terminated with exit code 137

to get a traceback.  The sync worker goroutine looked like:

  goroutine 113 [chan receive, 5 minutes]:
  github.com/openshift/cluster-version-operator/pkg/payload.RunGraph(0x1c31fc0, 0xc000c7e960, 0xc001c853c0, 0x2, 0xc0019c2360, 0x0, 0x0, 0x0)
    /go/src/github.com/openshift/cluster-version-operator/pkg/payload/task_graph.go:510 +0x450
  github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).apply(0xc00018a300, 0x1c31fc0, 0xc000c7e960, 0xc0007a8000, 0xc001c399e0, 0x2, 0x1beaa00, 0xc0009d8050, 0x18ee2a0, 0xc001b38600)
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:685 +0x558
  github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).syncOnce(0xc00018a300, 0x1c31fc0, 0xc000c7e960, 0xc001c399e0, 0x10, 0x1beaa00, 0xc0009d8050, 0xc000aa2480, 0xc00109f6e4, 0xa)
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:596 +0x34f
  github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1.1(0xc001c399e0, 0xc00018a300, 0x1c31f40, 0xc000b0e0c0, 0x1bf70c0, 0xc000115860, 0x1967d48, 0x7, 0x10, 0x0, ...)
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:324 +0x264
  github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1()
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:325 +0x366
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000f3bf40)
    /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001c85f40, 0x1beb200, 0xc000b64120, 0xc000b0e001, 0xc000038960)
    /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3
  k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000f3bf40, 0x989680, 0x0, 0x1, 0xc000038960)
    /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
  k8s.io/apimachinery/pkg/util/wait.Until(...)
    /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
  github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start(0xc00018a300, 0x1c31f40, 0xc000b0e0c0, 0x10, 0x1967d48, 0x7, 0x1bf70c0, 0xc000115860)
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:256 +0x14c
  created by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run
    /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:310 +0x43b

And since the hung CVO was from rc.5, we can resolve those line numbers in [1]:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.6.0-fc.5-x86_64 | grep cluster-version-operator
    cluster-version-operator                       https://github.com/openshift/cluster-version-operator                       d6475eff00192a56cd6a833aa7f73d6f4f4a2cef

So it was hung in the RunGraph call that attempts release image reconciliation [2] while it's waiting to read from resultCh.  I probably flubbed context expiration handling in the new-in-4.6 [4].

[1]: https://github.com/openshift/cluster-version-operator/tree/d6475eff00192a56cd6a833aa7f73d6f4f4a2cef
[2]: https://github.com/openshift/cluster-version-operator/blob/d6475eff00192a56cd6a833aa7f73d6f4f4a2cef/pkg/cvo/sync_worker.go#L685
[3]: https://github.com/openshift/cluster-version-operator/blob/d6475eff00192a56cd6a833aa7f73d6f4f4a2cef/pkg/payload/task_graph.go#L510
[4]: https://github.com/openshift/cluster-version-operator/pull/264

--- Additional comment from W. Trevor King on 2020-09-16 16:38:57 UTC ---

One workaround is to kill the cluster-version operator pod.  The deployment controller will create a replacement, and that replacement will begin syncing manifests again without getting stuck in the racy update-triggers-a-sync-worker-cancel hole.

--- Additional comment from W. Trevor King on 2020-09-16 18:02:06 UTC ---

Took a stab at doc text, although I'm also fine with this introduced-in-4.6-and-fixed-before-GA bug not having doc text.

--- Additional comment from OpenShift Automated Release Tooling on 2020-09-18 07:06:11 UTC ---

Elliott changed bug status from MODIFIED to ON_QA.

--- Additional comment from Johnny Liu on 2020-09-21 08:20:05 UTC ---

> I think I reproduced this by installing 4.6.0-fc.5 and updating back and forth between fc.4 and fc.5 until the sync worker hung.

I tried this scenario for about 7 rounds (back to fc.4, then forth to fc.5 again), each upgrade is completed successfully.

@wking, do you have idea how to reproduce this issue?

--- Additional comment from W. Trevor King on 2020-09-21 17:24:17 UTC ---

Per the description in [1], you want to have a big stack of parallel nodes, so that the workers will not be able to consume all the nodes in the queue before your target-release bump gets them canceled.  We have a big stack of parallel nodes when we are in the reconciliation phase, but during that phase the CVO is mostly doing quick "yeah, the in-cluster state is fine for this manifest" checks.  You might have more luck trying to catch it if you re-target mid-update while the CVO is early in the process of reconciling runlevel 50, after the kube-storage-version-migrator operator levels in runlevel 40.

But we also landed a unit test, so if we can't actually reproduce the breakage after a reasonable effort, I'm fine with just putting the new code through the usual regression testing.

[1]: https://github.com/openshift/cluster-version-operator/pull/455/commits/632e763fa209c3e1bd07098c2e33599f8c663e3e

--- Additional comment from Johnny Liu on 2020-09-24 14:04:34 UTC ---

Thanks for your hint. 

I finally reproduced this issue against a cluster with 20 workers, and updating back and forth between fc.4 and fc.5, when I am trying to upgrade back to fc.4 in the 3rd round, the sync worker get hung.  

[root@preserve-jialiu-ansible ~]# oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-release@sha256:960ec73733150827076cbb5fa2c1f5aaa9a94bfbce1b4897e46432a56ac976c1 --allow-explicit-upgrade
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
Updating to release image quay.io/openshift-release-dev/ocp-release@sha256:960ec73733150827076cbb5fa2c1f5aaa9a94bfbce1b4897e46432a56ac976c1

[root@preserve-jialiu-ansible ~]# oc get clusterversion -o json| jq -r '.items[].spec'
{
  "channel": "stable-4.6",
  "clusterID": "ab92b0b2-0d37-4f5c-83f9-e467ec43af1a",
  "desiredUpdate": {
    "force": false,
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:960ec73733150827076cbb5fa2c1f5aaa9a94bfbce1b4897e46432a56ac976c1",
    "version": ""
  },
  "upstream": "https://api.openshift.com/api/upgrades_info/v1/graph"
}

[root@preserve-jialiu-ansible ~]# oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-fc.5   True        False         107m    Cluster version is 4.6.0-fc.5


On the same cluster, run updating back and forth between fc.5 and 4.6.0-0.nightly-2020-09-22-073212, run 4 rounds, all get passed.

--- Additional comment from W. Trevor King on 2020-09-24 17:04:42 UTC ---

> On the same cluster, run updating back and forth between fc.5 and 4.6.0-0.nightly-2020-09-22-073212, run 4 rounds, all get passed.

Only the outgoing version should matter, so fc.5 -> patched nightly would still be vulnerable to the race.  But still, VERIFIED makes sense to me, and we'll hear about it in CI if we missed anything else related to accepting update requests.

--- Additional comment from errata-xmlrpc on 2020-10-06 18:03:45 UTC ---

This bug has been added to advisory RHBA-2020:54579 by OpenShift Release Team Bot (ocp-build/buildvm.openshift.eng.bos.redhat.com)

--- Additional comment from errata-xmlrpc on 2020-10-26 00:34:19 UTC ---

Bug report changed to RELEASE_PENDING status by Errata System.
Advisory RHBA-2020:4196-05 has been changed to PUSH_READY status.
https://errata.devel.redhat.com/advisory/54579

--- Additional comment from errata-xmlrpc on 2020-10-27 16:36:11 UTC ---

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 2 Yang Yang 2021-04-02 10:48:28 UTC

Hi Jack,

I'm going to verify it. Could you please hint me how can I trigger the trace level log? 

Thanks.

Comment 3 W. Trevor King 2021-04-02 18:21:18 UTC

v5 should give you access to:

$ git --no-pager grep 'klog.V.5' origin/release-4.5 -- pkg/cvo 
origin/release-4.5:pkg/cvo/cvo_test.go:                 klog.V(5).Infof("updated object to %#v", obj)
origin/release-4.5:pkg/cvo/status.go:           klog.V(5).Infof("initialize new history completed=%t desired=%#v", completed, desired)
origin/release-4.5:pkg/cvo/status.go:           klog.V(5).Infof("merge into existing history completed=%t desired=%#v last=%#v", completed, desired, last)
origin/release-4.5:pkg/cvo/status.go:           klog.V(5).Infof("must add a new history entry completed=%t desired=%#v != last=%#v", completed, desired, last)
origin/release-4.5:pkg/cvo/status.go:   if klog.V(5) && len(config.Status.History) > 1 {
origin/release-4.5:pkg/cvo/status.go:   klog.V(5).Infof("Synchronizing errs=%#v status=%#v", validationErrs, status)
origin/release-4.5:pkg/cvo/sync_worker.go:      klog.V(5).Infof("Starting sync worker")
origin/release-4.5:pkg/cvo/sync_worker.go:                              klog.V(5).Infof("Stopped worker")
origin/release-4.5:pkg/cvo/sync_worker.go:                              klog.V(5).Infof("Wait finished")
origin/release-4.5:pkg/cvo/sync_worker.go:                              klog.V(5).Infof("Work updated")
origin/release-4.5:pkg/cvo/sync_worker.go:                              klog.V(5).Infof("No change, waiting")
origin/release-4.5:pkg/cvo/sync_worker.go:                              klog.V(5).Infof("No work, waiting")
origin/release-4.5:pkg/cvo/sync_worker.go:                              klog.V(5).Infof("Previous sync status: %#v", reporter.previousStatus)
origin/release-4.5:pkg/cvo/sync_worker.go:                      klog.V(5).Infof("Sync succeeded, reconciling")
origin/release-4.5:pkg/cvo/sync_worker.go:      klog.V(5).Infof("Worker shut down")
origin/release-4.5:pkg/cvo/sync_worker.go:                              klog.V(5).Infof("Dropping status report from earlier in sync loop")
origin/release-4.5:pkg/cvo/sync_worker.go:      klog.V(5).Infof("Status change %#v", update)
origin/release-4.5:pkg/cvo/sync_worker.go:              if klog.V(5) {
origin/release-4.5:pkg/cvo/sync_worker.go:                      klog.V(5).Infof("Manifest: %s", string(task.Manifest.Raw))

For example, 4.5.37 CI doesn't have 'merge into existing history':

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1377215339498049536/artifacts/launch/pods/openshift-cluster-version_cluster-version-operator-8dc8b5fc4-fxhnt_cluster-version-operator.log | grep 'merge into existing history'

While recent 4.5 nightlies do:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.5-e2e-aws/1377872648524009472/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-744cf5c97f-fkd2r_cluster-version-operator.log | grep 'merge into existing history' | head -n1
I0402 07:15:39.344456       1 status.go:79] merge into existing history completed=false desired=v1.Update{Version:"4.5.0-0.nightly-2021-04-02-061143", Image:"registry.build02.ci.openshift.org/ci-op-jdcvygb3/release@sha256:573aec4324c3013f6305b807b3e383dfda96f503d70386221f58966483b6a00d", Force:false} last=&v1.UpdateHistory{State:"Partial", StartedTime:v1.Time{Time:time.Time{wall:0x0, ext:63752942657, loc:(*time.Location)(0x271c5c0)}}, CompletionTime:(*v1.Time)(nil), Version:"4.5.0-0.nightly-2021-04-02-061143", Image:"registry.build02.ci.openshift.org/ci-op-jdcvygb3/release@sha256:573aec4324c3013f6305b807b3e383dfda96f503d70386221f58966483b6a00d", Verified:false}

Comment 4 Yang Yang 2021-04-04 09:13:01 UTC

Trevor, I appreciate your help.

For a fresh installed cluster with 4.5.0-0.nightly-2021-04-02-061143, we can see many v5 logs "merge into existing history" in the cvo pod log:

# cat 1.log |grep "merge into existing history"
I0404 08:40:17.040642       1 status.go:79] merge into existing history completed=true desired=v1.Update{Version:"4.5.0-0.nightly-2021-04-02-061143", Image:"registry.ci.openshift.org/ocp/release@sha256:573aec4324c3013f6305b807b3e383dfda96f503d70386221f58966483b6a00d", Force:false} last=&v1.UpdateHistory{State:"Completed", StartedTime:v1.Time{Time:time.Time{wall:0x0, ext:63753119866, loc:(*time.Location)(0x271c5c0)}}, CompletionTime:(*v1.Time)(0xc000aa15e0), Version:"4.5.0-0.nightly-2021-04-02-061143", Image:"registry.ci.openshift.org/ocp/release@sha256:573aec4324c3013f6305b807b3e383dfda96f503d70386221f58966483b6a00d", Verified:false}

While a 4.5.36 cluster does not have that log:
# cat 2.log | grep "merge into existing history completed"

Moving it to verified.

Comment 7 errata-xmlrpc 2021-04-28 14:23:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.38 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1300

Note You need to log in before you can comment on or make changes to this bug.