Description of problem: Upgrade v4.2.4 cluster to v4.3 nightly build. During the upgrade process, the clusterversion did not show the upgrade progress, but just give the target version. # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.4 True True 21s Working towards 4.3.0-0.nightly-2019-11-12-000306: 1% complete # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.4 True False 36s Cluster version is 4.3.0-0.nightly-2019-11-12-000306 # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.4 True False 115s Cluster version is 4.3.0-0.nightly-2019-11-12-000306 .... # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.4 True False 5m14s Cluster version is 4.3.0-0.nightly-2019-11-12-000306 # ./oc adm upgrade Cluster version is 4.3.0-0.nightly-2019-11-12-000306 warning: Cannot display available updates: Reason: RemoteFailed Message: Unable to retrieve available updates: currently installed version 4.3.0-0.nightly-2019-11-12-000306 not found in the "stable-4.2" channel checked the history, it shows the upgrade is not completed yet. # oc get clusterversion -o json|jq .items[0].status.history [ { "completionTime": null, "image": "registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-11-12-000306", "startedTime": "2019-11-12T03:55:56Z", "state": "Partial", "verified": false, "version": "4.3.0-0.nightly-2019-11-12-000306" }, { "completionTime": "2019-11-12T03:51:56Z", "image": "registry.svc.ci.openshift.org/ocp/release@sha256:cebce35c054f1fb066a4dc0a518064945087ac1f3637fe23d2ee2b0c433d6ba8", "startedTime": "2019-11-12T03:34:41Z", "state": "Completed", "verified": false, "version": "4.2.4" } ] checked the co status, some of co are still updating. # ./oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.2.4 True False False 15m cloud-credential 4.2.4 True False False 29m cluster-autoscaler 4.2.4 True False False 25m console 4.2.4 True False False 18m dns 4.2.4 True False False 29m image-registry 4.2.4 True False False 21m ingress 4.2.4 True False False 20m insights 4.2.4 True False False 30m kube-apiserver 4.3.0-0.nightly-2019-11-12-000306 True False False 28m kube-controller-manager 4.3.0-0.nightly-2019-11-12-000306 True False False 27m kube-scheduler 4.3.0-0.nightly-2019-11-12-000306 True False False 27m machine-api 4.3.0-0.nightly-2019-11-12-000306 True False False 30m machine-config 4.2.4 True False False 29m marketplace 4.2.4 True False False 25m monitoring 4.2.4 True False False 19m network 4.2.4 True False False 29m node-tuning 4.3.0-0.nightly-2019-11-12-000306 True False False 95s openshift-apiserver 4.3.0-0.nightly-2019-11-12-000306 True False False 26m openshift-controller-manager 4.2.4 True False False 27m openshift-samples 4.3.0-0.nightly-2019-11-12-000306 True False False 56s operator-lifecycle-manager 4.2.4 True False False 28m operator-lifecycle-manager-catalog 4.2.4 True False False 28m operator-lifecycle-manager-packageserver 4.2.4 True False False 28m service-ca 4.2.4 True False False 29m service-catalog-apiserver 4.3.0-0.nightly-2019-11-12-000306 True False False 26m service-catalog-controller-manager 4.2.4 True False False 26m storage 4.3.0-0.nightly-2019-11-12-000306 True False False 107s But actually the upgrade is still ongoing, and above co will update finally. # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2019-11-12-000306 True False 70m Cluster version is 4.3.0-0.nightly-2019-11-12-000306 # ./oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version' 2019-11-12T03:55:56Z|2019-11-12T04:33:54Z|Completed|4.3.0-0.nightly-2019-11-12-000306 2019-11-12T03:34:41Z|2019-11-12T03:51:56Z|Completed|4.2.4 # ./oc get co|grep "4.3.0-0.nightly-2019-11-12-000306"|wc -l 27 # ./oc get co|grep "4.2.4"|wc -l 0 Version-Release number of the following components: 4.3.0-0.nightly-2019-11-12-000306 # ./oc version Client Version: openshift-clients-4.3.0-201910250623-48-g8c30708e Server Version: 4.3.0-0.nightly-2019-11-12-000306 Kubernetes Version: v1.16.2 How reproducible: not 100% reproduce, hit once Steps to Reproduce: 1. Upgrade ocp from v4.2.4 to 4.3.0-0.nightly-2019-11-12-000306 # ./oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-11-12-000306 --allow-explicit-upgrade --forceUpdating to release image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-11-12-000306 2. 3. Actual results: cluster version does not show correct status during upgrade. Expected results: cluster version should show correct status during upgrade. Additional info: When the status is not correct, i checked the cvo works well. # ./oc get all -n openshift-cluster-version NAME READY STATUS RESTARTS AGE pod/cluster-version-operator-7446bc8f98-kdwnr 1/1 Running 0 11m pod/version--gvw9x-phzjb 0/1 Completed 0 11m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/cluster-version-operator ClusterIP 172.30.240.16 <none> 9099/TCP 33m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/cluster-version-operator 1/1 1 1 33m NAME DESIRED CURRENT READY AGE replicaset.apps/cluster-version-operator-7446bc8f98 1 1 1 11m replicaset.apps/cluster-version-operator-77545448d6 0 0 0 33m replicaset.apps/cluster-version-operator-7b786f87c6 0 0 0 33m NAME COMPLETIONS DURATION AGE job.batch/version--gvw9x 1/1 15s 11m
Looking at the attached must-gather fragment: $ sha1sum log.tgz 0aa3c5a2c7e8a1cae260ee8a9ce158f57d4088b3 log.tgz $ tar xzf log.tgz $ grep 'Running sync.*in state\|Result of work' log/namespaces/openshift-cluster-version/pods/cluster-version-operator-7446bc8f98-p4vjs/cluster-version-operator/cluster-version-operator/logs/current.log | head -n6 2019-11-12T04:22:23.005709831Z I1112 04:22:23.003748 1 sync_worker.go:464] Running sync registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-11-12-000306 (force=true) on generation 2 in state Updating at attempt 0 2019-11-12T04:28:08.055820161Z I1112 04:28:08.055806 1 task_graph.go:611] Result of work: [Cluster operator monitoring is still updating] 2019-11-12T04:28:33.341970344Z I1112 04:28:33.341890 1 sync_worker.go:464] Running sync registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-11-12-000306 (force=true) on generation 2 in state Updating at attempt 1 2019-11-12T04:33:49.389389153Z I1112 04:33:49.389357 1 task_graph.go:611] Result of work: [] 2019-11-12T04:36:41.915469438Z I1112 04:36:41.915369 1 sync_worker.go:464] Running sync registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-11-12-000306 (force=true) on generation 2 in state Reconciling at attempt 0 2019-11-12T04:37:14.276571228Z I1112 04:37:14.276551 1 task_graph.go:611] Result of work: [] So that seems reasonable, and 04:33Z matches the upgrade-complete from your comment 0 history. I'm looking to see if I can find anything from earlier...
What we might need, is a dump of the CVO logs when it is reporting the early 'Cluster version is ...', before that pod gets killed (probably due to the control-plane machine it was on getting updated) and rescheduled elsewhere and blowing away the old logs.
ok, got u. I will catch more logs before the original cvo pod killed when I hit it again.
Can we please confirm whether or not this still happens both in the latest 4.4 builds and the 4.3 release builds?
Monitoring an upgrade from v4.3.0 to latest 4.4.0-0.nightly-2020-02-04-002939, did not hit the issue. Monitoring another upgrade from v4.2.16 to latest v4.3.0, did not hit the issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581