Description of problem: The ClusterVersion object in my cluster is having previously completed upgrade history modified post-upgrade such that it no longer is providing an accurate summary of upgrade history. I am seeing the timestamps on upgrades getting aligned end to end (upgrade completed time is always equal to the next upgrade's startedTime) and I know from previously looking at upgrades that this isn't accurate, and I know the values were not always like this, hence the conclusion that the history is being modified post-upgrade. Version-Release number of selected component (if applicable): 4.1.18 How reproducible: Seems common. Rob Szumski also hit this. Steps to Reproduce: 1. Run upgrades 2. Check Openshift UI 3. Wait 4. Do another upgrade 5. Verify even though you waited, the first upgrade's completedTime will equal the 2nd upgrades startedTime. Actual results: Incorrect upgrade history times Expected results: Correct upgrade history times Additional info: ClusterID is af8bc55b-9ae3-4735-bf65-b6ef43aeced9 You can see a snippet of my previous upgrade history in the following ticket: https://bugzilla.redhat.com/show_bug.cgi?id=1753348 In particular, this was my latest upgrades information from the ClusterVersion history list 4.1.16. Pay attention to completionTime. history: - completionTime: "2019-09-14T01:25:04Z" image: quay.io/openshift-release-dev/ocp-release@sha256:61ed953962d43cae388cb3c544b4cac358d4675076c2fc0befb236209d5116f7 startedTime: "2019-09-14T00:03:21Z" state: Completed verified: true version: 4.1.16 Now, look at the same information taken today from the same cluster at 4.1.18: history: - completionTime: "2019-09-25T17:22:57Z" image: quay.io/openshift-release-dev/ocp-release@sha256:420633acf3fc7572372fe2df758152f6ab1f53a21c79a6c4b741fa0394c7df3a startedTime: "2019-09-25T16:13:14Z" state: Completed verified: true version: 4.1.18 - completionTime: "2019-09-25T16:13:14Z" image: quay.io/openshift-release-dev/ocp-release@sha256:747e0d41ee2f1af8b234e8c96c3291225a120fab3af53ae691afb4f51ce02b85 startedTime: "2019-09-23T23:01:54Z" state: Completed verified: true version: 4.1.17 - completionTime: "2019-09-23T23:01:54Z" image: quay.io/openshift-release-dev/ocp-release@sha256:61ed953962d43cae388cb3c544b4cac358d4675076c2fc0befb236209d5116f7 startedTime: "2019-09-14T00:03:21Z" state: Completed verified: true version: 4.1.16 Notice that the version 4.1.16 completionTime was: - completionTime: "2019-09-14T01:25:04Z" and the current value of that same item in the CV history list's completionTime for 4.1.16 is: - completionTime: "2019-09-23T23:01:54Z" As shown, the completionTime of a previously, 2 versions ago upgrade changed.
Target 4.3, because we'll have to fix it there first (or confirm that it doesn't happen there). Once we get that settled, we can clone back to previous versions.
Reproduced in CI [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3/15/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-09-27-182824-sha256-de04c6df8404216773aa0a33ab82750724d3a472528aaf799a9d8b85dbc9f6b5/cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | tail -n15 history: - completionTime: 2019-09-27T22:29:06Z image: registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2019-09-27-182824 startedTime: 2019-09-27T21:45:59Z state: Completed verified: false version: 4.3.0-0.ci-2019-09-27-182824 - completionTime: 2019-09-27T21:45:59Z image: registry.svc.ci.openshift.org/ocp/release@sha256:d2462537f448853571cc914b334e7169c307767540aad25ebcf4106661953471 startedTime: 2019-09-27T21:25:09Z state: Completed verified: false version: 4.3.0-0.ci-2019-09-27-181735 observedGeneration: 2 versionHash: 9zNtav-0cDg= $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3/15/artifacts/e2e-aws-upgrade/container-logs/test.log | grep 'nodes to be schedulable\|clusterversion/version\|Completed upgrade' | sort Sep 27 21:42:28.080: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable Sep 27 21:45:59.753 W clusterversion/version changed Progressing to True: Working towards 4.3.0-0.ci-2019-09-27-182824: 1% complete Sep 27 21:46:11.007 - 2564s W clusterversion/version cluster is updating to 4.3.0-0.ci-2019-09-27-182824 Sep 27 21:52:06.823 E clusterversion/version changed Failing to True: MultipleErrors: Multiple errors are preventing progress:\n* Could not update deployment "openshift-authentication-operator/authentication-operator" (135 of 433)\n* Could not update deployment "openshift-cloud-credential-operator/cloud-credential-operator" (120 of 433)\n* Could not update deployment "openshift-cluster-node-tuning-operator/cluster-node-tuning-operator" (197 of 433)\n* Could not update deployment "openshift-cluster-samples-operator/cluster-samples-operator" (221 of 433)\n* Could not update deployment "openshift-cluster-storage-operator/cluster-storage-operator" (235 of 433)\n* Could not update deployment "openshift-controller-manager-operator/openshift-controller-manager-operator" (208 of 433)\n* Could not update deployment "openshift-monitoring/cluster-monitoring-operator" (260 of 433)\n* Could not update deployment "openshift-operator-lifecycle-manager/olm-operator" (324 of 433)\n* Could not update deployment "openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator" (245 of 433)\n* Could not update deployment "openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator" (254 of 433) Sep 27 21:52:36.822 W clusterversion/version changed Failing to False Sep 27 22:18:15.746 E clusterversion/version changed Failing to True: ClusterOperatorNotAvailable: Cluster operator machine-config is still updating Sep 27 22:20:08.961 W clusterversion/version changed Failing to False Sep 27 22:24:17.883 E clusterversion/version changed Failing to True: ClusterOperatorNotAvailable: Cluster operator machine-config is still updating Sep 27 22:24:20.369 W clusterversion/version changed Failing to False Sep 27 22:29:06.062 W clusterversion/version changed Progressing to False: Cluster version is 4.3.0-0.ci-2019-09-27-182824 Sep 27 22:29:06.062 W clusterversion/version cluster reached 4.3.0-0.ci-2019-09-27-182824 Sep 27 22:29:12.396: INFO: Completed upgrade to registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2019-09-27-182824 So the initial install (which finishes before the test container starts waiting for nodes at 21:42:28Z) has had its completion time bumped to 21:45:59Z. Gathered logs from the CVO start at 22:24:19Z [2], around when we tripped into failing for the machine-config, so they won't be much help. Checking node restart times: $ curl -s 'https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3/15/artifacts/e2e-aws-upgrade/nodes.json' | jq -r '.items[] | .conditions = ([.status.conditions[] | {key: .type, value: .}] | from_entries) | .conditions.Ready.lastTransitionTime + " " + .metadata.annotations["machine.openshift.io/machine"]' | sort 2019-09-27T22:09:55Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-master-1 2019-09-27T22:19:42Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-worker-us-east-1a-rtjd6 2019-09-27T22:24:09Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-master-2 2019-09-27T22:28:26Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-master-0 2019-09-27T22:33:30Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-worker-us-east-1a-djqwv 2019-09-27T22:47:09Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-worker-us-east-1b-zvtqm Code for this is in [3], which hasn't changed in a while. I don't see anything wrong there, but obviously I'm missing something... [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3/15 [2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3/15/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-09-27-182824-sha256-de04c6df8404216773aa0a33ab82750724d3a472528aaf799a9d8b85dbc9f6b5/namespaces/openshift-cluster-version/pods/cluster-version-operator-697f9f6cf-sfscz/cluster-version-operator/cluster-version-operator/logs/current.log [3]: https://github.com/openshift/cluster-version-operator/blob/0725bd53c732c8e3a490d1080f03edbe7e579b3a/pkg/cvo/status.go#L71-L110
Line 87 should be checking completion time for nil before setting.
Example validation with a CI upgrade [1]. Still no 4.3 nightlies to validate in an ART build [2], so I dunno why this is ON_QA. Maybe that transition is not gated on whether the 4.3 nightly actually builds. [1]: https://github.com/openshift/cluster-version-operator/pull/247#issuecomment-539824540 [2]: https://openshift-release.svc.ci.openshift.org/#4.3.0-0.nightly
Reproduced on 4.1.18-4.1.20 upgrade. Before upgrade # oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version' 2019-10-14T02:57:34Z|2019-10-14T03:13:16Z|Completed|4.1.18 After do upgrade operation # oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version' 2019-10-14T03:43:16Z|2019-10-14T03:43:31Z|Completed|4.1.18 2019-10-14T03:42:31Z|2019-10-14T03:43:16Z|Partial|4.1.20 2019-10-14T02:57:34Z|2019-10-14T03:42:31Z|Completed|4.1.18 The completionTime of the last item in history will be updated to the startedTime of next item.
Verified on 4.3.0-0.nightly-2019-10-15-225018 Before upgrade # oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version' 2019-10-16T04:22:53Z|2019-10-16T04:37:40Z|Completed|4.3.0-0.nightly-2019-10-15-180816 After do upgrade operation # oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version' 2019-10-16T08:23:10Z||Partial|4.3.0-0.nightly-2019-10-15-225018 2019-10-16T08:22:08Z|2019-10-16T08:22:25Z|Completed|4.3.0-0.nightly-2019-10-15-180816 2019-10-16T08:21:25Z|2019-10-16T08:22:08Z|Partial|4.3.0-0.nightly-2019-10-15-225018 2019-10-16T04:22:53Z|2019-10-16T04:37:40Z|Completed|4.3.0-0.nightly-2019-10-15-180816
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062