Bug 1756461
| Summary: | ClusterVersion upgrade history of previously completed upgrades completionTimestamp is being modified when it shouldnt | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Chance Zibolski <chancez> | |
| Component: | Cluster Version Operator | Assignee: | W. Trevor King <wking> | |
| Status: | CLOSED ERRATA | QA Contact: | liujia <jiajliu> | |
| Severity: | unspecified | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.1.z | CC: | aos-bugs, ccoleman, dmoessne, jokerman, rszumski, wking | |
| Target Milestone: | --- | |||
| Target Release: | 4.3.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1759710 1759712 (view as bug list) | Environment: | ||
| Last Closed: | 2020-05-13 21:25:49 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1759710 | |||
|
Description
Chance Zibolski
2019-09-27 17:13:57 UTC
Target 4.3, because we'll have to fix it there first (or confirm that it doesn't happen there). Once we get that settled, we can clone back to previous versions. Reproduced in CI [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3/15/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-09-27-182824-sha256-de04c6df8404216773aa0a33ab82750724d3a472528aaf799a9d8b85dbc9f6b5/cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | tail -n15 history: - completionTime: 2019-09-27T22:29:06Z image: registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2019-09-27-182824 startedTime: 2019-09-27T21:45:59Z state: Completed verified: false version: 4.3.0-0.ci-2019-09-27-182824 - completionTime: 2019-09-27T21:45:59Z image: registry.svc.ci.openshift.org/ocp/release@sha256:d2462537f448853571cc914b334e7169c307767540aad25ebcf4106661953471 startedTime: 2019-09-27T21:25:09Z state: Completed verified: false version: 4.3.0-0.ci-2019-09-27-181735 observedGeneration: 2 versionHash: 9zNtav-0cDg= $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3/15/artifacts/e2e-aws-upgrade/container-logs/test.log | grep 'nodes to be schedulable\|clusterversion/version\|Completed upgrade' | sort Sep 27 21:42:28.080: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable Sep 27 21:45:59.753 W clusterversion/version changed Progressing to True: Working towards 4.3.0-0.ci-2019-09-27-182824: 1% complete Sep 27 21:46:11.007 - 2564s W clusterversion/version cluster is updating to 4.3.0-0.ci-2019-09-27-182824 Sep 27 21:52:06.823 E clusterversion/version changed Failing to True: MultipleErrors: Multiple errors are preventing progress:\n* Could not update deployment "openshift-authentication-operator/authentication-operator" (135 of 433)\n* Could not update deployment "openshift-cloud-credential-operator/cloud-credential-operator" (120 of 433)\n* Could not update deployment "openshift-cluster-node-tuning-operator/cluster-node-tuning-operator" (197 of 433)\n* Could not update deployment "openshift-cluster-samples-operator/cluster-samples-operator" (221 of 433)\n* Could not update deployment "openshift-cluster-storage-operator/cluster-storage-operator" (235 of 433)\n* Could not update deployment "openshift-controller-manager-operator/openshift-controller-manager-operator" (208 of 433)\n* Could not update deployment "openshift-monitoring/cluster-monitoring-operator" (260 of 433)\n* Could not update deployment "openshift-operator-lifecycle-manager/olm-operator" (324 of 433)\n* Could not update deployment "openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator" (245 of 433)\n* Could not update deployment "openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator" (254 of 433) Sep 27 21:52:36.822 W clusterversion/version changed Failing to False Sep 27 22:18:15.746 E clusterversion/version changed Failing to True: ClusterOperatorNotAvailable: Cluster operator machine-config is still updating Sep 27 22:20:08.961 W clusterversion/version changed Failing to False Sep 27 22:24:17.883 E clusterversion/version changed Failing to True: ClusterOperatorNotAvailable: Cluster operator machine-config is still updating Sep 27 22:24:20.369 W clusterversion/version changed Failing to False Sep 27 22:29:06.062 W clusterversion/version changed Progressing to False: Cluster version is 4.3.0-0.ci-2019-09-27-182824 Sep 27 22:29:06.062 W clusterversion/version cluster reached 4.3.0-0.ci-2019-09-27-182824 Sep 27 22:29:12.396: INFO: Completed upgrade to registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2019-09-27-182824 So the initial install (which finishes before the test container starts waiting for nodes at 21:42:28Z) has had its completion time bumped to 21:45:59Z. Gathered logs from the CVO start at 22:24:19Z [2], around when we tripped into failing for the machine-config, so they won't be much help. Checking node restart times: $ curl -s 'https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3/15/artifacts/e2e-aws-upgrade/nodes.json' | jq -r '.items[] | .conditions = ([.status.conditions[] | {key: .type, value: .}] | from_entries) | .conditions.Ready.lastTransitionTime + " " + .metadata.annotations["machine.openshift.io/machine"]' | sort 2019-09-27T22:09:55Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-master-1 2019-09-27T22:19:42Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-worker-us-east-1a-rtjd6 2019-09-27T22:24:09Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-master-2 2019-09-27T22:28:26Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-master-0 2019-09-27T22:33:30Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-worker-us-east-1a-djqwv 2019-09-27T22:47:09Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-worker-us-east-1b-zvtqm Code for this is in [3], which hasn't changed in a while. I don't see anything wrong there, but obviously I'm missing something... [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3/15 [2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3/15/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-09-27-182824-sha256-de04c6df8404216773aa0a33ab82750724d3a472528aaf799a9d8b85dbc9f6b5/namespaces/openshift-cluster-version/pods/cluster-version-operator-697f9f6cf-sfscz/cluster-version-operator/cluster-version-operator/logs/current.log [3]: https://github.com/openshift/cluster-version-operator/blob/0725bd53c732c8e3a490d1080f03edbe7e579b3a/pkg/cvo/status.go#L71-L110 Line 87 should be checking completion time for nil before setting. Example validation with a CI upgrade [1]. Still no 4.3 nightlies to validate in an ART build [2], so I dunno why this is ON_QA. Maybe that transition is not gated on whether the 4.3 nightly actually builds. [1]: https://github.com/openshift/cluster-version-operator/pull/247#issuecomment-539824540 [2]: https://openshift-release.svc.ci.openshift.org/#4.3.0-0.nightly Reproduced on 4.1.18-4.1.20 upgrade. Before upgrade # oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version' 2019-10-14T02:57:34Z|2019-10-14T03:13:16Z|Completed|4.1.18 After do upgrade operation # oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version' 2019-10-14T03:43:16Z|2019-10-14T03:43:31Z|Completed|4.1.18 2019-10-14T03:42:31Z|2019-10-14T03:43:16Z|Partial|4.1.20 2019-10-14T02:57:34Z|2019-10-14T03:42:31Z|Completed|4.1.18 The completionTime of the last item in history will be updated to the startedTime of next item. Verified on 4.3.0-0.nightly-2019-10-15-225018 Before upgrade # oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version' 2019-10-16T04:22:53Z|2019-10-16T04:37:40Z|Completed|4.3.0-0.nightly-2019-10-15-180816 After do upgrade operation # oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version' 2019-10-16T08:23:10Z||Partial|4.3.0-0.nightly-2019-10-15-225018 2019-10-16T08:22:08Z|2019-10-16T08:22:25Z|Completed|4.3.0-0.nightly-2019-10-15-180816 2019-10-16T08:21:25Z|2019-10-16T08:22:08Z|Partial|4.3.0-0.nightly-2019-10-15-225018 2019-10-16T04:22:53Z|2019-10-16T04:37:40Z|Completed|4.3.0-0.nightly-2019-10-15-180816 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 |