Bug 1756461 - ClusterVersion upgrade history of previously completed upgrades completionTimestamp is being modified when it shouldnt
Summary: ClusterVersion upgrade history of previously completed upgrades completionTim...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.3.0
Assignee: W. Trevor King
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks: 1759710
TreeView+ depends on / blocked
 
Reported: 2019-09-27 17:13 UTC by Chance Zibolski
Modified: 2020-05-13 21:25 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1759710 1759712 (view as bug list)
Environment:
Last Closed: 2020-05-13 21:25:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 247 0 None closed Bug 1756461: pkg/cvo/status: Don't clobber last.CompletionTime when adding new entries 2020-03-18 05:01:16 UTC
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-05-13 21:25:52 UTC

Description Chance Zibolski 2019-09-27 17:13:57 UTC
Description of problem: The ClusterVersion object in my cluster is having previously completed upgrade history modified post-upgrade such that it no longer is providing an accurate summary of upgrade history. I am seeing the timestamps on upgrades getting aligned end to end (upgrade completed time is always equal to the next upgrade's startedTime) and I know from previously looking at upgrades that this isn't accurate, and I know the values were not always like this, hence the conclusion that the history is being modified post-upgrade.


Version-Release number of selected component (if applicable): 4.1.18


How reproducible: Seems common. Rob Szumski also hit this.


Steps to Reproduce:
1. Run upgrades
2. Check Openshift UI
3. Wait
4. Do another upgrade
5. Verify even though you waited, the first upgrade's completedTime will equal the 2nd upgrades startedTime.

Actual results: Incorrect upgrade history times


Expected results: Correct upgrade history times


Additional info:

ClusterID is af8bc55b-9ae3-4735-bf65-b6ef43aeced9

You can see a snippet of my previous upgrade history in the following ticket: https://bugzilla.redhat.com/show_bug.cgi?id=1753348 

In particular, this was my latest upgrades information from the ClusterVersion history list 4.1.16. Pay attention to completionTime.

    history:
    - completionTime: "2019-09-14T01:25:04Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:61ed953962d43cae388cb3c544b4cac358d4675076c2fc0befb236209d5116f7
      startedTime: "2019-09-14T00:03:21Z"
      state: Completed
      verified: true
      version: 4.1.16


Now, look at the same information taken today from the same cluster at 4.1.18:

    history:
    - completionTime: "2019-09-25T17:22:57Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:420633acf3fc7572372fe2df758152f6ab1f53a21c79a6c4b741fa0394c7df3a
      startedTime: "2019-09-25T16:13:14Z"
      state: Completed
      verified: true
      version: 4.1.18
    - completionTime: "2019-09-25T16:13:14Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:747e0d41ee2f1af8b234e8c96c3291225a120fab3af53ae691afb4f51ce02b85
      startedTime: "2019-09-23T23:01:54Z"
      state: Completed
      verified: true
      version: 4.1.17
    - completionTime: "2019-09-23T23:01:54Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:61ed953962d43cae388cb3c544b4cac358d4675076c2fc0befb236209d5116f7
      startedTime: "2019-09-14T00:03:21Z"
      state: Completed
      verified: true
      version: 4.1.16

Notice that the version 4.1.16 completionTime was:
    - completionTime: "2019-09-14T01:25:04Z"

and the current value of that same item in the CV history list's completionTime for 4.1.16 is:
    - completionTime: "2019-09-23T23:01:54Z"


As shown, the completionTime of a previously, 2 versions ago upgrade changed.

Comment 1 W. Trevor King 2019-09-27 17:19:20 UTC
Target 4.3, because we'll have to fix it there first (or confirm that it doesn't happen there).  Once we get that settled, we can clone back to previous versions.

Comment 2 W. Trevor King 2019-09-27 23:41:42 UTC
Reproduced in CI [1]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3/15/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-09-27-182824-sha256-de04c6df8404216773aa0a33ab82750724d3a472528aaf799a9d8b85dbc9f6b5/cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | tail -n15
  history:
  - completionTime: 2019-09-27T22:29:06Z
    image: registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2019-09-27-182824
    startedTime: 2019-09-27T21:45:59Z
    state: Completed
    verified: false
    version: 4.3.0-0.ci-2019-09-27-182824
  - completionTime: 2019-09-27T21:45:59Z
    image: registry.svc.ci.openshift.org/ocp/release@sha256:d2462537f448853571cc914b334e7169c307767540aad25ebcf4106661953471
    startedTime: 2019-09-27T21:25:09Z
    state: Completed
    verified: false
    version: 4.3.0-0.ci-2019-09-27-181735
  observedGeneration: 2
  versionHash: 9zNtav-0cDg=
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3/15/artifacts/e2e-aws-upgrade/container-logs/test.log | grep 'nodes to be schedulable\|clusterversion/version\|Completed upgrade' | sort
Sep 27 21:42:28.080: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable
Sep 27 21:45:59.753 W clusterversion/version changed Progressing to True: Working towards 4.3.0-0.ci-2019-09-27-182824: 1% complete
Sep 27 21:46:11.007 - 2564s W clusterversion/version cluster is updating to 4.3.0-0.ci-2019-09-27-182824
Sep 27 21:52:06.823 E clusterversion/version changed Failing to True: MultipleErrors: Multiple errors are preventing progress:\n* Could not update deployment "openshift-authentication-operator/authentication-operator" (135 of 433)\n* Could not update deployment "openshift-cloud-credential-operator/cloud-credential-operator" (120 of 433)\n* Could not update deployment "openshift-cluster-node-tuning-operator/cluster-node-tuning-operator" (197 of 433)\n* Could not update deployment "openshift-cluster-samples-operator/cluster-samples-operator" (221 of 433)\n* Could not update deployment "openshift-cluster-storage-operator/cluster-storage-operator" (235 of 433)\n* Could not update deployment "openshift-controller-manager-operator/openshift-controller-manager-operator" (208 of 433)\n* Could not update deployment "openshift-monitoring/cluster-monitoring-operator" (260 of 433)\n* Could not update deployment "openshift-operator-lifecycle-manager/olm-operator" (324 of 433)\n* Could not update deployment "openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator" (245 of 433)\n* Could not update deployment "openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator" (254 of 433)
Sep 27 21:52:36.822 W clusterversion/version changed Failing to False
Sep 27 22:18:15.746 E clusterversion/version changed Failing to True: ClusterOperatorNotAvailable: Cluster operator machine-config is still updating
Sep 27 22:20:08.961 W clusterversion/version changed Failing to False
Sep 27 22:24:17.883 E clusterversion/version changed Failing to True: ClusterOperatorNotAvailable: Cluster operator machine-config is still updating
Sep 27 22:24:20.369 W clusterversion/version changed Failing to False
Sep 27 22:29:06.062 W clusterversion/version changed Progressing to False: Cluster version is 4.3.0-0.ci-2019-09-27-182824
Sep 27 22:29:06.062 W clusterversion/version cluster reached 4.3.0-0.ci-2019-09-27-182824
Sep 27 22:29:12.396: INFO: Completed upgrade to registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2019-09-27-182824

So the initial install (which finishes before the test container starts waiting for nodes at 21:42:28Z) has had its completion time bumped to 21:45:59Z.  Gathered logs from the CVO start at 22:24:19Z [2], around when we tripped into failing for the machine-config, so they won't be much help.  Checking node restart times:

$ curl -s 'https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3/15/artifacts/e2e-aws-upgrade/nodes.json' | jq -r '.items[] | .conditions = ([.status.conditions[] | {key: .type, value: .}] | from_entries) | .conditions.Ready.lastTransitionTime + " " + .metadata.annotations["machine.openshift.io/machine"]' | sort
2019-09-27T22:09:55Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-master-1
2019-09-27T22:19:42Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-worker-us-east-1a-rtjd6
2019-09-27T22:24:09Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-master-2
2019-09-27T22:28:26Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-master-0
2019-09-27T22:33:30Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-worker-us-east-1a-djqwv
2019-09-27T22:47:09Z openshift-machine-api/ci-op-91m3mhcv-77a90-6dht8-worker-us-east-1b-zvtqm

Code for this is in [3], which hasn't changed in a while.  I don't see anything wrong there, but obviously I'm missing something...

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3/15
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3/15/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-09-27-182824-sha256-de04c6df8404216773aa0a33ab82750724d3a472528aaf799a9d8b85dbc9f6b5/namespaces/openshift-cluster-version/pods/cluster-version-operator-697f9f6cf-sfscz/cluster-version-operator/cluster-version-operator/logs/current.log
[3]: https://github.com/openshift/cluster-version-operator/blob/0725bd53c732c8e3a490d1080f03edbe7e579b3a/pkg/cvo/status.go#L71-L110

Comment 3 Clayton Coleman 2019-09-28 20:19:55 UTC
Line 87 should be checking completion time for nil before setting.

Comment 5 W. Trevor King 2019-10-09 04:38:48 UTC
Example validation with a CI upgrade [1].  Still no 4.3 nightlies to validate in an ART build [2], so I dunno why this is ON_QA.  Maybe that transition is not gated on whether the 4.3 nightly actually builds.

[1]: https://github.com/openshift/cluster-version-operator/pull/247#issuecomment-539824540
[2]: https://openshift-release.svc.ci.openshift.org/#4.3.0-0.nightly

Comment 6 liujia 2019-10-14 03:51:43 UTC
Reproduced on 4.1.18-4.1.20 upgrade.

Before upgrade
# oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version'
2019-10-14T02:57:34Z|2019-10-14T03:13:16Z|Completed|4.1.18

After do upgrade operation
# oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version'
2019-10-14T03:43:16Z|2019-10-14T03:43:31Z|Completed|4.1.18
2019-10-14T03:42:31Z|2019-10-14T03:43:16Z|Partial|4.1.20
2019-10-14T02:57:34Z|2019-10-14T03:42:31Z|Completed|4.1.18

The completionTime of the last item in history will be updated to the startedTime of next item.

Comment 7 liujia 2019-10-16 08:54:28 UTC
Verified on 4.3.0-0.nightly-2019-10-15-225018

Before upgrade
# oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version'
2019-10-16T04:22:53Z|2019-10-16T04:37:40Z|Completed|4.3.0-0.nightly-2019-10-15-180816

After do upgrade operation
# oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version'
2019-10-16T08:23:10Z||Partial|4.3.0-0.nightly-2019-10-15-225018
2019-10-16T08:22:08Z|2019-10-16T08:22:25Z|Completed|4.3.0-0.nightly-2019-10-15-180816
2019-10-16T08:21:25Z|2019-10-16T08:22:08Z|Partial|4.3.0-0.nightly-2019-10-15-225018
2019-10-16T04:22:53Z|2019-10-16T04:37:40Z|Completed|4.3.0-0.nightly-2019-10-15-180816

Comment 9 errata-xmlrpc 2020-05-13 21:25:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.