Bug 1703158

Summary: CVO takes more than 2 min to ack upgrade request
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: Cluster Version OperatorAssignee: Abhinav Dahiya <adahiya>
Status: CLOSED ERRATA QA Contact: Gaoyun Pei <gpei>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, bleanhar, bpeterse, gklein, jialiu, jokerman, mmccomas, wking
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:48:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2019-04-25 16:32:01 UTC
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/314/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-upgrade/20

The CVO timed out after 2 minutes without updating observed generation, which implies that the setting of desiredUpdate didn't correctly propagate to the sync worker and then cancel the current rollout.  Changes to desired update should propagate immediately.

I temporarily bumped the timeout to 5 min in origin but this is a serious bug and needs to be investigated and probably fixed before GA.

Comment 1 Abhinav Dahiya 2019-04-25 16:44:21 UTC
Looking at the logs
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/314/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-upgrade/20?log#log
```
Apr 25 15:37:14.264: INFO: Starting upgrade to version= image=registry.svc.ci.openshift.org/ci-op-c60fjs69/release@sha256:2a927947eac3e08e6b154d84b2b0f678b38087dbed6e8138cf7e492fbd8e9573
Apr 25 15:39:14.362: INFO: Current cluster version:
{
  "metadata": {
    "name": "version",
    "selfLink": "/apis/config.openshift.io/v1/clusterversions/version",
    "uid": "b6a1e5a3-676d-11e9-977b-12496c4a6d96",
    "resourceVersion": "17542",
    "generation": 2,
    "creationTimestamp": "2019-04-25T15:20:52Z"
  },
  "spec": {
    "clusterID": "34d74856-f667-4766-a862-1b6a0dc86d4e",
    "desiredUpdate": {
      "version": "",
      "image": "registry.svc.ci.openshift.org/ci-op-c60fjs69/release@sha256:2a927947eac3e08e6b154d84b2b0f678b38087dbed6e8138cf7e492fbd8e9573",
      "force": true
    },
    "upstream": "https://api.openshift.com/api/upgrades_info/v1/graph",
    "channel": "stable-4.0"
  },
  "status": {
    "desired": {
      "version": "",
      "image": "registry.svc.ci.openshift.org/ci-op-c60fjs69/release@sha256:2a927947eac3e08e6b154d84b2b0f678b38087dbed6e8138cf7e492fbd8e9573",
      "force": false
    },
    "history": [
      {
        "state": "Partial",
        "startedTime": "2019-04-25T15:37:14Z",
        "completionTime": null,
        "version": "",
        "image": "registry.svc.ci.openshift.org/ci-op-c60fjs69/release@sha256:2a927947eac3e08e6b154d84b2b0f678b38087dbed6e8138cf7e492fbd8e9573",
        "verified": false
      },
      {
        "state": "Completed",
        "startedTime": "2019-04-25T15:21:08Z",
        "completionTime": "2019-04-25T15:37:14Z",
        "version": "0.0.1-2019-04-25-150512",
        "image": "registry.svc.ci.openshift.org/ci-op-c60fjs69/release@sha256:2ad201f2bdba0ca66750d43afe7bdbefa236ebf70fc0a659d7d9490be2ece946",
        "verified": false
      }
    ],
    "observedGeneration": 0,
    "versionHash": "3f2ucK9TMPg=",
    "conditions": [
      {
        "type": "Available",
        "status": "True",
        "lastTransitionTime": "2019-04-25T15:34:47Z",
        "message": "Done applying 0.0.1-2019-04-25-150512"
      },
      {
        "type": "Failing",
        "status": "False",
        "lastTransitionTime": "2019-04-25T15:25:40Z"
      },
      {
        "type": "Progressing",
        "status": "True",
        "lastTransitionTime": "2019-04-25T15:37:14Z",
        "reason": "DownloadingUpdate",
        "message": "Working towards registry.svc.ci.openshift.org/ci-op-c60fjs69/release@sha256:2a927947eac3e08e6b154d84b2b0f678b38087dbed6e8138cf7e492fbd8e9573: downloading update"
      },
      {
        "type": "RetrievedUpdates",
        "status": "False",
        "lastTransitionTime": "2019-04-25T15:21:08Z",
        "reason": "RemoteFailed",
        "message": "Unable to retrieve available updates: currently installed version 0.0.1-2019-04-25-150512 not found in the \"stable-4.0\" channel"
      }
    ],
    "availableUpdates": null
  }
}
```

CVO did update the .status.desiredUpdate although the .status.observedGeneration is 0

Comment 2 Clayton Coleman 2019-04-25 16:46:06 UTC
This is 20% of upgrade Ci failures.

Comment 3 W. Trevor King 2019-04-25 17:32:27 UTC
Mitigated via [1] while we wait for a fix.

[1]: https://github.com/openshift/origin/pull/22670

Comment 4 Brenton Leanhardt 2019-04-25 17:34:09 UTC
*** Bug 1703140 has been marked as a duplicate of this bug. ***

Comment 5 Clayton Coleman 2019-04-29 01:04:08 UTC
Fixed in https://github.com/openshift/cluster-version-operator/pull/176

Comment 6 Gaoyun Pei 2019-05-08 11:38:56 UTC
After checking the error log in https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/314/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-upgrade/20/build-log.txt
It seems the error was thrown by the e2e testing framework. So confirmed in some subsequent e2e-aws-upgrade ci testing, no such error found again. For example in:
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/183/pull-ci-openshift-cluster-version-operator-master-e2e-aws-upgrade/53
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/182/pull-ci-openshift-cluster-version-operator-master-e2e-aws-upgrade/59

So move this bug to verified since the proposed PR already merged. Feel free to leave comments here if there's some better way for QE to verify this issue, thanks.

Comment 8 errata-xmlrpc 2019-06-04 10:48:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758