Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1779640

Summary: Cluster-autoscaler stuck on update, doesn't report status
Product: OpenShift Container Platform Reporter: Vadim Rutkovsky <vrutkovs>
Component: Cloud ComputeAssignee: Alberto <agarcial>
Cloud Compute sub component: Other Providers QA Contact: Jianwei Hou <jhou>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: brad.ison, vlaad, wking
Version: 4.4   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1779741 1779743 1779745 (view as bug list) Environment:
Last Closed: 2020-05-15 15:45:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1779741    

Description Vadim Rutkovsky 2019-12-04 12:27:06 UTC
Description of problem:

4.3 nightly -> 4.3 nightly update failed:
`failed to initialize the cluster: Cluster operator cluster-autoscaler is still updating`
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11999

Clusteroperators list (https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11999/artifacts/e2e-aws-upgrade/clusteroperators.json) shows its empty (?):

...
        {
            "apiVersion": "config.openshift.io/v1",
            "kind": "ClusterOperator",
            "metadata": {
                "creationTimestamp": "2019-12-04T00:20:33Z",
                "generation": 1,
                "name": "cluster-autoscaler",
                "resourceVersion": "11333",
                "selfLink": "/apis/config.openshift.io/v1/clusteroperators/cluster-autoscaler",
                "uid": "ed891617-6cf2-4c78-9c0e-54d2e86af724"
            },
            "spec": {}
        },
...


Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2019-12-03-211441 -> 4.3.0-0.nightly-2019-12-03-234445

How reproducible:



Additional info:

Comment 1 Brad Ison 2019-12-04 15:49:10 UTC
The underlying issue here is that etcd was under load and taking multiple seconds to sync its log, which was causing leader elections, and I think some API writes to fail.

In addition, the cluster-autoscaler-operator was not reporting failures to apply updates to its ClusterOperator resource, and worse, was not retrying when it failed to apply an "Available" status. So the CVO was unaware of its success. The linked PR fixes that, and I'll make sure it's back ported to previous releases.

Comment 2 W. Trevor King 2019-12-04 18:14:43 UTC
> The underlying issue here is that etcd was under load and taking multiple seconds to sync its log, which was causing leader elections, and I think some API writes to fail.

General tracker for this portion is bug 1775878.

Comment 4 Jianwei Hou 2019-12-20 03:31:10 UTC
Verified in 4.4.0-0.nightly-2019-12-19-223334.

oc get co cluster-autoscaler -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-12-20T03:11:49Z"
  generation: 1
  name: cluster-autoscaler
  resourceVersion: "9771"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/cluster-autoscaler
  uid: 99dba483-4ca7-4f50-af40-6ceeddfd0143
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-12-20T03:11:49Z"
    message: at version 4.4.0-0.nightly-2019-12-19-223334
    status: "True"
    type: Available
  - lastTransitionTime: "2019-12-20T03:11:49Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-12-20T03:11:49Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2019-12-20T03:11:49Z"
    status: "True"
    type: Upgradeable
  extension: null
  relatedObjects:
  - group: machine.openshift.io
    name: ""
    namespace: openshift-machine-api
    resource: machineautoscalers
  - group: machine.openshift.io
    name: ""
    namespace: openshift-machine-api
    resource: clusterautoscalers
  - group: ""
    name: openshift-machine-api
    resource: namespaces
  versions:
  - name: operator
    version: 4.4.0-0.nightly-2019-12-19-223334