Bug 1779745

Summary:	Cluster-autoscaler stuck on update, doesn't report status
Product:	OpenShift Container Platform	Reporter:	Brad Ison <brad.ison>
Component:	Cloud Compute	Assignee:	Brad Ison <brad.ison>
Status:	CLOSED ERRATA	QA Contact:	Jianwei Hou <jhou>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.1.z	CC:	brad.ison, jhou, piqin, vrutkovs
Target Milestone:	---
Target Release:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1779640	Environment:
Last Closed:	2020-01-09 09:16:20 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1779743
Bug Blocks:

Description Brad Ison 2019-12-04 15:54:40 UTC

+++ This bug was initially created as a clone of Bug #1779640 +++

Description of problem:

4.3 nightly -> 4.3 nightly update failed:
`failed to initialize the cluster: Cluster operator cluster-autoscaler is still updating`
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11999

Clusteroperators list (https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11999/artifacts/e2e-aws-upgrade/clusteroperators.json) shows its empty (?):

...
        {
            "apiVersion": "config.openshift.io/v1",
            "kind": "ClusterOperator",
            "metadata": {
                "creationTimestamp": "2019-12-04T00:20:33Z",
                "generation": 1,
                "name": "cluster-autoscaler",
                "resourceVersion": "11333",
                "selfLink": "/apis/config.openshift.io/v1/clusteroperators/cluster-autoscaler",
                "uid": "ed891617-6cf2-4c78-9c0e-54d2e86af724"
            },
            "spec": {}
        },
...


Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2019-12-03-211441 -> 4.3.0-0.nightly-2019-12-03-234445

How reproducible:



Additional info:

--- Additional comment from Brad Ison on 2019-12-04 15:49:10 UTC ---

The underlying issue here is that etcd was under load and taking multiple seconds to sync its log, which was causing leader elections, and I think some API writes to fail.

In addition, the cluster-autoscaler-operator was not reporting failures to apply updates to its ClusterOperator resource, and worse, was not retrying when it failed to apply an "Available" status. So the CVO was unaware of its success. The linked PR fixes that, and I'll make sure it's back ported to previous releases.

Comment 2 Qin Ping 2019-12-24 05:24:38 UTC

verified in 4.1.0-0.nightly-2019-12-23-102617

$ oc get co cluster-autoscaler -oyaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-12-23T07:50:08Z"
  generation: 1
  name: cluster-autoscaler
  resourceVersion: "378275"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/cluster-autoscaler
  uid: d6ed856a-2558-11ea-ac20-0aeeb9ddd54e
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-12-23T07:50:08Z"
    message: at version 4.1.0-0.nightly-2019-12-23-102617
    status: "True"
    type: Available
  - lastTransitionTime: "2019-12-24T03:25:03Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-12-23T07:50:08Z"
    status: "False"
    type: Degraded
  extension: null
  relatedObjects:
  - group: ""
    name: openshift-machine-api
    resource: namespaces
  versions:
  - name: operator
    version: 4.1.0-0.nightly-2019-12-23-102617

Comment 4 errata-xmlrpc 2020-01-09 09:16:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0010