1779741 – Cluster-autoscaler stuck on update, doesn't report status

Bug 1779741 - Cluster-autoscaler stuck on update, doesn't report status

Summary: Cluster-autoscaler stuck on update, doesn't report status

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Brad Ison
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:	1779640
Blocks:	1779743
TreeView+	depends on / blocked

Reported:	2019-12-04 15:50 UTC by Brad Ison
Modified:	2020-01-23 11:18 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1779640
Environment:
Last Closed:	2020-01-23 11:17:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-autoscaler-operator pull 125	0	'None'	'open'	'bug 1779741: Don''t suppress errors when reporting operator status'	2019-12-05 11:34:42 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:18:04 UTC

Description Brad Ison 2019-12-04 15:50:16 UTC

+++ This bug was initially created as a clone of Bug #1779640 +++

Description of problem:

4.3 nightly -> 4.3 nightly update failed:
`failed to initialize the cluster: Cluster operator cluster-autoscaler is still updating`
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11999

Clusteroperators list (https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11999/artifacts/e2e-aws-upgrade/clusteroperators.json) shows its empty (?):

...
        {
            "apiVersion": "config.openshift.io/v1",
            "kind": "ClusterOperator",
            "metadata": {
                "creationTimestamp": "2019-12-04T00:20:33Z",
                "generation": 1,
                "name": "cluster-autoscaler",
                "resourceVersion": "11333",
                "selfLink": "/apis/config.openshift.io/v1/clusteroperators/cluster-autoscaler",
                "uid": "ed891617-6cf2-4c78-9c0e-54d2e86af724"
            },
            "spec": {}
        },
...


Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2019-12-03-211441 -> 4.3.0-0.nightly-2019-12-03-234445

How reproducible:



Additional info:

--- Additional comment from Brad Ison on 2019-12-04 15:49:10 UTC ---

The underlying issue here is that etcd was under load and taking multiple seconds to sync its log, which was causing leader elections, and I think some API writes to fail.

In addition, the cluster-autoscaler-operator was not reporting failures to apply updates to its ClusterOperator resource, and worse, was not retrying when it failed to apply an "Available" status. So the CVO was unaware of its success. The linked PR fixes that, and I'll make sure it's back ported to previous releases.

Comment 2 Jianwei Hou 2019-12-10 10:02:40 UTC

Verified in 4.3.0-0.nightly-2019-12-10-014919

Cluster-autoscaler operator now reports status

oc get co cluster-autoscaler -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-12-10T03:39:00Z"
  generation: 1
  name: cluster-autoscaler
  resourceVersion: "96746"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/cluster-autoscaler
  uid: 03097927-57d8-489e-b3db-21c3b2b138b1
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-12-10T03:39:00Z"
    message: at version 4.3.0-0.nightly-2019-12-10-014919
    status: "True"
    type: Available
  - lastTransitionTime: "2019-12-10T08:14:10Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-12-10T03:39:00Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2019-12-10T03:39:00Z"
    status: "True"
    type: Upgradeable
  extension: null
  relatedObjects:
  - group: machine.openshift.io
    name: ""
    namespace: openshift-machine-api
    resource: machineautoscalers
  - group: machine.openshift.io
    name: ""
    namespace: openshift-machine-api
    resource: clusterautoscalers
  - group: ""
    name: openshift-machine-api
    resource: namespaces
  versions:
  - name: operator
    version: 4.3.0-0.nightly-2019-12-10-014919

Comment 4 errata-xmlrpc 2020-01-23 11:17:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.