Bug 1689146

Summary:	[cloud] Never see Progressing=True in upgrade for clusteroperator cluster-autoscaler
Product:	OpenShift Container Platform	Reporter:	sunzhaohua <zhsun>
Component:	Cloud Compute	Assignee:	Brad Ison <brad.ison>
Status:	CLOSED ERRATA	QA Contact:	sunzhaohua <zhsun>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.1.0	CC:	aos-cloud, brad.ison, decarr, jchaloup, mgugino, xtian
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:45:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description sunzhaohua 2019-03-15 09:45:42 UTC

Description of problem:
Never see Progressing=True in upgrade for clusteroperator cluster-autoscaler

Version-Release number of selected component (if applicable):
clusterversion: 4.0.0-0.nightly-2019-03-13-233958

How reproducible:
Always

Steps to Reproduce:
1. Install a cluster 4.0 with 4.0.0-0.nightly-2019-03-13-233958 version.
2. Upgrade the cluster to 4.0.0-0.nightly-2019-03-14-135819
3. Watch the clusteroperator cluster-autoscaler status in upgrade

Actual results:
Never see Progressing=True in upgrade

Every 2.0s: oc get clusteroperator 

NAME                                  VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
cluster-autoscaler                    4.0.0-0.nightly-2019-03-14-135819   True        False         False     3m28s

Expected results:
We could see Progressing=True in upgrade

Additional info:

Comment 1 Michael Gugino 2019-03-15 13:47:58 UTC

Is the cluster autoscaler deployed in this environment?  If not, you're never going to see it go progressing because there is no operand to wait on.

Comment 3 sunzhaohua 2019-03-18 03:25:44 UTC

I deployed autoscaler, then upgrade to 4.0.0-0.nightly-2019-03-14-135819. Maybe it happens very fast so that I couldn't see it.

$ oc get clusterautoscaler
NAME      AGE
default   117s

$ oc get machineautoscaler
NAME          REF KIND     REF NAME                              MIN   MAX   AGE
autoscale-a   MachineSet   zhsun1-rxgd4-worker-ap-northeast-1a   1     3     18s
autoscale-c   MachineSet   zhsun1-rxgd4-worker-ap-northeast-1c   1     3     29s
autoscale-d   MachineSet   zhsun1-rxgd4-worker-ap-northeast-1d   1     3     49s

$ oc get pod
NAME                                              READY   STATUS    RESTARTS   AGE
cluster-autoscaler-default-774f5b4c7-plwdb        1/1     Running   0          2m6s
cluster-autoscaler-operator-df46df49b-slgmv       1/1     Running   1          23m
clusterapi-manager-controllers-7fb5fcdb87-2b2bs   4/4     Running   0          22m
machine-api-operator-6997c457b8-pw2sn             1/1     Running   0          22m 

$ oc adm upgrade --to 4.0.0-0.nightly-2019-03-14-135819
Updating to 4.0.0-0.nightly-2019-03-14-135819

Comment 4 Jan Chaloupka 2019-03-19 13:18:45 UTC

sunzhaohua, based on your comment (https://bugzilla.redhat.com/show_bug.cgi?id=1689146#c3) are you saying you are no longer able to reproduce the issue?

Comment 5 sunzhaohua 2019-03-20 02:59:51 UTC

Jan, no, I could reproduce it each time. I mean I still couldn't see  "Progressing=True" in upgrade after I deployed autoscaler.

Comment 6 Michael Gugino 2019-03-20 13:51:54 UTC

Do you have logs from cluster autoscaler operator pod?

We have a log statement for ```glog.Infof("Syncing to version %v", r.releaseVersion)```
which is utilized when we are setting status=progressing.

Comment 7 sunzhaohua 2019-03-21 06:38:12 UTC

Before upgrade log:
$ oc logs -f cluster-autoscaler-operator-5c548c64b5-mfrvb
I0321 03:42:00.543450       1 main.go:14] Go Version: go1.10.8
I0321 03:42:00.544605       1 main.go:15] Go OS/Arch: linux/amd64
I0321 03:42:00.544626       1 main.go:16] Version: cluster-autoscaler-operator v4.0.22-201903161424-dirty
W0321 03:42:00.653609       1 machineautoscaler_controller.go:118] Removing support for unregistered target type: cluster.k8s.io/v1alpha1, Kind=MachineDeployment
W0321 03:42:00.654425       1 machineautoscaler_controller.go:118] Removing support for unregistered target type: cluster.k8s.io/v1alpha1, Kind=MachineSet
I0321 03:42:00.654809       1 main.go:30] Starting cluster-autoscaler-operator
I0321 03:42:00.654961       1 leaderelection.go:205] attempting to acquire leader lease  openshift-machine-api/cluster-autoscaler-operator-leader...
I0321 03:42:00.668584       1 status.go:136] Setting operator to available
I0321 03:42:00.668690       1 status.go:97] Setting operator version to: 4.0.0-0.nightly-2019-03-19-004004
I0321 03:42:00.679447       1 status.go:109] operator status not current; Updating operator
I0321 03:42:00.686451       1 leaderelection.go:214] successfully acquired lease openshift-machine-api/cluster-autoscaler-operator-leader
...
I0321 05:35:52.138271       1 machineautoscaler_controller.go:153] Reconciling MachineAutoscaler openshift-machine-api/autoscale-a
I0321 05:35:52.163000       1 machineautoscaler_controller.go:153] Reconciling MachineAutoscaler openshift-machine-api/autoscale-a
I0321 05:35:52.167299       1 machineautoscaler_controller.go:153] Reconciling MachineAutoscaler openshift-machine-api/autoscale-a
I0321 05:36:08.554285       1 machineautoscaler_controller.go:153] Reconciling MachineAutoscaler openshift-machine-api/autoscale-b
I0321 05:36:08.569716       1 machineautoscaler_controller.go:153] Reconciling MachineAutoscaler openshift-machine-api/autoscale-b
I0321 05:36:42.031018       1 clusterautoscaler_controller.go:122] Reconciling ClusterAutoscaler default
I0321 05:36:42.031056       1 clusterautoscaler_controller.go:216] Creating cluster-autoscaler deployment openshift-machine-api/default
I0321 05:36:42.052843       1 clusterautoscaler_controller.go:122] Reconciling ClusterAutoscaler default
I0321 05:36:42.075085       1 clusterautoscaler_controller.go:122] Reconciling ClusterAutoscaler default
I0321 05:36:42.086717       1 clusterautoscaler_controller.go:122] Reconciling ClusterAutoscaler default
I0321 05:36:42.106773       1 clusterautoscaler_controller.go:122] Reconciling ClusterAutoscaler default
I0321 05:37:06.969623       1 clusterautoscaler_controller.go:122] Reconciling ClusterAutoscaler default
E0321 05:43:15.878000       1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=13369, ErrCode=NO_ERROR, debug=""
E0321 05:43:15.878776       1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=13369, ErrCode=NO_ERROR, debug=""
E0321 05:43:15.879257       1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=13369, ErrCode=NO_ERROR, debug=""
E0321 05:43:15.879359       1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=13369, ErrCode=NO_ERROR, debug=""
E0321 05:43:15.879261       1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=13369, ErrCode=NO_ERROR, debug=""
W0321 05:43:16.163743       1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: watch of *unstructured.Unstructured ended with: unexpected object: &{map[message:too old resource version: 15769 (77787) reason:Gone code:410 kind:Status apiVersion:v1 metadata:map[] status:Failure]}
W0321 05:43:16.364825       1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: watch of *unstructured.Unstructured ended with: unexpected object: &{map[status:Failure message:too old resource version: 77224 (77787) reason:Gone code:410 kind:Status apiVersion:v1 metadata:map[]]}
W0321 05:43:16.498326       1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: watch of *v1alpha1.ClusterAutoscaler ended with: too old resource version: 77534 (81743)
I0321 05:43:17.369594       1 machineautoscaler_controller.go:153] Reconciling MachineAutoscaler openshift-machine-api/autoscale-a
I0321 05:43:17.374819       1 machineautoscaler_controller.go:153] Reconciling MachineAutoscaler openshift-machine-api/autoscale-b
I0321 05:43:17.505730       1 clusterautoscaler_controller.go:122] Reconciling ClusterAutoscaler default


error: unexpected EOF


After upgrade log:
$ oc logs -f cluster-autoscaler-operator-5d866c497-jhzrx
I0321 06:05:56.171001       1 main.go:14] Go Version: go1.10.8
I0321 06:05:56.171406       1 main.go:15] Go OS/Arch: linux/amd64
I0321 06:05:56.171421       1 main.go:16] Version: cluster-autoscaler-operator v4.0.22-201903161424-dirty
W0321 06:05:56.422828       1 machineautoscaler_controller.go:118] Removing support for unregistered target type: cluster.k8s.io/v1alpha1, Kind=MachineDeployment
W0321 06:05:56.423339       1 machineautoscaler_controller.go:118] Removing support for unregistered target type: cluster.k8s.io/v1alpha1, Kind=MachineSet
I0321 06:05:56.423705       1 main.go:30] Starting cluster-autoscaler-operator
I0321 06:05:56.423830       1 leaderelection.go:205] attempting to acquire leader lease  openshift-machine-api/cluster-autoscaler-operator-leader...
I0321 06:05:56.440400       1 status.go:136] Setting operator to available
I0321 06:05:56.440432       1 status.go:97] Setting operator version to: 4.0.0-0.nightly-2019-03-20-153904
I0321 06:05:56.445727       1 status.go:109] operator status not current; Updating operator
I0321 06:06:42.484309       1 leaderelection.go:214] successfully acquired lease openshift-machine-api/cluster-autoscaler-operator-leader
I0321 06:06:42.685102       1 clusterautoscaler_controller.go:122] Reconciling ClusterAutoscaler default
I0321 06:06:42.687792       1 clusterautoscaler_controller.go:122] Reconciling ClusterAutoscaler default
I0321 06:06:42.685102       1 machineautoscaler_controller.go:153] Reconciling MachineAutoscaler openshift-machine-api/autoscale-a
I0321 06:06:42.695697       1 machineautoscaler_controller.go:153] Reconciling MachineAutoscaler openshift-machine-api/autoscale-b
I0321 06:06:42.712879       1 clusterautoscaler_controller.go:122] Reconciling ClusterAutoscaler default
I0321 06:06:42.739058       1 clusterautoscaler_controller.go:122] Reconciling ClusterAutoscaler default
I0321 06:06:42.750482       1 clusterautoscaler_controller.go:122] Reconciling ClusterAutoscaler default
I0321 06:06:42.792135       1 clusterautoscaler_controller.go:122] Reconciling ClusterAutoscaler default
I0321 06:07:08.852923       1 clusterautoscaler_controller.go:122] Reconciling ClusterAutoscaler default
I0321 06:07:08.860783       1 clusterautoscaler_controller.go:122] Reconciling ClusterAutoscaler default

Comment 8 Michael Gugino 2019-03-21 12:21:14 UTC

It doesn't appear a clusterautoscaler deployment was ever created.  A ClusterAutoscaler CR is not created by default, has to be done by the user (or some other automation), this will trigger a clusterautoscaler deployment.  Without a deployment present, there is nothing for us to upgrade, so we don't report progressing.

Comment 9 Michael Gugino 2019-03-21 14:02:27 UTC

(In reply to Michael Gugino from comment #8)
> It doesn't appear a clusterautoscaler deployment was ever created.  A
> ClusterAutoscaler CR is not created by default, has to be done by the user
> (or some other automation), this will trigger a clusterautoscaler
> deployment.  Without a deployment present, there is nothing for us to
> upgrade, so we don't report progressing.

Disregard this ^^.

Looks like your install/upgrade is using an old version of cluster-autoscaler-operator that does not have latest code for enabling status=Progressing.

Comment 11 Brad Ison 2019-04-01 15:12:50 UTC

I think this should be fixed:
https://github.com/openshift/cluster-autoscaler-operator/pull/79

Comment 13 sunzhaohua 2019-04-03 07:01:04 UTC

verified.

During upgrade from 4.0.0-0.9 to 4.0.0-0.10 we can see:

$ oc get clusteroperator 
NAME                                 VERSION      AVAILABLE   PROGRESSING   FAILING   SINCE
authentication                       4.0.0-0.10   True        False         False     3s
cloud-credential                     4.0.0-0.10   True        False         False     3h56m
cluster-autoscaler                   4.0.0-0.10   True        True         False     3h57m

Comment 15 errata-xmlrpc 2019-06-04 10:45:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758