1664942 – [cloud-CA] autoscaler couldn't scale up

Bug 1664942 - [cloud-CA] autoscaler couldn't scale up

Summary: [cloud-CA] autoscaler couldn't scale up

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Jan Chaloupka
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-10 06:19 UTC by sunzhaohua
Modified:	2019-06-04 10:41 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:41:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:41:48 UTC

Description sunzhaohua 2019-01-10 06:19:05 UTC

Description of problem:
Autoscaler couldn't scale up

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION                           AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.alpha-2019-01-10-030010   True        False         24m       Cluster version is 4.0.0-0.alpha-2019-01-10-030010

How reproducible:
Always

Steps to Reproduce:
1. Create clusterautoscaler resource
$ oc get clusterautoscaler -o yaml
apiVersion: "autoscaling.openshift.io/v1alpha1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  scaleDown:
    enabled: true
    delayAfterAdd: 10s
    delayAfterDelete: 10s
    delayAfterFailure: 10s

2. Create  machineautoscaler resource
$ oc get machineautoscaler worker-us-east-2a -o yaml
apiVersion: "autoscaling.openshift.io/v1alpha1"
kind: "MachineAutoscaler"
metadata:
  name: "worker-us-east-2a"
  namespace: "openshift-cluster-api"
spec:
  minReplicas: 1
  maxReplicas: 3
  scaleTargetRef:
    apiVersion: cluster.k8s.io/v1alpha1
    kind: MachineSet
    name: zhsun-worker-us-east-2a


3. Create pod to scale up the cluster
$ oc get deploy scale-up -o yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: scale-up
  labels:
    app: scale-up
spec:
  replicas: 20
  selector:
    matchLabels:
      app: scale-up
  template:
    metadata:
      labels:
        app: scale-up
    spec:
      containers:
      - name: busybox
        image: docker.io/library/busybox
        resources:
          requests:
            memory: 2Gi
        command:
        - /bin/sh
        - "-c"
        - "echo 'this should be in the logs' && sleep 86400"
      terminationGracePeriodSeconds: 0

4. Check machine, node and logs

Actual results:
Couldn't scale up.


$ oc get machine
NAME                            AGE
zhsun-master-0                  39m
zhsun-master-1                  39m
zhsun-master-2                  39m
zhsun-worker-us-east-2a-8cj64   38m
zhsun-worker-us-east-2b-6mqq8   38m
zhsun-worker-us-east-2b-jxc2k   18m
zhsun-worker-us-east-2b-xf2qd   18m
zhsun-worker-us-east-2c-b7gxz   18m
zhsun-worker-us-east-2c-cdq4q   18m
zhsun-worker-us-east-2c-lv2f7   38m


$ oc get node
NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-138-114.us-east-2.compute.internal   Ready     worker    37m       v1.11.0+f67f40dbad
ip-10-0-147-190.us-east-2.compute.internal   Ready     worker    37m       v1.11.0+f67f40dbad
ip-10-0-169-58.us-east-2.compute.internal    Ready     worker    37m       v1.11.0+f67f40dbad
ip-10-0-22-114.us-east-2.compute.internal    Ready     master    41m       v1.11.0+f67f40dbad
ip-10-0-37-96.us-east-2.compute.internal     Ready     master    41m       v1.11.0+f67f40dbad
ip-10-0-9-142.us-east-2.compute.internal     Ready     master    41m       v1.11.0+f67f40dbad


$ oc logs -f cluster-autoscaler-default-56c9cd4b6d-cvz84
I0110 04:44:48.271569       1 scale_up.go:584] Scale-up: setting group openshift-cluster-api/zhsun-worker-us-east-2b size to 3
I0110 04:44:58.422215       1 scale_up.go:584] Scale-up: setting group openshift-cluster-api/zhsun-worker-us-east-2c size to 3
W0110 04:59:50.716321       1 clusterstate.go:201] Scale-up timed out for node group openshift-cluster-api/zhsun-worker-us-east-2b after 15m2.424890837s
W0110 04:59:50.721855       1 clusterstate.go:223] Disabling scale-up for node group openshift-cluster-api/zhsun-worker-us-east-2b until 2019-01-10 05:04:50.715266058 +0000 UTC m=+1685.106703017
W0110 04:59:50.784942       1 scale_up.go:327] Node group openshift-cluster-api/zhsun-worker-us-east-2b is not ready for scaleup - backoff
W0110 05:00:00.803310       1 clusterstate.go:201] Scale-up timed out for node group openshift-cluster-api/zhsun-worker-us-east-2c after 15m2.365596705s
W0110 05:00:00.803370       1 clusterstate.go:223] Disabling scale-up for node group openshift-cluster-api/zhsun-worker-us-east-2c until 2019-01-10 05:05:00.802366978 +0000 UTC m=+1695.193804101
W0110 05:00:00.865659       1 scale_up.go:327] Node group openshift-cluster-api/zhsun-worker-us-east-2b is not ready for scaleup - backoff
W0110 05:00:00.865692       1 scale_up.go:327] Node group openshift-cluster-api/zhsun-worker-us-east-2c is not ready for scaleup - backoff


$ oc logs -f clusterapi-manager-controllers-6f9cf4dd7c-lsz8f -c machine-controller
I0110 04:56:23.731044       1 utils.go:151] Falling to providerConfig
E0110 04:56:23.731054       1 actuator.go:384] error decoding MachineProviderConfig: unable to find machine provider config: neither Spec.ProviderConfig.Value nor Spec.ProviderConfig.ValueFrom set
E0110 04:56:23.731063       1 actuator.go:351] error getting running instances: unable to find machine provider config: neither Spec.ProviderConfig.Value nor Spec.ProviderConfig.ValueFrom set
E0110 04:56:23.731072       1 controller.go:166] Error checking existence of machine instance for machine object zhsun-worker-us-east-2c-cdq4q; unable to find machine provider config: neither Spec.ProviderConfig.Value nor Spec.ProviderConfig.ValueFrom set
I0110 04:56:24.731448       1 actuator.go:347] checking if machine exists

Machineset "providerSpec" disappeared.
$ oc edit machineset zhsun-worker-us-east-2b
spec:
  replicas: 3
  selector:
    matchLabels:
      sigs.k8s.io/cluster-api-cluster: zhsun
      sigs.k8s.io/cluster-api-machineset: zhsun-worker-us-east-2b
  template:
    metadata:
      creationTimestamp: null
      labels:
        sigs.k8s.io/cluster-api-cluster: zhsun
        sigs.k8s.io/cluster-api-machine-role: worker
        sigs.k8s.io/cluster-api-machine-type: worker
        sigs.k8s.io/cluster-api-machineset: zhsun-worker-us-east-2b
    spec:
      metadata:
        creationTimestamp: null
      providerConfig: {}
      versions:
        kubelet: ""
status:
  availableReplicas: 1
  fullyLabeledReplicas: 3
  observedGeneration: 2
  readyReplicas: 1
  replicas: 3


Expected results:
Could scale up normally.


Additional info:
As soon as we create a machineautoscaler, machineset providerSpec field will disappear

Comment 1 Jan Chaloupka 2019-01-10 08:10:34 UTC

Hi sunzhaohua,

can you share the machineset CRD definition? `kubectl get crd machinesets.cluster.k8s.io -o yaml` will do. To confirm if the providerSpec field is defined or missing.

Thanks

Comment 2 sunzhaohua 2019-01-15 06:21:24 UTC

Verified.

In the new version,  I didn't reproduce this issue. Cluster can scale up and down noramlly. 
If reproduced, will reopen a bug and check the CRD definition.

$ oc get clusterversion
NAME      VERSION                           AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.alpha-2019-01-15-001217   True        False         2h        Cluster version is 4.0.0-0.alpha-2019-01-15-001217


$ oc get machine
NAME                            INSTANCE              STATE     TYPE       REGION      ZONE         AGE
zhsun-master-0                  i-080a7bb622af5dbf7   running   m4.large   us-east-2   us-east-2a   29m
zhsun-master-1                  i-086bb1037011b5e66   running   m4.large   us-east-2   us-east-2b   29m
zhsun-master-2                  i-044ba37ef3c01df49   running   m4.large   us-east-2   us-east-2c   29m
zhsun-worker-us-east-2a-5s7wd   i-0e37fa6f833672972   running   m4.large   us-east-2   us-east-2a   28m
zhsun-worker-us-east-2a-8lszv   i-019e3f765a1149f66   running   m4.large   us-east-2   us-east-2a   5m
zhsun-worker-us-east-2a-dsqsj   i-062f9d90e4e545117   running   m4.large   us-east-2   us-east-2a   5m
zhsun-worker-us-east-2a-gmgx2   i-027057005cf3c4263   running   m4.large   us-east-2   us-east-2a   5m
zhsun-worker-us-east-2a-z5drr   i-0af94036444067f47   running   m4.large   us-east-2   us-east-2a   5m
zhsun-worker-us-east-2b-z8wkp   i-096d82f8a0ad0050a   running   m4.large   us-east-2   us-east-2b   28m
zhsun-worker-us-east-2c-kfns2   i-0c1dc48eb2b1d2346   running   m4.large   us-east-2   us-east-2c   28m
 
$ oc get node
NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-129-168.us-east-2.compute.internal   Ready     worker    27m       v1.11.0+c69f926354
ip-10-0-134-248.us-east-2.compute.internal   Ready     worker    4m        v1.11.0+c69f926354
ip-10-0-134-252.us-east-2.compute.internal   Ready     worker    4m        v1.11.0+c69f926354
ip-10-0-134-67.us-east-2.compute.internal    Ready     worker    5m        v1.11.0+c69f926354
ip-10-0-139-238.us-east-2.compute.internal   Ready     worker    4m        v1.11.0+c69f926354
ip-10-0-15-49.us-east-2.compute.internal     Ready     master    37m       v1.11.0+c69f926354
ip-10-0-151-196.us-east-2.compute.internal   Ready     worker    27m       v1.11.0+c69f926354
ip-10-0-171-213.us-east-2.compute.internal   Ready     worker    27m       v1.11.0+c69f926354
ip-10-0-20-128.us-east-2.compute.internal    Ready     master    37m       v1.11.0+c69f926354
ip-10-0-36-74.us-east-2.compute.internal     Ready     master    37m       v1.11.0+c69f926354


[szh@localhost installer]$ oc logs -f cluster-autoscaler-default-56c9cd4b6d-vt7d8
I0115 03:50:15.993741       1 leaderelection.go:187] attempting to acquire leader lease  openshift-cluster-api/cluster-autoscaler...
I0115 03:50:16.056782       1 leaderelection.go:196] successfully acquired lease openshift-cluster-api/cluster-autoscaler

I0115 03:51:50.058660       1 scale_up.go:584] Scale-up: setting group openshift-cluster-api/zhsun-worker-us-east-2a size to 5
I0115 04:08:07.807776       1 scale_down.go:791] Scale-down: removing empty node ip-10-0-134-67.us-east-2.compute.internal
I0115 04:08:07.807922       1 scale_down.go:791] Scale-down: removing empty node ip-10-0-139-238.us-east-2.compute.internal
I0115 04:08:07.807998       1 scale_down.go:791] Scale-down: removing empty node ip-10-0-134-252.us-east-2.compute.internal
I0115 04:08:07.809036       1 scale_down.go:791] Scale-down: removing empty node ip-10-0-134-248.us-east-2.compute.internal

Comment 5 errata-xmlrpc 2019-06-04 10:41:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.