1891551 – Clusterautoscaler doesn't scale up as expected

Bug 1891551 - Clusterautoscaler doesn't scale up as expected

Summary: Clusterautoscaler doesn't scale up as expected

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Joel Speed
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-26 15:38 UTC by aaleman
Modified:	2021-02-24 15:29 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The cluster autoscaler would use a template for node scaling decisions in certain circumstances. This template includes a subset of the information available on an actual node. Consequence: In some scenarios, the autoscaler would claim that adding new nodes would not solve allow pending pods to be scheduled. Fix: Ensure the node template includes as many standard labels as possible to increase the likelihood the affinity checks pass. Result: The autoscaler is less likely to be unable to scale up if a pending pod uses node affinity with a standard label
Clone Of:
Environment:
Last Closed:	2021-02-24 15:28:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes-autoscaler pull 178	0	None	closed	Bug 1891551: Ensure the node template include up to date and informative labels	2021-02-19 20:13:46 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	Waiting on Customer	[rfe] dpdk does only detected 128 lcore(s) out of 256 lcore(s) system.	2022-05-27 17:45:42 UTC

Description aaleman 2020-10-26 15:38:21 UTC

Description of problem:

We have a cluster that has:
* Three machinesets, each with a machineautoscaler, each in a distinct AZ
* A PDB that may prevent draining for up to ~4.5 hours to prevent our batch workloads from getting interrupted

During a node upgrade, there was a pod that couldn't get scheduled and did not cause the autoscaler to scale up. The pod has a volume in an AZ where at that time there was only one node and that node was in the process of being drained (which took a couple of hours because of the aforementioned PDB).

Pod spec (after it got scheduled because that one node finished draining):
```
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: alertmanager
              operator: In
              values:
              - main
          namespaces:
          - openshift-monitoring
          topologyKey: kubernetes.io/hostname
        weight: 100
  containers:
  - args:
    - --config.file=/etc/alertmanager/config/alertmanager.yaml
    - --storage.path=/alertmanager
    - --data.retention=120h
    - --cluster.listen-address=[$(POD_IP)]:9094
    - --web.listen-address=127.0.0.1:9093
    - --web.external-url=https://alertmanager-main-openshift-monitoring.apps.build02.gcp.ci.openshift.org/
    - --web.route-prefix=/
    - --cluster.peer=alertmanager-main-0.alertmanager-operated:9094
    - --cluster.peer=alertmanager-main-1.alertmanager-operated:9094
    - --cluster.peer=alertmanager-main-2.alertmanager-operated:9094
    - --cluster.reconnect-timeout=5m
    env:
    - name: POD_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e5bcf6d786fd218e1ef188eb40c39c31a98d03121fba3b7a1f16e87e45a7478b
    imagePullPolicy: IfNotPresent
    name: alertmanager
    ports:
    - containerPort: 9094
      name: mesh-tcp
      protocol: TCP
    - containerPort: 9094
      name: mesh-udp
      protocol: UDP
    resources:
      requests:
        cpu: 4m
        memory: 200Mi
    securityContext:
      capabilities:
        drop:
        - KILL
        - MKNOD
        - SETGID
        - SETUID
      runAsUser: 1000280000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /etc/alertmanager/config
      name: config-volume
    - mountPath: /alertmanager
      name: alertmanager-main-db
      subPath: alertmanager-db
    - mountPath: /etc/alertmanager/secrets/alertmanager-main-tls
      name: secret-alertmanager-main-tls
      readOnly: true
    - mountPath: /etc/alertmanager/secrets/alertmanager-main-proxy
      name: secret-alertmanager-main-proxy
      readOnly: true
    - mountPath: /etc/alertmanager/secrets/alertmanager-kube-rbac-proxy
      name: secret-alertmanager-kube-rbac-proxy
      readOnly: true
    - mountPath: /etc/pki/ca-trust/extracted/pem/
      name: alertmanager-trusted-ca-bundle
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: alertmanager-main-token-pvpj8
      readOnly: true
  - args:
    - --listen-address=localhost:8080
    - --reload-url=http://localhost:9093/-/reload
    - --watched-dir=/etc/alertmanager/config
    - --watched-dir=/etc/alertmanager/secrets/alertmanager-main-tls
    - --watched-dir=/etc/alertmanager/secrets/alertmanager-main-proxy
    - --watched-dir=/etc/alertmanager/secrets/alertmanager-kube-rbac-proxy
    command:
    - /bin/prometheus-config-reloader
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8c9e61400619c4613db5cc73097d287e3cd5d2125c85d1d84cc30cfdaa1093e7
    imagePullPolicy: IfNotPresent
    name: config-reloader
    resources:
      requests:
        cpu: 1m
        memory: 10Mi
    securityContext:
      capabilities:
        drop:
        - KILL
        - MKNOD
        - SETGID
        - SETUID
      runAsUser: 1000280000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /etc/alertmanager/config
      name: config-volume
      readOnly: true
    - mountPath: /etc/alertmanager/secrets/alertmanager-main-tls
      name: secret-alertmanager-main-tls
      readOnly: true
    - mountPath: /etc/alertmanager/secrets/alertmanager-main-proxy
      name: secret-alertmanager-main-proxy
      readOnly: true
    - mountPath: /etc/alertmanager/secrets/alertmanager-kube-rbac-proxy
      name: secret-alertmanager-kube-rbac-proxy
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: alertmanager-main-token-pvpj8
      readOnly: true
  - args:
    - -provider=openshift
    - -https-address=:9095
    - -http-address=
    - -email-domain=*
    - -upstream=http://localhost:9093
    - '-openshift-sar={"resource": "namespaces", "verb": "get"}'
    - '-openshift-delegate-urls={"/": {"resource": "namespaces", "verb": "get"}}'
    - -tls-cert=/etc/tls/private/tls.crt
    - -tls-key=/etc/tls/private/tls.key
    - -client-secret-file=/var/run/secrets/kubernetes.io/serviceaccount/token
    - -cookie-secret-file=/etc/proxy/secrets/session_secret
    - -openshift-service-account=alertmanager-main
    - -openshift-ca=/etc/pki/tls/cert.pem
    - -openshift-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    - -skip-auth-regex=^/metrics
    env:
    - name: HTTP_PROXY
    - name: HTTPS_PROXY
    - name: NO_PROXY
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:12b11e2000b42ce1aaa228d9c1f4c9177395add2fa43835e667b7fc9007e40e6
    imagePullPolicy: IfNotPresent
    name: alertmanager-proxy
    ports:
    - containerPort: 9095
      name: web
      protocol: TCP
    resources:
      requests:
        cpu: 1m
        memory: 20Mi
    securityContext:
      capabilities:
        drop:
        - KILL
        - MKNOD
        - SETGID
        - SETUID
      runAsUser: 1000280000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /etc/tls/private
      name: secret-alertmanager-main-tls
    - mountPath: /etc/proxy/secrets
      name: secret-alertmanager-main-proxy
    - mountPath: /etc/pki/ca-trust/extracted/pem/
      name: alertmanager-trusted-ca-bundle
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: alertmanager-main-token-pvpj8
      readOnly: true
  - args:
    - --secure-listen-address=0.0.0.0:9092
    - --upstream=http://127.0.0.1:9096
    - --config-file=/etc/kube-rbac-proxy/config.yaml
    - --tls-cert-file=/etc/tls/private/tls.crt
    - --tls-private-key-file=/etc/tls/private/tls.key
    - --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
    - --logtostderr=true
    - --v=10
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c75977f28becdf4f7065cfa37233464dd31208b1767e620c4f19658f53f8ff8c
    imagePullPolicy: IfNotPresent
    name: kube-rbac-proxy
    ports:
    - containerPort: 9092
      name: tenancy
      protocol: TCP
    resources:
      requests:
        cpu: 1m
        memory: 20Mi
    securityContext:
      capabilities:
        drop:
        - KILL
        - MKNOD
        - SETGID
        - SETUID
      runAsUser: 1000280000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /etc/kube-rbac-proxy
      name: secret-alertmanager-kube-rbac-proxy
    - mountPath: /etc/tls/private
      name: secret-alertmanager-main-tls
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: alertmanager-main-token-pvpj8
      readOnly: true
  - args:
    - --insecure-listen-address=127.0.0.1:9096
    - --upstream=http://127.0.0.1:9093
    - --label=namespace
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b635694dc0663a1404d43a4a9ac8513a3087c7cfead50f6ab413f3c217c40b2a
    imagePullPolicy: IfNotPresent
    name: prom-label-proxy
    resources:
      requests:
        cpu: 1m
        memory: 20Mi
    securityContext:
      capabilities:
        drop:
        - KILL
        - MKNOD
        - SETGID
        - SETUID
      runAsUser: 1000280000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: alertmanager-main-token-pvpj8
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: alertmanager-main-0
  imagePullSecrets:
  - name: alertmanager-main-dockercfg-ddtjb
  nodeName: build0-gstfj-w-b-jx6pf.c.openshift-ci-build-farm.internal
  nodeSelector:
    kubernetes.io/os: linux
  preemptionPolicy: PreemptLowerPriority
  priority: 2000000000
  priorityClassName: system-cluster-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1000280000
    seLinuxOptions:
      level: s0:c17,c4
  serviceAccount: alertmanager-main
  serviceAccountName: alertmanager-main
  subdomain: alertmanager-operated
  terminationGracePeriodSeconds: 120
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: alertmanager-main-db
    persistentVolumeClaim:
      claimName: alertmanager-main-db-alertmanager-main-0
  - name: config-volume
    secret:
      defaultMode: 420
      secretName: alertmanager-main
  - name: secret-alertmanager-main-tls
    secret:
      defaultMode: 420
      secretName: alertmanager-main-tls
  - name: secret-alertmanager-main-proxy
    secret:
      defaultMode: 420
      secretName: alertmanager-main-proxy
  - name: secret-alertmanager-kube-rbac-proxy
    secret:
      defaultMode: 420
      secretName: alertmanager-kube-rbac-proxy
  - configMap:
      defaultMode: 420
      items:
      - key: ca-bundle.crt
        path: tls-ca-bundle.pem
      name: alertmanager-trusted-ca-bundle-d34s91lhv300e
      optional: true
    name: alertmanager-trusted-ca-bundle
  - name: alertmanager-main-token-pvpj8
    secret:
      defaultMode: 420
      secretName: alertmanager-main-token-pvpj8
```
PVC yaml:
```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd
    volume.kubernetes.io/selected-node: build0-gstfj-w-b-pzxf6.c.openshift-ci-build-farm.internal
  creationTimestamp: "2020-05-26T19:28:36Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    alertmanager: main
    app: alertmanager
  name: alertmanager-main-db-alertmanager-main-0
  namespace: openshift-monitoring
  resourceVersion: "2164885"
  selfLink: /api/v1/namespaces/openshift-monitoring/persistentvolumeclaims/alertmanager-main-db-alertmanager-main-0
  uid: 843a253b-e243-4470-9073-2213153018d4
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: standard
  volumeMode: Filesystem
  volumeName: pvc-843a253b-e243-4470-9073-2213153018d4
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 10Gi
  phase: Bound
```

PV yaml:
```
$ k get pv pvc-843a253b-e243-4470-9073-2213153018d4 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    kubernetes.io/createdby: gce-pd-dynamic-provisioner
    pv.kubernetes.io/bound-by-controller: "yes"
    pv.kubernetes.io/provisioned-by: kubernetes.io/gce-pd
  creationTimestamp: "2020-05-26T19:28:39Z"
  finalizers:
  - kubernetes.io/pv-protection
  labels:
    failure-domain.beta.kubernetes.io/region: us-east1
    failure-domain.beta.kubernetes.io/zone: us-east1-b
  name: pvc-843a253b-e243-4470-9073-2213153018d4
  resourceVersion: "2164881"
  selfLink: /api/v1/persistentvolumes/pvc-843a253b-e243-4470-9073-2213153018d4
  uid: 49567541-2c10-4d77-b62d-8d08b9b2b1fe
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 10Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: alertmanager-main-db-alertmanager-main-0
    namespace: openshift-monitoring
    resourceVersion: "2164835"
    uid: 843a253b-e243-4470-9073-2213153018d4
  gcePersistentDisk:
    fsType: ext4
    pdName: build0-gstfj-dynamic-pvc-843a253b-e243-4470-9073-2213153018d4
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: failure-domain.beta.kubernetes.io/zone
          operator: In
          values:
          - us-east1-b
        - key: failure-domain.beta.kubernetes.io/region
          operator: In
          values:
          - us-east1
  persistentVolumeReclaimPolicy: Delete
  storageClassName: standard
  volumeMode: Filesystem
status:
  phase: Bound
```

Pod events when it failed to trigger scale up:
```
  Normal   NotTriggerScaleUp       38m (x50 over 147m)   cluster-autoscaler       pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had volume node affinity conflict, 1 node(s) didn't match node selector, 1 max node group size reached
```

The machineset at the time had one replica and the following autoscaler associated:

```
apiVersion: autoscaling.openshift.io/v1beta1
kind: MachineAutoscaler
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"autoscaling.openshift.io/v1beta1","kind":"MachineAutoscaler","metadata":{"annotations":{},"name":"build0-gstfj-w-b","namespace":"openshift-machine-api"},"spec":{"maxReplicas":12,"minReplicas":1,"scaleTargetRef":{"apiVersion":"machine.openshift.io/v1beta1","kind":"MachineSet","name":"build0-gstfj-w-b"}}}
  creationTimestamp: "2020-05-26T17:43:14Z"
  finalizers:
  - machinetarget.autoscaling.openshift.io
  generation: 3
  name: build0-gstfj-w-b
  namespace: openshift-machine-api
  resourceVersion: "104296600"
  selfLink: /apis/autoscaling.openshift.io/v1beta1/namespaces/openshift-machine-api/machineautoscalers/build0-gstfj-w-b
  uid: 25774371-66f6-4f4e-abad-c052ec7b7b92
spec:
  maxReplicas: 12
  minReplicas: 1
  scaleTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: build0-gstfj-w-b
status:
  lastTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: build0-gstfj-w-b
```

Version-Release number of selected component (if applicable):


How reproducible:

Most likely by:
* Creating a machineset with a distinct label and an associated autoscaler
* Scaling it to one
* Manualy setting that one node to unschedulable
* Creating a pod that has a nodeSelector that only matches that one machinesets nodes
* Observer the CA not scaling

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Joel Speed 2020-10-27 10:05:50 UTC

I believe this is closely related to https://bugzilla.redhat.com/show_bug.cgi?id=1880930

I tried to briefly reproduce this yesterday, but did not get to try with a persistent volume.

Without a persistent volume, I was unable to reproduce. My initial concern about the node cordoning therefore doesn't seem to be related to this issue.

Will attempt to reproduce again using a PV and see if that presents the issue

Comment 2 Joel Speed 2020-10-27 12:45:59 UTC

I managed to reproduce the issue today.

Steps:
- Create GCP cluster using IPI installation
- Create `cluster-monitoring-config` as post install step [1]
- Create a `clusterautoscaler` and `machineautoscalers` for each machineset
- `kubectl drain` the node that an alertmanager pod is on

I then also updated the cluster autoscaler pod to have a higher verbosity on the logs and captured:

```
I1027 12:41:52.690872       1 scheduler_binder.go:769] Could not get a CSINode object for the node "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-a-7824066832447018591": csinode.storage.k8s.io "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-a-7824066832447018591" not found
I1027 12:41:52.690935       1 scheduler_binder.go:786] PersistentVolume "pvc-6090a25f-dbb3-4ec4-b395-687554eda99d", Node "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-a-7824066832447018591" mismatch for Pod "openshift-monitoring/alertmanager-main-0": No matching NodeSelectorTerms
I1027 12:41:52.690963       1 scale_up.go:288] Pod alertmanager-main-0 can't be scheduled on openshift-machine-api/jspeed-test-gtz9k-worker-a, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I1027 12:41:52.690987       1 scale_up.go:437] No pod can fit to openshift-machine-api/jspeed-test-gtz9k-worker-a
I1027 12:41:52.891335       1 scale_up.go:288] Pod alertmanager-main-0 can't be scheduled on openshift-machine-api/jspeed-test-gtz9k-worker-b, predicate checking error: node(s) didn't match node selector; predicateName=NodeAffinity; reasons: node(s) didn't match node selector; debugInfo=
I1027 12:41:52.891370       1 scale_up.go:437] No pod can fit to openshift-machine-api/jspeed-test-gtz9k-worker-b
I1027 12:41:53.090741       1 scheduler_binder.go:769] Could not get a CSINode object for the node "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-c-1511327361354317201": csinode.storage.k8s.io "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-c-1511327361354317201" not found
I1027 12:41:53.090803       1 scheduler_binder.go:786] PersistentVolume "pvc-6090a25f-dbb3-4ec4-b395-687554eda99d", Node "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-c-1511327361354317201" mismatch for Pod "openshift-monitoring/alertmanager-main-0": No matching NodeSelectorTerms
I1027 12:41:53.090833       1 scale_up.go:288] Pod alertmanager-main-0 can't be scheduled on openshift-machine-api/jspeed-test-gtz9k-worker-c, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I1027 12:41:53.090857       1 scale_up.go:437] No pod can fit to openshift-machine-api/jspeed-test-gtz9k-worker-c
I1027 12:41:53.090878       1 scale_up.go:441] No expansion options
```

Looks like the node selector is failing for some reason even though it shouldn't be, the node selector required label is present on the node for jspeed-test-gtz9k-worker-b in the cluster.

I think the next step is to try to build a debug build with extra logging to understand more about what the autoscaler and in particular the scheduling part of this is thinking that it is seeing.

[1]:
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    alertmanagerMain:
      volumeClaimTemplate:
        spec:
          storageClassName: standard
          resources:
            requests:
              storage: 40Gi
```

Comment 3 Joel Speed 2020-10-28 18:14:29 UTC

So I've spent time today to work out exactly what was happening here.

Firstly, to reproduce this, you must ensure that the only pod that becomes unschedulable is the alert manager pod, otherwise the autoscaler will scale up anyway and the problem is masked.

Secondly, ALL nodes in a particular nodegroup (machineset) must be cordoned or otherwise not considered healthy. When a nodegroup is considered to have no healthy nodes (which includes cordoned), the autoscaler will use a "Template" node to perform scaling decisions rather than the actual nodes from the cluster.

Thirdly, looking at the code where our provider constructs a "Template" node [1], we can see that it is setting a small number of legacy well known labels, which do not include the set of labels that are in the nodeSelector on the alert manager pod, hence, the nodeAffinity predicate fails and the autoscaler deems it cannot schedule on that particular node, and doesn't scale up that node group.

I have started working on a PR that will update the list to include the newer stable well known labels and also add a fallback to use existing node labels if present to improve the matching algorithm, this should resolve this issue and in my testing does indeed resolve the issue.

[1]: https://github.com/openshift/kubernetes-autoscaler/blob/698efa2f989b509c5c1a2549a531a08e7639bd9f/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup.go#L311

Comment 5 sunzhaohua 2020-12-04 13:39:53 UTC

Verified
clusterversion: 4.7.0-0.nightly-2020-12-03-103850

Steps:
- Create GCP cluster using IPI installation
- Create `cluster-monitoring-config` as post install step [1]
- Create a `clusterautoscaler` and `machineautoscalers` for each machineset
- Updated the cluster autoscaler pod to have a higher verbosity
  # oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=0
  # oc -n openshift-machine-api scale deploycluster-autoscaler-operator --replicas=0
  $ oc edit deploy cluster-autoscaler-default
- Ensure that the only pod that becomes unschedulable is the alert manager pod
  # oc -n openshift-monitoring scale deploy cluster-monitoring-operator --replicas=0
  # oc -n openshift-monitoring scale deploy prometheus-operator --replicas=0
  # oc edit statefulset.apps/alertmanager-main
  alertmanager-proxy:
...
    Requests:
      cpu:     1m
      memory:  10Gi

- `kubectl drain` the node that an alertmanager pod is on, I drained all worker node
   $ oc adm drain zhsungcp4-1-8xxqx-worker-a-brflc.c.openshift-qe.internal --ignore-daemonsets --delete-local-data --force

- Check autoscaler log



[1]:
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    alertmanagerMain:
      volumeClaimTemplate:
        spec:
          storageClassName: standard
          resources:
            requests:
              storage: 40Gi
```

I1204 12:47:58.641819       1 klogx.go:86] Pod openshift-monitoring/alertmanager-main-2 is unschedulable
..

I1204 12:48:00.844055       1 scheduler_binder.go:769] Could not get a CSINode object for the node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-b-4799660975768660701": csinode.storage.k8s.io "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-b-4799660975768660701" not found
I1204 12:48:00.844109       1 scheduler_binder.go:786] PersistentVolume "pvc-4f05e351-265a-45fd-a9dd-edab77329956", Node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-b-4799660975768660701" mismatch for Pod "openshift-monitoring/alertmanager-main-2": No matching NodeSelectorTerms
I1204 12:48:00.844143       1 scale_up.go:288] Pod alertmanager-main-2 can't be scheduled on MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-b, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I1204 12:48:00.844173       1 scale_up.go:437] No pod can fit to MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-b
I1204 12:48:01.036773       1 request.go:581] Throttling request took 192.396933ms, request: GET:https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/zhsungcp4-1-8xxqx-worker-c/scale
I1204 12:48:01.043120       1 scheduler_binder.go:769] Could not get a CSINode object for the node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-919843791599379793": csinode.storage.k8s.io "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-919843791599379793" not found
I1204 12:48:01.043165       1 scheduler_binder.go:792] All bound volumes for Pod "openshift-monitoring/alertmanager-main-2" match with Node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-919843791599379793"
I1204 12:48:01.043276       1 scale_up.go:456] Best option to resize: MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c
I1204 12:48:01.043295       1 scale_up.go:460] Estimated 1 nodes needed in MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c
I1204 12:48:01.236782       1 request.go:581] Throttling request took 193.194598ms, request: GET:https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/zhsungcp4-1-8xxqx-worker-c/scale
I1204 12:48:01.242673       1 scale_up.go:574] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c 1->2 (max: 3)}]
I1204 12:48:01.242736       1 scale_up.go:663] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c size to 2
I1204 12:48:01.243089       1 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-machine-api", Name:"cluster-autoscaler-status", UID:"6e40d730-ab5d-4a68-8701-e9e73a39b014", APIVersion:"v1", ResourceVersion:"399976", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c size to 2
I1204 12:48:01.243475       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-machine-api", Name:"cluster-autoscaler-status", UID:"6e40d730-ab5d-4a68-8701-e9e73a39b014", APIVersion:"v1", ResourceVersion:"399976", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c size to 2



I1204 12:55:49.324003       1 klogx.go:86] Pod openshift-monitoring/alertmanager-main-0 is unschedulable


I1204 12:55:51.325780       1 scheduler_binder.go:792] All bound volumes for Pod "openshift-monitoring/alertmanager-main-0" match with Node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a-7222222877500009439"
..
I1204 12:55:51.724225       1 scheduler_binder.go:769] Could not get a CSINode object for the node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-7055955377579800709": csinode.storage.k8s.io "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-7055955377579800709" not found
I1204 12:55:51.724299       1 scheduler_binder.go:786] PersistentVolume "pvc-eedf59c6-c189-4880-9b11-dd74487508e6", Node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-7055955377579800709" mismatch for Pod "openshift-monitoring/alertmanager-main-0": No matching NodeSelectorTerms
I1204 12:55:51.724329       1 scale_up.go:288] Pod alertmanager-main-0 can't be scheduled on MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I1204 12:55:51.724355       1 scale_up.go:437] No pod can fit to MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c
I1204 12:55:51.724379       1 scale_up.go:456] Best option to resize: MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a
I1204 12:55:51.724391       1 scale_up.go:460] Estimated 1 nodes needed in MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a
I1204 12:55:51.918132       1 request.go:581] Throttling request took 193.456156ms, request: GET:https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/zhsungcp4-1-8xxqx-worker-a/scale
I1204 12:55:51.923398       1 scale_up.go:574] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a 1->2 (max: 3)}]
I1204 12:55:51.923460       1 scale_up.go:663] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a size to 2
I1204 12:55:51.923658       1 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-machine-api", Name:"cluster-autoscaler-status", UID:"6e40d730-ab5d-4a68-8701-e9e73a39b014", APIVersion:"v1", ResourceVersion:"403916", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a size to 2
I12/

$ oc get node
NAME                                                       STATUS   ROLES    AGE   VERSION
zhsungcp4-1-8xxqx-master-0.c.openshift-qe.internal         Ready    master   21h   v1.19.2+ad738ba
zhsungcp4-1-8xxqx-master-1.c.openshift-qe.internal         Ready    master   21h   v1.19.2+ad738ba
zhsungcp4-1-8xxqx-master-2.c.openshift-qe.internal         Ready    master   21h   v1.19.2+ad738ba
zhsungcp4-1-8xxqx-worker-a-6cvgs.c.openshift-qe.internal   Ready    worker   38m   v1.19.2+ad738ba
zhsungcp4-1-8xxqx-worker-b-nc5f4.c.openshift-qe.internal   Ready    worker   42m   v1.19.2+ad738ba
zhsungcp4-1-8xxqx-worker-c-rbcsn.c.openshift-qe.internal   Ready    worker   45m   v1.19.2+ad738ba

Comment 8 errata-xmlrpc 2021-02-24 15:28:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.