Description of problem: We have a cluster that has: * Three machinesets, each with a machineautoscaler, each in a distinct AZ * A PDB that may prevent draining for up to ~4.5 hours to prevent our batch workloads from getting interrupted During a node upgrade, there was a pod that couldn't get scheduled and did not cause the autoscaler to scale up. The pod has a volume in an AZ where at that time there was only one node and that node was in the process of being drained (which took a couple of hours because of the aforementioned PDB). Pod spec (after it got scheduled because that one node finished draining): ``` spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchExpressions: - key: alertmanager operator: In values: - main namespaces: - openshift-monitoring topologyKey: kubernetes.io/hostname weight: 100 containers: - args: - --config.file=/etc/alertmanager/config/alertmanager.yaml - --storage.path=/alertmanager - --data.retention=120h - --cluster.listen-address=[$(POD_IP)]:9094 - --web.listen-address=127.0.0.1:9093 - --web.external-url=https://alertmanager-main-openshift-monitoring.apps.build02.gcp.ci.openshift.org/ - --web.route-prefix=/ - --cluster.peer=alertmanager-main-0.alertmanager-operated:9094 - --cluster.peer=alertmanager-main-1.alertmanager-operated:9094 - --cluster.peer=alertmanager-main-2.alertmanager-operated:9094 - --cluster.reconnect-timeout=5m env: - name: POD_IP valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e5bcf6d786fd218e1ef188eb40c39c31a98d03121fba3b7a1f16e87e45a7478b imagePullPolicy: IfNotPresent name: alertmanager ports: - containerPort: 9094 name: mesh-tcp protocol: TCP - containerPort: 9094 name: mesh-udp protocol: UDP resources: requests: cpu: 4m memory: 200Mi securityContext: capabilities: drop: - KILL - MKNOD - SETGID - SETUID runAsUser: 1000280000 terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/alertmanager/config name: config-volume - mountPath: /alertmanager name: alertmanager-main-db subPath: alertmanager-db - mountPath: /etc/alertmanager/secrets/alertmanager-main-tls name: secret-alertmanager-main-tls readOnly: true - mountPath: /etc/alertmanager/secrets/alertmanager-main-proxy name: secret-alertmanager-main-proxy readOnly: true - mountPath: /etc/alertmanager/secrets/alertmanager-kube-rbac-proxy name: secret-alertmanager-kube-rbac-proxy readOnly: true - mountPath: /etc/pki/ca-trust/extracted/pem/ name: alertmanager-trusted-ca-bundle readOnly: true - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: alertmanager-main-token-pvpj8 readOnly: true - args: - --listen-address=localhost:8080 - --reload-url=http://localhost:9093/-/reload - --watched-dir=/etc/alertmanager/config - --watched-dir=/etc/alertmanager/secrets/alertmanager-main-tls - --watched-dir=/etc/alertmanager/secrets/alertmanager-main-proxy - --watched-dir=/etc/alertmanager/secrets/alertmanager-kube-rbac-proxy command: - /bin/prometheus-config-reloader env: - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8c9e61400619c4613db5cc73097d287e3cd5d2125c85d1d84cc30cfdaa1093e7 imagePullPolicy: IfNotPresent name: config-reloader resources: requests: cpu: 1m memory: 10Mi securityContext: capabilities: drop: - KILL - MKNOD - SETGID - SETUID runAsUser: 1000280000 terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/alertmanager/config name: config-volume readOnly: true - mountPath: /etc/alertmanager/secrets/alertmanager-main-tls name: secret-alertmanager-main-tls readOnly: true - mountPath: /etc/alertmanager/secrets/alertmanager-main-proxy name: secret-alertmanager-main-proxy readOnly: true - mountPath: /etc/alertmanager/secrets/alertmanager-kube-rbac-proxy name: secret-alertmanager-kube-rbac-proxy readOnly: true - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: alertmanager-main-token-pvpj8 readOnly: true - args: - -provider=openshift - -https-address=:9095 - -http-address= - -email-domain=* - -upstream=http://localhost:9093 - '-openshift-sar={"resource": "namespaces", "verb": "get"}' - '-openshift-delegate-urls={"/": {"resource": "namespaces", "verb": "get"}}' - -tls-cert=/etc/tls/private/tls.crt - -tls-key=/etc/tls/private/tls.key - -client-secret-file=/var/run/secrets/kubernetes.io/serviceaccount/token - -cookie-secret-file=/etc/proxy/secrets/session_secret - -openshift-service-account=alertmanager-main - -openshift-ca=/etc/pki/tls/cert.pem - -openshift-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt - -skip-auth-regex=^/metrics env: - name: HTTP_PROXY - name: HTTPS_PROXY - name: NO_PROXY image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:12b11e2000b42ce1aaa228d9c1f4c9177395add2fa43835e667b7fc9007e40e6 imagePullPolicy: IfNotPresent name: alertmanager-proxy ports: - containerPort: 9095 name: web protocol: TCP resources: requests: cpu: 1m memory: 20Mi securityContext: capabilities: drop: - KILL - MKNOD - SETGID - SETUID runAsUser: 1000280000 terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/tls/private name: secret-alertmanager-main-tls - mountPath: /etc/proxy/secrets name: secret-alertmanager-main-proxy - mountPath: /etc/pki/ca-trust/extracted/pem/ name: alertmanager-trusted-ca-bundle readOnly: true - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: alertmanager-main-token-pvpj8 readOnly: true - args: - --secure-listen-address=0.0.0.0:9092 - --upstream=http://127.0.0.1:9096 - --config-file=/etc/kube-rbac-proxy/config.yaml - --tls-cert-file=/etc/tls/private/tls.crt - --tls-private-key-file=/etc/tls/private/tls.key - --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305 - --logtostderr=true - --v=10 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c75977f28becdf4f7065cfa37233464dd31208b1767e620c4f19658f53f8ff8c imagePullPolicy: IfNotPresent name: kube-rbac-proxy ports: - containerPort: 9092 name: tenancy protocol: TCP resources: requests: cpu: 1m memory: 20Mi securityContext: capabilities: drop: - KILL - MKNOD - SETGID - SETUID runAsUser: 1000280000 terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/kube-rbac-proxy name: secret-alertmanager-kube-rbac-proxy - mountPath: /etc/tls/private name: secret-alertmanager-main-tls - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: alertmanager-main-token-pvpj8 readOnly: true - args: - --insecure-listen-address=127.0.0.1:9096 - --upstream=http://127.0.0.1:9093 - --label=namespace image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b635694dc0663a1404d43a4a9ac8513a3087c7cfead50f6ab413f3c217c40b2a imagePullPolicy: IfNotPresent name: prom-label-proxy resources: requests: cpu: 1m memory: 20Mi securityContext: capabilities: drop: - KILL - MKNOD - SETGID - SETUID runAsUser: 1000280000 terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: alertmanager-main-token-pvpj8 readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true hostname: alertmanager-main-0 imagePullSecrets: - name: alertmanager-main-dockercfg-ddtjb nodeName: build0-gstfj-w-b-jx6pf.c.openshift-ci-build-farm.internal nodeSelector: kubernetes.io/os: linux preemptionPolicy: PreemptLowerPriority priority: 2000000000 priorityClassName: system-cluster-critical restartPolicy: Always schedulerName: default-scheduler securityContext: fsGroup: 1000280000 seLinuxOptions: level: s0:c17,c4 serviceAccount: alertmanager-main serviceAccountName: alertmanager-main subdomain: alertmanager-operated terminationGracePeriodSeconds: 120 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 - effect: NoSchedule key: node.kubernetes.io/memory-pressure operator: Exists volumes: - name: alertmanager-main-db persistentVolumeClaim: claimName: alertmanager-main-db-alertmanager-main-0 - name: config-volume secret: defaultMode: 420 secretName: alertmanager-main - name: secret-alertmanager-main-tls secret: defaultMode: 420 secretName: alertmanager-main-tls - name: secret-alertmanager-main-proxy secret: defaultMode: 420 secretName: alertmanager-main-proxy - name: secret-alertmanager-kube-rbac-proxy secret: defaultMode: 420 secretName: alertmanager-kube-rbac-proxy - configMap: defaultMode: 420 items: - key: ca-bundle.crt path: tls-ca-bundle.pem name: alertmanager-trusted-ca-bundle-d34s91lhv300e optional: true name: alertmanager-trusted-ca-bundle - name: alertmanager-main-token-pvpj8 secret: defaultMode: 420 secretName: alertmanager-main-token-pvpj8 ``` PVC yaml: ``` apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd volume.kubernetes.io/selected-node: build0-gstfj-w-b-pzxf6.c.openshift-ci-build-farm.internal creationTimestamp: "2020-05-26T19:28:36Z" finalizers: - kubernetes.io/pvc-protection labels: alertmanager: main app: alertmanager name: alertmanager-main-db-alertmanager-main-0 namespace: openshift-monitoring resourceVersion: "2164885" selfLink: /api/v1/namespaces/openshift-monitoring/persistentvolumeclaims/alertmanager-main-db-alertmanager-main-0 uid: 843a253b-e243-4470-9073-2213153018d4 spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi storageClassName: standard volumeMode: Filesystem volumeName: pvc-843a253b-e243-4470-9073-2213153018d4 status: accessModes: - ReadWriteOnce capacity: storage: 10Gi phase: Bound ``` PV yaml: ``` $ k get pv pvc-843a253b-e243-4470-9073-2213153018d4 -o yaml apiVersion: v1 kind: PersistentVolume metadata: annotations: kubernetes.io/createdby: gce-pd-dynamic-provisioner pv.kubernetes.io/bound-by-controller: "yes" pv.kubernetes.io/provisioned-by: kubernetes.io/gce-pd creationTimestamp: "2020-05-26T19:28:39Z" finalizers: - kubernetes.io/pv-protection labels: failure-domain.beta.kubernetes.io/region: us-east1 failure-domain.beta.kubernetes.io/zone: us-east1-b name: pvc-843a253b-e243-4470-9073-2213153018d4 resourceVersion: "2164881" selfLink: /api/v1/persistentvolumes/pvc-843a253b-e243-4470-9073-2213153018d4 uid: 49567541-2c10-4d77-b62d-8d08b9b2b1fe spec: accessModes: - ReadWriteOnce capacity: storage: 10Gi claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: alertmanager-main-db-alertmanager-main-0 namespace: openshift-monitoring resourceVersion: "2164835" uid: 843a253b-e243-4470-9073-2213153018d4 gcePersistentDisk: fsType: ext4 pdName: build0-gstfj-dynamic-pvc-843a253b-e243-4470-9073-2213153018d4 nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: failure-domain.beta.kubernetes.io/zone operator: In values: - us-east1-b - key: failure-domain.beta.kubernetes.io/region operator: In values: - us-east1 persistentVolumeReclaimPolicy: Delete storageClassName: standard volumeMode: Filesystem status: phase: Bound ``` Pod events when it failed to trigger scale up: ``` Normal NotTriggerScaleUp 38m (x50 over 147m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had volume node affinity conflict, 1 node(s) didn't match node selector, 1 max node group size reached ``` The machineset at the time had one replica and the following autoscaler associated: ``` apiVersion: autoscaling.openshift.io/v1beta1 kind: MachineAutoscaler metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"autoscaling.openshift.io/v1beta1","kind":"MachineAutoscaler","metadata":{"annotations":{},"name":"build0-gstfj-w-b","namespace":"openshift-machine-api"},"spec":{"maxReplicas":12,"minReplicas":1,"scaleTargetRef":{"apiVersion":"machine.openshift.io/v1beta1","kind":"MachineSet","name":"build0-gstfj-w-b"}}} creationTimestamp: "2020-05-26T17:43:14Z" finalizers: - machinetarget.autoscaling.openshift.io generation: 3 name: build0-gstfj-w-b namespace: openshift-machine-api resourceVersion: "104296600" selfLink: /apis/autoscaling.openshift.io/v1beta1/namespaces/openshift-machine-api/machineautoscalers/build0-gstfj-w-b uid: 25774371-66f6-4f4e-abad-c052ec7b7b92 spec: maxReplicas: 12 minReplicas: 1 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: build0-gstfj-w-b status: lastTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: build0-gstfj-w-b ``` Version-Release number of selected component (if applicable): How reproducible: Most likely by: * Creating a machineset with a distinct label and an associated autoscaler * Scaling it to one * Manualy setting that one node to unschedulable * Creating a pod that has a nodeSelector that only matches that one machinesets nodes * Observer the CA not scaling Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I believe this is closely related to https://bugzilla.redhat.com/show_bug.cgi?id=1880930 I tried to briefly reproduce this yesterday, but did not get to try with a persistent volume. Without a persistent volume, I was unable to reproduce. My initial concern about the node cordoning therefore doesn't seem to be related to this issue. Will attempt to reproduce again using a PV and see if that presents the issue
I managed to reproduce the issue today. Steps: - Create GCP cluster using IPI installation - Create `cluster-monitoring-config` as post install step [1] - Create a `clusterautoscaler` and `machineautoscalers` for each machineset - `kubectl drain` the node that an alertmanager pod is on I then also updated the cluster autoscaler pod to have a higher verbosity on the logs and captured: ``` I1027 12:41:52.690872 1 scheduler_binder.go:769] Could not get a CSINode object for the node "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-a-7824066832447018591": csinode.storage.k8s.io "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-a-7824066832447018591" not found I1027 12:41:52.690935 1 scheduler_binder.go:786] PersistentVolume "pvc-6090a25f-dbb3-4ec4-b395-687554eda99d", Node "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-a-7824066832447018591" mismatch for Pod "openshift-monitoring/alertmanager-main-0": No matching NodeSelectorTerms I1027 12:41:52.690963 1 scale_up.go:288] Pod alertmanager-main-0 can't be scheduled on openshift-machine-api/jspeed-test-gtz9k-worker-a, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo= I1027 12:41:52.690987 1 scale_up.go:437] No pod can fit to openshift-machine-api/jspeed-test-gtz9k-worker-a I1027 12:41:52.891335 1 scale_up.go:288] Pod alertmanager-main-0 can't be scheduled on openshift-machine-api/jspeed-test-gtz9k-worker-b, predicate checking error: node(s) didn't match node selector; predicateName=NodeAffinity; reasons: node(s) didn't match node selector; debugInfo= I1027 12:41:52.891370 1 scale_up.go:437] No pod can fit to openshift-machine-api/jspeed-test-gtz9k-worker-b I1027 12:41:53.090741 1 scheduler_binder.go:769] Could not get a CSINode object for the node "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-c-1511327361354317201": csinode.storage.k8s.io "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-c-1511327361354317201" not found I1027 12:41:53.090803 1 scheduler_binder.go:786] PersistentVolume "pvc-6090a25f-dbb3-4ec4-b395-687554eda99d", Node "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-c-1511327361354317201" mismatch for Pod "openshift-monitoring/alertmanager-main-0": No matching NodeSelectorTerms I1027 12:41:53.090833 1 scale_up.go:288] Pod alertmanager-main-0 can't be scheduled on openshift-machine-api/jspeed-test-gtz9k-worker-c, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo= I1027 12:41:53.090857 1 scale_up.go:437] No pod can fit to openshift-machine-api/jspeed-test-gtz9k-worker-c I1027 12:41:53.090878 1 scale_up.go:441] No expansion options ``` Looks like the node selector is failing for some reason even though it shouldn't be, the node selector required label is present on the node for jspeed-test-gtz9k-worker-b in the cluster. I think the next step is to try to build a debug build with extra logging to understand more about what the autoscaler and in particular the scheduling part of this is thinking that it is seeing. [1]: ``` apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | alertmanagerMain: volumeClaimTemplate: spec: storageClassName: standard resources: requests: storage: 40Gi ```
So I've spent time today to work out exactly what was happening here. Firstly, to reproduce this, you must ensure that the only pod that becomes unschedulable is the alert manager pod, otherwise the autoscaler will scale up anyway and the problem is masked. Secondly, ALL nodes in a particular nodegroup (machineset) must be cordoned or otherwise not considered healthy. When a nodegroup is considered to have no healthy nodes (which includes cordoned), the autoscaler will use a "Template" node to perform scaling decisions rather than the actual nodes from the cluster. Thirdly, looking at the code where our provider constructs a "Template" node [1], we can see that it is setting a small number of legacy well known labels, which do not include the set of labels that are in the nodeSelector on the alert manager pod, hence, the nodeAffinity predicate fails and the autoscaler deems it cannot schedule on that particular node, and doesn't scale up that node group. I have started working on a PR that will update the list to include the newer stable well known labels and also add a fallback to use existing node labels if present to improve the matching algorithm, this should resolve this issue and in my testing does indeed resolve the issue. [1]: https://github.com/openshift/kubernetes-autoscaler/blob/698efa2f989b509c5c1a2549a531a08e7639bd9f/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup.go#L311
Verified clusterversion: 4.7.0-0.nightly-2020-12-03-103850 Steps: - Create GCP cluster using IPI installation - Create `cluster-monitoring-config` as post install step [1] - Create a `clusterautoscaler` and `machineautoscalers` for each machineset - Updated the cluster autoscaler pod to have a higher verbosity # oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=0 # oc -n openshift-machine-api scale deploycluster-autoscaler-operator --replicas=0 $ oc edit deploy cluster-autoscaler-default - Ensure that the only pod that becomes unschedulable is the alert manager pod # oc -n openshift-monitoring scale deploy cluster-monitoring-operator --replicas=0 # oc -n openshift-monitoring scale deploy prometheus-operator --replicas=0 # oc edit statefulset.apps/alertmanager-main alertmanager-proxy: ... Requests: cpu: 1m memory: 10Gi - `kubectl drain` the node that an alertmanager pod is on, I drained all worker node $ oc adm drain zhsungcp4-1-8xxqx-worker-a-brflc.c.openshift-qe.internal --ignore-daemonsets --delete-local-data --force - Check autoscaler log [1]: ``` apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | alertmanagerMain: volumeClaimTemplate: spec: storageClassName: standard resources: requests: storage: 40Gi ``` I1204 12:47:58.641819 1 klogx.go:86] Pod openshift-monitoring/alertmanager-main-2 is unschedulable .. I1204 12:48:00.844055 1 scheduler_binder.go:769] Could not get a CSINode object for the node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-b-4799660975768660701": csinode.storage.k8s.io "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-b-4799660975768660701" not found I1204 12:48:00.844109 1 scheduler_binder.go:786] PersistentVolume "pvc-4f05e351-265a-45fd-a9dd-edab77329956", Node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-b-4799660975768660701" mismatch for Pod "openshift-monitoring/alertmanager-main-2": No matching NodeSelectorTerms I1204 12:48:00.844143 1 scale_up.go:288] Pod alertmanager-main-2 can't be scheduled on MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-b, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo= I1204 12:48:00.844173 1 scale_up.go:437] No pod can fit to MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-b I1204 12:48:01.036773 1 request.go:581] Throttling request took 192.396933ms, request: GET:https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/zhsungcp4-1-8xxqx-worker-c/scale I1204 12:48:01.043120 1 scheduler_binder.go:769] Could not get a CSINode object for the node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-919843791599379793": csinode.storage.k8s.io "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-919843791599379793" not found I1204 12:48:01.043165 1 scheduler_binder.go:792] All bound volumes for Pod "openshift-monitoring/alertmanager-main-2" match with Node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-919843791599379793" I1204 12:48:01.043276 1 scale_up.go:456] Best option to resize: MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c I1204 12:48:01.043295 1 scale_up.go:460] Estimated 1 nodes needed in MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c I1204 12:48:01.236782 1 request.go:581] Throttling request took 193.194598ms, request: GET:https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/zhsungcp4-1-8xxqx-worker-c/scale I1204 12:48:01.242673 1 scale_up.go:574] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c 1->2 (max: 3)}] I1204 12:48:01.242736 1 scale_up.go:663] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c size to 2 I1204 12:48:01.243089 1 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-machine-api", Name:"cluster-autoscaler-status", UID:"6e40d730-ab5d-4a68-8701-e9e73a39b014", APIVersion:"v1", ResourceVersion:"399976", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c size to 2 I1204 12:48:01.243475 1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-machine-api", Name:"cluster-autoscaler-status", UID:"6e40d730-ab5d-4a68-8701-e9e73a39b014", APIVersion:"v1", ResourceVersion:"399976", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c size to 2 I1204 12:55:49.324003 1 klogx.go:86] Pod openshift-monitoring/alertmanager-main-0 is unschedulable I1204 12:55:51.325780 1 scheduler_binder.go:792] All bound volumes for Pod "openshift-monitoring/alertmanager-main-0" match with Node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a-7222222877500009439" .. I1204 12:55:51.724225 1 scheduler_binder.go:769] Could not get a CSINode object for the node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-7055955377579800709": csinode.storage.k8s.io "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-7055955377579800709" not found I1204 12:55:51.724299 1 scheduler_binder.go:786] PersistentVolume "pvc-eedf59c6-c189-4880-9b11-dd74487508e6", Node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-7055955377579800709" mismatch for Pod "openshift-monitoring/alertmanager-main-0": No matching NodeSelectorTerms I1204 12:55:51.724329 1 scale_up.go:288] Pod alertmanager-main-0 can't be scheduled on MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo= I1204 12:55:51.724355 1 scale_up.go:437] No pod can fit to MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c I1204 12:55:51.724379 1 scale_up.go:456] Best option to resize: MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a I1204 12:55:51.724391 1 scale_up.go:460] Estimated 1 nodes needed in MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a I1204 12:55:51.918132 1 request.go:581] Throttling request took 193.456156ms, request: GET:https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/zhsungcp4-1-8xxqx-worker-a/scale I1204 12:55:51.923398 1 scale_up.go:574] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a 1->2 (max: 3)}] I1204 12:55:51.923460 1 scale_up.go:663] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a size to 2 I1204 12:55:51.923658 1 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-machine-api", Name:"cluster-autoscaler-status", UID:"6e40d730-ab5d-4a68-8701-e9e73a39b014", APIVersion:"v1", ResourceVersion:"403916", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a size to 2 I12/ $ oc get node NAME STATUS ROLES AGE VERSION zhsungcp4-1-8xxqx-master-0.c.openshift-qe.internal Ready master 21h v1.19.2+ad738ba zhsungcp4-1-8xxqx-master-1.c.openshift-qe.internal Ready master 21h v1.19.2+ad738ba zhsungcp4-1-8xxqx-master-2.c.openshift-qe.internal Ready master 21h v1.19.2+ad738ba zhsungcp4-1-8xxqx-worker-a-6cvgs.c.openshift-qe.internal Ready worker 38m v1.19.2+ad738ba zhsungcp4-1-8xxqx-worker-b-nc5f4.c.openshift-qe.internal Ready worker 42m v1.19.2+ad738ba zhsungcp4-1-8xxqx-worker-c-rbcsn.c.openshift-qe.internal Ready worker 45m v1.19.2+ad738ba
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633