Bug 1891551
Summary: | Clusterautoscaler doesn't scale up as expected | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | aaleman |
Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> |
Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | hongkliu, jspeed |
Version: | 4.6 | ||
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: The cluster autoscaler would use a template for node scaling decisions in certain circumstances. This template includes a subset of the information available on an actual node.
Consequence: In some scenarios, the autoscaler would claim that adding new nodes would not solve allow pending pods to be scheduled.
Fix: Ensure the node template includes as many standard labels as possible to increase the likelihood the affinity checks pass.
Result: The autoscaler is less likely to be unable to scale up if a pending pod uses node affinity with a standard label
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-02-24 15:28:28 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
aaleman
2020-10-26 15:38:21 UTC
I believe this is closely related to https://bugzilla.redhat.com/show_bug.cgi?id=1880930 I tried to briefly reproduce this yesterday, but did not get to try with a persistent volume. Without a persistent volume, I was unable to reproduce. My initial concern about the node cordoning therefore doesn't seem to be related to this issue. Will attempt to reproduce again using a PV and see if that presents the issue I managed to reproduce the issue today. Steps: - Create GCP cluster using IPI installation - Create `cluster-monitoring-config` as post install step [1] - Create a `clusterautoscaler` and `machineautoscalers` for each machineset - `kubectl drain` the node that an alertmanager pod is on I then also updated the cluster autoscaler pod to have a higher verbosity on the logs and captured: ``` I1027 12:41:52.690872 1 scheduler_binder.go:769] Could not get a CSINode object for the node "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-a-7824066832447018591": csinode.storage.k8s.io "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-a-7824066832447018591" not found I1027 12:41:52.690935 1 scheduler_binder.go:786] PersistentVolume "pvc-6090a25f-dbb3-4ec4-b395-687554eda99d", Node "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-a-7824066832447018591" mismatch for Pod "openshift-monitoring/alertmanager-main-0": No matching NodeSelectorTerms I1027 12:41:52.690963 1 scale_up.go:288] Pod alertmanager-main-0 can't be scheduled on openshift-machine-api/jspeed-test-gtz9k-worker-a, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo= I1027 12:41:52.690987 1 scale_up.go:437] No pod can fit to openshift-machine-api/jspeed-test-gtz9k-worker-a I1027 12:41:52.891335 1 scale_up.go:288] Pod alertmanager-main-0 can't be scheduled on openshift-machine-api/jspeed-test-gtz9k-worker-b, predicate checking error: node(s) didn't match node selector; predicateName=NodeAffinity; reasons: node(s) didn't match node selector; debugInfo= I1027 12:41:52.891370 1 scale_up.go:437] No pod can fit to openshift-machine-api/jspeed-test-gtz9k-worker-b I1027 12:41:53.090741 1 scheduler_binder.go:769] Could not get a CSINode object for the node "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-c-1511327361354317201": csinode.storage.k8s.io "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-c-1511327361354317201" not found I1027 12:41:53.090803 1 scheduler_binder.go:786] PersistentVolume "pvc-6090a25f-dbb3-4ec4-b395-687554eda99d", Node "template-node-for-openshift-machine-api/jspeed-test-gtz9k-worker-c-1511327361354317201" mismatch for Pod "openshift-monitoring/alertmanager-main-0": No matching NodeSelectorTerms I1027 12:41:53.090833 1 scale_up.go:288] Pod alertmanager-main-0 can't be scheduled on openshift-machine-api/jspeed-test-gtz9k-worker-c, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo= I1027 12:41:53.090857 1 scale_up.go:437] No pod can fit to openshift-machine-api/jspeed-test-gtz9k-worker-c I1027 12:41:53.090878 1 scale_up.go:441] No expansion options ``` Looks like the node selector is failing for some reason even though it shouldn't be, the node selector required label is present on the node for jspeed-test-gtz9k-worker-b in the cluster. I think the next step is to try to build a debug build with extra logging to understand more about what the autoscaler and in particular the scheduling part of this is thinking that it is seeing. [1]: ``` apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | alertmanagerMain: volumeClaimTemplate: spec: storageClassName: standard resources: requests: storage: 40Gi ``` So I've spent time today to work out exactly what was happening here. Firstly, to reproduce this, you must ensure that the only pod that becomes unschedulable is the alert manager pod, otherwise the autoscaler will scale up anyway and the problem is masked. Secondly, ALL nodes in a particular nodegroup (machineset) must be cordoned or otherwise not considered healthy. When a nodegroup is considered to have no healthy nodes (which includes cordoned), the autoscaler will use a "Template" node to perform scaling decisions rather than the actual nodes from the cluster. Thirdly, looking at the code where our provider constructs a "Template" node [1], we can see that it is setting a small number of legacy well known labels, which do not include the set of labels that are in the nodeSelector on the alert manager pod, hence, the nodeAffinity predicate fails and the autoscaler deems it cannot schedule on that particular node, and doesn't scale up that node group. I have started working on a PR that will update the list to include the newer stable well known labels and also add a fallback to use existing node labels if present to improve the matching algorithm, this should resolve this issue and in my testing does indeed resolve the issue. [1]: https://github.com/openshift/kubernetes-autoscaler/blob/698efa2f989b509c5c1a2549a531a08e7639bd9f/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup.go#L311 Verified clusterversion: 4.7.0-0.nightly-2020-12-03-103850 Steps: - Create GCP cluster using IPI installation - Create `cluster-monitoring-config` as post install step [1] - Create a `clusterautoscaler` and `machineautoscalers` for each machineset - Updated the cluster autoscaler pod to have a higher verbosity # oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=0 # oc -n openshift-machine-api scale deploycluster-autoscaler-operator --replicas=0 $ oc edit deploy cluster-autoscaler-default - Ensure that the only pod that becomes unschedulable is the alert manager pod # oc -n openshift-monitoring scale deploy cluster-monitoring-operator --replicas=0 # oc -n openshift-monitoring scale deploy prometheus-operator --replicas=0 # oc edit statefulset.apps/alertmanager-main alertmanager-proxy: ... Requests: cpu: 1m memory: 10Gi - `kubectl drain` the node that an alertmanager pod is on, I drained all worker node $ oc adm drain zhsungcp4-1-8xxqx-worker-a-brflc.c.openshift-qe.internal --ignore-daemonsets --delete-local-data --force - Check autoscaler log [1]: ``` apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | alertmanagerMain: volumeClaimTemplate: spec: storageClassName: standard resources: requests: storage: 40Gi ``` I1204 12:47:58.641819 1 klogx.go:86] Pod openshift-monitoring/alertmanager-main-2 is unschedulable .. I1204 12:48:00.844055 1 scheduler_binder.go:769] Could not get a CSINode object for the node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-b-4799660975768660701": csinode.storage.k8s.io "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-b-4799660975768660701" not found I1204 12:48:00.844109 1 scheduler_binder.go:786] PersistentVolume "pvc-4f05e351-265a-45fd-a9dd-edab77329956", Node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-b-4799660975768660701" mismatch for Pod "openshift-monitoring/alertmanager-main-2": No matching NodeSelectorTerms I1204 12:48:00.844143 1 scale_up.go:288] Pod alertmanager-main-2 can't be scheduled on MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-b, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo= I1204 12:48:00.844173 1 scale_up.go:437] No pod can fit to MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-b I1204 12:48:01.036773 1 request.go:581] Throttling request took 192.396933ms, request: GET:https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/zhsungcp4-1-8xxqx-worker-c/scale I1204 12:48:01.043120 1 scheduler_binder.go:769] Could not get a CSINode object for the node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-919843791599379793": csinode.storage.k8s.io "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-919843791599379793" not found I1204 12:48:01.043165 1 scheduler_binder.go:792] All bound volumes for Pod "openshift-monitoring/alertmanager-main-2" match with Node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-919843791599379793" I1204 12:48:01.043276 1 scale_up.go:456] Best option to resize: MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c I1204 12:48:01.043295 1 scale_up.go:460] Estimated 1 nodes needed in MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c I1204 12:48:01.236782 1 request.go:581] Throttling request took 193.194598ms, request: GET:https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/zhsungcp4-1-8xxqx-worker-c/scale I1204 12:48:01.242673 1 scale_up.go:574] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c 1->2 (max: 3)}] I1204 12:48:01.242736 1 scale_up.go:663] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c size to 2 I1204 12:48:01.243089 1 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-machine-api", Name:"cluster-autoscaler-status", UID:"6e40d730-ab5d-4a68-8701-e9e73a39b014", APIVersion:"v1", ResourceVersion:"399976", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c size to 2 I1204 12:48:01.243475 1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-machine-api", Name:"cluster-autoscaler-status", UID:"6e40d730-ab5d-4a68-8701-e9e73a39b014", APIVersion:"v1", ResourceVersion:"399976", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c size to 2 I1204 12:55:49.324003 1 klogx.go:86] Pod openshift-monitoring/alertmanager-main-0 is unschedulable I1204 12:55:51.325780 1 scheduler_binder.go:792] All bound volumes for Pod "openshift-monitoring/alertmanager-main-0" match with Node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a-7222222877500009439" .. I1204 12:55:51.724225 1 scheduler_binder.go:769] Could not get a CSINode object for the node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-7055955377579800709": csinode.storage.k8s.io "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-7055955377579800709" not found I1204 12:55:51.724299 1 scheduler_binder.go:786] PersistentVolume "pvc-eedf59c6-c189-4880-9b11-dd74487508e6", Node "template-node-for-MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c-7055955377579800709" mismatch for Pod "openshift-monitoring/alertmanager-main-0": No matching NodeSelectorTerms I1204 12:55:51.724329 1 scale_up.go:288] Pod alertmanager-main-0 can't be scheduled on MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo= I1204 12:55:51.724355 1 scale_up.go:437] No pod can fit to MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-c I1204 12:55:51.724379 1 scale_up.go:456] Best option to resize: MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a I1204 12:55:51.724391 1 scale_up.go:460] Estimated 1 nodes needed in MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a I1204 12:55:51.918132 1 request.go:581] Throttling request took 193.456156ms, request: GET:https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/zhsungcp4-1-8xxqx-worker-a/scale I1204 12:55:51.923398 1 scale_up.go:574] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a 1->2 (max: 3)}] I1204 12:55:51.923460 1 scale_up.go:663] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a size to 2 I1204 12:55:51.923658 1 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-machine-api", Name:"cluster-autoscaler-status", UID:"6e40d730-ab5d-4a68-8701-e9e73a39b014", APIVersion:"v1", ResourceVersion:"403916", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp4-1-8xxqx-worker-a size to 2 I12/ $ oc get node NAME STATUS ROLES AGE VERSION zhsungcp4-1-8xxqx-master-0.c.openshift-qe.internal Ready master 21h v1.19.2+ad738ba zhsungcp4-1-8xxqx-master-1.c.openshift-qe.internal Ready master 21h v1.19.2+ad738ba zhsungcp4-1-8xxqx-master-2.c.openshift-qe.internal Ready master 21h v1.19.2+ad738ba zhsungcp4-1-8xxqx-worker-a-6cvgs.c.openshift-qe.internal Ready worker 38m v1.19.2+ad738ba zhsungcp4-1-8xxqx-worker-b-nc5f4.c.openshift-qe.internal Ready worker 42m v1.19.2+ad738ba zhsungcp4-1-8xxqx-worker-c-rbcsn.c.openshift-qe.internal Ready worker 45m v1.19.2+ad738ba Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |