Description of problem: The balanceSimilarNodeGroups of ClusterAutoscaler doesn't work when mem discrepancy between nodes > 128KB Cluster version: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.8 True False 6h6m Cluster version is 4.3.8 Steps to Reproduce: Cu captured the memory descrepency as follows: Created 3 machinesets and repeated the following steps. Step1: Scale out the machinesets' replica to 1 Step2: Check the nodes' memory capacity Step3: Scale in the machiensets' replica to 0 Reference to : 1.Bug 1733235- Installed worker nodes/machines have different amounts of memory 2.Bug 1731011 -[CA]Sometimes"--balance-similar-node-groups" option doesn't work well Actual results: Memory discrepancy reported up to 172016KB Additional info:
This is fixed upstream by this PR https://github.com/kubernetes/autoscaler/pull/2462 which changes the maximum memory difference to 256KB. This change is already present in our 4.4.z branch (https://github.com/openshift/kubernetes-autoscaler/blob/release-4.4/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go#L36) We could also backport the limit into the 4.3 branch, though I'm not sure if there are any implications of doing this, will need to investigate that further
Failed QA Test env: 4.3.0-0.nightly-2020-05-04-051714 on aws 1. created 3 new machineset with m5.xlarge, Memory discrepancy reported up to 172032KB $ oc get node -o yaml | grep "memory" memory: 7008428Ki memory: 8159404Ki message: kubelet has sufficient memory available memory: 15265964Ki memory: 16416940Ki message: kubelet has sufficient memory available memory: 14793144Ki memory: 15944120Ki message: kubelet has sufficient memory available memory: 15265964Ki memory: 16416940Ki message: kubelet has sufficient memory available memory: 14793144Ki memory: 15944120Ki message: kubelet has sufficient memory available memory: 7008436Ki memory: 8159412Ki message: kubelet has sufficient memory available memory: 14965176Ki memory: 16116152Ki message: kubelet has sufficient memory available memory: 7008428Ki memory: 8159404Ki message: kubelet has sufficient memory available memory: 15265964Ki memory: 16416940Ki message: kubelet has sufficient memory available 2. Create clusterautoscaler with "balanceSimilarNodeGroups: true" 3. Create 3 machineautoscaler usiing the new created machinesets $ oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE worker-a MachineSet zhsun-0506432-wt796-worker-us-east-2aa 1 10 65m worker-b MachineSet zhsun-0506432-wt796-worker-us-east-2bb 1 10 64m worker-c MachineSet zhsun-0506432-wt796-worker-us-east-2cc 1 10 42m 3. Create workload to scaleup the cluster. 4. Check machine, node and logs,balance only in 2 groups. I0506 10:27:44.492413 1 scale_up.go:273] 10 other pods are also unschedulable I0506 10:27:44.500886 1 scale_up.go:430] Best option to resize: openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2cc I0506 10:27:44.500911 1 scale_up.go:434] Estimated 10 nodes needed in openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2cc I0506 10:27:44.501043 1 scale_up.go:539] Final scale-up plan: [{openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2cc 1->10 (max: 10)}] I0506 10:27:44.501080 1 scale_up.go:700] Scale-up: setting group openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2cc size to 10 I0506 10:27:54.530600 1 scale_up.go:270] Pod openshift-machine-api/scale-up-5d784b79fd-t5jm8 is unschedulable I0506 10:27:54.530626 1 scale_up.go:270] Pod openshift-machine-api/scale-up-5d784b79fd-hcb8h is unschedulable I0506 10:27:54.530633 1 scale_up.go:270] Pod openshift-machine-api/scale-up-5d784b79fd-9g7gz is unschedulable I0506 10:27:54.532116 1 scale_up.go:430] Best option to resize: openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2bb I0506 10:27:54.532141 1 scale_up.go:434] Estimated 1 nodes needed in openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2bb I0506 10:27:54.532250 1 scale_up.go:531] Splitting scale-up between 2 similar node groups: {openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2bb, openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2aa} I0506 10:27:54.532280 1 scale_up.go:539] Final scale-up plan: [{openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2bb 1->2 (max: 10)}] I0506 10:27:54.532300 1 scale_up.go:700] Scale-up: setting group openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2bb size to 2 I0506 10:28:04.588905 1 static_autoscaler.go:334] No unschedulable pods If memory discrepancy is small, will balance in 3 groups memory: 14793128Ki memory: 15944104Ki memory: 14793144Ki memory: 15944120Ki memory: 14793144Ki memory: 15944120Ki I0506 10:19:23.150104 1 scale_up.go:430] Best option to resize: openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2cc I0506 10:19:23.150127 1 scale_up.go:434] Estimated 10 nodes needed in openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2cc I0506 10:19:23.150247 1 scale_up.go:531] Splitting scale-up between 3 similar node groups: {openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2cc, openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2aa, openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2bb} I0506 10:19:23.150280 1 scale_up.go:539] Final scale-up plan: [{openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2cc 1->5 (max: 10)} {openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2aa 1->4 (max: 10)} {openshift-machine-api/zhsun-0506432-wt796-worker-us-east-2bb 1->4 (max: 10)}]
@sunzhaohua Hey, is there a polarion linked to this that I can take a look at? I'd like to see more how the test case was set up so I can investigate why this failed Did you happen to do a must-gather for the cluster when you tested? I'm finding it hard to work out from the memory lists posted above which nodes are in which groups to compare the actual differences, some of those machines have large differences between them so it would be good to clarify
@Joel Speed Test case: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-20108 clusterversion: 4.5.0-0.nightly-2020-05-05-205255 Test steps: 1. update machineset setting "instanceType: m5.xlarge" $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE zhsunaws506-4ghhm-worker-us-east-2a 1 1 1 1 31h zhsunaws506-4ghhm-worker-us-east-2b 1 1 1 1 31h zhsunaws506-4ghhm-worker-us-east-2c 1 1 1 1 31h $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunaws506-4ghhm-master-0 Running m4.xlarge us-east-2 us-east-2a 31h zhsunaws506-4ghhm-master-1 Running m4.xlarge us-east-2 us-east-2b 31h zhsunaws506-4ghhm-master-2 Running m4.xlarge us-east-2 us-east-2c 31h zhsunaws506-4ghhm-worker-us-east-2a-zzd7q Running m5.xlarge us-east-2 us-east-2a 15m zhsunaws506-4ghhm-worker-us-east-2b-zj974 Running m5.xlarge us-east-2 us-east-2b 15m zhsunaws506-4ghhm-worker-us-east-2c-tsxxl Running m5.xlarge us-east-2 us-east-2c 42m $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-134-184.us-east-2.compute.internal Ready worker 11m v1.18.0-rc.1 ip-10-0-140-2.us-east-2.compute.internal Ready master 31h v1.18.0-rc.1 ip-10-0-157-233.us-east-2.compute.internal Ready worker 11m v1.18.0-rc.1 ip-10-0-158-45.us-east-2.compute.internal Ready master 31h v1.18.0-rc.1 ip-10-0-164-148.us-east-2.compute.internal Ready master 31h v1.18.0-rc.1 ip-10-0-171-149.us-east-2.compute.internal Ready worker 38m v1.18.0-rc.1 $ oc get node | grep worker ip-10-0-134-184.us-east-2.compute.internal Ready worker 11m v1.18.0-rc.1 ip-10-0-157-233.us-east-2.compute.internal Ready worker 11m v1.18.0-rc.1 ip-10-0-171-149.us-east-2.compute.internal Ready worker 38m v1.18.0-rc.1 $ oc get node ip-10-0-134-184.us-east-2.compute.internal ip-10-0-157-233.us-east-2.compute.internal ip-10-0-171-149.us-east-2.compute.internal -o yaml | grep "memory" memory: 14793144Ki memory: 15944120Ki memory: 14793128Ki memory: 15944104Ki memory: 14965176Ki memory: 16116152Ki 16116152Ki-15944120Ki=172048 2. Create clusterautoscaler with "balanceSimilarNodeGroups: true" --- apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: "default" spec: balanceSimilarNodeGroups: true scaleDown: enabled: true delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s unneededTime: 10s 3. Create 3 machineautoscalers --- apiVersion: "autoscaling.openshift.io/v1beta1" kind: "MachineAutoscaler" metadata: name: "worker-c" namespace: "openshift-machine-api" spec: minReplicas: 1 maxReplicas: 10 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: zhsunaws506-4ghhm-worker-us-east-2c $ oc get machineautoscalers NAME REF KIND REF NAME MIN MAX AGE worker-a MachineSet zhsunaws506-4ghhm-worker-us-east-2a 1 10 48m worker-b MachineSet zhsunaws506-4ghhm-worker-us-east-2b 1 10 48m worker-c MachineSet zhsunaws506-4ghhm-worker-us-east-2c 1 10 47m 4. Create workload apiVersion: apps/v1 kind: Deployment metadata: name: scale-up labels: app: scale-up spec: replicas: 40 selector: matchLabels: app: scale-up template: metadata: labels: app: scale-up spec: containers: - name: busybox image: docker.io/library/busybox resources: requests: memory: 4Gi command: - /bin/sh - "-c" - "echo 'this should be in the logs' && sleep 86400" terminationGracePeriodSeconds: 0 5. Check logs and machineset I0507 09:38:53.409088 1 scale_up.go:324] 13 other pods are also unschedulable I0507 09:38:55.819066 1 scale_up.go:452] Best option to resize: openshift-machine-api/zhsunaws506-4ghhm-worker-us-east-2a I0507 09:38:55.819104 1 scale_up.go:456] Estimated 11 nodes needed in openshift-machine-api/zhsunaws506-4ghhm-worker-us-east-2a I0507 09:38:56.406750 1 scale_up.go:562] Splitting scale-up between 2 similar node groups: {openshift-machine-api/zhsunaws506-4ghhm-worker-us-east-2a, openshift-machine-api/zhsunaws506-4ghhm-worker-us-east-2b} I0507 09:38:56.807804 1 scale_up.go:570] Final scale-up plan: [{openshift-machine-api/zhsunaws506-4ghhm-worker-us-east-2a 1->7 (max: 10)} {openshift-machine-api/zhsunaws506-4ghhm-worker-us-east-2b 1->6 (max: 10)}] $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE zhsunaws506-4ghhm-worker-us-east-2a 7 7 7 7 31h zhsunaws506-4ghhm-worker-us-east-2b 6 6 6 6 31h zhsunaws506-4ghhm-worker-us-east-2c 1 1 1 1 31h will attach must-gather
I've spent some time looking at this again and have determined that this bug is in fact still present in 4.5. I've changed this BZ to point to 4.5 and will introduce the fix into that version and then backport. Units for the resources coming from real nodes were not matching up with the units set in the check for a difference. Instead of allowing a 256MB delta, it only allowed a 256KB delta, which is much smaller than would be expected as a difference across real nodes on cloud providers.
Deferring to 4.6 while trying to agree on a strategy for fixing this upstream. Will backport to 4.5.z once we have agree on an approach to fix this issue
https://bugzilla.redhat.com/show_bug.cgi?id=1824215#c15 still applies. Tagging with upcomingSprint.
Verified clusterversion: 4.6.0-0.nightly-2020-06-12-084204 Test steps: 1. update machineset setting "instanceType: m5.xlarge" $ oc get node | grep worker ip-10-0-131-245.us-east-2.compute.internal Ready worker 27m v1.18.3+2164959 ip-10-0-190-142.us-east-2.compute.internal Ready worker 27m v1.18.3+2164959 ip-10-0-222-150.us-east-2.compute.internal Ready worker 23m v1.18.3+2164959 $ oc get node ip-10-0-131-245.us-east-2.compute.internal ip-10-0-190-142.us-east-2.compute.internal ip-10-0-222-150.us-east-2.compute.internal -o yaml| grep "memory" memory: 14784824Ki memory: 15935800Ki memory: 14956872Ki memory: 16107848Ki memory: 14784824Ki memory: 15935800Ki 16107848Ki-15935800Ki=172048 2. Create clusterautoscaler with "balanceSimilarNodeGroups: true" --- apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: "default" spec: balanceSimilarNodeGroups: true scaleDown: enabled: true delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 10s unneededTime: 10s 3. Create 3 machineautoscalers $ oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE worker-a MachineSet zhsun615aws-wrjpw-worker-us-east-2a 1 10 45s worker-b MachineSet zhsun615aws-wrjpw-worker-us-east-2b 1 10 29s worker-c MachineSet zhsun615aws-wrjpw-worker-us-east-2c 1 10 9s 4. Create workload 5. Check logs and machineset I0615 02:09:08.056685 1 scale_up.go:324] 13 other pods are also unschedulable I0615 02:09:10.431324 1 scale_up.go:452] Best option to resize: openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2c I0615 02:09:10.431372 1 scale_up.go:456] Estimated 11 nodes needed in openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2c I0615 02:09:11.018039 1 scale_up.go:562] Splitting scale-up between 3 similar node groups: {openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2c, openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2a, openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2b} I0615 02:09:11.617622 1 scale_up.go:570] Final scale-up plan: [{openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2c 1->5 (max: 10)} {openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2a 1->5 (max: 10)} {openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2b 1->4 (max: 10)}] I0615 02:09:11.617669 1 scale_up.go:659] Scale-up: setting group openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2c size to 5 I0615 02:09:12.228972 1 scale_up.go:659] Scale-up: setting group openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2a size to 5 I0615 02:09:12.823869 1 scale_up.go:659] Scale-up: setting group openshift-machine-api/zhsun615aws-wrjpw-worker-us-east-2b size to 4
# I think this bz have to be public due to connect to upstream github, so I copied and noted this comment from my update on comment #24 for visibility without customer's name. (In reply to Alberto from comment #23) Hello Alberto Garcia Lamela, I am sorry for jumping in,, I am RH OCP TAM for my customer. As Red Hat already knows, this affects customer's project which uses v4.4+, therefore this is a big problem for my TAM customer. ( This has a really huge impact on my customer, as we already have this Support Exception , based on request from customer's executive level discussion among RH PMs. ) Since we had no workaround nor mitigation for this yet, so please backport fix as sooner as possible. I am grateful for your help. Thank you, BR, Masaki
Hi Team, Cu would like to know the current status of backporting to 4.5 and 4.4. Please let me know if we have any information reg. this which we could pass onto the customer. Many thanks, Manisha
Since this issue is not being considered a release blocker, we will have to wait for 4.5 to release before we can merge it into a 4.5.z stream, this does also block backporting to 4.4 as well so this won't be available until a few weeks after 4.5 is released
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days