Description of problem: I deployed 300 pods(3 deployments * 100 pods for each deployment) and altered the Schedule Profile configurations to all 3 available values, but the placement seems to be almost the same. Please let me know if I got something wrong with the schedule profile configuration function. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Deploy 300 pods $ oc new-app httpd --name httpd $ oc new-app httpd --name httpd1 $ oc new-app httpd --name httpd2 $ oc scale --replicas=100 deployment httpd $ oc scale --replicas=100 deployment httpd1 $ oc scale --replicas=100 deployment httpd2 $ oc get pods -o wide | wc -l 301 $ oc get pods -o wide | grep australiaeast1 | wc -l 96 $ oc get pods -o wide | grep australiaeast2 | wc -l 100 $ oc get pods -o wide | grep australiaeast3 | wc -l 104 2. Alter the Schedule Profile configurations $ oc get scheduler cluster -o yaml apiVersion: config.openshift.io/v1 kind: Scheduler metadata: creationTimestamp: "2021-09-07T08:25:45Z" generation: 7 name: cluster resourceVersion: "252871" uid: 5065d4a1-8b0b-4e64-860a-4866ef636c9c spec: mastersSchedulable: false policy: name: "" status: {} $ oc patch scheduler cluster --type='merge' -p '{"spec":{"prof ile":"HighNodeUtilization"}}' scheduler.config.openshift.io/cluster patched $ oc get scheduler cluster -o yaml apiVersion: config.openshift.io/v1 kind: Scheduler metadata: creationTimestamp: "2021-09-07T08:25:45Z" generation: 8 name: cluster resourceVersion: "259718" uid: 5065d4a1-8b0b-4e64-860a-4866ef636c9c spec: mastersSchedulable: false policy: name: "" profile: HighNodeUtilization status: {} 3. Recreate the pods and check the placement of the 300 pods $ oc rollout restart deployment httpd httpd1 httpd2 deployment.apps/httpd restarted deployment.apps/httpd1 restarted deployment.apps/httpd2 restarted $ oc get pods -o wide | grep australiaeast1 | wc -l 100 $ oc get pods -o wide | grep australiaeast2 | wc -l 100 $ oc get pods -o wide | grep australiaeast3 | wc -l 100 Actual results: The placement doesn't get affected by the schedule profile configuration Expected results: The placement gets affected by the schedule profile configuration
This is something that is incredibly difficult to see "expected" results, especially in empty stock clusters. It's very likely that the different profiles are taking effect, but the differences between nodes are so small that you see (almost) no difference. Can you share any output from the kube-scheduler pods when you set the different profiles? We should see what config is being used by the scheduler that way (you may need to set `LogLevel: Debug`)
Hi Mike, Thank you for your reply. I got the oc adm inspect information about openshift-kube-scheduler Namespace when I set different profiles. $ oc patch kubeschedulers.operator/cluster --type=json -p '[{"op": "replace", "path": "/spec/logLevel", "value": "Debug" }]' kubescheduler.operator.openshift.io/cluster patched $ oc get kubeschedulers.operator/cluster -o yaml | grep logLevel logLevel: Debug $ oc -n testacore4 get pods -o wide | wc -l 301 $ oc -n testacore4 get pods -o wide | grep australiaeast1 | wc -l 100 $ oc -n testacore4 get pods -o wide | grep australiaeast2 | wc -l 100 $ oc -n testacore4 get pods -o wide | grep australiaeast3 | wc -l 100 $ oc get scheduler cluster -o yaml apiVersion: config.openshift.io/v1 kind: Scheduler metadata: creationTimestamp: "2021-09-07T08:25:45Z" generation: 9 name: cluster resourceVersion: "300780" uid: 5065d4a1-8b0b-4e64-860a-4866ef636c9c spec: mastersSchedulable: false policy: name: "" status: {} $ oc patch scheduler cluster --type='merge' -p '{"spec":{"profile":"HighNodeUtilization"}}' scheduler.config.openshift.io/cluster patched $ oc get scheduler cluster -o yaml apiVersion: config.openshift.io/v1 kind: Scheduler metadata: creationTimestamp: "2021-09-07T08:25:45Z" generation: 10 name: cluster resourceVersion: "305727" uid: 5065d4a1-8b0b-4e64-860a-4866ef636c9c spec: mastersSchedulable: false policy: name: "" profile: HighNodeUtilization status: {} $ oc -n openshift-kube-scheduler get pods | grep -v Completed NAME READY STATUS RESTARTS AGE openshift-kube-scheduler-yhe-testathon-6kn9r-master-0 3/3 Running 0 78s openshift-kube-scheduler-yhe-testathon-6kn9r-master-1 3/3 Running 0 3m20s openshift-kube-scheduler-yhe-testathon-6kn9r-master-2 3/3 Running 0 2m19s $ oc -n testacore4 rollout restart deployment httpd httpd1 httpd2 deployment.apps/httpd restarted deployment.apps/httpd1 restarted deployment.apps/httpd2 restarted $ oc -n testacore4 get pods -o wide | grep australiaeast1 | wc -l 102 $ oc -n testacore4 get pods -o wide | grep australiaeast2 | wc -l 99 $ oc -n testacore4 get pods -o wide | grep australiaeast3 | wc -l 99 $ oc adm inspect ns/openshift-kube-scheduler Gathering data for ns/openshift-kube-scheduler... Wrote inspect data to inspect.local.5211842808652610673. $ tar cvaf inspect.tar.gz inspect.local.5211842808652610673/ Please kindly check and let me know if there is anything necessary. Regards, Yiyong
Thanks for providing that information, from the logs I see the scheduler is using the following config: ... score: enabled: - name: NodeResourcesBalancedAllocation weight: 1 - name: ImageLocality weight: 1 - name: InterPodAffinity weight: 1 - name: NodeAffinity weight: 1 - name: NodePreferAvoidPods weight: 10000 - name: PodTopologySpread weight: 2 - name: TaintToleration weight: 1 - name: NodeResourcesMostAllocated weight: 0 ... So while NodeResourcesMostAllocated is enabled, it has a weight of 0. There is also NodeResourcesBalancedAllocation which may be interfering with a weight of 1. So I will definitely look more into our predefined profiles to see if there are adjustments we can make.
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
This is still valid, removing lifecycleStale
I opened https://github.com/openshift/cluster-kube-scheduler-operator/pull/378 to address this. Will need to backport to 4.8 and 4.7 as well First, I tried making this change to set the MostAllocated plugin with a weight of 1, but I still didn't see much change in the output. I think this is because there are other default spreading strategies, like topology spreading. Increasing the weight to 5 gave much better results: $ oc get pods -o wide | grep ci-ln-c4z2fst-72292-pt4p7-worker-a-ttbcz | wc -l 235 $ oc get pods -o wide | grep ci-ln-c4z2fst-72292-pt4p7-worker-c-j9k2z | wc -l 67 $ oc get pods -o wide | grep ci-ln-c4z2fst-72292-pt4p7-worker-b-9cqss | wc -l 4 I think this makes sense because we want to really drive this strategy's intent as bin packing
Hello Mike, I tried to verify the bug here with below nightly and i do not see the expected results as mentioned in comment 7. Followed similar steps in the description to reproduce the issue and i see that the placement does not get effected. [knarra@knarra cucushift]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-11-29-191648 True False 176m Cluster version is 4.10.0-0.nightly-2021-11-29-191648 4.10 pod creation: ================= [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-156-216.us-east-2.compute.internal | wc -l 99 [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-174-189.us-east-2.compute.internal | wc -l 102 [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-203-52.us-east-2.compute.internal | wc -l 99 [knarra@knarra cucushift]$ oc patch Scheduler cluster --type='json' -p='[{"op": "add", "path": "/spec/profile", "value":"HighNodeUtilization"}]' scheduler.config.openshift.io/cluster patched I1201 08:27:02.532460 1 configfile.go:96] "Using component config" config="apiVersion: kubescheduler.config.k8s.io/v1beta1\nclientConnection:\n acceptContentTypes: \"\"\n burst: 100\n contentType: application/vnd.kubernetes.protobuf\n kubeconfig: /etc/kubernetes/static-pod-resources/configmaps/scheduler-kubeconfig/kubeconfig\n qps: 50\nenableContentionProfiling: true\nenableProfiling: true\nhealthzBindAddress: 0.0.0.0:10251\nkind: KubeSchedulerConfiguration\nleaderElection:\n leaderElect: true\n leaseDuration: 2m17s\n renewDeadline: 1m47s\n resourceLock: configmaps\n resourceName: kube-scheduler\n resourceNamespace: openshift-kube-scheduler\n retryPeriod: 26s\nmetricsBindAddress: 0.0.0.0:10251\nparallelism: 16\npercentageOfNodesToScore: 0\npodInitialBackoffSeconds: 1\npodMaxBackoffSeconds: 10\nprofiles:\n- pluginConfig:\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: DefaultPreemptionArgs\n minCandidateNodesAbsolute: 100\n minCandidateNodesPercentage: 10\n name: DefaultPreemption\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n hardPodAffinityWeight: 1\n kind: InterPodAffinityArgs\n name: InterPodAffinity\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: NodeAffinityArgs\n name: NodeAffinity\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: NodeResourcesFitArgs\n scoringStrategy:\n resources:\n - name: cpu\n weight: 1\n - name: memory\n weight: 1\n type: LeastAllocated\n name: NodeResourcesFit\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: NodeResourcesMostAllocatedArgs\n resources:\n - name: cpu\n weight: 1\n - name: memory\n weight: 1\n name: NodeResourcesMostAllocated\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n defaultingType: System\n kind: PodTopologySpreadArgs\n name: PodTopologySpread\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n bindTimeoutSeconds: 600\n kind: VolumeBindingArgs\n name: VolumeBinding\n plugins:\n bind:\n enabled:\n - name: DefaultBinder\n weight: 0\n filter:\n enabled:\n - name: NodeUnschedulable\n weight: 0\n - name: NodeName\n weight: 0\n - name: TaintToleration\n weight: 0\n - name: NodeAffinity\n weight: 0\n - name: NodePorts\n weight: 0\n - name: NodeResourcesFit\n weight: 0\n - name: VolumeRestrictions\n weight: 0\n - name: EBSLimits\n weight: 0\n - name: GCEPDLimits\n weight: 0\n - name: NodeVolumeLimits\n weight: 0\n - name: AzureDiskLimits\n weight: 0\n - name: VolumeBinding\n weight: 0\n - name: VolumeZone\n weight: 0\n - name: PodTopologySpread\n weight: 0\n - name: InterPodAffinity\n weight: 0\n permit: {}\n postBind: {}\n postFilter:\n enabled:\n - name: DefaultPreemption\n weight: 0\n preBind:\n enabled:\n - name: VolumeBinding\n weight: 0\n preFilter:\n enabled:\n - name: NodeResourcesFit\n weight: 0\n - name: NodePorts\n weight: 0\n - name: VolumeRestrictions\n weight: 0\n - name: PodTopologySpread\n weight: 0\n - name: InterPodAffinity\n weight: 0\n - name: VolumeBinding\n weight: 0\n - name: NodeAffinity\n weight: 0\n preScore:\n enabled:\n - name: InterPodAffinity\n weight: 0\n - name: PodTopologySpread\n weight: 0\n - name: TaintToleration\n weight: 0\n - name: NodeAffinity\n weight: 0\n queueSort:\n enabled:\n - name: PrioritySort\n weight: 0\n reserve:\n enabled:\n - name: VolumeBinding\n weight: 0\n score:\n enabled:\n - name: ImageLocality\n weight: 1\n - name: InterPodAffinity\n weight: 1\n - name: NodeAffinity\n weight: 1\n - name: NodePreferAvoidPods\n weight: 10000\n - name: PodTopologySpread\n weight: 2\n - name: TaintToleration\n weight: 1\n - name: NodeResourcesMostAllocated\n weight: 5\n schedulerName: default-scheduler\n" I1201 08:27:02.532488 1 server.go:144] "Starting Kubernetes Scheduler version" version="v1.22.1+bac83a5" oc rollout restart deployment httpd httpd1 httpd2 deployment.apps/httpd restarted deployment.apps/httpd1 restarted deployment.apps/httpd2 restarted [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-156-216.us-east-2.compute.internal | wc -l 101 [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-174-189.us-east-2.compute.internal | wc -l 100 [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-203-52.us-east-2.compute.internal | wc -l 99 [knarra@knarra cucushift]$ Based on the above moving bug to Assigned state. I have the cluster handy in case you would want to take a look, thanks !!
Tried reproducing the bug on 4.9 and below is what i see: =========================================================== 4.9 pod creation: =========================== [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-142-212.us-east-2.compute.internal | wc -l 99 [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-170-232.us-east-2.compute.internal | wc -l 101 [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-200-222.us-east-2.compute.internal | wc -l 100 I1201 08:27:38.689680 1 configfile.go:96] "Using component config" config="apiVersion: kubescheduler.config.k8s.io/v1beta1\nclientConnection:\n acceptContentTypes: \"\"\n burst: 100\n contentType: application/vnd.kubernetes.protobuf\n kubeconfig: /etc/kubernetes/static-pod-resources/configmaps/scheduler-kubeconfig/kubeconfig\n qps: 50\nenableContentionProfiling: true\nenableProfiling: true\nhealthzBindAddress: 0.0.0.0:10251\nkind: KubeSchedulerConfiguration\nleaderElection:\n leaderElect: true\n leaseDuration: 2m17s\n renewDeadline: 1m47s\n resourceLock: configmaps\n resourceName: kube-scheduler\n resourceNamespace: openshift-kube-scheduler\n retryPeriod: 26s\nmetricsBindAddress: 0.0.0.0:10251\nparallelism: 16\npercentageOfNodesToScore: 0\npodInitialBackoffSeconds: 1\npodMaxBackoffSeconds: 10\nprofiles:\n- pluginConfig:\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: DefaultPreemptionArgs\n minCandidateNodesAbsolute: 100\n minCandidateNodesPercentage: 10\n name: DefaultPreemption\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n hardPodAffinityWeight: 1\n kind: InterPodAffinityArgs\n name: InterPodAffinity\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: NodeAffinityArgs\n name: NodeAffinity\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: NodeResourcesBalancedAllocationArgs\n resources:\n - name: cpu\n weight: 1\n - name: memory\n weight: 1\n name: NodeResourcesBalancedAllocation\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: NodeResourcesFitArgs\n scoringStrategy:\n resources:\n - name: cpu\n weight: 1\n - name: memory\n weight: 1\n type: LeastAllocated\n name: NodeResourcesFit\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: NodeResourcesMostAllocatedArgs\n resources:\n - name: cpu\n weight: 1\n - name: memory\n weight: 1\n name: NodeResourcesMostAllocated\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n defaultingType: System\n kind: PodTopologySpreadArgs\n name: PodTopologySpread\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n bindTimeoutSeconds: 600\n kind: VolumeBindingArgs\n name: VolumeBinding\n plugins:\n bind:\n enabled:\n - name: DefaultBinder\n weight: 0\n filter:\n enabled:\n - name: NodeUnschedulable\n weight: 0\n - name: NodeName\n weight: 0\n - name: TaintToleration\n weight: 0\n - name: NodeAffinity\n weight: 0\n - name: NodePorts\n weight: 0\n - name: NodeResourcesFit\n weight: 0\n - name: VolumeRestrictions\n weight: 0\n - name: EBSLimits\n weight: 0\n - name: GCEPDLimits\n weight: 0\n - name: NodeVolumeLimits\n weight: 0\n - name: AzureDiskLimits\n weight: 0\n - name: VolumeBinding\n weight: 0\n - name: VolumeZone\n weight: 0\n - name: PodTopologySpread\n weight: 0\n - name: InterPodAffinity\n weight: 0\n permit: {}\n postBind: {}\n postFilter:\n enabled:\n - name: DefaultPreemption\n weight: 0\n preBind:\n enabled:\n - name: VolumeBinding\n weight: 0\n preFilter:\n enabled:\n - name: NodeResourcesFit\n weight: 0\n - name: NodePorts\n weight: 0\n - name: VolumeRestrictions\n weight: 0\n - name: PodTopologySpread\n weight: 0\n - name: InterPodAffinity\n weight: 0\n - name: VolumeBinding\n weight: 0\n - name: NodeAffinity\n weight: 0\n preScore:\n enabled:\n - name: InterPodAffinity\n weight: 0\n - name: PodTopologySpread\n weight: 0\n - name: TaintToleration\n weight: 0\n - name: NodeAffinity\n weight: 0\n queueSort:\n enabled:\n - name: PrioritySort\n weight: 0\n reserve:\n enabled:\n - name: VolumeBinding\n weight: 0\n score:\n enabled:\n - name: NodeResourcesBalancedAllocation\n weight: 1\n - name: ImageLocality\n weight: 1\n - name: InterPodAffinity\n weight: 1\n - name: NodeAffinity\n weight: 1\n - name: NodePreferAvoidPods\n weight: 10000\n - name: PodTopologySpread\n weight: 2\n - name: TaintToleration\n weight: 1\n - name: NodeResourcesMostAllocated\n weight: 0\n schedulerName: default-scheduler\n" [knarra@knarra cucushift]$ oc rollout restart deployment httpd httpd1 httpd2 deployment.apps/httpd restarted deployment.apps/httpd1 restarted deployment.apps/httpd2 restarted [knarra@knarra cucushift]$ watch oc get pods [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-142-212.us-east-2.compute.internal | wc -l 100 [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-170-232.us-east-2.compute.internal | wc -l 99 [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-200-222.us-east-2.compute.internal | wc -l 101
## Checking the server version and setting the HighNodeUtilization profile ``` $ oc version Client Version: 4.9.0-0.nightly-2021-08-26-040328 Server Version: 4.11.0-0.nightly-2022-01-28-102930 Kubernetes Version: v1.23.0+d30ebbc $ oc get scheduler cluster -o json | jq ".spec.profile" "HighNodeUtilization" ``` ## Deploying 300 pods $ oc new-app httpd --name httpd ... $ oc new-app httpd --name httpd1 ... $ oc new-app httpd --name httpd2 ... $ oc scale --replicas=100 deployment httpd2 deployment.apps/httpd2 scaled $ oc scale --replicas=100 deployment httpd1 deployment.apps/httpd1 scaled $ oc scale --replicas=100 deployment httpd deployment.apps/httpd scaled $ oc get pods -o wide | grep ip-10-0-168-80.ec2.internal | wc -l 73 $ oc get pods -o wide | grep ip-10-0-155-82.ec2.internal | wc -l 227 $ oc get pods -o wide | grep ip-10-0-132-73.ec2.internal | wc -l 0
## Checking the server version and setting the HighNodeUtilization profile $ oc version Client Version: 4.9.0-0.nightly-2021-08-26-040328 Server Version: 4.10.0-0.nightly-2022-02-02-000921 Kubernetes Version: v1.23.3+b63be7f $ oc get scheduler cluster -o json | jq ".spec.profile" "HighNodeUtilization" ## Deploying 300 pods $ oc new-app httpd --name httpd ... $ oc new-app httpd --name httpd1 ... $ oc new-app httpd --name httpd2 ... $ oc scale --replicas=100 deployment httpd2 deployment.apps/httpd2 scaled $ oc scale --replicas=100 deployment httpd1 deployment.apps/httpd1 scaled $ oc scale --replicas=100 deployment httpd deployment.apps/httpd scaled $ oc get pods -o wide | grep ip-10-0-173-104.us-west-2.compute.internal | wc -l 0 $ oc get pods -o wide | grep ip-10-0-191-32.us-west-2.compute.internal | wc -l 75 $ oc get pods -o wide | grep ip-10-0-230-199.us-west-2.compute.internal | wc -l 225
I intentionally skipped the step of restarting the deployment pods: ``` $ oc rollout restart deployment httpd httpd1 httpd2 deployment.apps/httpd restarted deployment.apps/httpd1 restarted deployment.apps/httpd2 restarted ``` since this temporarily doubles the number of pods in the cluster which may cause even distribution of pods when the worker nodes do not have sufficient resources to run the double workload.
Verified bug with the build below and i see that it works fine. [knarra@knarra ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-rc.0 True False 106m Cluster version is 4.10.0-rc.0 Below are the steps i followed to verify the bug: ================================================== 1) Install latest 4.10 cluster 2) create a new project 3) create 300 pods # oc new-app httpd --name httpd # oc new-app httpd --name httpd1 # oc new-app httpd --name httpd2 # oc scale --replicas=100 deployment httpd # oc scale --replicas=100 deployment httpd1 # oc scale --replicas=100 deployment httpd2 Distribution of pods are as below: ===================================== [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l 99 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l 99 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l 102 Here is the config: ===================== -------------------------Configuration File Contents Start Here---------------------- apiVersion: kubescheduler.config.k8s.io/v1beta3 clientConnection: acceptContentTypes: "" burst: 100 contentType: application/vnd.kubernetes.protobuf kubeconfig: /etc/kubernetes/static-pod-resources/configmaps/scheduler-kubeconfig/kubeconfig qps: 50 enableContentionProfiling: true enableProfiling: true kind: KubeSchedulerConfiguration leaderElection: leaderElect: true leaseDuration: 2m17s renewDeadline: 1m47s resourceLock: configmaps resourceName: kube-scheduler resourceNamespace: openshift-kube-scheduler retryPeriod: 26s parallelism: 16 percentageOfNodesToScore: 0 podInitialBackoffSeconds: 1 podMaxBackoffSeconds: 10 profiles: - pluginConfig: - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 kind: DefaultPreemptionArgs minCandidateNodesAbsolute: 100 minCandidateNodesPercentage: 10 name: DefaultPreemption - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 hardPodAffinityWeight: 1 kind: InterPodAffinityArgs name: InterPodAffinity - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 kind: NodeAffinityArgs name: NodeAffinity - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 kind: NodeResourcesBalancedAllocationArgs resources: - name: cpu weight: 1 - name: memory weight: 1 name: NodeResourcesBalancedAllocation - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 kind: NodeResourcesFitArgs scoringStrategy: resources: - name: cpu weight: 1 - name: memory weight: 1 type: LeastAllocated name: NodeResourcesFit - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 defaultingType: System kind: PodTopologySpreadArgs name: PodTopologySpread - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 bindTimeoutSeconds: 600 kind: VolumeBindingArgs name: VolumeBinding plugins: bind: {} filter: {} multiPoint: enabled: - name: PrioritySort weight: 0 - name: NodeUnschedulable weight: 0 - name: NodeName weight: 0 - name: TaintToleration weight: 3 - name: NodeAffinity weight: 2 - name: NodePorts weight: 0 - name: NodeResourcesFit weight: 1 - name: VolumeRestrictions weight: 0 - name: EBSLimits weight: 0 - name: GCEPDLimits weight: 0 - name: NodeVolumeLimits weight: 0 - name: AzureDiskLimits weight: 0 - name: VolumeBinding weight: 0 - name: VolumeZone weight: 0 - name: PodTopologySpread weight: 2 - name: InterPodAffinity weight: 2 - name: DefaultPreemption weight: 0 - name: NodeResourcesBalancedAllocation weight: 1 - name: ImageLocality weight: 1 - name: DefaultBinder weight: 0 permit: {} postBind: {} postFilter: {} preBind: {} preFilter: {} preScore: {} queueSort: {} reserve: {} score: {} schedulerName: default-scheduler ------------------------------------Configuration File Contents End Here--------------------------------- 4) Alter the profile to HigNodeUtilization using the command below # oc patch Scheduler cluster --type='json' -p='[{"op": "add", "path": "/spec/profile", "value":"HighNodeUtilization"}]' 5) Deploy three hundred pods again using the commands below # oc new-app httpd --name httpd # oc new-app httpd --name httpd1 # oc new-app httpd --name httpd2 # oc scale --replicas=100 deployment httpd # oc scale --replicas=100 deployment httpd1 # oc scale --replicas=100 deployment httpd2 Distribution of pods after altering profile to HighNodeUtilization: =================================================================== [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l 139 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l 135 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l 26 -------------------------Configuration File Contents Start Here---------------------- apiVersion: kubescheduler.config.k8s.io/v1beta3 clientConnection: acceptContentTypes: "" burst: 100 contentType: application/vnd.kubernetes.protobuf kubeconfig: /etc/kubernetes/static-pod-resources/configmaps/scheduler-kubeconfig/kubeconfig qps: 50 enableContentionProfiling: true enableProfiling: true kind: KubeSchedulerConfiguration leaderElection: leaderElect: true leaseDuration: 2m17s renewDeadline: 1m47s resourceLock: configmaps resourceName: kube-scheduler resourceNamespace: openshift-kube-scheduler retryPeriod: 26s parallelism: 16 percentageOfNodesToScore: 0 podInitialBackoffSeconds: 1 podMaxBackoffSeconds: 10 profiles: - pluginConfig: - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 kind: DefaultPreemptionArgs minCandidateNodesAbsolute: 100 minCandidateNodesPercentage: 10 name: DefaultPreemption - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 hardPodAffinityWeight: 1 kind: InterPodAffinityArgs name: InterPodAffinity - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 kind: NodeAffinityArgs name: NodeAffinity - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 kind: NodeResourcesBalancedAllocationArgs resources: - name: cpu weight: 1 - name: memory weight: 1 name: NodeResourcesBalancedAllocation - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 kind: NodeResourcesFitArgs scoringStrategy: resources: - name: cpu weight: 1 - name: memory weight: 1 type: MostAllocated name: NodeResourcesFit - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 defaultingType: System kind: PodTopologySpreadArgs name: PodTopologySpread - args: apiVersion: kubescheduler.config.k8s.io/v1beta3 bindTimeoutSeconds: 600 kind: VolumeBindingArgs name: VolumeBinding plugins: bind: {} filter: {} multiPoint: enabled: - name: PrioritySort weight: 0 - name: NodeUnschedulable weight: 0 - name: NodeName weight: 0 - name: TaintToleration weight: 3 - name: NodeAffinity weight: 2 - name: NodePorts weight: 0 - name: NodeResourcesFit weight: 1 - name: VolumeRestrictions weight: 0 - name: EBSLimits weight: 0 - name: GCEPDLimits weight: 0 - name: NodeVolumeLimits weight: 0 - name: AzureDiskLimits weight: 0 - name: VolumeBinding weight: 0 - name: VolumeZone weight: 0 - name: PodTopologySpread weight: 2 - name: InterPodAffinity weight: 2 - name: DefaultPreemption weight: 0 - name: NodeResourcesBalancedAllocation weight: 1 - name: ImageLocality weight: 1 - name: DefaultBinder weight: 0 permit: {} postBind: {} postFilter: {} preBind: {} preFilter: {} preScore: {} queueSort: {} reserve: {} score: disabled: - name: NodeResourcesBalancedAllocation weight: 0 enabled: - name: NodeResourcesFit weight: 5 schedulerName: default-scheduler ------------------------------------Configuration File Contents End Here--------------------------------- Tried the same procedure as described in the bug description and below are the values i see before and after altering the profile. Before altering the profile: ============================== [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l 99 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l 99 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l 102 After altering the profile: ================================ [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l 66 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l 117 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l 117 Checked with jan and the suggestion was to repeat the same with more worker nodes i.e 5 or 7 to see the actual difference, will try the same and then move the bug to verified state.
Verified bug with the below build and below is the procedure i have followed to verify the bug. [knarra@knarra ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-rc.0 True False 20h Cluster version is 4.10.0-rc.0 Procedure followed to verify the bug: ==================================== 1) Install cluster with latest 4.10 bits 2) Deploy 300 pods using the commands below # oc new-app httpd --name=httpd # oc new-app httpd --name httpd1 # oc new-app httpd --name httpd2 3) Scale the app pods using the command below # oc scale --replicas=100 deployment httpd # oc scale --replicas=100 deployment httpd1 # oc scale --replicas=100 deployment httpd2 4) Now Retreive number of pods scheduled to each node # oc get pods -o wide | grep <node_name> | wc -l 5) Alter the scheduler profile to HighNodeUtilization using the command below # oc patch Scheduler cluster --type='json' -p='[{"op": "add", "path": "/spec/profile", "value":"HighNodeUtilization"}]' 6) wait for kube-scheduler pods to restart 7) Now Run the commands below to rollout the restart of the pods # oc rollout restart deployment httpd httpd1 httpd2 8) wait for the pods to terminate and start fresh. 9) Repeat steps 1 through 8 for cluster with 3, 5 & 7 worker nodes. Observations that were made during the test: ============================================ 7 worker node cluster: ========================= [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-129-190.us-east-2.compute.internal | wc -l 36 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-145-88.us-east-2.compute.internal | wc -l 37 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-150-102.us-east-2.compute.internal | wc -l 38 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-180-1.us-east-2.compute.internal | wc -l 48 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-182-90.us-east-2.compute.internal | wc -l 47 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-207-182.us-east-2.compute.internal | wc -l 48 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-104.us-east-2.compute.internal | wc -l 46 After profile (1st rollout): ============================ [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-129-190.us-east-2.compute.internal | wc -l 101 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-145-88.us-east-2.compute.internal | wc -l 6 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-150-102.us-east-2.compute.internal | wc -l 3 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-180-1.us-east-2.compute.internal | wc -l 39 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-182-90.us-east-2.compute.internal | wc -l 31 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-207-182.us-east-2.compute.internal | wc -l 26 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-104.us-east-2.compute.internal | wc -l 94 After profile (2nd rollout): =================================== [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-129-190.us-east-2.compute.internal | wc -l 150 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-145-88.us-east-2.compute.internal | wc -l 0 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-150-102.us-east-2.compute.internal | wc -l 0 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-180-1.us-east-2.compute.internal | wc -l 0 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-182-90.us-east-2.compute.internal | wc -l 0 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-207-182.us-east-2.compute.internal | wc -l 0 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-104.us-east-2.compute.internal | wc -l 150 5 node worker cluster: ========================= [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-133-12.us-east-2.compute.internal | wc -l 55 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-159-8.us-east-2.compute.internal | wc -l 57 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-178-40.us-east-2.compute.internal | wc -l 56 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-191-234.us-east-2.compute.internal | wc -l 55 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-94.us-east-2.compute.internal | wc -l 77 After profile (1st rollout): ============================ [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-133-12.us-east-2.compute.internal | wc -l 121 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-159-8.us-east-2.compute.internal | wc -l 2 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-178-40.us-east-2.compute.internal | wc -l 20 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-191-234.us-east-2.compute.internal | wc -l 87 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-94.us-east-2.compute.internal | wc -l 70 After profile (2nd rollout): ================================= [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-133-12.us-east-2.compute.internal | wc -l 101 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-159-8.us-east-2.compute.internal | wc -l 0 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-178-40.us-east-2.compute.internal | wc -l 0 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-191-234.us-east-2.compute.internal | wc -l 102 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-94.us-east-2.compute.internal | wc -l 97 3 node worker cluster: ========================== [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l 99 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l 99 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l 102 After profile(1st rollout): ========================== [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l 108 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l 111 [knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l 81 Based on the above it looks to me that there is definitely a difference after changing the scheduler profile to HighNodeUtilization. Based on that moving the bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056