Bug 2002300
| Summary: | Altering the Schedule Profile configurations doesn't affect the placement of the pods | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | yhe | |
| Component: | kube-scheduler | Assignee: | Jan Chaloupka <jchaloup> | |
| Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | low | |||
| Version: | 4.10 | CC: | aos-bugs, jchaloup, mfojtik | |
| Target Milestone: | --- | |||
| Target Release: | 4.10.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2026109 (view as bug list) | Environment: | ||
| Last Closed: | 2022-03-10 16:08:32 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2026109, 2026111 | |||
This is something that is incredibly difficult to see "expected" results, especially in empty stock clusters. It's very likely that the different profiles are taking effect, but the differences between nodes are so small that you see (almost) no difference. Can you share any output from the kube-scheduler pods when you set the different profiles? We should see what config is being used by the scheduler that way (you may need to set `LogLevel: Debug`) Hi Mike,
Thank you for your reply.
I got the oc adm inspect information about openshift-kube-scheduler Namespace when I set different profiles.
$ oc patch kubeschedulers.operator/cluster --type=json -p '[{"op": "replace", "path": "/spec/logLevel", "value": "Debug" }]'
kubescheduler.operator.openshift.io/cluster patched
$ oc get kubeschedulers.operator/cluster -o yaml | grep logLevel
logLevel: Debug
$ oc -n testacore4 get pods -o wide | wc -l
301
$ oc -n testacore4 get pods -o wide | grep australiaeast1 | wc -l
100
$ oc -n testacore4 get pods -o wide | grep australiaeast2 | wc -l
100
$ oc -n testacore4 get pods -o wide | grep australiaeast3 | wc -l
100
$ oc get scheduler cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
creationTimestamp: "2021-09-07T08:25:45Z"
generation: 9
name: cluster
resourceVersion: "300780"
uid: 5065d4a1-8b0b-4e64-860a-4866ef636c9c
spec:
mastersSchedulable: false
policy:
name: ""
status: {}
$ oc patch scheduler cluster --type='merge' -p '{"spec":{"profile":"HighNodeUtilization"}}'
scheduler.config.openshift.io/cluster patched
$ oc get scheduler cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
creationTimestamp: "2021-09-07T08:25:45Z"
generation: 10
name: cluster
resourceVersion: "305727"
uid: 5065d4a1-8b0b-4e64-860a-4866ef636c9c
spec:
mastersSchedulable: false
policy:
name: ""
profile: HighNodeUtilization
status: {}
$ oc -n openshift-kube-scheduler get pods | grep -v Completed
NAME READY STATUS RESTARTS AGE
openshift-kube-scheduler-yhe-testathon-6kn9r-master-0 3/3 Running 0 78s
openshift-kube-scheduler-yhe-testathon-6kn9r-master-1 3/3 Running 0 3m20s
openshift-kube-scheduler-yhe-testathon-6kn9r-master-2 3/3 Running 0 2m19s
$ oc -n testacore4 rollout restart deployment httpd httpd1 httpd2
deployment.apps/httpd restarted
deployment.apps/httpd1 restarted
deployment.apps/httpd2 restarted
$ oc -n testacore4 get pods -o wide | grep australiaeast1 | wc -l
102
$ oc -n testacore4 get pods -o wide | grep australiaeast2 | wc -l
99
$ oc -n testacore4 get pods -o wide | grep australiaeast3 | wc -l
99
$ oc adm inspect ns/openshift-kube-scheduler
Gathering data for ns/openshift-kube-scheduler...
Wrote inspect data to inspect.local.5211842808652610673.
$ tar cvaf inspect.tar.gz inspect.local.5211842808652610673/
Please kindly check and let me know if there is anything necessary.
Regards,
Yiyong
Thanks for providing that information, from the logs I see the scheduler is using the following config:
...
score:
enabled:
- name: NodeResourcesBalancedAllocation
weight: 1
- name: ImageLocality
weight: 1
- name: InterPodAffinity
weight: 1
- name: NodeAffinity
weight: 1
- name: NodePreferAvoidPods
weight: 10000
- name: PodTopologySpread
weight: 2
- name: TaintToleration
weight: 1
- name: NodeResourcesMostAllocated
weight: 0
...
So while NodeResourcesMostAllocated is enabled, it has a weight of 0. There is also NodeResourcesBalancedAllocation which may be interfering with a weight of 1. So I will definitely look more into our predefined profiles to see if there are adjustments we can make.
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. This is still valid, removing lifecycleStale I opened https://github.com/openshift/cluster-kube-scheduler-operator/pull/378 to address this. Will need to backport to 4.8 and 4.7 as well First, I tried making this change to set the MostAllocated plugin with a weight of 1, but I still didn't see much change in the output. I think this is because there are other default spreading strategies, like topology spreading. Increasing the weight to 5 gave much better results: $ oc get pods -o wide | grep ci-ln-c4z2fst-72292-pt4p7-worker-a-ttbcz | wc -l 235 $ oc get pods -o wide | grep ci-ln-c4z2fst-72292-pt4p7-worker-c-j9k2z | wc -l 67 $ oc get pods -o wide | grep ci-ln-c4z2fst-72292-pt4p7-worker-b-9cqss | wc -l 4 I think this makes sense because we want to really drive this strategy's intent as bin packing Hello Mike, I tried to verify the bug here with below nightly and i do not see the expected results as mentioned in comment 7. Followed similar steps in the description to reproduce the issue and i see that the placement does not get effected. [knarra@knarra cucushift]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-11-29-191648 True False 176m Cluster version is 4.10.0-0.nightly-2021-11-29-191648 4.10 pod creation: ================= [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-156-216.us-east-2.compute.internal | wc -l 99 [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-174-189.us-east-2.compute.internal | wc -l 102 [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-203-52.us-east-2.compute.internal | wc -l 99 [knarra@knarra cucushift]$ oc patch Scheduler cluster --type='json' -p='[{"op": "add", "path": "/spec/profile", "value":"HighNodeUtilization"}]' scheduler.config.openshift.io/cluster patched I1201 08:27:02.532460 1 configfile.go:96] "Using component config" config="apiVersion: kubescheduler.config.k8s.io/v1beta1\nclientConnection:\n acceptContentTypes: \"\"\n burst: 100\n contentType: application/vnd.kubernetes.protobuf\n kubeconfig: /etc/kubernetes/static-pod-resources/configmaps/scheduler-kubeconfig/kubeconfig\n qps: 50\nenableContentionProfiling: true\nenableProfiling: true\nhealthzBindAddress: 0.0.0.0:10251\nkind: KubeSchedulerConfiguration\nleaderElection:\n leaderElect: true\n leaseDuration: 2m17s\n renewDeadline: 1m47s\n resourceLock: configmaps\n resourceName: kube-scheduler\n resourceNamespace: openshift-kube-scheduler\n retryPeriod: 26s\nmetricsBindAddress: 0.0.0.0:10251\nparallelism: 16\npercentageOfNodesToScore: 0\npodInitialBackoffSeconds: 1\npodMaxBackoffSeconds: 10\nprofiles:\n- pluginConfig:\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: DefaultPreemptionArgs\n minCandidateNodesAbsolute: 100\n minCandidateNodesPercentage: 10\n name: DefaultPreemption\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n hardPodAffinityWeight: 1\n kind: InterPodAffinityArgs\n name: InterPodAffinity\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: NodeAffinityArgs\n name: NodeAffinity\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: NodeResourcesFitArgs\n scoringStrategy:\n resources:\n - name: cpu\n weight: 1\n - name: memory\n weight: 1\n type: LeastAllocated\n name: NodeResourcesFit\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: NodeResourcesMostAllocatedArgs\n resources:\n - name: cpu\n weight: 1\n - name: memory\n weight: 1\n name: NodeResourcesMostAllocated\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n defaultingType: System\n kind: PodTopologySpreadArgs\n name: PodTopologySpread\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n bindTimeoutSeconds: 600\n kind: VolumeBindingArgs\n name: VolumeBinding\n plugins:\n bind:\n enabled:\n - name: DefaultBinder\n weight: 0\n filter:\n enabled:\n - name: NodeUnschedulable\n weight: 0\n - name: NodeName\n weight: 0\n - name: TaintToleration\n weight: 0\n - name: NodeAffinity\n weight: 0\n - name: NodePorts\n weight: 0\n - name: NodeResourcesFit\n weight: 0\n - name: VolumeRestrictions\n weight: 0\n - name: EBSLimits\n weight: 0\n - name: GCEPDLimits\n weight: 0\n - name: NodeVolumeLimits\n weight: 0\n - name: AzureDiskLimits\n weight: 0\n - name: VolumeBinding\n weight: 0\n - name: VolumeZone\n weight: 0\n - name: PodTopologySpread\n weight: 0\n - name: InterPodAffinity\n weight: 0\n permit: {}\n postBind: {}\n postFilter:\n enabled:\n - name: DefaultPreemption\n weight: 0\n preBind:\n enabled:\n - name: VolumeBinding\n weight: 0\n preFilter:\n enabled:\n - name: NodeResourcesFit\n weight: 0\n - name: NodePorts\n weight: 0\n - name: VolumeRestrictions\n weight: 0\n - name: PodTopologySpread\n weight: 0\n - name: InterPodAffinity\n weight: 0\n - name: VolumeBinding\n weight: 0\n - name: NodeAffinity\n weight: 0\n preScore:\n enabled:\n - name: InterPodAffinity\n weight: 0\n - name: PodTopologySpread\n weight: 0\n - name: TaintToleration\n weight: 0\n - name: NodeAffinity\n weight: 0\n queueSort:\n enabled:\n - name: PrioritySort\n weight: 0\n reserve:\n enabled:\n - name: VolumeBinding\n weight: 0\n score:\n enabled:\n - name: ImageLocality\n weight: 1\n - name: InterPodAffinity\n weight: 1\n - name: NodeAffinity\n weight: 1\n - name: NodePreferAvoidPods\n weight: 10000\n - name: PodTopologySpread\n weight: 2\n - name: TaintToleration\n weight: 1\n - name: NodeResourcesMostAllocated\n weight: 5\n schedulerName: default-scheduler\n" I1201 08:27:02.532488 1 server.go:144] "Starting Kubernetes Scheduler version" version="v1.22.1+bac83a5" oc rollout restart deployment httpd httpd1 httpd2 deployment.apps/httpd restarted deployment.apps/httpd1 restarted deployment.apps/httpd2 restarted [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-156-216.us-east-2.compute.internal | wc -l 101 [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-174-189.us-east-2.compute.internal | wc -l 100 [knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-203-52.us-east-2.compute.internal | wc -l 99 [knarra@knarra cucushift]$ Based on the above moving bug to Assigned state. I have the cluster handy in case you would want to take a look, thanks !! Tried reproducing the bug on 4.9 and below is what i see:
===========================================================
4.9 pod creation:
===========================
[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-142-212.us-east-2.compute.internal | wc -l
99
[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-170-232.us-east-2.compute.internal | wc -l
101
[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-200-222.us-east-2.compute.internal | wc -l
100
I1201 08:27:38.689680 1 configfile.go:96] "Using component config" config="apiVersion: kubescheduler.config.k8s.io/v1beta1\nclientConnection:\n acceptContentTypes: \"\"\n burst: 100\n contentType: application/vnd.kubernetes.protobuf\n kubeconfig: /etc/kubernetes/static-pod-resources/configmaps/scheduler-kubeconfig/kubeconfig\n qps: 50\nenableContentionProfiling: true\nenableProfiling: true\nhealthzBindAddress: 0.0.0.0:10251\nkind: KubeSchedulerConfiguration\nleaderElection:\n leaderElect: true\n leaseDuration: 2m17s\n renewDeadline: 1m47s\n resourceLock: configmaps\n resourceName: kube-scheduler\n resourceNamespace: openshift-kube-scheduler\n retryPeriod: 26s\nmetricsBindAddress: 0.0.0.0:10251\nparallelism: 16\npercentageOfNodesToScore: 0\npodInitialBackoffSeconds: 1\npodMaxBackoffSeconds: 10\nprofiles:\n- pluginConfig:\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: DefaultPreemptionArgs\n minCandidateNodesAbsolute: 100\n minCandidateNodesPercentage: 10\n name: DefaultPreemption\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n hardPodAffinityWeight: 1\n kind: InterPodAffinityArgs\n name: InterPodAffinity\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: NodeAffinityArgs\n name: NodeAffinity\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: NodeResourcesBalancedAllocationArgs\n resources:\n - name: cpu\n weight: 1\n - name: memory\n weight: 1\n name: NodeResourcesBalancedAllocation\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: NodeResourcesFitArgs\n scoringStrategy:\n resources:\n - name: cpu\n weight: 1\n - name: memory\n weight: 1\n type: LeastAllocated\n name: NodeResourcesFit\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n kind: NodeResourcesMostAllocatedArgs\n resources:\n - name: cpu\n weight: 1\n - name: memory\n weight: 1\n name: NodeResourcesMostAllocated\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n defaultingType: System\n kind: PodTopologySpreadArgs\n name: PodTopologySpread\n - args:\n apiVersion: kubescheduler.config.k8s.io/v1beta1\n bindTimeoutSeconds: 600\n kind: VolumeBindingArgs\n name: VolumeBinding\n plugins:\n bind:\n enabled:\n - name: DefaultBinder\n weight: 0\n filter:\n enabled:\n - name: NodeUnschedulable\n weight: 0\n - name: NodeName\n weight: 0\n - name: TaintToleration\n weight: 0\n - name: NodeAffinity\n weight: 0\n - name: NodePorts\n weight: 0\n - name: NodeResourcesFit\n weight: 0\n - name: VolumeRestrictions\n weight: 0\n - name: EBSLimits\n weight: 0\n - name: GCEPDLimits\n weight: 0\n - name: NodeVolumeLimits\n weight: 0\n - name: AzureDiskLimits\n weight: 0\n - name: VolumeBinding\n weight: 0\n - name: VolumeZone\n weight: 0\n - name: PodTopologySpread\n weight: 0\n - name: InterPodAffinity\n weight: 0\n permit: {}\n postBind: {}\n postFilter:\n enabled:\n - name: DefaultPreemption\n weight: 0\n preBind:\n enabled:\n - name: VolumeBinding\n weight: 0\n preFilter:\n enabled:\n - name: NodeResourcesFit\n weight: 0\n - name: NodePorts\n weight: 0\n - name: VolumeRestrictions\n weight: 0\n - name: PodTopologySpread\n weight: 0\n - name: InterPodAffinity\n weight: 0\n - name: VolumeBinding\n weight: 0\n - name: NodeAffinity\n weight: 0\n preScore:\n enabled:\n - name: InterPodAffinity\n weight: 0\n - name: PodTopologySpread\n weight: 0\n - name: TaintToleration\n weight: 0\n - name: NodeAffinity\n weight: 0\n queueSort:\n enabled:\n - name: PrioritySort\n weight: 0\n reserve:\n enabled:\n - name: VolumeBinding\n weight: 0\n score:\n enabled:\n - name: NodeResourcesBalancedAllocation\n weight: 1\n - name: ImageLocality\n weight: 1\n - name: InterPodAffinity\n weight: 1\n - name: NodeAffinity\n weight: 1\n - name: NodePreferAvoidPods\n weight: 10000\n - name: PodTopologySpread\n weight: 2\n - name: TaintToleration\n weight: 1\n - name: NodeResourcesMostAllocated\n weight: 0\n schedulerName: default-scheduler\n"
[knarra@knarra cucushift]$ oc rollout restart deployment httpd httpd1 httpd2
deployment.apps/httpd restarted
deployment.apps/httpd1 restarted
deployment.apps/httpd2 restarted
[knarra@knarra cucushift]$ watch oc get pods
[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-142-212.us-east-2.compute.internal | wc -l
100
[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-170-232.us-east-2.compute.internal | wc -l
99
[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-200-222.us-east-2.compute.internal | wc -l
101
## Checking the server version and setting the HighNodeUtilization profile ``` $ oc version Client Version: 4.9.0-0.nightly-2021-08-26-040328 Server Version: 4.11.0-0.nightly-2022-01-28-102930 Kubernetes Version: v1.23.0+d30ebbc $ oc get scheduler cluster -o json | jq ".spec.profile" "HighNodeUtilization" ``` ## Deploying 300 pods $ oc new-app httpd --name httpd ... $ oc new-app httpd --name httpd1 ... $ oc new-app httpd --name httpd2 ... $ oc scale --replicas=100 deployment httpd2 deployment.apps/httpd2 scaled $ oc scale --replicas=100 deployment httpd1 deployment.apps/httpd1 scaled $ oc scale --replicas=100 deployment httpd deployment.apps/httpd scaled $ oc get pods -o wide | grep ip-10-0-168-80.ec2.internal | wc -l 73 $ oc get pods -o wide | grep ip-10-0-155-82.ec2.internal | wc -l 227 $ oc get pods -o wide | grep ip-10-0-132-73.ec2.internal | wc -l 0 ## Checking the server version and setting the HighNodeUtilization profile $ oc version Client Version: 4.9.0-0.nightly-2021-08-26-040328 Server Version: 4.10.0-0.nightly-2022-02-02-000921 Kubernetes Version: v1.23.3+b63be7f $ oc get scheduler cluster -o json | jq ".spec.profile" "HighNodeUtilization" ## Deploying 300 pods $ oc new-app httpd --name httpd ... $ oc new-app httpd --name httpd1 ... $ oc new-app httpd --name httpd2 ... $ oc scale --replicas=100 deployment httpd2 deployment.apps/httpd2 scaled $ oc scale --replicas=100 deployment httpd1 deployment.apps/httpd1 scaled $ oc scale --replicas=100 deployment httpd deployment.apps/httpd scaled $ oc get pods -o wide | grep ip-10-0-173-104.us-west-2.compute.internal | wc -l 0 $ oc get pods -o wide | grep ip-10-0-191-32.us-west-2.compute.internal | wc -l 75 $ oc get pods -o wide | grep ip-10-0-230-199.us-west-2.compute.internal | wc -l 225 I intentionally skipped the step of restarting the deployment pods: ``` $ oc rollout restart deployment httpd httpd1 httpd2 deployment.apps/httpd restarted deployment.apps/httpd1 restarted deployment.apps/httpd2 restarted ``` since this temporarily doubles the number of pods in the cluster which may cause even distribution of pods when the worker nodes do not have sufficient resources to run the double workload. Verified bug with the build below and i see that it works fine.
[knarra@knarra ~]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.0-rc.0 True False 106m Cluster version is 4.10.0-rc.0
Below are the steps i followed to verify the bug:
==================================================
1) Install latest 4.10 cluster
2) create a new project
3) create 300 pods
# oc new-app httpd --name httpd
# oc new-app httpd --name httpd1
# oc new-app httpd --name httpd2
# oc scale --replicas=100 deployment httpd
# oc scale --replicas=100 deployment httpd1
# oc scale --replicas=100 deployment httpd2
Distribution of pods are as below:
=====================================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l
99
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l
99
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l
102
Here is the config:
=====================
-------------------------Configuration File Contents Start Here----------------------
apiVersion: kubescheduler.config.k8s.io/v1beta3
clientConnection:
acceptContentTypes: ""
burst: 100
contentType: application/vnd.kubernetes.protobuf
kubeconfig: /etc/kubernetes/static-pod-resources/configmaps/scheduler-kubeconfig/kubeconfig
qps: 50
enableContentionProfiling: true
enableProfiling: true
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
leaseDuration: 2m17s
renewDeadline: 1m47s
resourceLock: configmaps
resourceName: kube-scheduler
resourceNamespace: openshift-kube-scheduler
retryPeriod: 26s
parallelism: 16
percentageOfNodesToScore: 0
podInitialBackoffSeconds: 1
podMaxBackoffSeconds: 10
profiles:
- pluginConfig:
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: DefaultPreemptionArgs
minCandidateNodesAbsolute: 100
minCandidateNodesPercentage: 10
name: DefaultPreemption
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta3
hardPodAffinityWeight: 1
kind: InterPodAffinityArgs
name: InterPodAffinity
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: NodeAffinityArgs
name: NodeAffinity
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: NodeResourcesBalancedAllocationArgs
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
name: NodeResourcesBalancedAllocation
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: NodeResourcesFitArgs
scoringStrategy:
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
type: LeastAllocated
name: NodeResourcesFit
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta3
defaultingType: System
kind: PodTopologySpreadArgs
name: PodTopologySpread
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta3
bindTimeoutSeconds: 600
kind: VolumeBindingArgs
name: VolumeBinding
plugins:
bind: {}
filter: {}
multiPoint:
enabled:
- name: PrioritySort
weight: 0
- name: NodeUnschedulable
weight: 0
- name: NodeName
weight: 0
- name: TaintToleration
weight: 3
- name: NodeAffinity
weight: 2
- name: NodePorts
weight: 0
- name: NodeResourcesFit
weight: 1
- name: VolumeRestrictions
weight: 0
- name: EBSLimits
weight: 0
- name: GCEPDLimits
weight: 0
- name: NodeVolumeLimits
weight: 0
- name: AzureDiskLimits
weight: 0
- name: VolumeBinding
weight: 0
- name: VolumeZone
weight: 0
- name: PodTopologySpread
weight: 2
- name: InterPodAffinity
weight: 2
- name: DefaultPreemption
weight: 0
- name: NodeResourcesBalancedAllocation
weight: 1
- name: ImageLocality
weight: 1
- name: DefaultBinder
weight: 0
permit: {}
postBind: {}
postFilter: {}
preBind: {}
preFilter: {}
preScore: {}
queueSort: {}
reserve: {}
score: {}
schedulerName: default-scheduler
------------------------------------Configuration File Contents End Here---------------------------------
4) Alter the profile to HigNodeUtilization using the command below
# oc patch Scheduler cluster --type='json' -p='[{"op": "add", "path": "/spec/profile", "value":"HighNodeUtilization"}]'
5) Deploy three hundred pods again using the commands below
# oc new-app httpd --name httpd
# oc new-app httpd --name httpd1
# oc new-app httpd --name httpd2
# oc scale --replicas=100 deployment httpd
# oc scale --replicas=100 deployment httpd1
# oc scale --replicas=100 deployment httpd2
Distribution of pods after altering profile to HighNodeUtilization:
===================================================================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l
139
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l
135
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l
26
-------------------------Configuration File Contents Start Here----------------------
apiVersion: kubescheduler.config.k8s.io/v1beta3
clientConnection:
acceptContentTypes: ""
burst: 100
contentType: application/vnd.kubernetes.protobuf
kubeconfig: /etc/kubernetes/static-pod-resources/configmaps/scheduler-kubeconfig/kubeconfig
qps: 50
enableContentionProfiling: true
enableProfiling: true
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
leaseDuration: 2m17s
renewDeadline: 1m47s
resourceLock: configmaps
resourceName: kube-scheduler
resourceNamespace: openshift-kube-scheduler
retryPeriod: 26s
parallelism: 16
percentageOfNodesToScore: 0
podInitialBackoffSeconds: 1
podMaxBackoffSeconds: 10
profiles:
- pluginConfig:
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: DefaultPreemptionArgs
minCandidateNodesAbsolute: 100
minCandidateNodesPercentage: 10
name: DefaultPreemption
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta3
hardPodAffinityWeight: 1
kind: InterPodAffinityArgs
name: InterPodAffinity
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: NodeAffinityArgs
name: NodeAffinity
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: NodeResourcesBalancedAllocationArgs
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
name: NodeResourcesBalancedAllocation
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: NodeResourcesFitArgs
scoringStrategy:
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
type: MostAllocated
name: NodeResourcesFit
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta3
defaultingType: System
kind: PodTopologySpreadArgs
name: PodTopologySpread
- args:
apiVersion: kubescheduler.config.k8s.io/v1beta3
bindTimeoutSeconds: 600
kind: VolumeBindingArgs
name: VolumeBinding
plugins:
bind: {}
filter: {}
multiPoint:
enabled:
- name: PrioritySort
weight: 0
- name: NodeUnschedulable
weight: 0
- name: NodeName
weight: 0
- name: TaintToleration
weight: 3
- name: NodeAffinity
weight: 2
- name: NodePorts
weight: 0
- name: NodeResourcesFit
weight: 1
- name: VolumeRestrictions
weight: 0
- name: EBSLimits
weight: 0
- name: GCEPDLimits
weight: 0
- name: NodeVolumeLimits
weight: 0
- name: AzureDiskLimits
weight: 0
- name: VolumeBinding
weight: 0
- name: VolumeZone
weight: 0
- name: PodTopologySpread
weight: 2
- name: InterPodAffinity
weight: 2
- name: DefaultPreemption
weight: 0
- name: NodeResourcesBalancedAllocation
weight: 1
- name: ImageLocality
weight: 1
- name: DefaultBinder
weight: 0
permit: {}
postBind: {}
postFilter: {}
preBind: {}
preFilter: {}
preScore: {}
queueSort: {}
reserve: {}
score:
disabled:
- name: NodeResourcesBalancedAllocation
weight: 0
enabled:
- name: NodeResourcesFit
weight: 5
schedulerName: default-scheduler
------------------------------------Configuration File Contents End Here---------------------------------
Tried the same procedure as described in the bug description and below are the values i see before and after altering the profile.
Before altering the profile:
==============================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l
99
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l
99
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l
102
After altering the profile:
================================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l
66
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l
117
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l
117
Checked with jan and the suggestion was to repeat the same with more worker nodes i.e 5 or 7 to see the actual difference, will try the same and then move the bug to verified state.
Verified bug with the below build and below is the procedure i have followed to verify the bug.
[knarra@knarra ~]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.0-rc.0 True False 20h Cluster version is 4.10.0-rc.0
Procedure followed to verify the bug:
====================================
1) Install cluster with latest 4.10 bits
2) Deploy 300 pods using the commands below
# oc new-app httpd --name=httpd
# oc new-app httpd --name httpd1
# oc new-app httpd --name httpd2
3) Scale the app pods using the command below
# oc scale --replicas=100 deployment httpd
# oc scale --replicas=100 deployment httpd1
# oc scale --replicas=100 deployment httpd2
4) Now Retreive number of pods scheduled to each node
# oc get pods -o wide | grep <node_name> | wc -l
5) Alter the scheduler profile to HighNodeUtilization using the command below
# oc patch Scheduler cluster --type='json' -p='[{"op": "add", "path": "/spec/profile", "value":"HighNodeUtilization"}]'
6) wait for kube-scheduler pods to restart
7) Now Run the commands below to rollout the restart of the pods
# oc rollout restart deployment httpd httpd1 httpd2
8) wait for the pods to terminate and start fresh.
9) Repeat steps 1 through 8 for cluster with 3, 5 & 7 worker nodes.
Observations that were made during the test:
============================================
7 worker node cluster:
=========================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-129-190.us-east-2.compute.internal | wc -l
36
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-145-88.us-east-2.compute.internal | wc -l
37
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-150-102.us-east-2.compute.internal | wc -l
38
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-180-1.us-east-2.compute.internal | wc -l
48
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-182-90.us-east-2.compute.internal | wc -l
47
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-207-182.us-east-2.compute.internal | wc -l
48
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-104.us-east-2.compute.internal | wc -l
46
After profile (1st rollout):
============================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-129-190.us-east-2.compute.internal | wc -l
101
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-145-88.us-east-2.compute.internal | wc -l
6
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-150-102.us-east-2.compute.internal | wc -l
3
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-180-1.us-east-2.compute.internal | wc -l
39
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-182-90.us-east-2.compute.internal | wc -l
31
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-207-182.us-east-2.compute.internal | wc -l
26
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-104.us-east-2.compute.internal | wc -l
94
After profile (2nd rollout):
===================================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-129-190.us-east-2.compute.internal | wc -l
150
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-145-88.us-east-2.compute.internal | wc -l
0
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-150-102.us-east-2.compute.internal | wc -l
0
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-180-1.us-east-2.compute.internal | wc -l
0
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-182-90.us-east-2.compute.internal | wc -l
0
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-207-182.us-east-2.compute.internal | wc -l
0
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-104.us-east-2.compute.internal | wc -l
150
5 node worker cluster:
=========================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-133-12.us-east-2.compute.internal | wc -l
55
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-159-8.us-east-2.compute.internal | wc -l
57
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-178-40.us-east-2.compute.internal | wc -l
56
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-191-234.us-east-2.compute.internal | wc -l
55
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-94.us-east-2.compute.internal | wc -l
77
After profile (1st rollout):
============================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-133-12.us-east-2.compute.internal | wc -l
121
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-159-8.us-east-2.compute.internal | wc -l
2
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-178-40.us-east-2.compute.internal | wc -l
20
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-191-234.us-east-2.compute.internal | wc -l
87
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-94.us-east-2.compute.internal | wc -l
70
After profile (2nd rollout):
=================================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-133-12.us-east-2.compute.internal | wc -l
101
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-159-8.us-east-2.compute.internal | wc -l
0
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-178-40.us-east-2.compute.internal | wc -l
0
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-191-234.us-east-2.compute.internal | wc -l
102
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-94.us-east-2.compute.internal | wc -l
97
3 node worker cluster:
==========================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l
99
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l
99
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l
102
After profile(1st rollout):
==========================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l
108
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l
111
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l
81
Based on the above it looks to me that there is definitely a difference after changing the scheduler profile to HighNodeUtilization. Based on that moving the bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |
Description of problem: I deployed 300 pods(3 deployments * 100 pods for each deployment) and altered the Schedule Profile configurations to all 3 available values, but the placement seems to be almost the same. Please let me know if I got something wrong with the schedule profile configuration function. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Deploy 300 pods $ oc new-app httpd --name httpd $ oc new-app httpd --name httpd1 $ oc new-app httpd --name httpd2 $ oc scale --replicas=100 deployment httpd $ oc scale --replicas=100 deployment httpd1 $ oc scale --replicas=100 deployment httpd2 $ oc get pods -o wide | wc -l 301 $ oc get pods -o wide | grep australiaeast1 | wc -l 96 $ oc get pods -o wide | grep australiaeast2 | wc -l 100 $ oc get pods -o wide | grep australiaeast3 | wc -l 104 2. Alter the Schedule Profile configurations $ oc get scheduler cluster -o yaml apiVersion: config.openshift.io/v1 kind: Scheduler metadata: creationTimestamp: "2021-09-07T08:25:45Z" generation: 7 name: cluster resourceVersion: "252871" uid: 5065d4a1-8b0b-4e64-860a-4866ef636c9c spec: mastersSchedulable: false policy: name: "" status: {} $ oc patch scheduler cluster --type='merge' -p '{"spec":{"prof ile":"HighNodeUtilization"}}' scheduler.config.openshift.io/cluster patched $ oc get scheduler cluster -o yaml apiVersion: config.openshift.io/v1 kind: Scheduler metadata: creationTimestamp: "2021-09-07T08:25:45Z" generation: 8 name: cluster resourceVersion: "259718" uid: 5065d4a1-8b0b-4e64-860a-4866ef636c9c spec: mastersSchedulable: false policy: name: "" profile: HighNodeUtilization status: {} 3. Recreate the pods and check the placement of the 300 pods $ oc rollout restart deployment httpd httpd1 httpd2 deployment.apps/httpd restarted deployment.apps/httpd1 restarted deployment.apps/httpd2 restarted $ oc get pods -o wide | grep australiaeast1 | wc -l 100 $ oc get pods -o wide | grep australiaeast2 | wc -l 100 $ oc get pods -o wide | grep australiaeast3 | wc -l 100 Actual results: The placement doesn't get affected by the schedule profile configuration Expected results: The placement gets affected by the schedule profile configuration