Bug 2002300 - Altering the Schedule Profile configurations doesn't affect the placement of the pods
Summary: Altering the Schedule Profile configurations doesn't affect the placement of ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-scheduler
Version: 4.10
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: 4.10.0
Assignee: Jan Chaloupka
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks: 2026109 2026111
TreeView+ depends on / blocked
 
Reported: 2021-09-08 13:44 UTC by yhe
Modified: 2022-03-10 16:08 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2026109 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:08:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-scheduler-operator pull 378 0 None open Bug 2002300: Disable balancedAllocation and add weight for HighNodeUtilization profile 2021-11-23 19:10:51 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:08:52 UTC

Description yhe 2021-09-08 13:44:25 UTC
Description of problem:
I deployed 300 pods(3 deployments * 100 pods for each deployment) and altered the Schedule Profile configurations to all 3 available values, but the placement seems to be almost the same.
Please let me know if I got something wrong with the schedule profile configuration function.

Version-Release number of selected component (if applicable):

How reproducible:
Always

Steps to Reproduce:
1. Deploy 300 pods

$ oc new-app httpd --name httpd
$ oc new-app httpd --name httpd1
$ oc new-app httpd --name httpd2

$ oc scale --replicas=100 deployment httpd
$ oc scale --replicas=100 deployment httpd1
$ oc scale --replicas=100 deployment httpd2

$ oc get pods -o wide | wc -l
301

$ oc get pods -o wide | grep australiaeast1 | wc -l
96
$ oc get pods -o wide | grep australiaeast2 | wc -l
100
$ oc get pods -o wide | grep australiaeast3 | wc -l
104

2. Alter the Schedule Profile configurations

$ oc get scheduler cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  creationTimestamp: "2021-09-07T08:25:45Z"
  generation: 7
  name: cluster
  resourceVersion: "252871"
  uid: 5065d4a1-8b0b-4e64-860a-4866ef636c9c
spec:
  mastersSchedulable: false
  policy:
    name: ""
status: {}

$ oc patch scheduler cluster --type='merge' -p '{"spec":{"prof
ile":"HighNodeUtilization"}}'
scheduler.config.openshift.io/cluster patched

$ oc get scheduler cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  creationTimestamp: "2021-09-07T08:25:45Z"
  generation: 8
  name: cluster
  resourceVersion: "259718"
  uid: 5065d4a1-8b0b-4e64-860a-4866ef636c9c
spec:
  mastersSchedulable: false
  policy:
    name: ""
  profile: HighNodeUtilization
status: {}

3. Recreate the pods and check the placement of the 300 pods

$ oc rollout restart deployment httpd httpd1 httpd2
deployment.apps/httpd restarted
deployment.apps/httpd1 restarted
deployment.apps/httpd2 restarted

$ oc get pods -o wide | grep australiaeast1 | wc -l
100
$ oc get pods -o wide | grep australiaeast2 | wc -l
100
$ oc get pods -o wide | grep australiaeast3 | wc -l
100

Actual results:
The placement doesn't get affected by the schedule profile configuration

Expected results:
The placement gets affected by the schedule profile configuration

Comment 1 Mike Dame 2021-09-08 15:12:18 UTC
This is something that is incredibly difficult to see "expected" results, especially in empty stock clusters. It's very likely that the different profiles are taking effect, but the differences between nodes are so small that you see (almost) no difference.

Can you share any output from the kube-scheduler pods when you set the different profiles? We should see what config is being used by the scheduler that way (you may need to set `LogLevel: Debug`)

Comment 2 yhe 2021-09-09 03:16:32 UTC
Hi Mike,

Thank you for your reply.

I got the oc adm inspect information about openshift-kube-scheduler Namespace when I set different profiles.

$ oc patch kubeschedulers.operator/cluster --type=json -p '[{"op": "replace", "path": "/spec/logLevel", "value": "Debug" }]'
kubescheduler.operator.openshift.io/cluster patched

$ oc get kubeschedulers.operator/cluster -o yaml | grep logLevel
  logLevel: Debug

$ oc -n testacore4 get pods -o wide | wc -l
301
$ oc -n testacore4 get pods -o wide | grep australiaeast1 | wc -l
100
$ oc -n testacore4 get pods -o wide | grep australiaeast2 | wc -l
100
$ oc -n testacore4 get pods -o wide | grep australiaeast3 | wc -l
100

$ oc get scheduler cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  creationTimestamp: "2021-09-07T08:25:45Z"
  generation: 9
  name: cluster
  resourceVersion: "300780"
  uid: 5065d4a1-8b0b-4e64-860a-4866ef636c9c
spec:
  mastersSchedulable: false
  policy:
    name: ""
status: {}

$ oc patch scheduler cluster --type='merge' -p '{"spec":{"profile":"HighNodeUtilization"}}'
scheduler.config.openshift.io/cluster patched

$ oc get scheduler cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  creationTimestamp: "2021-09-07T08:25:45Z"
  generation: 10
  name: cluster
  resourceVersion: "305727"
  uid: 5065d4a1-8b0b-4e64-860a-4866ef636c9c
spec:
  mastersSchedulable: false
  policy:
    name: ""
  profile: HighNodeUtilization
status: {}

$ oc -n openshift-kube-scheduler get pods | grep -v Completed
NAME                                                    READY   STATUS      RESTARTS   AGE
openshift-kube-scheduler-yhe-testathon-6kn9r-master-0   3/3     Running     0          78s
openshift-kube-scheduler-yhe-testathon-6kn9r-master-1   3/3     Running     0          3m20s
openshift-kube-scheduler-yhe-testathon-6kn9r-master-2   3/3     Running     0          2m19s

$ oc -n testacore4 rollout restart deployment httpd httpd1 httpd2
deployment.apps/httpd restarted
deployment.apps/httpd1 restarted
deployment.apps/httpd2 restarted

$ oc -n testacore4 get pods -o wide | grep australiaeast1 | wc -l
102
$ oc -n testacore4 get pods -o wide | grep australiaeast2 | wc -l
99
$ oc -n testacore4 get pods -o wide | grep australiaeast3 | wc -l
99

$ oc adm inspect ns/openshift-kube-scheduler
Gathering data for ns/openshift-kube-scheduler...
Wrote inspect data to inspect.local.5211842808652610673.

$ tar cvaf inspect.tar.gz inspect.local.5211842808652610673/

Please kindly check and let me know if there is anything necessary.

Regards,
Yiyong

Comment 4 Mike Dame 2021-09-09 13:57:04 UTC
Thanks for providing that information, from the logs I see the scheduler is using the following config:

...
    score:
      enabled:
      - name: NodeResourcesBalancedAllocation
        weight: 1
      - name: ImageLocality
        weight: 1
      - name: InterPodAffinity
        weight: 1
      - name: NodeAffinity
        weight: 1
      - name: NodePreferAvoidPods
        weight: 10000
      - name: PodTopologySpread
        weight: 2
      - name: TaintToleration
        weight: 1
      - name: NodeResourcesMostAllocated
        weight: 0
...

So while NodeResourcesMostAllocated is enabled, it has a weight of 0. There is also NodeResourcesBalancedAllocation which may be interfering with a weight of 1. So I will definitely look more into our predefined profiles to see if there are adjustments we can make.

Comment 5 Michal Fojtik 2021-10-23 13:01:58 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 6 Mike Dame 2021-10-26 14:14:50 UTC
This is still valid, removing lifecycleStale

Comment 7 Mike Dame 2021-11-23 19:09:48 UTC
I opened https://github.com/openshift/cluster-kube-scheduler-operator/pull/378 to address this. Will need to backport to 4.8 and 4.7 as well

First, I tried making this change to set the MostAllocated plugin with a weight of 1, but I still didn't see much change in the output. I think this is because there are other default spreading strategies, like topology spreading.

Increasing the weight to 5 gave much better results:

$ oc get pods -o wide | grep ci-ln-c4z2fst-72292-pt4p7-worker-a-ttbcz | wc -l
     235
$ oc get pods -o wide | grep ci-ln-c4z2fst-72292-pt4p7-worker-c-j9k2z | wc -l
      67
$ oc get pods -o wide | grep ci-ln-c4z2fst-72292-pt4p7-worker-b-9cqss | wc -l
       4

I think this makes sense because we want to really drive this strategy's intent as bin packing

Comment 10 RamaKasturi 2021-12-01 09:41:40 UTC
Hello Mike,

   I tried to verify the bug here with below nightly and i do not see the expected results as mentioned in comment 7. Followed similar steps in the description to reproduce the issue and i see that the placement does not get effected.

[knarra@knarra cucushift]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-11-29-191648   True        False         176m    Cluster version is 4.10.0-0.nightly-2021-11-29-191648

4.10 pod creation:
=================
[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-156-216.us-east-2.compute.internal | wc -l
99
[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-174-189.us-east-2.compute.internal | wc -l
102
[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-203-52.us-east-2.compute.internal | wc -l
99

[knarra@knarra cucushift]$ oc patch Scheduler cluster --type='json' -p='[{"op": "add", "path": "/spec/profile", "value":"HighNodeUtilization"}]'
scheduler.config.openshift.io/cluster patched

I1201 08:27:02.532460       1 configfile.go:96] "Using component config" config="apiVersion: kubescheduler.config.k8s.io/v1beta1\nclientConnection:\n  acceptContentTypes: \"\"\n  burst: 100\n  contentType: application/vnd.kubernetes.protobuf\n  kubeconfig: /etc/kubernetes/static-pod-resources/configmaps/scheduler-kubeconfig/kubeconfig\n  qps: 50\nenableContentionProfiling: true\nenableProfiling: true\nhealthzBindAddress: 0.0.0.0:10251\nkind: KubeSchedulerConfiguration\nleaderElection:\n  leaderElect: true\n  leaseDuration: 2m17s\n  renewDeadline: 1m47s\n  resourceLock: configmaps\n  resourceName: kube-scheduler\n  resourceNamespace: openshift-kube-scheduler\n  retryPeriod: 26s\nmetricsBindAddress: 0.0.0.0:10251\nparallelism: 16\npercentageOfNodesToScore: 0\npodInitialBackoffSeconds: 1\npodMaxBackoffSeconds: 10\nprofiles:\n- pluginConfig:\n  - args:\n      apiVersion: kubescheduler.config.k8s.io/v1beta1\n      kind: DefaultPreemptionArgs\n      minCandidateNodesAbsolute: 100\n      minCandidateNodesPercentage: 10\n    name: DefaultPreemption\n  - args:\n      apiVersion: kubescheduler.config.k8s.io/v1beta1\n      hardPodAffinityWeight: 1\n      kind: InterPodAffinityArgs\n    name: InterPodAffinity\n  - args:\n      apiVersion: kubescheduler.config.k8s.io/v1beta1\n      kind: NodeAffinityArgs\n    name: NodeAffinity\n  - args:\n      apiVersion: kubescheduler.config.k8s.io/v1beta1\n      kind: NodeResourcesFitArgs\n      scoringStrategy:\n        resources:\n        - name: cpu\n          weight: 1\n        - name: memory\n          weight: 1\n        type: LeastAllocated\n    name: NodeResourcesFit\n  - args:\n      apiVersion: kubescheduler.config.k8s.io/v1beta1\n      kind: NodeResourcesMostAllocatedArgs\n      resources:\n      - name: cpu\n        weight: 1\n      - name: memory\n        weight: 1\n    name: NodeResourcesMostAllocated\n  - args:\n      apiVersion: kubescheduler.config.k8s.io/v1beta1\n      defaultingType: System\n      kind: PodTopologySpreadArgs\n    name: PodTopologySpread\n  - args:\n      apiVersion: kubescheduler.config.k8s.io/v1beta1\n      bindTimeoutSeconds: 600\n      kind: VolumeBindingArgs\n    name: VolumeBinding\n  plugins:\n    bind:\n      enabled:\n      - name: DefaultBinder\n        weight: 0\n    filter:\n      enabled:\n      - name: NodeUnschedulable\n        weight: 0\n      - name: NodeName\n        weight: 0\n      - name: TaintToleration\n        weight: 0\n      - name: NodeAffinity\n        weight: 0\n      - name: NodePorts\n        weight: 0\n      - name: NodeResourcesFit\n        weight: 0\n      - name: VolumeRestrictions\n        weight: 0\n      - name: EBSLimits\n        weight: 0\n      - name: GCEPDLimits\n        weight: 0\n      - name: NodeVolumeLimits\n        weight: 0\n      - name: AzureDiskLimits\n        weight: 0\n      - name: VolumeBinding\n        weight: 0\n      - name: VolumeZone\n        weight: 0\n      - name: PodTopologySpread\n        weight: 0\n      - name: InterPodAffinity\n        weight: 0\n    permit: {}\n    postBind: {}\n    postFilter:\n      enabled:\n      - name: DefaultPreemption\n        weight: 0\n    preBind:\n      enabled:\n      - name: VolumeBinding\n        weight: 0\n    preFilter:\n      enabled:\n      - name: NodeResourcesFit\n        weight: 0\n      - name: NodePorts\n        weight: 0\n      - name: VolumeRestrictions\n        weight: 0\n      - name: PodTopologySpread\n        weight: 0\n      - name: InterPodAffinity\n        weight: 0\n      - name: VolumeBinding\n        weight: 0\n      - name: NodeAffinity\n        weight: 0\n    preScore:\n      enabled:\n      - name: InterPodAffinity\n        weight: 0\n      - name: PodTopologySpread\n        weight: 0\n      - name: TaintToleration\n        weight: 0\n      - name: NodeAffinity\n        weight: 0\n    queueSort:\n      enabled:\n      - name: PrioritySort\n        weight: 0\n    reserve:\n      enabled:\n      - name: VolumeBinding\n        weight: 0\n    score:\n      enabled:\n      - name: ImageLocality\n        weight: 1\n      - name: InterPodAffinity\n        weight: 1\n      - name: NodeAffinity\n        weight: 1\n      - name: NodePreferAvoidPods\n        weight: 10000\n      - name: PodTopologySpread\n        weight: 2\n      - name: TaintToleration\n        weight: 1\n      - name: NodeResourcesMostAllocated\n        weight: 5\n  schedulerName: default-scheduler\n"
I1201 08:27:02.532488       1 server.go:144] "Starting Kubernetes Scheduler version" version="v1.22.1+bac83a5"

oc rollout restart deployment httpd httpd1 httpd2
deployment.apps/httpd restarted
deployment.apps/httpd1 restarted
deployment.apps/httpd2 restarted


[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-156-216.us-east-2.compute.internal | wc -l
101
[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-174-189.us-east-2.compute.internal | wc -l
100
[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-203-52.us-east-2.compute.internal | wc -l
99
[knarra@knarra cucushift]$ 

Based on the above moving bug to Assigned state. I have the cluster handy in case you would want to take a look, thanks !!

Comment 11 RamaKasturi 2021-12-01 09:44:45 UTC
Tried reproducing the bug on 4.9 and below is what i see:
===========================================================
4.9 pod creation:
===========================
[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-142-212.us-east-2.compute.internal | wc -l
99
[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-170-232.us-east-2.compute.internal | wc -l
101
[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-200-222.us-east-2.compute.internal | wc -l
100

I1201 08:27:38.689680       1 configfile.go:96] "Using component config" config="apiVersion: kubescheduler.config.k8s.io/v1beta1\nclientConnection:\n  acceptContentTypes: \"\"\n  burst: 100\n  contentType: application/vnd.kubernetes.protobuf\n  kubeconfig: /etc/kubernetes/static-pod-resources/configmaps/scheduler-kubeconfig/kubeconfig\n  qps: 50\nenableContentionProfiling: true\nenableProfiling: true\nhealthzBindAddress: 0.0.0.0:10251\nkind: KubeSchedulerConfiguration\nleaderElection:\n  leaderElect: true\n  leaseDuration: 2m17s\n  renewDeadline: 1m47s\n  resourceLock: configmaps\n  resourceName: kube-scheduler\n  resourceNamespace: openshift-kube-scheduler\n  retryPeriod: 26s\nmetricsBindAddress: 0.0.0.0:10251\nparallelism: 16\npercentageOfNodesToScore: 0\npodInitialBackoffSeconds: 1\npodMaxBackoffSeconds: 10\nprofiles:\n- pluginConfig:\n  - args:\n      apiVersion: kubescheduler.config.k8s.io/v1beta1\n      kind: DefaultPreemptionArgs\n      minCandidateNodesAbsolute: 100\n      minCandidateNodesPercentage: 10\n    name: DefaultPreemption\n  - args:\n      apiVersion: kubescheduler.config.k8s.io/v1beta1\n      hardPodAffinityWeight: 1\n      kind: InterPodAffinityArgs\n    name: InterPodAffinity\n  - args:\n      apiVersion: kubescheduler.config.k8s.io/v1beta1\n      kind: NodeAffinityArgs\n    name: NodeAffinity\n  - args:\n      apiVersion: kubescheduler.config.k8s.io/v1beta1\n      kind: NodeResourcesBalancedAllocationArgs\n      resources:\n      - name: cpu\n        weight: 1\n      - name: memory\n        weight: 1\n    name: NodeResourcesBalancedAllocation\n  - args:\n      apiVersion: kubescheduler.config.k8s.io/v1beta1\n      kind: NodeResourcesFitArgs\n      scoringStrategy:\n        resources:\n        - name: cpu\n          weight: 1\n        - name: memory\n          weight: 1\n        type: LeastAllocated\n    name: NodeResourcesFit\n  - args:\n      apiVersion: kubescheduler.config.k8s.io/v1beta1\n      kind: NodeResourcesMostAllocatedArgs\n      resources:\n      - name: cpu\n        weight: 1\n      - name: memory\n        weight: 1\n    name: NodeResourcesMostAllocated\n  - args:\n      apiVersion: kubescheduler.config.k8s.io/v1beta1\n      defaultingType: System\n      kind: PodTopologySpreadArgs\n    name: PodTopologySpread\n  - args:\n      apiVersion: kubescheduler.config.k8s.io/v1beta1\n      bindTimeoutSeconds: 600\n      kind: VolumeBindingArgs\n    name: VolumeBinding\n  plugins:\n    bind:\n      enabled:\n      - name: DefaultBinder\n        weight: 0\n    filter:\n      enabled:\n      - name: NodeUnschedulable\n        weight: 0\n      - name: NodeName\n        weight: 0\n      - name: TaintToleration\n        weight: 0\n      - name: NodeAffinity\n        weight: 0\n      - name: NodePorts\n        weight: 0\n      - name: NodeResourcesFit\n        weight: 0\n      - name: VolumeRestrictions\n        weight: 0\n      - name: EBSLimits\n        weight: 0\n      - name: GCEPDLimits\n        weight: 0\n      - name: NodeVolumeLimits\n        weight: 0\n      - name: AzureDiskLimits\n        weight: 0\n      - name: VolumeBinding\n        weight: 0\n      - name: VolumeZone\n        weight: 0\n      - name: PodTopologySpread\n        weight: 0\n      - name: InterPodAffinity\n        weight: 0\n    permit: {}\n    postBind: {}\n    postFilter:\n      enabled:\n      - name: DefaultPreemption\n        weight: 0\n    preBind:\n      enabled:\n      - name: VolumeBinding\n        weight: 0\n    preFilter:\n      enabled:\n      - name: NodeResourcesFit\n        weight: 0\n      - name: NodePorts\n        weight: 0\n      - name: VolumeRestrictions\n        weight: 0\n      - name: PodTopologySpread\n        weight: 0\n      - name: InterPodAffinity\n        weight: 0\n      - name: VolumeBinding\n        weight: 0\n      - name: NodeAffinity\n        weight: 0\n    preScore:\n      enabled:\n      - name: InterPodAffinity\n        weight: 0\n      - name: PodTopologySpread\n        weight: 0\n      - name: TaintToleration\n        weight: 0\n      - name: NodeAffinity\n        weight: 0\n    queueSort:\n      enabled:\n      - name: PrioritySort\n        weight: 0\n    reserve:\n      enabled:\n      - name: VolumeBinding\n        weight: 0\n    score:\n      enabled:\n      - name: NodeResourcesBalancedAllocation\n        weight: 1\n      - name: ImageLocality\n        weight: 1\n      - name: InterPodAffinity\n        weight: 1\n      - name: NodeAffinity\n        weight: 1\n      - name: NodePreferAvoidPods\n        weight: 10000\n      - name: PodTopologySpread\n        weight: 2\n      - name: TaintToleration\n        weight: 1\n      - name: NodeResourcesMostAllocated\n        weight: 0\n  schedulerName: default-scheduler\n"


[knarra@knarra cucushift]$ oc rollout restart deployment httpd httpd1 httpd2
deployment.apps/httpd restarted
deployment.apps/httpd1 restarted
deployment.apps/httpd2 restarted


[knarra@knarra cucushift]$ watch oc get pods
[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-142-212.us-east-2.compute.internal | wc -l
100
[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-170-232.us-east-2.compute.internal | wc -l
99
[knarra@knarra cucushift]$ oc get pods -o wide | grep ip-10-0-200-222.us-east-2.compute.internal | wc -l
101

Comment 13 Jan Chaloupka 2022-02-02 12:04:20 UTC
## Checking the server version and setting the HighNodeUtilization profile

```
$ oc version
Client Version: 4.9.0-0.nightly-2021-08-26-040328
Server Version: 4.11.0-0.nightly-2022-01-28-102930
Kubernetes Version: v1.23.0+d30ebbc
$ oc get scheduler cluster -o json | jq ".spec.profile"
"HighNodeUtilization"
```

## Deploying 300 pods

$ oc new-app httpd --name httpd
...
$ oc new-app httpd --name httpd1
...
$ oc new-app httpd --name httpd2
...
$ oc scale --replicas=100 deployment httpd2
deployment.apps/httpd2 scaled
$ oc scale --replicas=100 deployment httpd1
deployment.apps/httpd1 scaled
$ oc scale --replicas=100 deployment httpd
deployment.apps/httpd scaled
$ oc get pods -o wide | grep ip-10-0-168-80.ec2.internal | wc -l
73
$ oc get pods -o wide | grep ip-10-0-155-82.ec2.internal | wc -l
227
$ oc get pods -o wide | grep ip-10-0-132-73.ec2.internal | wc -l
0

Comment 14 Jan Chaloupka 2022-02-02 13:04:33 UTC
## Checking the server version and setting the HighNodeUtilization profile

$ oc version
Client Version: 4.9.0-0.nightly-2021-08-26-040328
Server Version: 4.10.0-0.nightly-2022-02-02-000921
Kubernetes Version: v1.23.3+b63be7f
$ oc get scheduler cluster -o json | jq ".spec.profile"
"HighNodeUtilization"

## Deploying 300 pods

$ oc new-app httpd --name httpd
...
$ oc new-app httpd --name httpd1
...
$ oc new-app httpd --name httpd2
...
$ oc scale --replicas=100 deployment httpd2
deployment.apps/httpd2 scaled
$ oc scale --replicas=100 deployment httpd1
deployment.apps/httpd1 scaled
$ oc scale --replicas=100 deployment httpd
deployment.apps/httpd scaled
$ oc get pods -o wide | grep ip-10-0-173-104.us-west-2.compute.internal | wc -l
0
$ oc get pods -o wide | grep ip-10-0-191-32.us-west-2.compute.internal | wc -l
75
$ oc get pods -o wide | grep ip-10-0-230-199.us-west-2.compute.internal | wc -l
225

Comment 15 Jan Chaloupka 2022-02-02 13:08:02 UTC
I intentionally skipped the step of restarting the deployment pods:

```
$ oc rollout restart deployment httpd httpd1 httpd2
deployment.apps/httpd restarted
deployment.apps/httpd1 restarted
deployment.apps/httpd2 restarted
```

since this temporarily doubles the number of pods in the cluster which may cause even distribution of pods when the worker nodes do not have sufficient resources to run the double workload.

Comment 16 RamaKasturi 2022-02-03 11:53:19 UTC
Verified bug with the build below and i see that it works fine.

[knarra@knarra ~]$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.0   True        False         106m    Cluster version is 4.10.0-rc.0

Below are the steps i followed to verify the bug:
==================================================
1) Install latest 4.10 cluster
2) create a new project
3) create 300 pods
 # oc new-app httpd --name httpd
 # oc new-app httpd --name httpd1
 # oc new-app httpd --name httpd2
 # oc scale --replicas=100 deployment httpd
 # oc scale --replicas=100 deployment httpd1
 # oc scale --replicas=100 deployment httpd2

Distribution of pods are as below:
=====================================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l
99
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l
99
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l
102

Here is the config:
=====================
-------------------------Configuration File Contents Start Here---------------------- 
apiVersion: kubescheduler.config.k8s.io/v1beta3
clientConnection:
  acceptContentTypes: ""
  burst: 100
  contentType: application/vnd.kubernetes.protobuf
  kubeconfig: /etc/kubernetes/static-pod-resources/configmaps/scheduler-kubeconfig/kubeconfig
  qps: 50
enableContentionProfiling: true
enableProfiling: true
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: true
  leaseDuration: 2m17s
  renewDeadline: 1m47s
  resourceLock: configmaps
  resourceName: kube-scheduler
  resourceNamespace: openshift-kube-scheduler
  retryPeriod: 26s
parallelism: 16
percentageOfNodesToScore: 0
podInitialBackoffSeconds: 1
podMaxBackoffSeconds: 10
profiles:
- pluginConfig:
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta3
      kind: DefaultPreemptionArgs
      minCandidateNodesAbsolute: 100
      minCandidateNodesPercentage: 10
    name: DefaultPreemption
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta3
      hardPodAffinityWeight: 1
      kind: InterPodAffinityArgs
    name: InterPodAffinity
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta3
      kind: NodeAffinityArgs
    name: NodeAffinity
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta3
      kind: NodeResourcesBalancedAllocationArgs
      resources:
      - name: cpu
        weight: 1
      - name: memory
        weight: 1
    name: NodeResourcesBalancedAllocation
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta3
      kind: NodeResourcesFitArgs
      scoringStrategy:
        resources:
        - name: cpu
          weight: 1
        - name: memory
          weight: 1
        type: LeastAllocated
    name: NodeResourcesFit
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta3
      defaultingType: System
      kind: PodTopologySpreadArgs
    name: PodTopologySpread
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta3
      bindTimeoutSeconds: 600
      kind: VolumeBindingArgs
    name: VolumeBinding
  plugins:
    bind: {}
    filter: {}
    multiPoint:
      enabled:
      - name: PrioritySort
        weight: 0
      - name: NodeUnschedulable
        weight: 0
      - name: NodeName
        weight: 0
      - name: TaintToleration
        weight: 3
      - name: NodeAffinity
        weight: 2
      - name: NodePorts
        weight: 0
      - name: NodeResourcesFit
        weight: 1
      - name: VolumeRestrictions
        weight: 0
      - name: EBSLimits
        weight: 0
      - name: GCEPDLimits
        weight: 0
      - name: NodeVolumeLimits
        weight: 0
      - name: AzureDiskLimits
        weight: 0
      - name: VolumeBinding
        weight: 0
      - name: VolumeZone
        weight: 0
      - name: PodTopologySpread
        weight: 2
      - name: InterPodAffinity
        weight: 2
      - name: DefaultPreemption
        weight: 0
      - name: NodeResourcesBalancedAllocation
        weight: 1
      - name: ImageLocality
        weight: 1
      - name: DefaultBinder
        weight: 0
    permit: {}
    postBind: {}
    postFilter: {}
    preBind: {}
    preFilter: {}
    preScore: {}
    queueSort: {}
    reserve: {}
    score: {}
  schedulerName: default-scheduler

------------------------------------Configuration File Contents End Here---------------------------------

4) Alter the profile to HigNodeUtilization using the command below
# oc patch Scheduler cluster --type='json' -p='[{"op": "add", "path": "/spec/profile", "value":"HighNodeUtilization"}]'

5) Deploy three hundred pods again using the commands below
 # oc new-app httpd --name httpd
 # oc new-app httpd --name httpd1
 # oc new-app httpd --name httpd2
 # oc scale --replicas=100 deployment httpd
 # oc scale --replicas=100 deployment httpd1
 # oc scale --replicas=100 deployment httpd2

Distribution of pods after altering profile to HighNodeUtilization:
===================================================================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l
139
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l
135
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l
26

-------------------------Configuration File Contents Start Here---------------------- 
apiVersion: kubescheduler.config.k8s.io/v1beta3
clientConnection:
  acceptContentTypes: ""
  burst: 100
  contentType: application/vnd.kubernetes.protobuf
  kubeconfig: /etc/kubernetes/static-pod-resources/configmaps/scheduler-kubeconfig/kubeconfig
  qps: 50
enableContentionProfiling: true
enableProfiling: true
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: true
  leaseDuration: 2m17s
  renewDeadline: 1m47s
  resourceLock: configmaps
  resourceName: kube-scheduler
  resourceNamespace: openshift-kube-scheduler
  retryPeriod: 26s
parallelism: 16
percentageOfNodesToScore: 0
podInitialBackoffSeconds: 1
podMaxBackoffSeconds: 10
profiles:
- pluginConfig:
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta3
      kind: DefaultPreemptionArgs
      minCandidateNodesAbsolute: 100
      minCandidateNodesPercentage: 10
    name: DefaultPreemption
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta3
      hardPodAffinityWeight: 1
      kind: InterPodAffinityArgs
    name: InterPodAffinity
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta3
      kind: NodeAffinityArgs
    name: NodeAffinity
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta3
      kind: NodeResourcesBalancedAllocationArgs
      resources:
      - name: cpu
        weight: 1
      - name: memory
        weight: 1
    name: NodeResourcesBalancedAllocation
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta3
      kind: NodeResourcesFitArgs
      scoringStrategy:
        resources:
        - name: cpu
          weight: 1
        - name: memory
          weight: 1
        type: MostAllocated
    name: NodeResourcesFit
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta3
      defaultingType: System
      kind: PodTopologySpreadArgs
    name: PodTopologySpread
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta3
      bindTimeoutSeconds: 600
      kind: VolumeBindingArgs
    name: VolumeBinding
  plugins:
    bind: {}
    filter: {}
    multiPoint:
      enabled:
      - name: PrioritySort
        weight: 0
      - name: NodeUnschedulable
        weight: 0
      - name: NodeName
        weight: 0
      - name: TaintToleration
        weight: 3
      - name: NodeAffinity
        weight: 2
      - name: NodePorts
        weight: 0
      - name: NodeResourcesFit
        weight: 1
      - name: VolumeRestrictions
        weight: 0
      - name: EBSLimits
        weight: 0
      - name: GCEPDLimits
        weight: 0
      - name: NodeVolumeLimits
        weight: 0
      - name: AzureDiskLimits
        weight: 0
      - name: VolumeBinding
        weight: 0
      - name: VolumeZone
        weight: 0
      - name: PodTopologySpread
        weight: 2
      - name: InterPodAffinity
        weight: 2
      - name: DefaultPreemption
        weight: 0
      - name: NodeResourcesBalancedAllocation
        weight: 1
      - name: ImageLocality
        weight: 1
      - name: DefaultBinder
        weight: 0
    permit: {}
    postBind: {}
    postFilter: {}
    preBind: {}
    preFilter: {}
    preScore: {}
    queueSort: {}
    reserve: {}
    score:
      disabled:
      - name: NodeResourcesBalancedAllocation
        weight: 0
      enabled:
      - name: NodeResourcesFit
        weight: 5
  schedulerName: default-scheduler

------------------------------------Configuration File Contents End Here---------------------------------

Tried the same procedure as described in the bug description and below are the values i see before and after altering the profile.

Before altering the profile:
==============================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l
99
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l
99
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l
102

After altering the profile:
================================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l
66
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l
117
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l
117

Checked with jan and the suggestion was to repeat the same with more worker nodes i.e 5 or 7 to see the actual difference, will try the same and then move the bug to verified state.

Comment 17 RamaKasturi 2022-02-04 09:13:49 UTC
Verified bug with the below build and below is the procedure i have followed to verify the bug.

[knarra@knarra ~]$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.0   True        False         20h     Cluster version is 4.10.0-rc.0

Procedure followed to verify the bug:
====================================
1) Install cluster with latest 4.10 bits
2) Deploy 300 pods using the commands below
  # oc new-app httpd --name=httpd
  # oc new-app httpd --name httpd1
  # oc new-app httpd --name httpd2
3) Scale the app pods using the command below
  # oc scale --replicas=100 deployment httpd
  # oc scale --replicas=100 deployment httpd1
  # oc scale --replicas=100 deployment httpd2
4) Now Retreive number of pods scheduled to each node 
  # oc get pods -o wide | grep <node_name> | wc -l
5) Alter the scheduler profile to HighNodeUtilization using the command below
  # oc patch Scheduler cluster --type='json' -p='[{"op": "add", "path": "/spec/profile", "value":"HighNodeUtilization"}]'
6) wait for kube-scheduler pods to restart
7) Now Run the commands below to rollout the restart of the pods
  # oc rollout restart deployment httpd httpd1 httpd2
8) wait for the pods to terminate and start fresh.
9) Repeat steps 1 through 8 for cluster with 3, 5 & 7 worker nodes.

Observations that were made during the test:
============================================
7 worker node cluster:
=========================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-129-190.us-east-2.compute.internal | wc -l
36
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-145-88.us-east-2.compute.internal | wc -l
37
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-150-102.us-east-2.compute.internal | wc -l
38
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-180-1.us-east-2.compute.internal | wc -l
48
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-182-90.us-east-2.compute.internal | wc -l
47
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-207-182.us-east-2.compute.internal | wc -l
48
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-104.us-east-2.compute.internal | wc -l
46

After profile (1st rollout):
============================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-129-190.us-east-2.compute.internal | wc -l
101
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-145-88.us-east-2.compute.internal | wc -l
6
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-150-102.us-east-2.compute.internal | wc -l
3
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-180-1.us-east-2.compute.internal | wc -l
39
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-182-90.us-east-2.compute.internal | wc -l
31
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-207-182.us-east-2.compute.internal | wc -l
26
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-104.us-east-2.compute.internal | wc -l
94

After profile (2nd rollout):
===================================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-129-190.us-east-2.compute.internal | wc -l
150
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-145-88.us-east-2.compute.internal | wc -l
0
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-150-102.us-east-2.compute.internal | wc -l
0
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-180-1.us-east-2.compute.internal | wc -l
0
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-182-90.us-east-2.compute.internal | wc -l
0
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-207-182.us-east-2.compute.internal | wc -l
0
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-104.us-east-2.compute.internal | wc -l
150




5 node worker cluster:
=========================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-133-12.us-east-2.compute.internal | wc -l
55
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-159-8.us-east-2.compute.internal | wc -l
57
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-178-40.us-east-2.compute.internal | wc -l
56
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-191-234.us-east-2.compute.internal | wc -l
55
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-94.us-east-2.compute.internal | wc -l
77

After profile (1st rollout):
============================

[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-133-12.us-east-2.compute.internal | wc -l
121
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-159-8.us-east-2.compute.internal | wc -l
2
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-178-40.us-east-2.compute.internal | wc -l
20
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-191-234.us-east-2.compute.internal | wc -l
87
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-94.us-east-2.compute.internal | wc -l
70

After profile (2nd rollout):
=================================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-133-12.us-east-2.compute.internal | wc -l
101
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-159-8.us-east-2.compute.internal | wc -l
0
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-178-40.us-east-2.compute.internal | wc -l
0
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-191-234.us-east-2.compute.internal | wc -l
102
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-219-94.us-east-2.compute.internal | wc -l
97



3 node worker cluster:
==========================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l
99
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l
99
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l
102

After profile(1st rollout):
==========================
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-149-144.us-east-2.compute.internal | wc -l
108
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-164-20.us-east-2.compute.internal | wc -l
111
[knarra@knarra ~]$ oc get pods -o wide | grep ip-10-0-197-46.us-east-2.compute.internal | wc -l
81

Based on the above it looks to me that there is definitely a difference after changing the scheduler profile to HighNodeUtilization. Based on that moving the bug to verified state.

Comment 20 errata-xmlrpc 2022-03-10 16:08:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.