This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
Bug 2071886 - [RFE] Option to allow scheduler to rearrange the cluster to allow certain actions to succeed
Summary: [RFE] Option to allow scheduler to rearrange the cluster to allow certain act...
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Infrastructure
Version: 4.10.6
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: future
Assignee: Javier Cano Cano
QA Contact: Roni Kishner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-05 04:17 UTC by Germano Veit Michel
Modified: 2023-12-14 16:05 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-12-14 16:05:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker   CNV-17404 0 None None None 2023-12-14 16:05:51 UTC
Red Hat Knowledge Base (Solution) 6980556 0 None None None 2022-10-17 14:20:10 UTC

Description Germano Veit Michel 2022-04-05 04:17:26 UTC
Consider this scenario:

* 3 worker nodes with 30G of memory each
* 6 VMs, with 8G RAM each, 2 running on each of the 3 workers.

As below:

NAME                             AGE   PHASE        IP            NODENAME                        READY
centos-stream8-empty-whitefish   11m   Running      10.128.2.19   worker-green.shift.toca.local   True
centos7-loud-bear                13m   Running      10.128.2.18   worker-green.shift.toca.local   True
fedora-advanced-goose            12m   Running      10.131.0.23   worker-blue.shift.toca.local    True
fedora-civil-spoonbill           11m   Running      10.131.0.24   worker-blue.shift.toca.local    True
fedora-frozen-stingray           11m   Running      10.129.2.42   worker-red.shift.toca.local     True
centos-stream9-useless-raven     13m   Running      10.129.2.41   worker-red.shift.toca.local     True

All the VMs are as follows:

spec:
  domain:
    resources:
      requests:
        memory: 8Gi

So the 3 worker nodes have 16G requested by VMs, plus some more for the other cluster pods, looking like this:

  Resource                       Requests     Limits
  --------                       --------     ------
  memory                         19858Mi (64%)  0 (0%)
  memory                         18024Mi (58%)  0 (0%)
  memory                         20166Mi (64%)  0 (0%)

Eeach of these worker nodes has at least capacity to run another 8G VM, but not a 16G one.

The admin now wants to start a 16G VM. Combined, the cluster has capacity, but not on a single worker.

The admin requests the 16G VM to start. The result is:

NAME                             AGE   PHASE        IP            NODENAME                        READY
centos-stream8-empty-whitefish   11m   Running      10.128.2.19   worker-green.shift.toca.local   True
centos7-loud-bear                13m   Running      10.128.2.18   worker-green.shift.toca.local   True
fedora-advanced-goose            12m   Running      10.131.0.23   worker-blue.shift.toca.local    True
fedora-civil-spoonbill           11m   Running      10.131.0.24   worker-blue.shift.toca.local    True
fedora-frozen-stingray           11m   Running      10.129.2.42   worker-red.shift.toca.local     True
centos-stream9-useless-raven     13m   Running      10.129.2.41   worker-red.shift.toca.local     True
rhel7-red-llama                  14m   Scheduling                                                 False

Because:
0/7 nodes are available: 1 node(s) were unschedulable, 3 Insufficient memory, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

However, the admin can do this, the solution is a simple live migration of any VM to any other node, so that one node usage goes down to 1 VM (8G) and another one goes to 3VMs (24G), and the 16G VM can start.

For example, live migrate one VM manually and the state quickly goes through these steps without further admin interference:

# oc get vmi
NAME                             AGE   PHASE       IP            NODENAME                        READY
centos-stream8-empty-whitefish   29m   Running     10.128.2.19   worker-green.shift.toca.local   True
centos-stream9-useless-raven     30m   Running     10.129.2.41   worker-red.shift.toca.local     True
fedora-frozen-stingray           28m   Running     10.129.2.42   worker-red.shift.toca.local     True
centos7-loud-bear                30m   Running     10.131.0.25   worker-blue.shift.toca.local    True    <--- migrated from Green
fedora-advanced-goose            29m   Running     10.131.0.23   worker-blue.shift.toca.local    True
fedora-civil-spoonbill           28m   Running     10.131.0.24   worker-blue.shift.toca.local    True
rhel7-red-llama                  26s   Scheduled                 worker-green.shift.toca.local   False

Just a few moments later they are all running:
# oc get vmi
NAME                             AGE   PHASE     IP            NODENAME                        READY
centos-stream8-empty-whitefish   29m   Running   10.128.2.19   worker-green.shift.toca.local   True
centos-stream9-useless-raven     30m   Running   10.129.2.41   worker-red.shift.toca.local     True
centos7-loud-bear                30m   Running   10.131.0.25   worker-blue.shift.toca.local    True
fedora-advanced-goose            29m   Running   10.131.0.23   worker-blue.shift.toca.local    True
fedora-civil-spoonbill           28m   Running   10.131.0.24   worker-blue.shift.toca.local    True
fedora-frozen-stingray           28m   Running   10.129.2.42   worker-red.shift.toca.local     True
rhel7-red-llama                  34s   Running   10.128.2.22   worker-green.shift.toca.local   True


Request: the admin would like to have an option for this to work automatically, without admin interference to solve this problem.

I did some research on PriorityClasses, but it seems they are wiped from VM spec. And even if they worked, it would create an ever increase priority problem so I cannot find a solution for this with currently available mechanisms.

The customer is requesting this for VMs, not Pods, even though a generic solution could work for both.

Finally, the example above is for (re)starting a VM (i.e. highly available one), but the same logic is requested for other tasks such as draining nodes.

Comment 2 Ronen 2022-04-07 08:10:50 UTC
Virtual machines in OpenShift Virtualization are just pods and comply with OCP scheduler.
While the request is specific for VMs, the solution must be generic for OCP and should work for all pods. Please open the RFE for OCP.


Note You need to log in before you can comment on or make changes to this bug.