Bug 2071886

Summary:	[RFE] Option to allow scheduler to rearrange the cluster to allow certain actions to succeed
Product:	Container Native Virtualization (CNV)	Reporter:	Germano Veit Michel <gveitmic>
Component:	Infrastructure	Assignee:	Javier Cano Cano <jcanocan>
Status:	CLOSED MIGRATED	QA Contact:	Roni Kishner <rkishner>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.10.6	CC:	cnagarka, danken, dholler, rkishner, rsdeor, usurse, ycui
Target Milestone:	---
Target Release:	future
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-12-14 16:05:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2022-04-05 04:17:26 UTC

Consider this scenario:

* 3 worker nodes with 30G of memory each
* 6 VMs, with 8G RAM each, 2 running on each of the 3 workers.

As below:

NAME                             AGE   PHASE        IP            NODENAME                        READY
centos-stream8-empty-whitefish   11m   Running      10.128.2.19   worker-green.shift.toca.local   True
centos7-loud-bear                13m   Running      10.128.2.18   worker-green.shift.toca.local   True
fedora-advanced-goose            12m   Running      10.131.0.23   worker-blue.shift.toca.local    True
fedora-civil-spoonbill           11m   Running      10.131.0.24   worker-blue.shift.toca.local    True
fedora-frozen-stingray           11m   Running      10.129.2.42   worker-red.shift.toca.local     True
centos-stream9-useless-raven     13m   Running      10.129.2.41   worker-red.shift.toca.local     True

All the VMs are as follows:

spec:
  domain:
    resources:
      requests:
        memory: 8Gi

So the 3 worker nodes have 16G requested by VMs, plus some more for the other cluster pods, looking like this:

  Resource                       Requests     Limits
  --------                       --------     ------
  memory                         19858Mi (64%)  0 (0%)
  memory                         18024Mi (58%)  0 (0%)
  memory                         20166Mi (64%)  0 (0%)

Eeach of these worker nodes has at least capacity to run another 8G VM, but not a 16G one.

The admin now wants to start a 16G VM. Combined, the cluster has capacity, but not on a single worker.

The admin requests the 16G VM to start. The result is:

NAME                             AGE   PHASE        IP            NODENAME                        READY
centos-stream8-empty-whitefish   11m   Running      10.128.2.19   worker-green.shift.toca.local   True
centos7-loud-bear                13m   Running      10.128.2.18   worker-green.shift.toca.local   True
fedora-advanced-goose            12m   Running      10.131.0.23   worker-blue.shift.toca.local    True
fedora-civil-spoonbill           11m   Running      10.131.0.24   worker-blue.shift.toca.local    True
fedora-frozen-stingray           11m   Running      10.129.2.42   worker-red.shift.toca.local     True
centos-stream9-useless-raven     13m   Running      10.129.2.41   worker-red.shift.toca.local     True
rhel7-red-llama                  14m   Scheduling                                                 False

Because:
0/7 nodes are available: 1 node(s) were unschedulable, 3 Insufficient memory, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

However, the admin can do this, the solution is a simple live migration of any VM to any other node, so that one node usage goes down to 1 VM (8G) and another one goes to 3VMs (24G), and the 16G VM can start.

For example, live migrate one VM manually and the state quickly goes through these steps without further admin interference:

# oc get vmi
NAME                             AGE   PHASE       IP            NODENAME                        READY
centos-stream8-empty-whitefish   29m   Running     10.128.2.19   worker-green.shift.toca.local   True
centos-stream9-useless-raven     30m   Running     10.129.2.41   worker-red.shift.toca.local     True
fedora-frozen-stingray           28m   Running     10.129.2.42   worker-red.shift.toca.local     True
centos7-loud-bear                30m   Running     10.131.0.25   worker-blue.shift.toca.local    True    <--- migrated from Green
fedora-advanced-goose            29m   Running     10.131.0.23   worker-blue.shift.toca.local    True
fedora-civil-spoonbill           28m   Running     10.131.0.24   worker-blue.shift.toca.local    True
rhel7-red-llama                  26s   Scheduled                 worker-green.shift.toca.local   False

Just a few moments later they are all running:
# oc get vmi
NAME                             AGE   PHASE     IP            NODENAME                        READY
centos-stream8-empty-whitefish   29m   Running   10.128.2.19   worker-green.shift.toca.local   True
centos-stream9-useless-raven     30m   Running   10.129.2.41   worker-red.shift.toca.local     True
centos7-loud-bear                30m   Running   10.131.0.25   worker-blue.shift.toca.local    True
fedora-advanced-goose            29m   Running   10.131.0.23   worker-blue.shift.toca.local    True
fedora-civil-spoonbill           28m   Running   10.131.0.24   worker-blue.shift.toca.local    True
fedora-frozen-stingray           28m   Running   10.129.2.42   worker-red.shift.toca.local     True
rhel7-red-llama                  34s   Running   10.128.2.22   worker-green.shift.toca.local   True


Request: the admin would like to have an option for this to work automatically, without admin interference to solve this problem.

I did some research on PriorityClasses, but it seems they are wiped from VM spec. And even if they worked, it would create an ever increase priority problem so I cannot find a solution for this with currently available mechanisms.

The customer is requesting this for VMs, not Pods, even though a generic solution could work for both.

Finally, the example above is for (re)starting a VM (i.e. highly available one), but the same logic is requested for other tasks such as draining nodes.

Comment 2 Ronen 2022-04-07 08:10:50 UTC

Virtual machines in OpenShift Virtualization are just pods and comply with OCP scheduler.
While the request is specific for VMs, the solution must be generic for OCP and should work for all pods. Please open the RFE for OCP.