Bug 2071886
| Summary: | [RFE] Option to allow scheduler to rearrange the cluster to allow certain actions to succeed | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Germano Veit Michel <gveitmic> |
| Component: | Infrastructure | Assignee: | Javier Cano Cano <jcanocan> |
| Status: | CLOSED MIGRATED | QA Contact: | Roni Kishner <rkishner> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.10.6 | CC: | cnagarka, danken, dholler, rkishner, rsdeor, usurse, ycui |
| Target Milestone: | --- | ||
| Target Release: | future | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-12-14 16:05:52 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Virtual machines in OpenShift Virtualization are just pods and comply with OCP scheduler. While the request is specific for VMs, the solution must be generic for OCP and should work for all pods. Please open the RFE for OCP. |
Consider this scenario: * 3 worker nodes with 30G of memory each * 6 VMs, with 8G RAM each, 2 running on each of the 3 workers. As below: NAME AGE PHASE IP NODENAME READY centos-stream8-empty-whitefish 11m Running 10.128.2.19 worker-green.shift.toca.local True centos7-loud-bear 13m Running 10.128.2.18 worker-green.shift.toca.local True fedora-advanced-goose 12m Running 10.131.0.23 worker-blue.shift.toca.local True fedora-civil-spoonbill 11m Running 10.131.0.24 worker-blue.shift.toca.local True fedora-frozen-stingray 11m Running 10.129.2.42 worker-red.shift.toca.local True centos-stream9-useless-raven 13m Running 10.129.2.41 worker-red.shift.toca.local True All the VMs are as follows: spec: domain: resources: requests: memory: 8Gi So the 3 worker nodes have 16G requested by VMs, plus some more for the other cluster pods, looking like this: Resource Requests Limits -------- -------- ------ memory 19858Mi (64%) 0 (0%) memory 18024Mi (58%) 0 (0%) memory 20166Mi (64%) 0 (0%) Eeach of these worker nodes has at least capacity to run another 8G VM, but not a 16G one. The admin now wants to start a 16G VM. Combined, the cluster has capacity, but not on a single worker. The admin requests the 16G VM to start. The result is: NAME AGE PHASE IP NODENAME READY centos-stream8-empty-whitefish 11m Running 10.128.2.19 worker-green.shift.toca.local True centos7-loud-bear 13m Running 10.128.2.18 worker-green.shift.toca.local True fedora-advanced-goose 12m Running 10.131.0.23 worker-blue.shift.toca.local True fedora-civil-spoonbill 11m Running 10.131.0.24 worker-blue.shift.toca.local True fedora-frozen-stingray 11m Running 10.129.2.42 worker-red.shift.toca.local True centos-stream9-useless-raven 13m Running 10.129.2.41 worker-red.shift.toca.local True rhel7-red-llama 14m Scheduling False Because: 0/7 nodes are available: 1 node(s) were unschedulable, 3 Insufficient memory, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. However, the admin can do this, the solution is a simple live migration of any VM to any other node, so that one node usage goes down to 1 VM (8G) and another one goes to 3VMs (24G), and the 16G VM can start. For example, live migrate one VM manually and the state quickly goes through these steps without further admin interference: # oc get vmi NAME AGE PHASE IP NODENAME READY centos-stream8-empty-whitefish 29m Running 10.128.2.19 worker-green.shift.toca.local True centos-stream9-useless-raven 30m Running 10.129.2.41 worker-red.shift.toca.local True fedora-frozen-stingray 28m Running 10.129.2.42 worker-red.shift.toca.local True centos7-loud-bear 30m Running 10.131.0.25 worker-blue.shift.toca.local True <--- migrated from Green fedora-advanced-goose 29m Running 10.131.0.23 worker-blue.shift.toca.local True fedora-civil-spoonbill 28m Running 10.131.0.24 worker-blue.shift.toca.local True rhel7-red-llama 26s Scheduled worker-green.shift.toca.local False Just a few moments later they are all running: # oc get vmi NAME AGE PHASE IP NODENAME READY centos-stream8-empty-whitefish 29m Running 10.128.2.19 worker-green.shift.toca.local True centos-stream9-useless-raven 30m Running 10.129.2.41 worker-red.shift.toca.local True centos7-loud-bear 30m Running 10.131.0.25 worker-blue.shift.toca.local True fedora-advanced-goose 29m Running 10.131.0.23 worker-blue.shift.toca.local True fedora-civil-spoonbill 28m Running 10.131.0.24 worker-blue.shift.toca.local True fedora-frozen-stingray 28m Running 10.129.2.42 worker-red.shift.toca.local True rhel7-red-llama 34s Running 10.128.2.22 worker-green.shift.toca.local True Request: the admin would like to have an option for this to work automatically, without admin interference to solve this problem. I did some research on PriorityClasses, but it seems they are wiped from VM spec. And even if they worked, it would create an ever increase priority problem so I cannot find a solution for this with currently available mechanisms. The customer is requesting this for VMs, not Pods, even though a generic solution could work for both. Finally, the example above is for (re)starting a VM (i.e. highly available one), but the same logic is requested for other tasks such as draining nodes.