Created attachment 1559375 [details] vm yaml Description of problem: When target node does not have enough disk space for creating new pod during migration it can forcibly evict some other pods: 1. I have 2 VMs with containerDisk running on Node-1: # oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE virt-launcher-vm-noevict1-ft2kx 2/2 Running 0 100s 10.130.0.142 working-wxjj9-worker-0-2qvfn <none> virt-launcher-vm-noevict2-7rqrq 2/2 Running 0 97s 10.130.0.143 working-wxjj9-worker-0-2qvfn <none> 2. First VM migrated to Node-2: # oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE virt-launcher-vm-noevict1-87rrj 2/2 Running 0 2m31s 10.129.0.224 working-wxjj9-worker-0-8wljt <none> virt-launcher-vm-noevict2-7rqrq 2/2 Running 0 9m56s 10.130.0.143 working-wxjj9-worker-0-2qvfn <none> Empty disk space on node-2 is very limited: [root@working-wxjj9-worker-0-8wljt ~]# df -h | grep sysr Filesystem Size Used Avail Use% Mounted on /dev/vda3 16G 13G 2.6G 84% /sysroot 3. When I'm running migration of second VM - it does not verify empty space on target node and initiate migration. Disk on the Node-2 is filling up to 100% and after that first VM is evicted from Node-2: # oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE virt-launcher-vm-noevict1-87rrj 0/2 Evicted 0 3m20s <none> working-wxjj9-worker-0-8wljt <none> virt-launcher-vm-noevict2-7rqrq 2/2 Running 0 10m 10.130.0.143 working-wxjj9-worker-0-2qvfn <none> virt-launcher-vm-noevict2-kpzh9 2/2 Running 0 17s 10.129.0.225 working-wxjj9-worker-0-8wljt <none> 4. VM-2 migrated to Node-2 just by taking the place of VM-1: # oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE virt-launcher-vm-noevict1-h8pz4 2/2 Running 0 12m 10.130.0.146 working-wxjj9-worker-0-2qvfn <none> virt-launcher-vm-noevict2-kpzh9 2/2 Running 0 12m 10.129.0.225 working-wxjj9-worker-0-8wljt <none> VM-1 was terminated and recreated on Node-1
IIUIC the eviciton of the pods on the second note is probably happening because the node is getting under disk pressure - and might lead to the pod eviction. Another thought is if we have a specific priority clas son the pod which could cause the eviction. Denys, in what state is the kubelet/node when the pod is getting evicted?
Fabian, target node shows that it has pressure and tries to evict pod: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning EvictionThresholdMet 43m (x8 over 3d) kubelet, working-wxjj9-worker-0-8wljt Attempting to reclaim ephemeral-storage Normal NodeHasDiskPressure 42m (x7 over 3d) kubelet, working-wxjj9-worker-0-8wljt Node working-wxjj9-worker-0-8wljt status is now: NodeHasDiskPressure Normal NodeHasNoDiskPressure 37m (x11 over 3d) kubelet, working-wxjj9-worker-0-8wljt Node working-wxjj9-worker-0-8wljt status is now: NodeHasNoDiskPressure # oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE virt-launcher-vm-noevict1-92mmf 0/2 Evicted 0 46m <none> working-wxjj9-worker-0-8wljt <none> virt-launcher-vm-noevict1-h8pz4 2/2 Running 0 2d19h 10.130.0.146 working-wxjj9-worker-0-2qvfn <none>
If a node gets under disk pressure, some pods will be removed from the node. The order is this: Lowest QoS first, withing the same QoS class the Pods/VMs with the biggest emptyDir/emptyDisk/image-layer-writes will be evicted first. This is an emergency procedure which has higher priority than Application needs, therefore it will ignore PodDisruptionBudgets and and simply delete Pods. That also means that VMIs will not be live-migrated (which is the equivalent of ignoring application needs). First we can't influence that, second I don't think we should influence that. What do others think?
Roman, can we somehow prevent to schedule a VM (pod) on a node when we know that it will get into a pressure situation?
I think the current behaviour is fine from the KubeVirt perspective. In general the scheduler does not schedule to nodes with NodePressure (but it can take some time until the node pressure condition is announced and visible for the scheduler). If VMs should not be evicted you will need the QoS class Guaranteed.
Arik, I think the challenge here is that we don't know why this is allowed. Should the system be able to protect us from over-allocating disk space? Is there something we're doing wrong that allows this? Is there something we could do to actively prevent this?
As far as I can tell, the scheduler just does not take the expected storage consumption of the pod into account. I see in [1] that it is planned to have the storage consumption represented in terms of resource request/limit so the k8s scheduler would consider them but it is not yet implemented. And I think that even if we add the storage consumption to the resources section and extend the scheduler [2] to filter the nodes accordingly, we have no information about the actual (which makes sense to map to resource-request) and virtual (which makes sense to map to resource limit) sizes of a container-disk [3]. Does it make sense? [1] https://kubernetes.io/docs/concepts/storage/#resources [2] https://kubernetes.io/docs/concepts/extend-kubernetes/extend-cluster/#scheduler-extensions [3] https://kubevirt.io/user-guide/docs/latest/creating-virtual-machines/disks-and-volumes.html#containerdisk
Fabian, can you please explain why is it of a high priority and severity bug? It is not interesting for deployments that don't involve migrations and in those that do involve migrations we can assume shared storage, no?
Sure, Arik. So, the problem is a lot around testing, as testing is using containerDisks. Also: Even in an env with shared storage, where live migration is used, it can well be that a user is using conterDisk in addition to a shared disk, and such a user would be affected by this bug as well, not? But maybe it is not as urgent - Pushing it out to 2.2 Please ask to move it back if this is happening often
This is solved in 2.2. There the containerDisk does not require extra space (except for what is written to the disk since it started, which should be close to 0 in our tests). Before that the whole disk was copied out of the container.
> This is solved in 2.2. There the containerDisk does not require extra space (except for what is written to the disk since it started, which should be close to 0 in our tests). Before that the whole disk was copied out of the container. I meant solved in 2.1.
Here the PR: https://github.com/kubevirt/kubevirt/pull/2395
Just tested, as mentioned earlier we do not copy whole disk. The disk consumption does not increase when we start and stop the VM. USed 2 separate images as could quickly find a larger image to test. Each image was around 308 MB. The storage consumption on the node did not increase on the start/stop of the VM. VERIFIED with container-native-virtualization/hyperconverged-cluster-operator:v2.2.0-9 and container-native-virtualization/virt-operator:v2.2.0-10
Correction: Used 2 separate images as could not* quickly find a larger image to test.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:0307