Bug 1703603 - Migration of VM with containerDisk can forcibly evict another pods when there are not enough space on the node
Summary: Migration of VM with containerDisk can forcibly evict another pods when there...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 2.2.0
Assignee: Arik
QA Contact: Kedar Bidarkar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-26 21:21 UTC by Denys Shchedrivyi
Modified: 2020-01-30 16:27 UTC (History)
8 users (show)

Fixed In Version: hyperconverged-cluster-operator-container-v2.2.0-3 virt-operator-container-v2.2.0-2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-30 16:27:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vm yaml (997 bytes, text/plain)
2019-04-26 21:21 UTC, Denys Shchedrivyi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2020:0307 0 None None None 2020-01-30 16:27:21 UTC

Description Denys Shchedrivyi 2019-04-26 21:21:56 UTC
Created attachment 1559375 [details]
vm yaml

Description of problem:

 When target node does not have enough disk space for creating new pod during migration it can forcibly evict some other pods:

1. I have 2 VMs with containerDisk running on Node-1:
# oc get pod -o wide
NAME                              READY     STATUS    RESTARTS   AGE       IP             NODE                           NOMINATED NODE
virt-launcher-vm-noevict1-ft2kx   2/2       Running   0          100s      10.130.0.142   working-wxjj9-worker-0-2qvfn   <none>
virt-launcher-vm-noevict2-7rqrq   2/2       Running   0          97s       10.130.0.143   working-wxjj9-worker-0-2qvfn   <none>



2. First VM migrated to Node-2:
# oc get pod -o wide
NAME                              READY     STATUS    RESTARTS   AGE       IP             NODE                           NOMINATED NODE
virt-launcher-vm-noevict1-87rrj   2/2       Running   0          2m31s     10.129.0.224   working-wxjj9-worker-0-8wljt   <none>
virt-launcher-vm-noevict2-7rqrq   2/2       Running   0          9m56s     10.130.0.143   working-wxjj9-worker-0-2qvfn   <none>

Empty disk space on node-2 is very limited:

[root@working-wxjj9-worker-0-8wljt ~]# df -h | grep sysr
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda3        16G   13G  2.6G  84% /sysroot



3. When I'm running migration of second VM - it does not verify empty space on target node and initiate migration. Disk on the Node-2 is filling up to 100% and after that first VM is evicted from Node-2:
# oc get pod -o wide
NAME                              READY     STATUS    RESTARTS   AGE       IP             NODE                           NOMINATED NODE
virt-launcher-vm-noevict1-87rrj   0/2       Evicted   0          3m20s     <none>         working-wxjj9-worker-0-8wljt   <none>
virt-launcher-vm-noevict2-7rqrq   2/2       Running   0          10m       10.130.0.143   working-wxjj9-worker-0-2qvfn   <none>
virt-launcher-vm-noevict2-kpzh9   2/2       Running   0          17s       10.129.0.225   working-wxjj9-worker-0-8wljt   <none>


4. VM-2 migrated to Node-2 just by taking the place of VM-1:
# oc get pod -o wide
NAME                              READY     STATUS    RESTARTS   AGE       IP             NODE                           NOMINATED NODE
virt-launcher-vm-noevict1-h8pz4   2/2       Running   0          12m       10.130.0.146   working-wxjj9-worker-0-2qvfn   <none>
virt-launcher-vm-noevict2-kpzh9   2/2       Running   0          12m       10.129.0.225   working-wxjj9-worker-0-8wljt   <none>


 VM-1 was terminated and recreated on Node-1

Comment 1 Fabian Deutsch 2019-04-29 12:06:02 UTC
IIUIC the eviciton of the pods on the second note is probably happening because the node is getting under disk pressure - and might lead to the pod eviction.
Another thought is if we have a specific priority clas son the pod which could cause the eviction.

Denys, in what state is the kubelet/node when the pod is getting evicted?

Comment 2 Denys Shchedrivyi 2019-04-29 16:12:35 UTC
Fabian, target node shows that it has pressure and tries to evict pod:


Events:
  Type     Reason                 Age                From                                   Message
  ----     ------                 ----               ----                                   -------
  Warning  EvictionThresholdMet   43m (x8 over 3d)   kubelet, working-wxjj9-worker-0-8wljt  Attempting to reclaim ephemeral-storage
  Normal   NodeHasDiskPressure    42m (x7 over 3d)   kubelet, working-wxjj9-worker-0-8wljt  Node working-wxjj9-worker-0-8wljt status is now: NodeHasDiskPressure
  Normal   NodeHasNoDiskPressure  37m (x11 over 3d)  kubelet, working-wxjj9-worker-0-8wljt  Node working-wxjj9-worker-0-8wljt status is now: NodeHasNoDiskPressure


# oc get pod -o wide
NAME                              READY     STATUS    RESTARTS   AGE       IP             NODE                           NOMINATED NODE
virt-launcher-vm-noevict1-92mmf   0/2       Evicted   0          46m       <none>         working-wxjj9-worker-0-8wljt   <none>
virt-launcher-vm-noevict1-h8pz4   2/2       Running   0          2d19h     10.130.0.146   working-wxjj9-worker-0-2qvfn   <none>

Comment 5 Roman Mohr 2019-05-14 07:59:38 UTC
If a node gets under disk pressure, some pods will be removed from the node. The order is this: Lowest QoS first, withing the same QoS class the Pods/VMs with the biggest emptyDir/emptyDisk/image-layer-writes will be evicted first. This is an emergency procedure which has higher priority than Application needs, therefore it will ignore PodDisruptionBudgets and and simply delete Pods. That also means that VMIs will not be live-migrated (which is the equivalent of ignoring application needs).

First we can't influence that, second I don't think we should influence that. What do others think?

Comment 7 Fabian Deutsch 2019-06-20 12:05:30 UTC
Roman, can we somehow prevent to schedule a VM (pod) on a node when we know that it will get into a pressure situation?

Comment 9 Roman Mohr 2019-07-09 13:29:33 UTC
I think the current behaviour is fine from the KubeVirt perspective. In general the scheduler does  not schedule to nodes with NodePressure (but it can take some time until the node pressure condition is announced and visible for the scheduler). If VMs should not be evicted you will need the QoS class Guaranteed.

Comment 10 sgott 2019-08-12 20:37:01 UTC
Arik,

I think the challenge here is that we don't know why this is allowed. Should the system be able to protect us from over-allocating disk space? Is there something we're doing wrong that allows this? Is there something we could do to actively prevent this?

Comment 11 Arik 2019-08-13 08:43:39 UTC
As far as I can tell, the scheduler just does not take the expected storage consumption of the pod into account.

I see in [1] that it is planned to have the storage consumption represented in terms of resource request/limit so the k8s scheduler would consider them but it is not yet implemented.

And I think that even if we add the storage consumption to the resources section and extend the scheduler [2] to filter the nodes accordingly, we have no information about the actual (which makes sense to map to resource-request) and virtual (which makes sense to map to resource limit) sizes of a container-disk [3].

Does it make sense?

[1] https://kubernetes.io/docs/concepts/storage/#resources
[2] https://kubernetes.io/docs/concepts/extend-kubernetes/extend-cluster/#scheduler-extensions
[3] https://kubevirt.io/user-guide/docs/latest/creating-virtual-machines/disks-and-volumes.html#containerdisk

Comment 12 Arik 2019-08-19 11:47:24 UTC
Fabian, can you please explain why is it of a high priority and severity bug?
It is not interesting for deployments that don't involve migrations and in those that do involve migrations we can assume shared storage, no?

Comment 13 Fabian Deutsch 2019-08-20 12:28:03 UTC
Sure, Arik.

So, the problem is a lot around testing, as testing is using containerDisks.

Also: Even in an env with shared storage, where live migration is used, it can well be that a user is using conterDisk in addition to a shared disk, and such a user would be affected by this bug as well, not?

But maybe it is not as urgent - Pushing it out to 2.2

Please ask to move it back if this is happening often

Comment 16 Roman Mohr 2019-10-23 14:20:34 UTC
This is solved in 2.2. There the containerDisk does not require extra space (except for what is written to the disk since it started, which should be close to 0 in our tests). Before that the whole disk was copied out of the container.

Comment 17 Roman Mohr 2019-10-23 14:21:17 UTC
> This is solved in 2.2. There the containerDisk does not require extra space (except for what is written to the disk since it started, which should be close to 0 in our tests). Before that the whole disk was copied out of the container.

I meant solved in 2.1.

Comment 18 Roman Mohr 2019-10-23 14:22:28 UTC
Here the PR: https://github.com/kubevirt/kubevirt/pull/2395

Comment 20 Kedar Bidarkar 2020-01-08 13:55:16 UTC
Just tested, as mentioned earlier we do not copy whole disk.

The disk consumption does not increase when we start and stop the VM.


USed 2 separate images as could quickly find a larger image to test.

Each image was around 308 MB. 

The storage consumption on the node did not increase on the start/stop of the VM.


VERIFIED with container-native-virtualization/hyperconverged-cluster-operator:v2.2.0-9   and container-native-virtualization/virt-operator:v2.2.0-10

Comment 21 Kedar Bidarkar 2020-01-08 13:56:13 UTC
Correction: Used 2 separate images as could not* quickly find a larger image to test.

Comment 23 errata-xmlrpc 2020-01-30 16:27:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:0307


Note You need to log in before you can comment on or make changes to this bug.