Bug 1731819
Summary: | Migration pod become "Evicted" due to low on resource: ephemeral-storage on node after running quite a few migrations | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Yan Du <yadu> | ||||||||
Component: | Virtualization | Assignee: | Itamar Holder <iholder> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Israel Pinto <ipinto> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 2.0 | CC: | cnv-qe-bugs, dvossel, fdeutsch, iholder, ipinto, kbidarka, ncredi, rmohr, roman, sgordon, sgott, vromanso, zpeng | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 4.8.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | hco-bundle-registry-container-v4.8.0-347 virt-operator-container-v4.8.0-58 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2021-07-27 14:20:49 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Yan Du
2019-07-22 06:49:12 UTC
Created attachment 1592523 [details]
describe node
Moving back to virtualization. This is happening during live migration. From a cursory look it seems that kubernetes is responding in a somewhat surprising way to the namespace being deleted during the migration. Yan, could you please provide the VM definition used for this? Created attachment 1601356 [details]
fedora vm
Hi, Stuart, please get the vm yaml file in attachment. Created attachment 1601367 [details] vm spec Stuart, actually I met the issue when I debugging the network migration automation script, and the vm is configured with multus network based on the fedora template in #Comment4. I attached it as well. Vladik, Could you investigate this? Kedar, is this still an issue? "The node was low on resource: ephemeral-storage. Container volumecontainerdisk was using 4Ki, which exceeds its request of 0. Container compute was using 212Ki, which exceeds its request of 0." This means that pods got evicted because they were using ephemral storage that they did not request. KubeVirt should reflect that it actually is consuming epheraml resources in these two containers by requesting an educated-guess amount of ephemeral storage, let's say: 1Mi. Is this something that we've addressed in code already? (In reply to Fabian Deutsch from comment #12) > "The node was low on resource: ephemeral-storage. Container > volumecontainerdisk was using 4Ki, which exceeds its request of 0. Container > compute was using 212Ki, which exceeds its request of 0." > > This means that pods got evicted because they were using ephemral storage > that they did not request. > KubeVirt should reflect that it actually is consuming epheraml resources in > these two containers by requesting an educated-guess amount of ephemeral > storage, let's say: 1Mi. I like Fabians suggestion. Sounds like that is the proper thing to to. set evictionStrategy: LiveMigrate on the VMI spec and the migration pod likely won't get evicted. > I like Fabians suggestion. Sounds like that is the proper thing to to.
@rmohr you saw the vmi yaml didn't have EvictionStrategy set, right?
This was reported against 2.0... did we even have an evictionStrategy back then? We also copied the container disk then. not sure if this is relevant today. I was debating whether we should still request a minor amount of ephemeral-storage, we do have temporary files, iirc. It's probably not enough to cause an eviction. (In reply to David Vossel from comment #17) > > I like Fabians suggestion. Sounds like that is the proper thing to to. > > > @rmohr you saw the vmi yaml didn't have EvictionStrategy set, > right? Yes the attached yaml does not have it set. I think this eviction is coming from the kubelet and it will happen independent of PDBs. I think that if you are in the QoS setting of Burstable, you can get deleted if you are above your requests independent of PDBs. I am almost sure but would have to try it again. Independent of that, I think that our workloads should by default never get in the situation to be evicted (independend if /evict or a delete from the kubelet) due to infra-overhead. (In reply to Vladik Romanovsky from comment #18) > This was reported against 2.0... did we even have an evictionStrategy back > then? > > We also copied the container disk then. not sure if this is relevant today. > > I was debating whether we should still request a minor amount of > ephemeral-storage, we do have temporary files, iirc. It's probably not > enough to cause an eviction. I think that every Ki above the request will make the pod a candidate to evict it. We have the issue that some of our disks `ephemeral`, `containerDisk` and `emptyDisk` for instance, just write into emptyDirs. In that case I agree that users have to add the request to the VMI to be guarded against this type of eviction. However I would still want to add a request of e.g. 5Mi in general to cover our overhead. Since I am pretty sure that every Ki above the limit is too much. > I think this eviction is coming from the kubelet and it will happen independent of PDBs. I think that if you are in the QoS setting of Burstable, you can get deleted if you are above your requests independent of PDBs. I am almost sure but would have to try it again.
I understand now and agree that adding a storage request is a good first step.
Do you have a link to the PR, that fixes this bug? (In reply to Kedar Bidarkar from comment #22) > Do you have a link to the PR, that fixes this bug? Yes I do: https://github.com/kubevirt/kubevirt/pull/5013. verify with build hco-bundle-registry-container-v4.8.0-367 virt-operator-container-v4.8.0-58 step: 1. Create a namespace 2. Create a fedora vm 3. Do migrations 4. Delete the migration/vm/namesapce 5. Repeat step1-4 quite a few times run step 1-4 >10 times no Evicted pod found NAME READY STATUS RESTARTS AGE virt-launcher-vm-fedora-k55nl 1/1 Running 0 7m43s virt-launcher-vm-fedora-lvkmj 0/1 Completed 0 19m virt-launcher-vm-fedora-rxjgb 0/1 Completed 0 18m Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2920 |