Bug 2099479
| Summary: | importing multiple DVs at once - import pod OOMKilled | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Roni Kishner <rkishner> |
| Component: | Storage | Assignee: | Arnon Gilboa <agilboa> |
| Status: | CLOSED WONTFIX | QA Contact: | Yan Du <yadu> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.11.0 | CC: | agilboa, akalenyu, alitke, dshchedr, llong, stefanha, yadu |
| Target Milestone: | --- | ||
| Target Release: | 4.12.1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-02-01 14:04:05 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Roni Kishner
2022-06-21 05:40:15 UTC
Wanted to add - on PSI cluster I see the same picture (importer pod restarts) even with importing only one DV. Based on Prometheus metrics importer pod consumes only ~340Mb and OOMKilled without exceeding the memory limit.. I've attached the screenshot I am battling a similar issue that manifests itself as a flake in kubevirt upstream CI.
I have reproduced it on CDI main a few times, definitely not 100% reproducible.
Note this DV will require scratch space (Filesystem PVC) but the target is a Block PVC,
just one potential suspect. (i/o from fs to block spike?)
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
name: memorypressure-dv
spec:
source:
registry:
url: docker://quay.io/kubevirt/fedora-with-test-tooling-container-disk:v0.53.2
pvc:
storageClassName: rook-ceph-block
volumeMode: Block
accessModes:
- ReadWriteMany
resources:
requests:
storage: 6Gi
So far I had no luck reproducing it even with 6 parallel imports based on Alex spec. Roni, Denys - please help with reproduction. Arnon did you manage to reproduce it? If not, it makes sense it is related to a slow environment only, since the dataSources takes a really long time to download the DVs, and when all DVs are being downloaded there is not enough memory. Roni, I easily reproduced it using Alex Kalenyuk, which is just a single import and no DataImportCrons/DataSources at all. Please try it on your cluster as well. Managed to reproduce it when importing 10 DVs at once. Do you know if this can be avoided or this is the expected result? maybe we really do run out of memory when that many importer pods are running so this is not truly a bug. Thoughts on the qemu-img OOM: 1. Using "qemu-img -t none -T none" bypasses the page cache. qemu-img uses a buffer size of 2-32MB, depending on the attributes of the underlying device. The idea is that this buffer size should be sufficient for efficient bulk I/O. The default read iodepth is 8 and the write iodepth is 1 (to ensure sequential writes). The sequential write limitation can be relaxed with the "-W" option and then it defaults to 8. The iodepth can be adjusted with "-m <num-coroutines>". Bypassing the page cache should avoid the OOM issue, but performance may be worse than with cache=writeback because there is no read-ahead/write-behind. If you experience performance issues, let's discuss it upstream on <qemu-devel>. 2. The "-r <bytes-per-second>" rate limit option can be used to limit throughput. The rate limit is probably not an effective way of working around the OOM, but it is a useful tool for capping qemu-img's I/O bandwidth consumption so it doesn't interfere with other VMs. 3. Finally, if you really want to use cache=writeback in this environment where the OOM killer activates instead of throttling the process, then it should be possible to extend qemu-img to invoke fsync periodically and wait. That way the amount of dirty memory can be bounded. This can be discussed upstream on <qemu-devel>. This adds another tunable though and I would go with cache=none instead. (In reply to Stefan Hajnoczi from comment #12) > Thoughts on the qemu-img OOM: > > 1. > Using "qemu-img -t none -T none" bypasses the page cache. qemu-img uses a > buffer size of 2-32MB, depending on the attributes of the underlying device. > The idea is that this buffer size should be sufficient for efficient bulk > I/O. > > The default read iodepth is 8 and the write iodepth is 1 (to ensure > sequential writes). The sequential write limitation can be relaxed with the > "-W" option and then it defaults to 8. The iodepth can be adjusted with "-m > <num-coroutines>". > > Bypassing the page cache should avoid the OOM issue, but performance may be > worse than with cache=writeback because there is no read-ahead/write-behind. > If you experience performance issues, let's discuss it upstream on > <qemu-devel>. > > 2. > The "-r <bytes-per-second>" rate limit option can be used to limit > throughput. The rate limit is probably not an effective way of working > around the OOM, but it is a useful tool for capping qemu-img's I/O bandwidth > consumption so it doesn't interfere with other VMs. > > 3. > Finally, if you really want to use cache=writeback in this environment where > the OOM killer activates instead of throttling the process, then it should > be possible to extend qemu-img to invoke fsync periodically and wait. That > way the amount of dirty memory can be bounded. This can be discussed > upstream on <qemu-devel>. This adds another tunable though and I > would go with cache=none instead. Stefan, do you think this is a problem that could occur in real-world setups? > I think the process might have been writing dirty file data faster than the kernel can flush it out to the filesystem storage We think that our local development setups (backed by non real disks) or the ones in QE are simply not up to par and thus present this situation. Basically, we suspect this might not be a problem and don't want to start introducing and fixing a bunch of things just to work well with crappy storage. WDYT? @stefanha (In reply to Alex Kalenyuk from comment #13) > (In reply to Stefan Hajnoczi from comment #12) > > Thoughts on the qemu-img OOM: > > > > 1. > > Using "qemu-img -t none -T none" bypasses the page cache. qemu-img uses a > > buffer size of 2-32MB, depending on the attributes of the underlying device. > > The idea is that this buffer size should be sufficient for efficient bulk > > I/O. > > > > The default read iodepth is 8 and the write iodepth is 1 (to ensure > > sequential writes). The sequential write limitation can be relaxed with the > > "-W" option and then it defaults to 8. The iodepth can be adjusted with "-m > > <num-coroutines>". > > > > Bypassing the page cache should avoid the OOM issue, but performance may be > > worse than with cache=writeback because there is no read-ahead/write-behind. > > If you experience performance issues, let's discuss it upstream on > > <qemu-devel>. > > > > 2. > > The "-r <bytes-per-second>" rate limit option can be used to limit > > throughput. The rate limit is probably not an effective way of working > > around the OOM, but it is a useful tool for capping qemu-img's I/O bandwidth > > consumption so it doesn't interfere with other VMs. > > > > 3. > > Finally, if you really want to use cache=writeback in this environment where > > the OOM killer activates instead of throttling the process, then it should > > be possible to extend qemu-img to invoke fsync periodically and wait. That > > way the amount of dirty memory can be bounded. This can be discussed > > upstream on <qemu-devel>. This adds another tunable though and I > > would go with cache=none instead. > > Stefan, do you think this is a problem that could occur in real-world setups? > > I think the process might have been writing dirty file data faster than the kernel can flush it out to the filesystem storage > > We think that our local development setups (backed by non real disks) or the > ones in QE are simply > not up to par and thus present this situation. > Basically, we suspect this might not be a problem and don't want to start > introducing and fixing a bunch of things > just to work well with crappy storage. WDYT? > @stefanha Waiman Long mentioned that cgroup v2 can throttle processes instead of killing them. There is information about cgroup v2 in Kubernetes here: https://kubernetes.io/blog/2021/11/26/qos-memory-resources/ It mentions that processes are throttled if they reach the limit with cgroup v2 instead of killed. Based on this, I think it's okay to leave things as they are. In newer Kubernetes deployments with cgroup v2 this issue will not occur. I'm going to close this as WONTFIX for the following reasons: - It is only reproducible in constrained testing environments - It eventually succeeds - With cgroups-v2 this should no longer be an issue We don't want to harm performance of typical environments to optimize for the test environment. |