Bug 2099479

Summary:	importing multiple DVs at once - import pod OOMKilled
Product:	Container Native Virtualization (CNV)	Reporter:	Roni Kishner <rkishner>
Component:	Storage	Assignee:	Arnon Gilboa <agilboa>
Status:	CLOSED WONTFIX	QA Contact:	Yan Du <yadu>
Severity:	medium	Docs Contact:
Priority:	high
Version:	4.11.0	CC:	agilboa, akalenyu, alitke, dshchedr, llong, stefanha, yadu
Target Milestone:	---
Target Release:	4.12.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-02-01 14:04:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Roni Kishner 2022-06-21 05:40:15 UTC

Description of problem:
When importing multiple new DVs using dataSources all at once the some of the importer pods are crashing on OOMKilled Error, making the import process restart. this is slowly being stabilized as there are less DVs to be imported


Version-Release number of selected component (if applicable):
all 4.11 versions

How reproducible:
100%

Steps to Reproduce:
1. Start 6 CronJobs of importing DVs at once
2.
3.

Actual results:
The imports takes more then 45 minutes because of repeated restarts

Expected results:
The imports should be completed without the importerPod crashing

Additional info:
The importerPod is set with a limit memory of 600M, we should check if increasing the limit memory of the importerPods will stabilize this process.

Comment 1 Denys Shchedrivyi 2022-10-14 19:10:44 UTC

Wanted to add - on PSI cluster I see the same picture (importer pod restarts) even with importing only one DV. Based on Prometheus metrics importer pod consumes only ~340Mb and OOMKilled without exceeding the memory limit.. I've attached the screenshot

Comment 3 Alex Kalenyuk 2022-11-06 10:47:38 UTC

I am battling a similar issue that manifests itself as a flake in kubevirt upstream CI.
I have reproduced it on CDI main a few times, definitely not 100% reproducible.
Note this DV will require scratch space (Filesystem PVC) but the target is a Block PVC,
just one potential suspect. (i/o from fs to block spike?)

apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  name: memorypressure-dv
spec:
  source:
      registry:
         url: docker://quay.io/kubevirt/fedora-with-test-tooling-container-disk:v0.53.2
  pvc:
    storageClassName: rook-ceph-block
    volumeMode: Block
    accessModes:
    - ReadWriteMany
    resources:
      requests:
        storage: 6Gi

Comment 4 Arnon Gilboa 2022-11-06 16:59:53 UTC

So far I had no luck reproducing it even with 6 parallel imports based on Alex spec.

Roni, Denys - please help with reproduction.

Comment 7 Roni Kishner 2022-11-09 09:21:49 UTC

Arnon did you manage to reproduce it?
If not, it makes sense it is related to a slow environment only, since the dataSources takes a really long time to download the DVs, and when all DVs are being downloaded there is not enough memory.

Comment 8 Arnon Gilboa 2022-11-09 09:39:18 UTC

Roni, I easily reproduced it using Alex Kalenyuk, which is just a single import and no DataImportCrons/DataSources at all.
Please try it on your cluster as well.

Comment 9 Roni Kishner 2022-11-09 17:51:25 UTC

Managed to reproduce it when importing 10 DVs at once.

Do you know if this can be avoided or this is the expected result? maybe we really do run out of memory when that many importer pods are running
so this is not truly a bug.

Comment 12 Stefan Hajnoczi 2022-11-14 21:40:26 UTC

Thoughts on the qemu-img OOM:

1.
Using "qemu-img -t none -T none" bypasses the page cache. qemu-img uses a buffer size of 2-32MB, depending on the attributes of the underlying device. The idea is that this buffer size should be sufficient for efficient bulk I/O.

The default read iodepth is 8 and the write iodepth is 1 (to ensure sequential writes). The sequential write limitation can be relaxed with the "-W" option and then it defaults to 8. The iodepth can be adjusted with "-m <num-coroutines>".

Bypassing the page cache should avoid the OOM issue, but performance may be worse than with cache=writeback because there is no read-ahead/write-behind. If you experience performance issues, let's discuss it upstream on <qemu-devel>.

2.
The "-r <bytes-per-second>" rate limit option can be used to limit throughput. The rate limit is probably not an effective way of working around the OOM, but it is a useful tool for capping qemu-img's I/O bandwidth consumption so it doesn't interfere with other VMs.

3.
Finally, if you really want to use cache=writeback in this environment where the OOM killer activates instead of throttling the process, then it should be possible to extend qemu-img to invoke fsync periodically and wait. That way the amount of dirty memory can be bounded. This can be discussed upstream on <qemu-devel>. This adds another tunable though and I would go with cache=none instead.

Comment 13 Alex Kalenyuk 2022-11-16 15:11:56 UTC

(In reply to Stefan Hajnoczi from comment #12)
> Thoughts on the qemu-img OOM:
> 
> 1.
> Using "qemu-img -t none -T none" bypasses the page cache. qemu-img uses a
> buffer size of 2-32MB, depending on the attributes of the underlying device.
> The idea is that this buffer size should be sufficient for efficient bulk
> I/O.
> 
> The default read iodepth is 8 and the write iodepth is 1 (to ensure
> sequential writes). The sequential write limitation can be relaxed with the
> "-W" option and then it defaults to 8. The iodepth can be adjusted with "-m
> <num-coroutines>".
> 
> Bypassing the page cache should avoid the OOM issue, but performance may be
> worse than with cache=writeback because there is no read-ahead/write-behind.
> If you experience performance issues, let's discuss it upstream on
> <qemu-devel>.
> 
> 2.
> The "-r <bytes-per-second>" rate limit option can be used to limit
> throughput. The rate limit is probably not an effective way of working
> around the OOM, but it is a useful tool for capping qemu-img's I/O bandwidth
> consumption so it doesn't interfere with other VMs.
> 
> 3.
> Finally, if you really want to use cache=writeback in this environment where
> the OOM killer activates instead of throttling the process, then it should
> be possible to extend qemu-img to invoke fsync periodically and wait. That
> way the amount of dirty memory can be bounded. This can be discussed
> upstream on <qemu-devel>. This adds another tunable though and I
> would go with cache=none instead.

Stefan, do you think this is a problem that could occur in real-world setups?
> I think the process might have been writing dirty file data faster than the kernel can flush it out to the filesystem storage

We think that our local development setups (backed by non real disks) or the ones in QE are simply
not up to par and thus present this situation.
Basically, we suspect this might not be a problem and don't want to start introducing and fixing a bunch of things
just to work well with crappy storage. WDYT?
@stefanha

Comment 14 Stefan Hajnoczi 2022-12-08 19:59:49 UTC

(In reply to Alex Kalenyuk from comment #13)
> (In reply to Stefan Hajnoczi from comment #12)
> > Thoughts on the qemu-img OOM:
> > 
> > 1.
> > Using "qemu-img -t none -T none" bypasses the page cache. qemu-img uses a
> > buffer size of 2-32MB, depending on the attributes of the underlying device.
> > The idea is that this buffer size should be sufficient for efficient bulk
> > I/O.
> > 
> > The default read iodepth is 8 and the write iodepth is 1 (to ensure
> > sequential writes). The sequential write limitation can be relaxed with the
> > "-W" option and then it defaults to 8. The iodepth can be adjusted with "-m
> > <num-coroutines>".
> > 
> > Bypassing the page cache should avoid the OOM issue, but performance may be
> > worse than with cache=writeback because there is no read-ahead/write-behind.
> > If you experience performance issues, let's discuss it upstream on
> > <qemu-devel>.
> > 
> > 2.
> > The "-r <bytes-per-second>" rate limit option can be used to limit
> > throughput. The rate limit is probably not an effective way of working
> > around the OOM, but it is a useful tool for capping qemu-img's I/O bandwidth
> > consumption so it doesn't interfere with other VMs.
> > 
> > 3.
> > Finally, if you really want to use cache=writeback in this environment where
> > the OOM killer activates instead of throttling the process, then it should
> > be possible to extend qemu-img to invoke fsync periodically and wait. That
> > way the amount of dirty memory can be bounded. This can be discussed
> > upstream on <qemu-devel>. This adds another tunable though and I
> > would go with cache=none instead.
> 
> Stefan, do you think this is a problem that could occur in real-world setups?
> > I think the process might have been writing dirty file data faster than the kernel can flush it out to the filesystem storage
> 
> We think that our local development setups (backed by non real disks) or the
> ones in QE are simply
> not up to par and thus present this situation.
> Basically, we suspect this might not be a problem and don't want to start
> introducing and fixing a bunch of things
> just to work well with crappy storage. WDYT?
> @stefanha

Waiman Long mentioned that cgroup v2 can throttle processes instead of killing them.

There is information about cgroup v2 in Kubernetes here:
https://kubernetes.io/blog/2021/11/26/qos-memory-resources/

It mentions that processes are throttled if they reach the limit with cgroup v2 instead of killed.

Based on this, I think it's okay to leave things as they are. In newer Kubernetes deployments with cgroup v2 this issue will not occur.

Comment 16 Adam Litke 2023-02-01 14:04:05 UTC

I'm going to close this as WONTFIX for the following reasons:
- It is only reproducible in constrained testing environments
- It eventually succeeds
- With cgroups-v2 this should no longer be an issue

We don't want to harm performance of typical environments to optimize for the test environment.