This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
Bug 2196072 - virt-cdi-import often killed by oom-killer during VM provisioning
Summary: virt-cdi-import often killed by oom-killer during VM provisioning
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 4.12.2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.13.7
Assignee: Alex Kalenyuk
QA Contact: Jenia Peimer
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-05-08 01:31 UTC by Germano Veit Michel
Modified: 2023-12-14 16:14 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-12-14 16:14:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker   CNV-28628 0 None None None 2023-12-14 16:14:56 UTC

Description Germano Veit Michel 2023-05-08 01:31:50 UTC
Description of problem:

Cluster often has to attempt to provision a VM multiple times before it actually works. It sometimes fails 2-3 times in a row before it works.
It always works after some retries, but quite often also fails on 1st and 2nd attempts.

I've seen this rarely before, now after upgrading the cluster to 4.13-rc7 (CNV still on 4.12.2) it seems to hit more frequently.

I noticed its writing with "-t writeback", and page cache is most of the cgroup memory limit consumed (537M out of 585M).

This does not seem to be a good option for converting images under a cgroup limit.

107       127318  127223 54 01:15 ?        00:00:10 qemu-img convert -S 0 -t writeback -p -O raw nbd+unix:///?socket=/tmp/nbdkit.sock /dev/cdi-block-volume

May 08 01:05:17 yellow.toca.local kernel: qemu-img invoked oom-killer: gfp_mask=0x101c00(GFP_NOIO|__GFP_HARDWALL|__GFP_WRITE), order=0, oom_score_adj=999
May 08 01:05:17 yellow.toca.local kernel: CPU: 2 PID: 115058 Comm: qemu-img Kdump: loaded Not tainted 5.14.0-284.13.1.el9_2.x86_64 #1

May 08 01:05:17 yellow.toca.local kernel: Call Trace:
May 08 01:05:17 yellow.toca.local kernel:  <TASK>
May 08 01:05:17 yellow.toca.local kernel:  dump_stack_lvl+0x34/0x48
May 08 01:05:17 yellow.toca.local kernel:  dump_header+0x4a/0x201
May 08 01:05:17 yellow.toca.local kernel:  oom_kill_process.cold+0xb/0x10
May 08 01:05:17 yellow.toca.local kernel:  out_of_memory+0xed/0x2e0
May 08 01:05:17 yellow.toca.local kernel:  ? __wake_up_common+0x7d/0x190
May 08 01:05:17 yellow.toca.local kernel:  mem_cgroup_out_of_memory+0x13a/0x150
May 08 01:05:17 yellow.toca.local kernel:  try_charge_memcg+0x6df/0x7a0
May 08 01:05:17 yellow.toca.local kernel:  charge_memcg+0x9f/0x130
May 08 01:05:17 yellow.toca.local kernel:  __mem_cgroup_charge+0x29/0x80
May 08 01:05:17 yellow.toca.local kernel:  __filemap_add_folio+0x224/0x5a0
May 08 01:05:17 yellow.toca.local kernel:  ? scan_shadow_nodes+0x30/0x30
May 08 01:05:17 yellow.toca.local kernel:  filemap_add_folio+0x37/0xa0
May 08 01:05:17 yellow.toca.local kernel:  __filemap_get_folio+0x1e6/0x330
May 08 01:05:17 yellow.toca.local kernel:  ? blkdev_write_begin+0x20/0x20
May 08 01:05:17 yellow.toca.local kernel:  pagecache_get_page+0x15/0x90
May 08 01:05:17 yellow.toca.local kernel:  block_write_begin+0x24/0xf0
May 08 01:05:17 yellow.toca.local kernel:  generic_perform_write+0xbe/0x200
May 08 01:05:17 yellow.toca.local kernel:  __generic_file_write_iter+0xe5/0x1a0
May 08 01:05:17 yellow.toca.local kernel:  blkdev_write_iter+0xe4/0x160
May 08 01:05:17 yellow.toca.local kernel:  new_sync_write+0xfc/0x190
May 08 01:05:17 yellow.toca.local kernel:  vfs_write+0x1ef/0x280
May 08 01:05:17 yellow.toca.local kernel:  __x64_sys_pwrite64+0x90/0xc0
May 08 01:05:17 yellow.toca.local kernel:  do_syscall_64+0x59/0x90
May 08 01:05:17 yellow.toca.local kernel:  ? do_futex+0x15b/0x200
May 08 01:05:17 yellow.toca.local kernel:  ? __x64_sys_futex+0x73/0x1d0
May 08 01:05:17 yellow.toca.local kernel:  ? switch_fpu_return+0x49/0xd0
May 08 01:05:17 yellow.toca.local kernel:  ? exit_to_user_mode_prepare+0xec/0x100
May 08 01:05:17 yellow.toca.local kernel:  ? syscall_exit_to_user_mode+0x12/0x30
May 08 01:05:17 yellow.toca.local kernel:  ? do_syscall_64+0x69/0x90
May 08 01:05:17 yellow.toca.local kernel:  ? do_syscall_64+0x69/0x90
May 08 01:05:17 yellow.toca.local kernel:  ? sysvec_apic_timer_interrupt+0x3c/0x90
May 08 01:05:17 yellow.toca.local kernel:  entry_SYSCALL_64_after_hwframe+0x63/0xcd


May 08 01:05:17 yellow.toca.local kernel: memory: usage 585936kB, limit 585936kB, failcnt 131
May 08 01:05:17 yellow.toca.local kernel: memory+swap: usage 585936kB, limit 9007199254740988kB, failcnt 0
May 08 01:05:17 yellow.toca.local kernel: kmem: usage 17716kB, limit 9007199254740988kB, failcnt 0
May 08 01:05:17 yellow.toca.local kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb18ee349_565f_4cb6_9f2b_f4e67a7a7927.slice:
May 08 01:05:17 yellow.toca.local kernel: anon 44654592
                                          file 537202688
                                          kernel 18141184
                                          kernel_stack 262144
                                          pagetables 1024000
                                          percpu 20856
                                          sock 0
                                          vmalloc 28672
                                          shmem 53248
                                          zswap 0
                                          zswapped 0
                                          file_mapped 0
                                          file_dirty 0
                                          file_writeback 537149440
                                          swapcached 0
                                          anon_thp 14680064
                                          file_thp 0
                                          shmem_thp 0
                                          inactive_anon 44646400
                                          active_anon 61440
                                          inactive_file 268509184
                                          active_file 268574720
                                          unevictable 0
                                          slab_reclaimable 16203816
                                          slab_unreclaimable 497488
                                          slab 16701304
                                          workingset_refault_anon 0
                                          workingset_refault_file 0
                                          workingset_activate_anon 0
                                          workingset_activate_file 0
                                          workingset_restore_anon 0
                                          workingset_restore_file 0
                                          workingset_nodereclaim 0
                                          pgscan 4950026
                                          pgsteal 4
                                          pgscan_kswapd 0
                                          pgscan_direct 4950026
                                          pgsteal_kswapd 0
                                          pgsteal_direct 4
                                          pgfault 13780
                                          pgmajfault 0
                                          pgrefill 4884435
                                          pgactivate 4950019
                                          pgdeactivate 4884435
                                          pglazyfree 0
                                          pglazyfreed 0
                                          zswpin 0
                                          zswpout 0
                                          thp_fault_alloc 13
                                          thp_collapse_alloc 0
May 08 01:05:17 yellow.toca.local kernel: Tasks state (memory values in pages):
May 08 01:05:17 yellow.toca.local kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
May 08 01:05:17 yellow.toca.local kernel: [ 114988]     0 114988     2076      485    49152        0         -1000 conmon
May 08 01:05:17 yellow.toca.local kernel: [ 115000]   107 115000   335956    13160   401408        0           999 virt-cdi-import
May 08 01:05:17 yellow.toca.local kernel: [ 115032]   107 115032    95298     2604   262144        0           999 nbdkit
May 08 01:05:17 yellow.toca.local kernel: [ 115053]   107 115053   110130     6764   360448        0           999 qemu-img
May 08 01:05:17 yellow.toca.local kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-3ec0af77f3bab257e52e5a33cfde0e88861bae125913f97b356105b5efd7cc18.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice>
May 08 01:05:17 yellow.toca.local kernel: Memory cgroup out of memory: Killed process 115000 (virt-cdi-import) total-vm:1343824kB, anon-rss:19060kB, file-rss:33580kB, shmem-rss:0kB, UID:107 pgtables:392kB oom_score_adj:999

The image is sourced from an httpd server, it a qcow2. And written to a topolvm volume.

spec:
  dataVolumeTemplates:
  - metadata:
      creationTimestamp: null
      name: rhvh-1-rootdisk
      namespace: homelab
    spec:
      preallocation: false
      source:
        http:
          url: http://pi.toca.local:8080/images/rhel-8.6.qcow2
      storage:
        resources:
          requests:
            storage: 100Gi
        storageClassName: lvms-ssd-lvmcluster

Version-Release number of selected component (if applicable):
4.12.2

How reproducible:
Always

Steps to Reproduce:
1. Provision a VM
2. Watch the logs on the node and it may fail 1-2 times before a retry succeeds

Comment 1 Yan Du 2023-05-10 12:11:44 UTC
Alex, are we sure we want to use writeback cache mode?

Comment 2 Alex Kalenyuk 2023-05-11 12:43:38 UTC
(In reply to Yan Du from comment #1)
> Alex, are we sure we want to use writeback cache mode?

As far as I understand we are not in the wrong to want to utilize the page cache instead of completely bypassing it ('writeback');
This has come up before and was extensively discussed. The summary is in https://bugzilla.redhat.com/show_bug.cgi?id=2099479#c14,
TLDR being that:
- with cgroupsv2 we should not be seeing the oom kill (I think cgroupsv2 got moved out of 4.13 - https://github.com/openshift/machine-config-operator/pull/3486#issuecomment-1448150138)
- should be careful not to start introducing and fixing a bunch of things
just to work well with storage that may or may not be up to par

@alitke Should we reconsider our previous stances?

Comment 3 Germano Veit Michel 2023-05-11 21:35:26 UTC
(In reply to Alex Kalenyuk from comment #2)
> (In reply to Yan Du from comment #1)
> > Alex, are we sure we want to use writeback cache mode?
> 
> As far as I understand we are not in the wrong to want to utilize the page
> cache instead of completely bypassing it ('writeback');
> This has come up before and was extensively discussed. The summary is in
> https://bugzilla.redhat.com/show_bug.cgi?id=2099479#c14,
> TLDR being that:
> - with cgroupsv2 we should not be seeing the oom kill (I think cgroupsv2 got
> moved out of 4.13 -
> https://github.com/openshift/machine-config-operator/pull/3486#issuecomment-
> 1448150138)

Yeah, I'm on 4.13 (now rc8), its its old cgroups v1 :(

> - should be careful not to start introducing and fixing a bunch of things
> just to work well with storage that may or may not be up to par

Isn't this actually the opposite? Using writeback to compensate for slow storage that cannot handle o_direct, or unaligned writes or something else wrong?

But I don't think its storage related, the storage can easily handle the writes (in fact its almost not writing at all, the whole thing is being cached).
The problem is a big amount of free memory on the node (its idle) combined with default dirty_rate/bytes settings.
I can make the problem go away with dirty_bytes set to 1/2 of the pod limit: https://access.redhat.com/articles/45002, so it forces flushing way before we hit the pod limits that trigger the oom-kill.

To reproduce:
- Node idle with heaps of free memory
- Download the image from a machine on the local network, so download is quick
- Image bigger than the pod limit

The whole image ends up in the cache and it kills the pod.

I was trying to find a way to play with vm.dirty_* per pod, but there doesn't seem to be a way.

IMHO, something should be done here if cgroupsv2 are not coming to 4.13, I think customers will see it, but up to you.

> @alitke Should we reconsider our previous stances?

Comment 4 Germano Veit Michel 2023-07-12 11:47:21 UTC
Same problem hitting hypershift on CNV (4.13.4).

It fails to provision the workers (from nodepool), repeats many times before succeeding, delaying hosted cluster creation.

Jul 12 11:38:20 indigo.shift.home.arpa kernel: memory: usage 585936kB, limit 585936kB, failcnt 3575147
Jul 12 11:38:20 indigo.shift.home.arpa kernel: memory+swap: usage 585936kB, limit 9007199254740988kB, failcnt 0
Jul 12 11:38:20 indigo.shift.home.arpa kernel: kmem: usage 26400kB, limit 9007199254740988kB, failcnt 0
Jul 12 11:38:20 indigo.shift.home.arpa kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod074083a6_5919_4d0a_b7c3_efd462c4bcfe.slice:
Jul 12 11:38:20 indigo.shift.home.arpa kernel: anon 102940672
                                               file 470024192
                                               kernel 27033600
                                               kernel_stack 311296
                                               pagetables 983040
                                               percpu 31176
                                               sock 0
                                               vmalloc 28672
                                               shmem 53248
                                               zswap 0
                                               zswapped 0
                                               file_mapped 0
                                               file_dirty 0
                                               file_writeback 469970944
                                               swapcached 0
                                               anon_thp 41943040
                                               file_thp 0
                                               shmem_thp 0
                                               inactive_anon 102932480
                                               active_anon 61440
                                               inactive_file 234987520
                                               active_file 234983424
                                               unevictable 0
                                               slab_reclaimable 25100472
                                               slab_unreclaimable 502720
                                               slab 25603192
                                               workingset_refault_anon 0
                                               workingset_refault_file 403091
                                               workingset_activate_anon 0
                                               workingset_activate_file 85534
                                               workingset_restore_anon 0
                                               workingset_restore_file 3950
                                               workingset_nodereclaim 39697
                                               pgscan 155410421
                                               pgsteal 4112520
                                               pgscan_kswapd 0
                                               pgscan_direct 155410421
                                               pgsteal_kswapd 0
                                               pgsteal_direct 4112520
                                               pgfault 60886
                                               pgmajfault 0
                                               pgrefill 149534074
                                               pgactivate 150867185
                                               pgdeactivate 149534073
                                               pglazyfree 0
                                               pglazyfreed 0
                                               zswpin 0
                                               zswpout 0
                                               thp_fault_alloc 112
                                               thp_collapse_alloc 0
Jul 12 11:38:20 indigo.shift.home.arpa kernel: Tasks state (memory values in pages):
Jul 12 11:38:20 indigo.shift.home.arpa kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jul 12 11:38:20 indigo.shift.home.arpa kernel: [ 107047]     0 107047     2076      517    49152        0         -1000 conmon
Jul 12 11:38:20 indigo.shift.home.arpa kernel: [ 107059]   107 107059   325091    12759   315392        0           999 virt-cdi-import
Jul 12 11:38:20 indigo.shift.home.arpa kernel: [ 111096]   107 111096   246656    22575   647168        0           999 qemu-img

Comment 5 Adam Litke 2023-08-09 17:50:29 UTC
The problem is that if we switch to cache=none then all imports will slow down and this will impact a different set of customers.  Alex, could we experiment with adding a thread to the importer that repeatedly calls sync to flush I/O to disk?  We're still going to have a race condition here (especially with slower storage) but maybe that can improve things?

Comment 6 Germano Veit Michel 2023-08-09 21:45:40 UTC
(In reply to Adam Litke from comment #5)
> The problem is that if we switch to cache=none then all imports will slow
> down and this will impact a different set of customers.

cache=none should not slow down anything with fast storage. In fact, the VM runs with cache=none and has no performance issues.

      <driver name='qemu' type='raw' cache='none' error_policy='stop' discard='unmap'/>

This is happening on high end (but consumer NVMe - 5GB/s writes) and does not get even close to that due to internet download speed of the images.

I suspect there is some inefficiency problem with this conversion somewhere else, and its being masked by using destination caches in the convert.

Regardless of fast storage or not, what ultimately determines how much will be cached or not (and then go over the cgroupv1 limit) are the kernel tunables around dirty pages, the thresholds, amount of memory etc.

Comment 7 Alex Kalenyuk 2023-08-31 10:47:58 UTC
(In reply to Adam Litke from comment #5)
> The problem is that if we switch to cache=none then all imports will slow
> down and this will impact a different set of customers.  Alex, could we
> experiment with adding a thread to the importer that repeatedly calls sync
> to flush I/O to disk?  We're still going to have a race condition here
> (especially with slower storage) but maybe that can improve things?

So there are a few points captured in this thread https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg05720.html
following the discussion in the sig-storage meeting.

Specifically echoing berrange
```
> 3. Using buffered I/O because O_DIRECT is not universally supported?
>
> If you can't use O_DIRECT, then qemu-img could be extended to manage its
> dirty page cache set carefully. This consists of picking a budget and
> writing back to disk when the budget is exhausted.

IOW, re-implementing what the kernel should already be doing for us :-(

This feels like the least desirable thing for QEMU to take on, especially
since cgroups v1 is an evolutionary dead-end, with v2 increasingly taking
over the world.
```
I think that neither us/qemu needs to be in this domain managing dirty pages pretty much
directly

However, taking into account our pod's constrained memory environment I think we are indeed in the wrong
using writeback. This is also addressed in berrange's comment
```
writeback caching helps if you have lots of free memory, but on
virtualization hosts memory is usually the biggest VM density
constraint, so apps shouldn't generally expect there to be lots
of free host memory to burn as I/O cache.
```
Note we ship CDI with pod limit set to 600M.
So with lots of available memory, under default dirty_page/ratio
settings, it should be pretty easy to OOM.

I think the action for us to take is to revert to cache=none if
O_DIRECT is supported. Otherwise, keep writeback as a fallback.
This also seems to be the approach that was taken in other projects as well:
https://opendev.org/openstack/nova/src/branch/master/nova/privsep/qemu.py
https://github.com/lxc/lxd/blob/399e6794b2fdbd0821f64910202478f2e25a023f/lxd/storage/utils.go#L597

Comment 10 Adam Litke 2023-10-24 12:42:19 UTC
It seems like this issue will be mitigated by the move to cgroupsv2 in 4.14+ so I wouldn't recommend changing our cache mode for importing (since it offers greater compatibility and better I/O performance).  Instead, let's publish a KCS article that includes a MachineConfig that can be applied to worker nodes to adjust the kernel dirty pages tunables to force more aggressive writeback.

Comment 11 Germano Veit Michel 2023-10-24 21:32:37 UTC
(In reply to Adam Litke from comment #10)
> It seems like this issue will be mitigated by the move to cgroupsv2 in 4.14+
> so I wouldn't recommend changing our cache mode for importing (since it
> offers greater compatibility and better I/O performance).  Instead, let's
> publish a KCS article that includes a MachineConfig that can be applied to
> worker nodes to adjust the kernel dirty pages tunables to force more
> aggressive writeback.

KCS like this already exist for RHEL, one can just dump that into a MachineConfig.
And one will have to set it so aggressively so the page cache is never big enough
to exceed what the CDI pod is caching under its cgroup limit.

I'd not recommend this to customers, as it changes the entire system and
not just for the problematic cdi pod. Changing those tunables will affect both
Write-back and Write-through as both hit the page cache, it can affect other
pods using slow storage that actually need this.

I still see this on my cluster all the time. I think people may not notice it
because it just retry until it succeeds. But it's a waste of resources.

With proper storage cache=none should be very quick to write these [small]
images, my bet this is compensating for some inefficiency somewhere else.

Comment 12 Arvin Amirian 2023-10-24 21:48:42 UTC
In my case the CDI pod never succeeds. I had to downgrade OCP to 4.12 to avoid this issue.

Comment 13 Dominik Holler 2023-10-31 07:29:24 UTC
Neither fix attached nor verified, postponing to next z-stream.

Comment 14 Alex Kalenyuk 2023-10-31 16:26:53 UTC
(In reply to Germano Veit Michel from comment #11)
> (In reply to Adam Litke from comment #10)
> > It seems like this issue will be mitigated by the move to cgroupsv2 in 4.14+
> > so I wouldn't recommend changing our cache mode for importing (since it
> > offers greater compatibility and better I/O performance).  Instead, let's
> > publish a KCS article that includes a MachineConfig that can be applied to
> > worker nodes to adjust the kernel dirty pages tunables to force more
> > aggressive writeback.
> 
> KCS like this already exist for RHEL, one can just dump that into a
> MachineConfig.
> And one will have to set it so aggressively so the page cache is never big
> enough
> to exceed what the CDI pod is caching under its cgroup limit.
> 
> I'd not recommend this to customers, as it changes the entire system and
> not just for the problematic cdi pod. Changing those tunables will affect
> both
> Write-back and Write-through as both hit the page cache, it can affect other
> pods using slow storage that actually need this.
> 
> I still see this on my cluster all the time. I think people may not notice it
> because it just retry until it succeeds. But it's a waste of resources.
> 
> With proper storage cache=none should be very quick to write these [small]
> images, my bet this is compensating for some inefficiency somewhere else.

Generally, I agree with the statement about good storage, although,
Note that with cgroupsv2 writeback [0] things look better with cache=writeback;
It takes the dirty_rate on the host into account for an individual container,
So for 10% for example, it'll start writeback at 0.1*600MB in the importer
(It also throttles instead of OOM)
So we may want to keep writeback in main/4.15/4.14?

BTW, regarding needing aggressive values, I have seen some distros specifically recommend these:
echo 629145600 > /proc/sys/vm/dirty_bytes
echo 314572800 > /proc/sys/vm/dirty_background_bytes
Which would also work for CDI in this case. I still think it's a tricky area to be in, and would rather not ask this.
The closest Red Hat reference I could find for recommendation is https://access.redhat.com/articles/2252871

[0] https://docs.kernel.org/admin-guide/cgroup-v2.html#writeback

(In reply to Arvin Amirian from comment #12)
> In my case the CDI pod never succeeds. I had to downgrade OCP to 4.12 to
> avoid this issue.

This is surprising, because 4.12 is also using cache=writeback

Comment 15 Germano Veit Michel 2023-10-31 21:47:36 UTC
(In reply to Alex Kalenyuk from comment #14)
> Generally, I agree with the statement about good storage, although,
> Note that with cgroupsv2 writeback [0] things look better with
> cache=writeback;
> It takes the dirty_rate on the host into account for an individual container,
> So for 10% for example, it'll start writeback at 0.1*600MB in the importer
> (It also throttles instead of OOM)
> So we may want to keep writeback in main/4.15/4.14?

Well, my understanding is that looks better because once it hits
the cgroup limit it essentially behaves like cache=none as further
writes block until there is space in the page cache, which is forced
to commit to disk sooner.

And here we are probably writing once to each file offset, sequentially.
Assuming it needs to flush at the end, I don't really know how its going
to speed things up, especially without HDDs.

> 
> BTW, regarding needing aggressive values, I have seen some distros
> specifically recommend these:
> echo 629145600 > /proc/sys/vm/dirty_bytes
> echo 314572800 > /proc/sys/vm/dirty_background_bytes
> Which would also work for CDI in this case. I still think it's a tricky area
> to be in, and would rather not ask this.
> The closest Red Hat reference I could find for recommendation is
> https://access.redhat.com/articles/2252871

Indeed, I'm not recommending these to any customers if we get a support
case about this, we can't change the entire system because of a single pod.
And I think the values would need to be so aggresive to not hit the pod
cgroup limit that any apps actually making proper use of caching would
almost run like cache=none too.

> 
> [0] https://docs.kernel.org/admin-guide/cgroup-v2.html#writeback
> 
> (In reply to Arvin Amirian from comment #12)
> > In my case the CDI pod never succeeds. I had to downgrade OCP to 4.12 to
> > avoid this issue.
> 
> This is surprising, because 4.12 is also using cache=writeback

Right, and the tunables are roughly the same. Except it's a much newer kernel.

[root@rhel-86 ~]# for f in `ls /proc/sys/vm/dirty*` ;  do echo $f ; cat $f ; done;
/proc/sys/vm/dirty_background_bytes
0
/proc/sys/vm/dirty_background_ratio
10
/proc/sys/vm/dirty_bytes
0
/proc/sys/vm/dirty_expire_centisecs
3000
/proc/sys/vm/dirty_ratio
30
/proc/sys/vm/dirtytime_expire_seconds
43200
/proc/sys/vm/dirty_writeback_centisecs
500

for f in `ls /proc/sys/vm/dirty*` ;  do echo $f ; cat $f ; done;
/proc/sys/vm/dirty_background_bytes
0
/proc/sys/vm/dirty_background_ratio
10
/proc/sys/vm/dirty_bytes
0
/proc/sys/vm/dirty_expire_centisecs
3000
/proc/sys/vm/dirty_ratio
30
/proc/sys/vm/dirtytime_expire_seconds
43200
/proc/sys/vm/dirty_writeback_centisecs
500

So I also don't know why its worse in 4.13, which goes back to my assumption/guess
there is some inefficiency somewhere else, probably inside CDI, contributing
to this.

In my setup these OOMs were already there in 4.12, but got a lot worse in 4.13.

It appears 4.14 is now GA. Once the upgrade graph from 4.13 becomes available I'll update here, expecting the problem to go away due to the cgroups change.

But it still doesn't make much sense to me, and 4.12/4.13 still supported. 
But given the bug is 6 months old, there may be little value in changing this now, assuming 4.14 works.

If the problem is gone in 4.14, I won't object if engineering decides to CLOSE WONTFIX this.

Comment 17 Alex Kalenyuk 2023-11-29 17:52:17 UTC
@germano Hey!
Do you still see this in 4.14?
I have been looking at some OOMs lately, and I think we have had a wrong assumption about k8s capitalizing on cgroupsv2 memory throttling.
Basically, IIUC from https://kubernetes.io/blog/2023/05/05/qos-memory-resources/, the feature that would cause memory.high to be set is still in alpha
and thus memory.high is not being set in k8s (keeping the oom killer behavior)

From a quick check in my dev environment, I can trigger OOMs by pulling from local registry + setting dirty_ratio to a ridiculously high value,
but I wanted to double-check with you.

Comment 18 Alex Kalenyuk 2023-11-29 19:44:24 UTC
@gveitmic wrong needinfo

Comment 19 Germano Veit Michel 2023-11-29 22:10:19 UTC
Hey Alex,

It doesn't seem any better than 4.13.

Still happening all the time here on 4.14.3 and CNV 4.14.0

Fired up an import, and it failed twice, before succeeding the 3rd time.

Nov 29 22:05:47 indigo.shift.home.arpa kernel: qemu-img invoked oom-killer: gfp_mask=0x101c00(GFP_NOIO|__GFP_HARDWALL|__GFP_WRITE), order=0, oom_score_adj=999
Nov 29 22:05:47 indigo.shift.home.arpa kernel: CPU: 1 PID: 1568108 Comm: qemu-img Tainted: G        W        --------  ---  5.14.0-284.41.1.el9_2.x86_64 #1
Nov 29 22:05:47 indigo.shift.home.arpa kernel: Hardware name: Intel(R) Client Systems NUC11PAHi3/NUC11PABi3, BIOS PATGL357.0050.2022.1228.1726 12/28/2022
Nov 29 22:05:47 indigo.shift.home.arpa kernel: Call Trace:
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  <TASK>
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  dump_stack_lvl+0x34/0x48
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  dump_header+0x4a/0x201
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  oom_kill_process.cold+0xb/0x10
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  out_of_memory+0xed/0x2e0
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  mem_cgroup_out_of_memory+0x13a/0x150
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  try_charge_memcg+0x6df/0x7a0
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  charge_memcg+0x7a/0xf0
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  __mem_cgroup_charge+0x29/0x80
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  __filemap_add_folio+0x224/0x5a0
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  ? scan_shadow_nodes+0x30/0x30
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  filemap_add_folio+0x37/0xa0
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  __filemap_get_folio+0x1e6/0x330
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  ? blkdev_write_begin+0x20/0x20
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  pagecache_get_page+0x15/0x90
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  block_write_begin+0x24/0xf0
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  generic_perform_write+0xbe/0x200
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  __generic_file_write_iter+0xe5/0x1a0
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  blkdev_write_iter+0xe4/0x160
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  new_sync_write+0xfc/0x190
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  vfs_write+0x1ef/0x280
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  __x64_sys_pwrite64+0x90/0xc0
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  do_syscall_64+0x59/0x90
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  ? switch_fpu_return+0x49/0xd0
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  ? exit_to_user_mode_prepare+0xec/0x100
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  ? syscall_exit_to_user_mode+0x12/0x30
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  ? do_syscall_64+0x69/0x90
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  ? syscall_exit_to_user_mode+0x12/0x30
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  ? do_syscall_64+0x69/0x90
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  ? do_syscall_64+0x69/0x90
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  ? exit_to_user_mode_prepare+0xec/0x100
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  entry_SYSCALL_64_after_hwframe+0x63/0xcd
Nov 29 22:05:47 indigo.shift.home.arpa kernel: RIP: 0033:0x7f73d6d8698f
Nov 29 22:05:47 indigo.shift.home.arpa kernel: Code: 08 89 3c 24 48 89 4c 24 18 e8 7d f2 f5 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8 12 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 04 24 e8 cd f2 f5 ff 48 8b
Nov 29 22:05:47 indigo.shift.home.arpa kernel: RSP: 002b:00007f73a77fd8d0 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
Nov 29 22:05:47 indigo.shift.home.arpa kernel: RAX: ffffffffffffffda RBX: 00007f73cf4fbb98 RCX: 00007f73d6d8698f
Nov 29 22:05:47 indigo.shift.home.arpa kernel: RDX: 0000000001000000 RSI: 00007f73927fe000 RDI: 000000000000000a
Nov 29 22:05:47 indigo.shift.home.arpa kernel: RBP: 00005648f5d23fa0 R08: 0000000000000000 R09: 00005648f5b241b4
Nov 29 22:05:47 indigo.shift.home.arpa kernel: R10: 0000000200000000 R11: 0000000000000293 R12: 00007f73927fe000
Nov 29 22:05:47 indigo.shift.home.arpa kernel: R13: 00005648f3ea5ef8 R14: 0000000000000000 R15: 00007f73927fe000
Nov 29 22:05:47 indigo.shift.home.arpa kernel:  </TASK>
Nov 29 22:05:47 indigo.shift.home.arpa kernel: memory: usage 585936kB, limit 585936kB, failcnt 2181321
Nov 29 22:05:47 indigo.shift.home.arpa kernel: memory+swap: usage 585936kB, limit 9007199254740988kB, failcnt 0
Nov 29 22:05:47 indigo.shift.home.arpa kernel: kmem: usage 27728kB, limit 9007199254740988kB, failcnt 0
Nov 29 22:05:47 indigo.shift.home.arpa kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb807fa67_7d8d_4879_9909_d6addf99d34c.slice:
Nov 29 22:05:47 indigo.shift.home.arpa kernel: anon 54149120
                                               file 517455872
                                               kernel 28393472
                                               kernel_stack 360448
                                               pagetables 929792
                                               percpu 31176
                                               sock 0
                                               vmalloc 28672
                                               shmem 53248
                                               zswap 0
                                               zswapped 0
                                               file_mapped 0
                                               file_dirty 0
                                               file_writeback 517402624
                                               swapcached 0
                                               anon_thp 25165824
                                               file_thp 0
                                               shmem_thp 0
                                               inactive_anon 54149120
                                               active_anon 53248
                                               inactive_file 258637824
                                               active_file 258641920
                                               unevictable 0
                                               slab_reclaimable 26438480
                                               slab_unreclaimable 527704
                                               slab 26966184
                                               workingset_refault_anon 0
                                               workingset_refault_file 143023
                                               workingset_activate_anon 0
                                               workingset_activate_file 12607
                                               workingset_restore_anon 0
                                               workingset_restore_file 12185
                                               workingset_nodereclaim 17648
                                               pgscan 102281033
                                               pgsteal 2438322
                                               pgscan_kswapd 0
                                               pgscan_direct 102281033
                                               pgsteal_kswapd 0
                                               pgsteal_direct 2438322
                                               pgfault 40210
                                               pgmajfault 0
                                               pgrefill 98761290
                                               pgactivate 99823370
                                               pgdeactivate 98761290
                                               pglazyfree 0
                                               pglazyfreed 0
                                               zswpin 0
                                               zswpout 0
                                               thp_fault_alloc 110
                                               thp_collapse_alloc 0
Nov 29 22:05:47 indigo.shift.home.arpa kernel: Tasks state (memory values in pages):
Nov 29 22:05:47 indigo.shift.home.arpa kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Nov 29 22:05:47 indigo.shift.home.arpa kernel: [1567623]     0 1567623     2076      485    57344        0         -1000 conmon
Nov 29 22:05:47 indigo.shift.home.arpa kernel: [1567635]   107 1567635   327313    14337   335872        0           999 virt-cdi-import
Nov 29 22:05:47 indigo.shift.home.arpa kernel: [1568099]   107 1568099   293507    10132   569344        0           999 qemu-img
Nov 29 22:05:47 indigo.shift.home.arpa kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-ac64a01f39e18026b39feeb4c63534fdfb956a3403db72b259bf100b4ec5b530.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb807fa67_7d8d_4879_9909_d6addf99d34c.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb807fa67_7d8d_4879_9909_d6addf99d34c.slice/crio-ac64a01f39e18026b39feeb4c63534fdfb956a3403db72b259bf100b4ec5b530.scope,task=virt-cdi-import,pid=1567635,uid=107
Nov 29 22:05:47 indigo.shift.home.arpa kernel: Memory cgroup out of memory: Killed process 1567635 (virt-cdi-import) total-vm:1309252kB, anon-rss:16856kB, file-rss:40492kB, shmem-rss:0kB, UID:107 pgtables:328kB oom_score_adj:999


Nov 29 22:07:11 indigo.shift.home.arpa kernel: qemu-img invoked oom-killer: gfp_mask=0x101c00(GFP_NOIO|__GFP_HARDWALL|__GFP_WRITE), order=0, oom_score_adj=999
Nov 29 22:07:11 indigo.shift.home.arpa kernel: CPU: 2 PID: 1569974 Comm: qemu-img Tainted: G        W        --------  ---  5.14.0-284.41.1.el9_2.x86_64 #1
Nov 29 22:07:11 indigo.shift.home.arpa kernel: Hardware name: Intel(R) Client Systems NUC11PAHi3/NUC11PABi3, BIOS PATGL357.0050.2022.1228.1726 12/28/2022
Nov 29 22:07:11 indigo.shift.home.arpa kernel: Call Trace:
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  <TASK>
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  dump_stack_lvl+0x34/0x48
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  dump_header+0x4a/0x201
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  oom_kill_process.cold+0xb/0x10
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  out_of_memory+0xed/0x2e0
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  mem_cgroup_out_of_memory+0x13a/0x150
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  try_charge_memcg+0x6df/0x7a0
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  charge_memcg+0x7a/0xf0
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  __mem_cgroup_charge+0x29/0x80
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  __filemap_add_folio+0x224/0x5a0
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  ? scan_shadow_nodes+0x30/0x30
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  filemap_add_folio+0x37/0xa0
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  __filemap_get_folio+0x1e6/0x330
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  ? blkdev_write_begin+0x20/0x20
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  pagecache_get_page+0x15/0x90
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  block_write_begin+0x24/0xf0
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  generic_perform_write+0xbe/0x200
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  __generic_file_write_iter+0xe5/0x1a0
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  blkdev_write_iter+0xe4/0x160
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  new_sync_write+0xfc/0x190
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  vfs_write+0x1ef/0x280
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  __x64_sys_pwrite64+0x90/0xc0
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  do_syscall_64+0x59/0x90
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  ? do_futex+0xf7/0x200
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  ? __x64_sys_futex+0x73/0x1d0
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  ? syscall_exit_work+0x11a/0x150
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  ? syscall_exit_to_user_mode+0x12/0x30
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  ? do_syscall_64+0x69/0x90
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  ? exit_to_user_mode_prepare+0xec/0x100
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  ? syscall_exit_to_user_mode+0x12/0x30
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  ? do_syscall_64+0x69/0x90
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  ? syscall_exit_to_user_mode+0x12/0x30
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  ? do_syscall_64+0x69/0x90
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  entry_SYSCALL_64_after_hwframe+0x63/0xcd
Nov 29 22:07:11 indigo.shift.home.arpa kernel: RIP: 0033:0x7f3fc545298f
Nov 29 22:07:11 indigo.shift.home.arpa kernel: Code: 08 89 3c 24 48 89 4c 24 18 e8 7d f2 f5 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8 12 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 04 24 e8 cd f2 f5 ff 48 8b
Nov 29 22:07:11 indigo.shift.home.arpa kernel: RSP: 002b:00007f3fa97f98d0 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
Nov 29 22:07:11 indigo.shift.home.arpa kernel: RAX: ffffffffffffffda RBX: 00007f3fc1e1fb98 RCX: 00007f3fc545298f
Nov 29 22:07:11 indigo.shift.home.arpa kernel: RDX: 0000000001000000 RSI: 00007f3f8afff000 RDI: 000000000000000a
Nov 29 22:07:11 indigo.shift.home.arpa kernel: RBP: 0000561c7420e5f0 R08: 0000000000000000 R09: 0000561c741de1b4
Nov 29 22:07:11 indigo.shift.home.arpa kernel: R10: 000000023c000000 R11: 0000000000000293 R12: 00007f3f8afff000
Nov 29 22:07:11 indigo.shift.home.arpa kernel: R13: 0000561c737fbef8 R14: 0000000000000000 R15: 00007f3f8afff000
Nov 29 22:07:11 indigo.shift.home.arpa kernel:  </TASK>
Nov 29 22:07:11 indigo.shift.home.arpa kernel: memory: usage 585936kB, limit 585936kB, failcnt 4889989
Nov 29 22:07:11 indigo.shift.home.arpa kernel: memory+swap: usage 585936kB, limit 9007199254740988kB, failcnt 0
Nov 29 22:07:11 indigo.shift.home.arpa kernel: kmem: usage 27692kB, limit 9007199254740988kB, failcnt 0
Nov 29 22:07:11 indigo.shift.home.arpa kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb807fa67_7d8d_4879_9909_d6addf99d34c.slice:
Nov 29 22:07:11 indigo.shift.home.arpa kernel: anon 53297152
                                               file 518344704
                                               kernel 28356608
                                               kernel_stack 327680
                                               pagetables 909312
                                               percpu 31176
                                               sock 0
                                               vmalloc 28672
                                               shmem 106496
                                               zswap 0
                                               zswapped 0
                                               file_mapped 0
                                               file_dirty 102400
                                               file_writeback 518135808
                                               swapcached 0
                                               anon_thp 27262976
                                               file_thp 0
                                               shmem_thp 0
                                               inactive_anon 53301248
                                               active_anon 102400
                                               inactive_file 259100672
                                               active_file 259096576
                                               unevictable 0
                                               slab_reclaimable 26468176
                                               slab_unreclaimable 512840
                                               slab 26981016
                                               workingset_refault_anon 0
                                               workingset_refault_file 294606
                                               workingset_activate_anon 0
                                               workingset_activate_file 12623
                                               workingset_restore_anon 0
                                               workingset_restore_file 12201
                                               workingset_nodereclaim 59824
                                               pgscan 230151528
                                               pgsteal 5377545
                                               pgscan_kswapd 0
                                               pgscan_direct 230151528
                                               pgsteal_kswapd 0
                                               pgsteal_direct 5377545
                                               pgfault 80008
                                               pgmajfault 0
                                               pgrefill 222422482
                                               pgactivate 224764584
                                               pgdeactivate 222422481
                                               pglazyfree 0
                                               pglazyfreed 0
                                               zswpin 0
                                               zswpout 0
                                               thp_fault_alloc 230
                                               thp_collapse_alloc 0
Nov 29 22:07:11 indigo.shift.home.arpa kernel: Tasks state (memory values in pages):
Nov 29 22:07:11 indigo.shift.home.arpa kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Nov 29 22:07:11 indigo.shift.home.arpa kernel: [1569617]     0 1569617     2076      484    57344        0         -1000 conmon
Nov 29 22:07:11 indigo.shift.home.arpa kernel: [1569629]   107 1569629   327313    14627   335872        0           999 virt-cdi-import
Nov 29 22:07:11 indigo.shift.home.arpa kernel: [1569968]   107 1569968   252051     9633   548864        0           999 qemu-img
Nov 29 22:07:11 indigo.shift.home.arpa kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-7522c30bc48aab7fdb59bca6ab6161d4a8a842b96f6b71323713112989e8367f.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb807fa67_7d8d_4879_9909_d6addf99d34c.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb807fa67_7d8d_4879_9909_d6addf99d34c.slice/crio-7522c30bc48aab7fdb59bca6ab6161d4a8a842b96f6b71323713112989e8367f.scope,task=virt-cdi-import,pid=1569629,uid=107
Nov 29 22:07:11 indigo.shift.home.arpa kernel: Memory cgroup out of memory: Killed process 1569629 (virt-cdi-import) total-vm:1309252kB, anon-rss:18484kB, file-rss:40024kB, shmem-rss:0kB, UID:107 pgtables:328kB oom_score_adj:999

Comment 20 Germano Veit Michel 2023-11-29 22:13:02 UTC
Btw, just a random and maybe incorrect observation, but writing to lvms seems the easiest to reproduce. Writing to NFS or hospath (filesystem) seems better than before.

Comment 21 Alex Kalenyuk 2023-11-30 09:03:34 UTC
(In reply to Germano Veit Michel from comment #20)
> Btw, just a random and maybe incorrect observation, but writing to lvms
> seems the easiest to reproduce. Writing to NFS or hospath (filesystem) seems
> better than before.

Just one last thing before I start writing the PR. Is this 4.14 freshly installed?
I am asking because cgroupsv2 is only the default for new installs,
so if you upgraded from 4.13 IIUC you will keep cgroupsv1

Comment 22 Germano Veit Michel 2023-11-30 21:03:53 UTC
Sorry, I should have checked that. Its still on v1, upgraded from 4.13.
I tried to switch to v2 editing nodes.config/cluster but it reconciled back to v1.
So please give me a couple days and I'll do a re-install with 4.14 to test it.

However, does this mean we will have customers stuck on v1 and hitting this problem over and over again unless they fresh install?
If so, I think we need another solution too, not just rely on v2.

Comment 23 Alex Kalenyuk 2023-11-30 21:45:33 UTC
(In reply to Germano Veit Michel from comment #22)
> Sorry, I should have checked that. Its still on v1, upgraded from 4.13.
> I tried to switch to v2 editing nodes.config/cluster but it reconciled back
> to v1.
> So please give me a couple days and I'll do a re-install with 4.14 to test
> it.
> 
> However, does this mean we will have customers stuck on v1 and hitting this
> problem over and over again unless they fresh install?
> If so, I think we need another solution too, not just rely on v2.

Yes, I think that is correct. I just wanted to know if your exact setup triggers the issue with cgroupsv2 as well.
> If so, I think we need another solution too, not just rely on v2.
Yep agree

PR posted at https://github.com/kubevirt/containerized-data-importer/pull/3017,
feel free to chime in

Comment 24 Germano Veit Michel 2023-12-01 03:01:04 UTC
Redeployed fresh with 4.14.3... but my cgroups are still v1 :(

$ oc get nodes.config/cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Node
...
spec:
  cgroupMode: v1 


gveitmic@pi:~ $ oc get clusterversion -o yaml
...
    history:
    - completionTime: "2023-12-01T02:24:42Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:e6e1d90b492d50438034e6edb46bdafa6c86ae3b80ef3328685912d89681fdee
      startedTime: "2023-12-01T00:43:02Z"
      state: Completed
      verified: true
      version: 4.14.4
    - completionTime: "2023-12-01T00:28:35Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:e73ab4b33a9c3ff00c9f800a38d69853ca0c4dfa5a88e3df331f66df8f18ec55
      startedTime: "2023-11-30T21:49:12Z"
      state: Completed
      verified: false
      version: 4.14.3
    observedGeneration: 4
    versionHash: MxPRa6e5igo=
kind: List
metadata:
  resourceVersion: ""

Wondering why its v1 and not v2, so I can test this.

Comment 25 Germano Veit Michel 2023-12-01 03:54:04 UTC
OK, found that if I have a performanceprofile it will force the cluster back to cgroupv1.

Removed that, and I can get the cluster to cgroupv2.

[core@black ~]$ cat /proc/cmdline 
BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-7bc66a29a38b700292a67d5c856a2a30015dfe4fd8ae1cf9fb055459e0acc990/vmlinuz-5.14.0-284.41.1.el9_2.x86_64 ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/7bc66a29a38b700292a67d5c856a2a30015dfe4fd8ae1cf9fb055459e0acc990/0 root=UUID=48d42d68-25f1-4aba-b183-71a64e7dbb51 rw rootflags=prjquota boot=UUID=94e1f33a-ada6-424f-8390-f441e9d27d94 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=1 intel_iommu=on pci=realloc pci=assign-busses pcie_acs_override=downstream nvme_core.default_ps_max_latency_us=0

Normally, each VM import would take 2 to 4 retries to complete. I just provisioned 6 VMs and all of them succeeded on the first try. Its hard to say if it will never happen again, but at minimum this better with v2 enabled.
I'll keep an eye on the next few days, since I re-installed the cluster I'll be re-deploying many VMs and let you know if it stays like that.

Thanks!


Note You need to log in before you can comment on or make changes to this bug.