Bug 2196072 - virt-cdi-import often killed by oom-killer during VM provisioning [NEEDINFO]
Summary: virt-cdi-import often killed by oom-killer during VM provisioning
Keywords:
Status: ASSIGNED
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 4.12.2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.13.4
Assignee: Alex Kalenyuk
QA Contact: Natalie Gavrielov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-05-08 01:31 UTC by Germano Veit Michel
Modified: 2023-08-14 10:47 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:
alitke: needinfo? (akalenyu)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CNV-28628 0 None None None 2023-05-08 01:32:04 UTC

Description Germano Veit Michel 2023-05-08 01:31:50 UTC
Description of problem:

Cluster often has to attempt to provision a VM multiple times before it actually works. It sometimes fails 2-3 times in a row before it works.
It always works after some retries, but quite often also fails on 1st and 2nd attempts.

I've seen this rarely before, now after upgrading the cluster to 4.13-rc7 (CNV still on 4.12.2) it seems to hit more frequently.

I noticed its writing with "-t writeback", and page cache is most of the cgroup memory limit consumed (537M out of 585M).

This does not seem to be a good option for converting images under a cgroup limit.

107       127318  127223 54 01:15 ?        00:00:10 qemu-img convert -S 0 -t writeback -p -O raw nbd+unix:///?socket=/tmp/nbdkit.sock /dev/cdi-block-volume

May 08 01:05:17 yellow.toca.local kernel: qemu-img invoked oom-killer: gfp_mask=0x101c00(GFP_NOIO|__GFP_HARDWALL|__GFP_WRITE), order=0, oom_score_adj=999
May 08 01:05:17 yellow.toca.local kernel: CPU: 2 PID: 115058 Comm: qemu-img Kdump: loaded Not tainted 5.14.0-284.13.1.el9_2.x86_64 #1

May 08 01:05:17 yellow.toca.local kernel: Call Trace:
May 08 01:05:17 yellow.toca.local kernel:  <TASK>
May 08 01:05:17 yellow.toca.local kernel:  dump_stack_lvl+0x34/0x48
May 08 01:05:17 yellow.toca.local kernel:  dump_header+0x4a/0x201
May 08 01:05:17 yellow.toca.local kernel:  oom_kill_process.cold+0xb/0x10
May 08 01:05:17 yellow.toca.local kernel:  out_of_memory+0xed/0x2e0
May 08 01:05:17 yellow.toca.local kernel:  ? __wake_up_common+0x7d/0x190
May 08 01:05:17 yellow.toca.local kernel:  mem_cgroup_out_of_memory+0x13a/0x150
May 08 01:05:17 yellow.toca.local kernel:  try_charge_memcg+0x6df/0x7a0
May 08 01:05:17 yellow.toca.local kernel:  charge_memcg+0x9f/0x130
May 08 01:05:17 yellow.toca.local kernel:  __mem_cgroup_charge+0x29/0x80
May 08 01:05:17 yellow.toca.local kernel:  __filemap_add_folio+0x224/0x5a0
May 08 01:05:17 yellow.toca.local kernel:  ? scan_shadow_nodes+0x30/0x30
May 08 01:05:17 yellow.toca.local kernel:  filemap_add_folio+0x37/0xa0
May 08 01:05:17 yellow.toca.local kernel:  __filemap_get_folio+0x1e6/0x330
May 08 01:05:17 yellow.toca.local kernel:  ? blkdev_write_begin+0x20/0x20
May 08 01:05:17 yellow.toca.local kernel:  pagecache_get_page+0x15/0x90
May 08 01:05:17 yellow.toca.local kernel:  block_write_begin+0x24/0xf0
May 08 01:05:17 yellow.toca.local kernel:  generic_perform_write+0xbe/0x200
May 08 01:05:17 yellow.toca.local kernel:  __generic_file_write_iter+0xe5/0x1a0
May 08 01:05:17 yellow.toca.local kernel:  blkdev_write_iter+0xe4/0x160
May 08 01:05:17 yellow.toca.local kernel:  new_sync_write+0xfc/0x190
May 08 01:05:17 yellow.toca.local kernel:  vfs_write+0x1ef/0x280
May 08 01:05:17 yellow.toca.local kernel:  __x64_sys_pwrite64+0x90/0xc0
May 08 01:05:17 yellow.toca.local kernel:  do_syscall_64+0x59/0x90
May 08 01:05:17 yellow.toca.local kernel:  ? do_futex+0x15b/0x200
May 08 01:05:17 yellow.toca.local kernel:  ? __x64_sys_futex+0x73/0x1d0
May 08 01:05:17 yellow.toca.local kernel:  ? switch_fpu_return+0x49/0xd0
May 08 01:05:17 yellow.toca.local kernel:  ? exit_to_user_mode_prepare+0xec/0x100
May 08 01:05:17 yellow.toca.local kernel:  ? syscall_exit_to_user_mode+0x12/0x30
May 08 01:05:17 yellow.toca.local kernel:  ? do_syscall_64+0x69/0x90
May 08 01:05:17 yellow.toca.local kernel:  ? do_syscall_64+0x69/0x90
May 08 01:05:17 yellow.toca.local kernel:  ? sysvec_apic_timer_interrupt+0x3c/0x90
May 08 01:05:17 yellow.toca.local kernel:  entry_SYSCALL_64_after_hwframe+0x63/0xcd


May 08 01:05:17 yellow.toca.local kernel: memory: usage 585936kB, limit 585936kB, failcnt 131
May 08 01:05:17 yellow.toca.local kernel: memory+swap: usage 585936kB, limit 9007199254740988kB, failcnt 0
May 08 01:05:17 yellow.toca.local kernel: kmem: usage 17716kB, limit 9007199254740988kB, failcnt 0
May 08 01:05:17 yellow.toca.local kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb18ee349_565f_4cb6_9f2b_f4e67a7a7927.slice:
May 08 01:05:17 yellow.toca.local kernel: anon 44654592
                                          file 537202688
                                          kernel 18141184
                                          kernel_stack 262144
                                          pagetables 1024000
                                          percpu 20856
                                          sock 0
                                          vmalloc 28672
                                          shmem 53248
                                          zswap 0
                                          zswapped 0
                                          file_mapped 0
                                          file_dirty 0
                                          file_writeback 537149440
                                          swapcached 0
                                          anon_thp 14680064
                                          file_thp 0
                                          shmem_thp 0
                                          inactive_anon 44646400
                                          active_anon 61440
                                          inactive_file 268509184
                                          active_file 268574720
                                          unevictable 0
                                          slab_reclaimable 16203816
                                          slab_unreclaimable 497488
                                          slab 16701304
                                          workingset_refault_anon 0
                                          workingset_refault_file 0
                                          workingset_activate_anon 0
                                          workingset_activate_file 0
                                          workingset_restore_anon 0
                                          workingset_restore_file 0
                                          workingset_nodereclaim 0
                                          pgscan 4950026
                                          pgsteal 4
                                          pgscan_kswapd 0
                                          pgscan_direct 4950026
                                          pgsteal_kswapd 0
                                          pgsteal_direct 4
                                          pgfault 13780
                                          pgmajfault 0
                                          pgrefill 4884435
                                          pgactivate 4950019
                                          pgdeactivate 4884435
                                          pglazyfree 0
                                          pglazyfreed 0
                                          zswpin 0
                                          zswpout 0
                                          thp_fault_alloc 13
                                          thp_collapse_alloc 0
May 08 01:05:17 yellow.toca.local kernel: Tasks state (memory values in pages):
May 08 01:05:17 yellow.toca.local kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
May 08 01:05:17 yellow.toca.local kernel: [ 114988]     0 114988     2076      485    49152        0         -1000 conmon
May 08 01:05:17 yellow.toca.local kernel: [ 115000]   107 115000   335956    13160   401408        0           999 virt-cdi-import
May 08 01:05:17 yellow.toca.local kernel: [ 115032]   107 115032    95298     2604   262144        0           999 nbdkit
May 08 01:05:17 yellow.toca.local kernel: [ 115053]   107 115053   110130     6764   360448        0           999 qemu-img
May 08 01:05:17 yellow.toca.local kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-3ec0af77f3bab257e52e5a33cfde0e88861bae125913f97b356105b5efd7cc18.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice>
May 08 01:05:17 yellow.toca.local kernel: Memory cgroup out of memory: Killed process 115000 (virt-cdi-import) total-vm:1343824kB, anon-rss:19060kB, file-rss:33580kB, shmem-rss:0kB, UID:107 pgtables:392kB oom_score_adj:999

The image is sourced from an httpd server, it a qcow2. And written to a topolvm volume.

spec:
  dataVolumeTemplates:
  - metadata:
      creationTimestamp: null
      name: rhvh-1-rootdisk
      namespace: homelab
    spec:
      preallocation: false
      source:
        http:
          url: http://pi.toca.local:8080/images/rhel-8.6.qcow2
      storage:
        resources:
          requests:
            storage: 100Gi
        storageClassName: lvms-ssd-lvmcluster

Version-Release number of selected component (if applicable):
4.12.2

How reproducible:
Always

Steps to Reproduce:
1. Provision a VM
2. Watch the logs on the node and it may fail 1-2 times before a retry succeeds

Comment 1 Yan Du 2023-05-10 12:11:44 UTC
Alex, are we sure we want to use writeback cache mode?

Comment 2 Alex Kalenyuk 2023-05-11 12:43:38 UTC
(In reply to Yan Du from comment #1)
> Alex, are we sure we want to use writeback cache mode?

As far as I understand we are not in the wrong to want to utilize the page cache instead of completely bypassing it ('writeback');
This has come up before and was extensively discussed. The summary is in https://bugzilla.redhat.com/show_bug.cgi?id=2099479#c14,
TLDR being that:
- with cgroupsv2 we should not be seeing the oom kill (I think cgroupsv2 got moved out of 4.13 - https://github.com/openshift/machine-config-operator/pull/3486#issuecomment-1448150138)
- should be careful not to start introducing and fixing a bunch of things
just to work well with storage that may or may not be up to par

@alitke Should we reconsider our previous stances?

Comment 3 Germano Veit Michel 2023-05-11 21:35:26 UTC
(In reply to Alex Kalenyuk from comment #2)
> (In reply to Yan Du from comment #1)
> > Alex, are we sure we want to use writeback cache mode?
> 
> As far as I understand we are not in the wrong to want to utilize the page
> cache instead of completely bypassing it ('writeback');
> This has come up before and was extensively discussed. The summary is in
> https://bugzilla.redhat.com/show_bug.cgi?id=2099479#c14,
> TLDR being that:
> - with cgroupsv2 we should not be seeing the oom kill (I think cgroupsv2 got
> moved out of 4.13 -
> https://github.com/openshift/machine-config-operator/pull/3486#issuecomment-
> 1448150138)

Yeah, I'm on 4.13 (now rc8), its its old cgroups v1 :(

> - should be careful not to start introducing and fixing a bunch of things
> just to work well with storage that may or may not be up to par

Isn't this actually the opposite? Using writeback to compensate for slow storage that cannot handle o_direct, or unaligned writes or something else wrong?

But I don't think its storage related, the storage can easily handle the writes (in fact its almost not writing at all, the whole thing is being cached).
The problem is a big amount of free memory on the node (its idle) combined with default dirty_rate/bytes settings.
I can make the problem go away with dirty_bytes set to 1/2 of the pod limit: https://access.redhat.com/articles/45002, so it forces flushing way before we hit the pod limits that trigger the oom-kill.

To reproduce:
- Node idle with heaps of free memory
- Download the image from a machine on the local network, so download is quick
- Image bigger than the pod limit

The whole image ends up in the cache and it kills the pod.

I was trying to find a way to play with vm.dirty_* per pod, but there doesn't seem to be a way.

IMHO, something should be done here if cgroupsv2 are not coming to 4.13, I think customers will see it, but up to you.

> @alitke Should we reconsider our previous stances?

Comment 4 Germano Veit Michel 2023-07-12 11:47:21 UTC
Same problem hitting hypershift on CNV (4.13.4).

It fails to provision the workers (from nodepool), repeats many times before succeeding, delaying hosted cluster creation.

Jul 12 11:38:20 indigo.shift.home.arpa kernel: memory: usage 585936kB, limit 585936kB, failcnt 3575147
Jul 12 11:38:20 indigo.shift.home.arpa kernel: memory+swap: usage 585936kB, limit 9007199254740988kB, failcnt 0
Jul 12 11:38:20 indigo.shift.home.arpa kernel: kmem: usage 26400kB, limit 9007199254740988kB, failcnt 0
Jul 12 11:38:20 indigo.shift.home.arpa kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod074083a6_5919_4d0a_b7c3_efd462c4bcfe.slice:
Jul 12 11:38:20 indigo.shift.home.arpa kernel: anon 102940672
                                               file 470024192
                                               kernel 27033600
                                               kernel_stack 311296
                                               pagetables 983040
                                               percpu 31176
                                               sock 0
                                               vmalloc 28672
                                               shmem 53248
                                               zswap 0
                                               zswapped 0
                                               file_mapped 0
                                               file_dirty 0
                                               file_writeback 469970944
                                               swapcached 0
                                               anon_thp 41943040
                                               file_thp 0
                                               shmem_thp 0
                                               inactive_anon 102932480
                                               active_anon 61440
                                               inactive_file 234987520
                                               active_file 234983424
                                               unevictable 0
                                               slab_reclaimable 25100472
                                               slab_unreclaimable 502720
                                               slab 25603192
                                               workingset_refault_anon 0
                                               workingset_refault_file 403091
                                               workingset_activate_anon 0
                                               workingset_activate_file 85534
                                               workingset_restore_anon 0
                                               workingset_restore_file 3950
                                               workingset_nodereclaim 39697
                                               pgscan 155410421
                                               pgsteal 4112520
                                               pgscan_kswapd 0
                                               pgscan_direct 155410421
                                               pgsteal_kswapd 0
                                               pgsteal_direct 4112520
                                               pgfault 60886
                                               pgmajfault 0
                                               pgrefill 149534074
                                               pgactivate 150867185
                                               pgdeactivate 149534073
                                               pglazyfree 0
                                               pglazyfreed 0
                                               zswpin 0
                                               zswpout 0
                                               thp_fault_alloc 112
                                               thp_collapse_alloc 0
Jul 12 11:38:20 indigo.shift.home.arpa kernel: Tasks state (memory values in pages):
Jul 12 11:38:20 indigo.shift.home.arpa kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jul 12 11:38:20 indigo.shift.home.arpa kernel: [ 107047]     0 107047     2076      517    49152        0         -1000 conmon
Jul 12 11:38:20 indigo.shift.home.arpa kernel: [ 107059]   107 107059   325091    12759   315392        0           999 virt-cdi-import
Jul 12 11:38:20 indigo.shift.home.arpa kernel: [ 111096]   107 111096   246656    22575   647168        0           999 qemu-img

Comment 5 Adam Litke 2023-08-09 17:50:29 UTC
The problem is that if we switch to cache=none then all imports will slow down and this will impact a different set of customers.  Alex, could we experiment with adding a thread to the importer that repeatedly calls sync to flush I/O to disk?  We're still going to have a race condition here (especially with slower storage) but maybe that can improve things?

Comment 6 Germano Veit Michel 2023-08-09 21:45:40 UTC
(In reply to Adam Litke from comment #5)
> The problem is that if we switch to cache=none then all imports will slow
> down and this will impact a different set of customers.

cache=none should not slow down anything with fast storage. In fact, the VM runs with cache=none and has no performance issues.

      <driver name='qemu' type='raw' cache='none' error_policy='stop' discard='unmap'/>

This is happening on high end (but consumer NVMe - 5GB/s writes) and does not get even close to that due to internet download speed of the images.

I suspect there is some inefficiency problem with this conversion somewhere else, and its being masked by using destination caches in the convert.

Regardless of fast storage or not, what ultimately determines how much will be cached or not (and then go over the cgroupv1 limit) are the kernel tunables around dirty pages, the thresholds, amount of memory etc.


Note You need to log in before you can comment on or make changes to this bug.