Bug 2196072
| Summary: | virt-cdi-import often killed by oom-killer during VM provisioning | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Germano Veit Michel <gveitmic> |
| Component: | Storage | Assignee: | Alex Kalenyuk <akalenyu> |
| Status: | ASSIGNED --- | QA Contact: | Natalie Gavrielov <ngavrilo> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.12.2 | CC: | akalenyu, alitke, yadu |
| Target Milestone: | --- | Flags: | alitke:
needinfo?
(akalenyu) |
| Target Release: | 4.13.4 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Alex, are we sure we want to use writeback cache mode? (In reply to Yan Du from comment #1) > Alex, are we sure we want to use writeback cache mode? As far as I understand we are not in the wrong to want to utilize the page cache instead of completely bypassing it ('writeback'); This has come up before and was extensively discussed. The summary is in https://bugzilla.redhat.com/show_bug.cgi?id=2099479#c14, TLDR being that: - with cgroupsv2 we should not be seeing the oom kill (I think cgroupsv2 got moved out of 4.13 - https://github.com/openshift/machine-config-operator/pull/3486#issuecomment-1448150138) - should be careful not to start introducing and fixing a bunch of things just to work well with storage that may or may not be up to par @alitke Should we reconsider our previous stances? (In reply to Alex Kalenyuk from comment #2) > (In reply to Yan Du from comment #1) > > Alex, are we sure we want to use writeback cache mode? > > As far as I understand we are not in the wrong to want to utilize the page > cache instead of completely bypassing it ('writeback'); > This has come up before and was extensively discussed. The summary is in > https://bugzilla.redhat.com/show_bug.cgi?id=2099479#c14, > TLDR being that: > - with cgroupsv2 we should not be seeing the oom kill (I think cgroupsv2 got > moved out of 4.13 - > https://github.com/openshift/machine-config-operator/pull/3486#issuecomment- > 1448150138) Yeah, I'm on 4.13 (now rc8), its its old cgroups v1 :( > - should be careful not to start introducing and fixing a bunch of things > just to work well with storage that may or may not be up to par Isn't this actually the opposite? Using writeback to compensate for slow storage that cannot handle o_direct, or unaligned writes or something else wrong? But I don't think its storage related, the storage can easily handle the writes (in fact its almost not writing at all, the whole thing is being cached). The problem is a big amount of free memory on the node (its idle) combined with default dirty_rate/bytes settings. I can make the problem go away with dirty_bytes set to 1/2 of the pod limit: https://access.redhat.com/articles/45002, so it forces flushing way before we hit the pod limits that trigger the oom-kill. To reproduce: - Node idle with heaps of free memory - Download the image from a machine on the local network, so download is quick - Image bigger than the pod limit The whole image ends up in the cache and it kills the pod. I was trying to find a way to play with vm.dirty_* per pod, but there doesn't seem to be a way. IMHO, something should be done here if cgroupsv2 are not coming to 4.13, I think customers will see it, but up to you. > @alitke Should we reconsider our previous stances? Same problem hitting hypershift on CNV (4.13.4).
It fails to provision the workers (from nodepool), repeats many times before succeeding, delaying hosted cluster creation.
Jul 12 11:38:20 indigo.shift.home.arpa kernel: memory: usage 585936kB, limit 585936kB, failcnt 3575147
Jul 12 11:38:20 indigo.shift.home.arpa kernel: memory+swap: usage 585936kB, limit 9007199254740988kB, failcnt 0
Jul 12 11:38:20 indigo.shift.home.arpa kernel: kmem: usage 26400kB, limit 9007199254740988kB, failcnt 0
Jul 12 11:38:20 indigo.shift.home.arpa kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod074083a6_5919_4d0a_b7c3_efd462c4bcfe.slice:
Jul 12 11:38:20 indigo.shift.home.arpa kernel: anon 102940672
file 470024192
kernel 27033600
kernel_stack 311296
pagetables 983040
percpu 31176
sock 0
vmalloc 28672
shmem 53248
zswap 0
zswapped 0
file_mapped 0
file_dirty 0
file_writeback 469970944
swapcached 0
anon_thp 41943040
file_thp 0
shmem_thp 0
inactive_anon 102932480
active_anon 61440
inactive_file 234987520
active_file 234983424
unevictable 0
slab_reclaimable 25100472
slab_unreclaimable 502720
slab 25603192
workingset_refault_anon 0
workingset_refault_file 403091
workingset_activate_anon 0
workingset_activate_file 85534
workingset_restore_anon 0
workingset_restore_file 3950
workingset_nodereclaim 39697
pgscan 155410421
pgsteal 4112520
pgscan_kswapd 0
pgscan_direct 155410421
pgsteal_kswapd 0
pgsteal_direct 4112520
pgfault 60886
pgmajfault 0
pgrefill 149534074
pgactivate 150867185
pgdeactivate 149534073
pglazyfree 0
pglazyfreed 0
zswpin 0
zswpout 0
thp_fault_alloc 112
thp_collapse_alloc 0
Jul 12 11:38:20 indigo.shift.home.arpa kernel: Tasks state (memory values in pages):
Jul 12 11:38:20 indigo.shift.home.arpa kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Jul 12 11:38:20 indigo.shift.home.arpa kernel: [ 107047] 0 107047 2076 517 49152 0 -1000 conmon
Jul 12 11:38:20 indigo.shift.home.arpa kernel: [ 107059] 107 107059 325091 12759 315392 0 999 virt-cdi-import
Jul 12 11:38:20 indigo.shift.home.arpa kernel: [ 111096] 107 111096 246656 22575 647168 0 999 qemu-img
The problem is that if we switch to cache=none then all imports will slow down and this will impact a different set of customers. Alex, could we experiment with adding a thread to the importer that repeatedly calls sync to flush I/O to disk? We're still going to have a race condition here (especially with slower storage) but maybe that can improve things? (In reply to Adam Litke from comment #5) > The problem is that if we switch to cache=none then all imports will slow > down and this will impact a different set of customers. cache=none should not slow down anything with fast storage. In fact, the VM runs with cache=none and has no performance issues. <driver name='qemu' type='raw' cache='none' error_policy='stop' discard='unmap'/> This is happening on high end (but consumer NVMe - 5GB/s writes) and does not get even close to that due to internet download speed of the images. I suspect there is some inefficiency problem with this conversion somewhere else, and its being masked by using destination caches in the convert. Regardless of fast storage or not, what ultimately determines how much will be cached or not (and then go over the cgroupv1 limit) are the kernel tunables around dirty pages, the thresholds, amount of memory etc. |
Description of problem: Cluster often has to attempt to provision a VM multiple times before it actually works. It sometimes fails 2-3 times in a row before it works. It always works after some retries, but quite often also fails on 1st and 2nd attempts. I've seen this rarely before, now after upgrading the cluster to 4.13-rc7 (CNV still on 4.12.2) it seems to hit more frequently. I noticed its writing with "-t writeback", and page cache is most of the cgroup memory limit consumed (537M out of 585M). This does not seem to be a good option for converting images under a cgroup limit. 107 127318 127223 54 01:15 ? 00:00:10 qemu-img convert -S 0 -t writeback -p -O raw nbd+unix:///?socket=/tmp/nbdkit.sock /dev/cdi-block-volume May 08 01:05:17 yellow.toca.local kernel: qemu-img invoked oom-killer: gfp_mask=0x101c00(GFP_NOIO|__GFP_HARDWALL|__GFP_WRITE), order=0, oom_score_adj=999 May 08 01:05:17 yellow.toca.local kernel: CPU: 2 PID: 115058 Comm: qemu-img Kdump: loaded Not tainted 5.14.0-284.13.1.el9_2.x86_64 #1 May 08 01:05:17 yellow.toca.local kernel: Call Trace: May 08 01:05:17 yellow.toca.local kernel: <TASK> May 08 01:05:17 yellow.toca.local kernel: dump_stack_lvl+0x34/0x48 May 08 01:05:17 yellow.toca.local kernel: dump_header+0x4a/0x201 May 08 01:05:17 yellow.toca.local kernel: oom_kill_process.cold+0xb/0x10 May 08 01:05:17 yellow.toca.local kernel: out_of_memory+0xed/0x2e0 May 08 01:05:17 yellow.toca.local kernel: ? __wake_up_common+0x7d/0x190 May 08 01:05:17 yellow.toca.local kernel: mem_cgroup_out_of_memory+0x13a/0x150 May 08 01:05:17 yellow.toca.local kernel: try_charge_memcg+0x6df/0x7a0 May 08 01:05:17 yellow.toca.local kernel: charge_memcg+0x9f/0x130 May 08 01:05:17 yellow.toca.local kernel: __mem_cgroup_charge+0x29/0x80 May 08 01:05:17 yellow.toca.local kernel: __filemap_add_folio+0x224/0x5a0 May 08 01:05:17 yellow.toca.local kernel: ? scan_shadow_nodes+0x30/0x30 May 08 01:05:17 yellow.toca.local kernel: filemap_add_folio+0x37/0xa0 May 08 01:05:17 yellow.toca.local kernel: __filemap_get_folio+0x1e6/0x330 May 08 01:05:17 yellow.toca.local kernel: ? blkdev_write_begin+0x20/0x20 May 08 01:05:17 yellow.toca.local kernel: pagecache_get_page+0x15/0x90 May 08 01:05:17 yellow.toca.local kernel: block_write_begin+0x24/0xf0 May 08 01:05:17 yellow.toca.local kernel: generic_perform_write+0xbe/0x200 May 08 01:05:17 yellow.toca.local kernel: __generic_file_write_iter+0xe5/0x1a0 May 08 01:05:17 yellow.toca.local kernel: blkdev_write_iter+0xe4/0x160 May 08 01:05:17 yellow.toca.local kernel: new_sync_write+0xfc/0x190 May 08 01:05:17 yellow.toca.local kernel: vfs_write+0x1ef/0x280 May 08 01:05:17 yellow.toca.local kernel: __x64_sys_pwrite64+0x90/0xc0 May 08 01:05:17 yellow.toca.local kernel: do_syscall_64+0x59/0x90 May 08 01:05:17 yellow.toca.local kernel: ? do_futex+0x15b/0x200 May 08 01:05:17 yellow.toca.local kernel: ? __x64_sys_futex+0x73/0x1d0 May 08 01:05:17 yellow.toca.local kernel: ? switch_fpu_return+0x49/0xd0 May 08 01:05:17 yellow.toca.local kernel: ? exit_to_user_mode_prepare+0xec/0x100 May 08 01:05:17 yellow.toca.local kernel: ? syscall_exit_to_user_mode+0x12/0x30 May 08 01:05:17 yellow.toca.local kernel: ? do_syscall_64+0x69/0x90 May 08 01:05:17 yellow.toca.local kernel: ? do_syscall_64+0x69/0x90 May 08 01:05:17 yellow.toca.local kernel: ? sysvec_apic_timer_interrupt+0x3c/0x90 May 08 01:05:17 yellow.toca.local kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd May 08 01:05:17 yellow.toca.local kernel: memory: usage 585936kB, limit 585936kB, failcnt 131 May 08 01:05:17 yellow.toca.local kernel: memory+swap: usage 585936kB, limit 9007199254740988kB, failcnt 0 May 08 01:05:17 yellow.toca.local kernel: kmem: usage 17716kB, limit 9007199254740988kB, failcnt 0 May 08 01:05:17 yellow.toca.local kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb18ee349_565f_4cb6_9f2b_f4e67a7a7927.slice: May 08 01:05:17 yellow.toca.local kernel: anon 44654592 file 537202688 kernel 18141184 kernel_stack 262144 pagetables 1024000 percpu 20856 sock 0 vmalloc 28672 shmem 53248 zswap 0 zswapped 0 file_mapped 0 file_dirty 0 file_writeback 537149440 swapcached 0 anon_thp 14680064 file_thp 0 shmem_thp 0 inactive_anon 44646400 active_anon 61440 inactive_file 268509184 active_file 268574720 unevictable 0 slab_reclaimable 16203816 slab_unreclaimable 497488 slab 16701304 workingset_refault_anon 0 workingset_refault_file 0 workingset_activate_anon 0 workingset_activate_file 0 workingset_restore_anon 0 workingset_restore_file 0 workingset_nodereclaim 0 pgscan 4950026 pgsteal 4 pgscan_kswapd 0 pgscan_direct 4950026 pgsteal_kswapd 0 pgsteal_direct 4 pgfault 13780 pgmajfault 0 pgrefill 4884435 pgactivate 4950019 pgdeactivate 4884435 pglazyfree 0 pglazyfreed 0 zswpin 0 zswpout 0 thp_fault_alloc 13 thp_collapse_alloc 0 May 08 01:05:17 yellow.toca.local kernel: Tasks state (memory values in pages): May 08 01:05:17 yellow.toca.local kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name May 08 01:05:17 yellow.toca.local kernel: [ 114988] 0 114988 2076 485 49152 0 -1000 conmon May 08 01:05:17 yellow.toca.local kernel: [ 115000] 107 115000 335956 13160 401408 0 999 virt-cdi-import May 08 01:05:17 yellow.toca.local kernel: [ 115032] 107 115032 95298 2604 262144 0 999 nbdkit May 08 01:05:17 yellow.toca.local kernel: [ 115053] 107 115053 110130 6764 360448 0 999 qemu-img May 08 01:05:17 yellow.toca.local kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-3ec0af77f3bab257e52e5a33cfde0e88861bae125913f97b356105b5efd7cc18.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice> May 08 01:05:17 yellow.toca.local kernel: Memory cgroup out of memory: Killed process 115000 (virt-cdi-import) total-vm:1343824kB, anon-rss:19060kB, file-rss:33580kB, shmem-rss:0kB, UID:107 pgtables:392kB oom_score_adj:999 The image is sourced from an httpd server, it a qcow2. And written to a topolvm volume. spec: dataVolumeTemplates: - metadata: creationTimestamp: null name: rhvh-1-rootdisk namespace: homelab spec: preallocation: false source: http: url: http://pi.toca.local:8080/images/rhel-8.6.qcow2 storage: resources: requests: storage: 100Gi storageClassName: lvms-ssd-lvmcluster Version-Release number of selected component (if applicable): 4.12.2 How reproducible: Always Steps to Reproduce: 1. Provision a VM 2. Watch the logs on the node and it may fail 1-2 times before a retry succeeds