Bug 1507149
Summary: | [LLNL 7.5 Bug] slab leak causing a crash when using kmem control group | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Ben Woodard <woodard> | |
Component: | kernel | Assignee: | Aristeu Rozanski <arozansk> | |
kernel sub component: | Control Groups | QA Contact: | Chao Ye <cye> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | urgent | CC: | aaron_wilk, afox, aleksander.lukasz, amdas, anand, aquini, arozansk, arunmk, atomlin, atragler, bjarolim, cfernandez, chaithco, charles.wright, chenxu198511, codyja, cye, daniel.kuffner, dhoward, dwd, foraker1, fwissing, hannsj_uhl, ikulkarn, jennyw, jentrena, jijia, joaquin.lopez, jpirko, jpmenil, julio.valcarcel, jwinn, kolyshkin, kyoshida, lining916740672, liwan, mikelley, mmilgram, mschibli, mystery.wd, nmurray, pasik, perobins, plazonic, qguo, quackmaster, rbobek, rcheerla, redhat-bugs, redtex, R.Eggermont, ribarry, riehecky, rmonther, sakulkar, shuwang, sman, smeisner, sroza, subhat, sukulkar, svarada, tgraf, tgummels, uobergfe, vagrawal, vamsee, woodard, wredtex, xyhao06, yoliynyk, zhenghc00 | |
Version: | 7.4 | Keywords: | ZStream | |
Target Milestone: | rc | Flags: | jpmenil:
needinfo-
jpmenil: needinfo- |
|
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | kernel-3.10.0-1075.el7 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1748234 1748236 1748237 1752421 (view as bug list) | Environment: | ||
Last Closed: | 2020-03-31 19:12:49 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1392283, 1420851, 1549423, 1599298, 1649189, 1689150, 1713152, 1748234, 1748236, 1748237, 1752421, 1754591 | |||
Attachments: |
Description
Ben Woodard
2017-10-27 20:06:02 UTC
We found that when we turned off ConstrainKmemSpace=no in the slurm configuration there were no crashes over the weekend. Therefore it appears that the kmem cg is strongly implicated as the source of the problem. I asked LLNL's kernel engineer to apply a44cb9449182fd7b25bf5f1cc38b7f19e0b96f6d which could deal with at least part of the problem. ref: https://bugzilla.redhat.com/show_bug.cgi?id=1442618 From the reporter: I did some testing on a 7.4 node today, and I was able to trivially recreate the behavior where setting a kmem limit in a cgroup causes a number of slab caches to get created that don't go away when the cgroup is removed. The steps were, # mkdir /sys/fs/cgroup/memory/kmem_test # echo 137438953472 > /sys/fs/cgroup/memory/kmem_test/memory.kmem.max_usage_in_bytes (128GB, total RAM in the node) # Spawn a shell and echo its pid into /sys/fs/cgroup/memory/kmem_test/tasks to put the shell into the cgroup. # Do a little work in the shell. Pinging the management node seemed to be enough. # 'ls /sys/kernel/slab | grep kmem_test' will show a bunch of new slab caches # Exit the shell. When I tried this, it took the shell 30-60 seconds to exit. Not sure what was up there. # rmdir /sys/fs/cgroup/memory/kmem_test # 'ls /sys/kernel/slab | grep kmem_test' shows the slabs are still there. No panics/messages to syslog though, and no broken symlinks in /sys/kernel/slab. I'm not rightly certain whether all of the child slab caches are supposed to go away when the cgroup is removed or not, but letting them build up into the hundreds of thousands probably isn't good behavior. ---------- Is that the expected behavior or do we have our reproducer here? What is supposed to reap these now abandoned slab caches? It seems like the build up of these slab caches is what ultimately crashes the node. So the reproducer that ultimately leads to the crash is probably something like what is described above but iterated many times. Good morning, I hope you don't mind if I butt in. We have a cluster of 80 Intel OPA/PSM2 nodes running 7.4, slurm 17.04 with memory enforcement and with hfi1 drivers updated to Intel's latest 10.6.x.x that's experiencing similar issues. E.g. - after a while lots of: [684578.150960] cache_from_obj: Wrong slab cache. kmem_cache_node but object is from kmalloc-64(6281:step_0) [684578.162326] cache_from_obj: Wrong slab cache. kmem_cache_node but object is from kmalloc-64(6281:step_0) [684578.173546] cache_from_obj: Wrong slab cache. kmem_cache but object is from kmalloc-256(6281:step_0) [684578.186536] cache_from_obj: Wrong slab cache. kmem_cache_node but object is from kmalloc-64(6281:step_0) [684578.197869] cache_from_obj: Wrong slab cache. kmem_cache_node but object is from kmalloc-64(6281:step_0) [684578.210447] cache_from_obj: Wrong slab cache. kmem_cache but object is from kmalloc-256(6281:step_0) in logs and we see huge /sys/kernel/slab dirs (e.g. 120k entries). No node crashes but we do get weird program crashes that we cannot easily explain any other way. Anyway, I just wanted to report that we tried applying a44cb9449182fd7b25bf5f1cc38b7f19e0b96f6d to 3.10.0-693.5.2.el7 and while it does not fix the problem fully it does help. We ran repeatedly a simple MPI program (report rank) and on node without patch ls /sys/kernel/slab|grep step|wc -l grew at twice the rate vs the node with patch (after an hour 2043 vs 965). We'd be happy to try more things if that might help. Thanks, Josko Hi, Can you both try this patch and report back it's effect on the issue: commit f57947d27711451a7739a25bba6cddc8a385e438 Author: Li Zefan <lizefan> Date: Tue Jun 18 18:41:53 2013 +0800 cgroup: fix memory leak in cgroup_rm_cftypes() The memory allocated in cgroup_add_cftypes() should be freed. The effect of this bug is we leak a bit memory everytime we unload cfq-iosched module if blkio cgroup is enabled. I was advised the above might help. Travis Travis, before we ask LLNL to test random patches. Can we do some testing to make sure that we don't continue to see evidence of the problem using the reproducer provided? Good afternoon, just finished testing also with the 2nd patch and there isn't a huge difference. After 60 MPI runs I get 3175 items in /sys/kernel/slab on unpatched, 1285 with first patch, 1290 with both. I also kept list of /sys/kernel/slab dir after every test - if you want to compare what's happening - and can put it up somewhere. Thanks for the suggestion - if it can help keep'em coming. encountered this issue with nvidia-docker(cgroup open) and 3.10.0-514.el7.x86_64 on centos [6728212.703168] [<ffffffff811c35c4>] __kmem_cache_shutdown+0x14/0x80 [6728212.703196] [<ffffffff8118ceaf>] kmem_cache_destroy+0x3f/0xe0 [6728212.703222] [<ffffffff811d4949>] kmem_cache_destroy_memcg_children+0x89/0xb0 [6728212.703252] [<ffffffff8118ce84>] kmem_cache_destroy+0x14/0xe0 [6728212.703284] [<ffffffffa167a507>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm] [6728212.703321] [<ffffffffa167c11e>] uvm_pmm_gpu_deinit+0x3e/0x70 [nvidia_uvm] [6728212.703355] [<ffffffffa1650b10>] remove_gpu+0x220/0x300 [nvidia_uvm] [6728212.703387] [<ffffffffa1650e31>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm] [6728212.703423] [<ffffffffa1654588>] uvm_va_space_destroy+0x348/0x3b0 [nvidia_uvm] [6728212.703458] [<ffffffffa164a401>] uvm_release+0x11/0x20 [nvidia_uvm] [6728212.703486] [<ffffffff811e0329>] __fput+0xe9/0x270 [6728212.703509] [<ffffffff811e05ee>] ____fput+0xe/0x10 [6728212.703532] [<ffffffff810a22f4>] task_work_run+0xc4/0xe0 [6728212.703557] [<ffffffff810815eb>] do_exit+0x2cb/0xa60 [6728212.703580] [<ffffffff810e1f22>] ? __unqueue_futex+0x32/0x70 [6728212.703605] [<ffffffff810e2f7d>] ? futex_wait+0x11d/0x280 [6728212.704619] [<ffffffff81081dff>] do_group_exit+0x3f/0xa0 [6728212.705598] [<ffffffff81092c10>] get_signal_to_deliver+0x1d0/0x6d0 [6728212.706559] [<ffffffff81014417>] do_signal+0x57/0x6c0 [6728212.707495] [<ffffffff8110b796>] ? __audit_syscall_exit+0x1e6/0x280 [6728212.708407] [<ffffffff81014adf>] do_notify_resume+0x5f/0xb0 Ben, Can you please try the attached patches (non obsolete)? The following Linus' commits are included: 363a044 memcg, slab: kmem_cache_create_memcg(): fix memleak on fail path 1aa1325 memcg, slab: clean up memcg cache initialization/destruction 2edefe1 memcg, slab: fix races in per-memcg cache creation/destruction b852990 memcg, slab: do not destroy children caches if parent has aliases Hullo, I am not Ben but I gave it a try nevertheless. Patches apply cleanly to 693.17.1 kernel but don't seem to help at all. Number of entries in /sys/kernel/slab increases at almost exactly the same rate as for clean 693.17.1 kernel. Josko There are a few reports of the same thing happening when running Docker here: https://github.com/moby/moby/issues/35446 Kernel versions reported there are all different, but they look recent. #!/bin/bash mkdir -p pages for x in `seq 1280000`; do [ $((x % 1000)) -eq 0 ] && echo $x mkdir /sys/fs/cgroup/memory/foo # echo 1M > /sys/fs/cgroup/memory/foo/memory.limit_in_bytes echo 100M > /sys/fs/cgroup/memory/foo/memory.kmem.limit_in_bytes echo $$ >/sys/fs/cgroup/memory/foo/cgroup.procs memhog 4K &>/dev/null echo trex>pages/$x echo $$ >/sys/fs/cgroup/memory/cgroup.procs rmdir /sys/fs/cgroup/memory/foo done [root@localhost ~]# cat /proc/mounts proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0 devtmpfs /dev devtmpfs rw,nosuid,size=4077588k,nr_inodes=1019397,mode=755 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0 devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0 tmpfs /sys/fs/cgroup tmpfs rw,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 configfs /sys/kernel/config configfs rw,relatime 0 0 /dev/mapper/centos-root / xfs rw,relatime,attr2,inode64,noquota 0 0 systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=31,pgrp=1,timeout=300,minproto=5,maxproto=5,direct 0 0 debugfs /sys/kernel/debug debugfs rw,relatime 0 0 hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0 mqueue /dev/mqueue mqueue rw,relatime 0 0 /dev/vda1 /boot xfs rw,relatime,attr2,inode64,noquota 0 0 i encountered same issue use sciript above i got this scripit from https://github.com/torvalds/linux/commit/73f576c04b9410ed19660f74f97521bee6e1c546 and did some changes to reproduce "Docker leaking cgroups causing no space left on device? #29638" https://github.com/moby/moby/issues/29638 unfortunately,encountered this bug . on Longterm stable 4.4.121 ,i did not see the same scence. FYI. Encountering this same issue with Centos 7, kernel 3.10.0-693.21.1.el7.x86_64, and docker-ce-17.12.1.ce-1.el7.centos.x86_64. The behavior we see is after a period of heavy container usage on hosts, the /sys/kernel/slab dir will have over 1 mil files/dirs, and new containers won't startup anymore. Also the following is logged heavily: May 14 19:50:06 hostname kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(5071:9934c4c3821bcb474f5a113d5a1822c276fcf196d6c2e31b1b8dd969df9d50d2) May 14 19:50:06 hostname kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(5071:9934c4c3821bcb474f5a113d5a1822c276fcf196d6c2e31b1b8dd969df9d50d2) May 14 19:50:06 hostname kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(5071:9934c4c3821bcb474f5a113d5a1822c276fcf196d6c2e31b1b8dd969df9d50d2) May 14 19:50:06 hostname kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(5071:9934c4c3821bcb474f5a113d5a1822c276fcf196d6c2e31b1b8dd969df9d50d2) Looks like Kubernetes users are also experiencing this issue, where the recent changes of enabling kmem limit in Kubernetes cause cgroups to leak. For details please refer to https://github.com/kubernetes/kubernetes/issues/61937. Is this just pending someone testing the patches? 1442618 referenced up near the top is a private bug -- can the relevant parts be copied here (or has that already been done), or is there another bug that has them? This is biting us and at least one of our customers via the Kubernetes issue linked above. The following script repros it reliably on a Kubernetes node running kernel 3.10.0-862.9.1.el7.x86_64 : #!/bin/bash run_some_pods () { for j in {1..20} do kubectl run -i -t busybox-${j} --image=busybox --restart=Never -- echo "hi" &> /dev/null & done sleep 1 wait for j in {1..20} do kubectl delete pods busybox-${j} --grace-period=0 --force &> /dev/null || true done } for i in {1..3279} do echo "Running pod batch $i" run_some_pods done We'd be happy to test any patches. We were able to work around the issue in the context of kubernetes 1.8.12 by doing the following on startup. Before running the v1.8 kubelet, we run a v1.6 kubelet with suitable arguments, and verify that it completes enough of its startup process to create /sys/fs/cgroup/memory/kubepods . Then, we start up the kubernetes 1.8.12 cluster as normal. We tried creating those cgroup folders manually as a simpler version of the above workaround using the following script: #!/bin/bash for thing in memory blkio "cpu,cpuacct" cpuset devices freezer hugetlb memory net_cls perf_event systemd do mkdir-p /sys/fs/cgroup/${thing}/kubepods mkdir -p /sys/fs/cgroup/${thing}/kubepods/burstable mkdir -p /sys/fs/cgroup/${thing}/kubepods/besteffort done However, this simpler attempt did not prevent the repro script above from showing the issue. What would we need to add to this simpler workaround to prevent the issue as starting a v1.6 kubelet does? Posting stats from one of the machines that shows this problem: Problem: > [root@vd0916 ~]# mkdir /sys/fs/cgroup/memory/b6c438c7ba469d5c3bad89c4a81d732a68a375b096d3791c4847cd517f8eee7c > mkdir: cannot create directory ‘/sys/fs/cgroup/memory/b6c438c7ba469d5c3bad89c4a81d732a68a375b096d3791c4847cd517f8eee7c’: No space left on device Some metrics: > [root@vd0916 ~]# ll /sys/kernel/slab | wc -l > 2587635 > [root@vd0916 ~]# slabtop -s -c Active / Total Objects (% used) : 87188577 / 222858888 (39.1%) Active / Total Slabs (% used) : 4716565 / 4716565 (100.0%) Active / Total Caches (% used) : 77 / 118 (65.3%) Active / Total Size (% used) : 17769913.34K / 63916938.41K (27.8%) Minimum / Average / Maximum Object : 0.01K / 0.29K / 8.00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 29129856 15635925 53% 0.19K 693568 42 5548544K dentry 23371439 5516568 23% 0.57K 417348 56 13355136K radix_tree_node 22468896 22386411 99% 0.11K 624136 36 2496544K sysfs_dir_cache 19747455 13294931 67% 0.10K 506345 39 2025380K buffer_head 17578139 5788599 32% 0.15K 331663 53 2653304K xfs_ili 15482112 88288 0% 0.25K 241908 64 3870528K kmalloc-256 11236544 44493 0% 0.50K 175571 64 5618272K kmalloc-512 11229792 1416428 12% 0.38K 267376 42 4278016K blkdev_requests 11110932 5855107 52% 0.09K 264546 42 1058184K kmalloc-96 9687120 1467586 15% 1.06K 322904 30 10332928K xfs_inode 9019200 3112167 34% 0.12K 140925 64 1127400K kmalloc-128 8103040 894521 11% 0.06K 126610 64 506440K kmalloc-64 7442190 2117485 28% 0.19K 177195 42 1417560K kmalloc-192 7131390 32130 0% 0.23K 101877 70 1630032K cfq_queue 4258608 753366 17% 0.66K 88721 48 2839072K shmem_inode_cache 3663990 2814407 76% 0.58K 66618 55 2131776K inode_cache 3575296 382122 10% 0.01K 6983 512 27932K kmalloc-8 3378048 2483023 73% 0.03K 26391 128 105564K kmalloc-32 1172800 1168793 99% 0.06K 18325 64 73300K kmem_cache_node 660992 152980 23% 0.02K 2582 256 10328K kmalloc-16 588736 587894 99% 0.25K 9199 64 147184K kmem_cache 570556 336689 59% 0.64K 11644 49 372608K proc_inode_cache 527904 28209 5% 1.00K 16497 32 527904K kmalloc-1024 308160 6622 2% 1.56K 15408 20 493056K mm_struct 225215 4247 1% 1.03K 7265 31 232480K nfs_inode_cache 201952 133963 66% 4.00K 25244 8 807808K kmalloc-4096 132600 3240 2% 0.39K 3315 40 53040K xfs_efd_item 123368 123368 100% 0.07K 2203 56 8812K Acpi-ParseExt 110048 37821 34% 2.00K 6878 16 220096K kmalloc-2048 71055 3090 4% 2.06K 4737 15 151584K idr_layer_cache 51684 50881 98% 0.05K 708 73 2832K fanotify_event_info 51000 50933 99% 0.12K 750 68 6000K dnotify_mark 37504 36803 98% 0.06K 586 64 2344K anon_vma 33303 33303 100% 0.08K 653 51 2612K selinux_inode_security 28083 27187 96% 0.21K 759 37 6072K vm_area_struct 27534 18839 68% 0.81K 706 39 22592K task_xstate 26214 21624 82% 0.04K 257 102 1028K Acpi-Namespace 26163 25143 96% 0.31K 513 51 8208K nf_conntrack_ffffffff81a26d80 25959 25808 99% 0.62K 509 51 16288K sock_inode_cache 25530 25312 99% 0.09K 555 46 2220K configfs_dir_cache 24565 24225 98% 0.05K 289 85 1156K shared_policy_node 23205 22976 99% 0.10K 595 39 2380K blkdev_ioc 20416 20197 98% 0.18K 464 44 3712K xfs_log_ticket > cat /proc/slabinfo <<<< See file attached >>>> Created attachment 1479631 [details]
Output of /proc/slabinfo on the machine showing the issue
Kernel version of the host that I was able to reproduce the issue. > [root@vd0916 ~]# uname -r > 3.10.0-327.36.3.el7.x86_64 Reproducibility steps: > I was running a script similar to what @Anand has posted https://bugzilla.redhat.com/show_bug.cgi?id=1507149#c44. > By starting a lot of kubernetes pods and killing them in a tight loop. The source of the strange workaround in comment 45 was this article: http://www.linuxfly.org/kubernetes-19-conflict-with-centos7/? which one of our colleagues translated for us. His summary was: The reason that upgrading k8s from 1.6 to 1.9 alone does not reproduce the cgroup memory leak issue is that without restarting the node kubelet will re-use the /sys/fs/cgroup/memory/kubepods directory that was created by k8s 1.6 kubelet, which has cgroup kernel memory limit disabled by default. Restarting the nodes will result in /sys/fs/cgroup/memory/kubepods directory being re-created by k8s 1.9 kubelet and hence will have cgroup kernel memory limit enabled by default. Tomorrow I'll start a test to double-check the workaround on a second node. I tried the script in Comment 32 and was able to have the same issue (where a docker run fails due to ENOSPC). The script is copied below for ease of reading: #!/bin/bash mkdir -p pages for x in `seq 1280000`; do [ $((x % 1000)) -eq 0 ] && echo $x mkdir /sys/fs/cgroup/memory/foo # echo 1M > /sys/fs/cgroup/memory/foo/memory.limit_in_bytes echo 100M > /sys/fs/cgroup/memory/foo/memory.kmem.limit_in_bytes echo $$ >/sys/fs/cgroup/memory/foo/cgroup.procs memhog 4K &>/dev/null echo trex>pages/$x echo $$ >/sys/fs/cgroup/memory/cgroup.procs rmdir /sys/fs/cgroup/memory/foo done I suspected that this could be due to short-lived processes (our use case potentially has many short-lived containers in an error-handling path). I modified the script to the following with the sleeper being a compiled C program that prints pid (getpid()) and then sleeps for a second: #!/bin/bash mkdir -p pages for x in `seq 1280000`; do [ $((x % 1000)) -eq 0 ] && echo $x mkdir /sys/fs/cgroup/memory/test2 # echo 1M > /sys/fs/cgroup/memory/test2/memory.limit_in_bytes echo 100M > /sys/fs/cgroup/memory/test2/memory.kmem.limit_in_bytes ./sleeper >/sys/fs/cgroup/memory/test2/cgroup.procs memhog 4K &>/dev/null echo trex>pages/$x echo $$ >/sys/fs/cgroup/memory/cgroup.procs rmdir /sys/fs/cgroup/memory/test2 done --------------------------------------- On a machine without the ENOSPC: ls -lrt /sys/kernel/slab/ | wc -l 251 slabtop -s l -o Active / Total Objects (% used) : 533404 / 535929 (99.5%) Active / Total Slabs (% used) : 18399 / 18399 (100.0%) Active / Total Caches (% used) : 65 / 95 (68.4%) Active / Total Size (% used) : 182723.21K / 183691.12K (99.5%) Minimum / Average / Maximum Object : 0.01K / 0.34K / 8.00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 124131 123889 99% 0.19K 5911 21 23644K dentry 89908 89908 100% 0.15K 3458 26 13832K xfs_ili 95280 95280 100% 1.06K 3176 30 101632K xfs_inode 89427 89427 100% 0.10K 2293 39 9172K buffer_head 16335 16147 98% 0.58K 605 27 9680K inode_cache 18072 18072 100% 0.11K 502 36 2008K sysfs_dir_cache --------------------------------------- On a machine with the ENOSPC: ls -lrt /sys/kernel/slab/ | wc -l 2491573 sudo slabtop -s l -o Active / Total Objects (% used) : 99818530 / 209669755 (47.6%) Active / Total Slabs (% used) : 4532638 / 4532638 (100.0%) Active / Total Caches (% used) : 125 / 161 (77.6%) Active / Total Size (% used) : 21303475.11K / 59862027.02K (35.6%) Minimum / Average / Maximum Object : 0.01K / 0.29K / 8.00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 30771153 17915952 58% 0.19K 732647 42 5861176K dentry 22472028 22454051 99% 0.11K 624223 36 2496892K sysfs_dir_cache 22912383 19079451 83% 0.10K 587497 39 2349988K buffer_head 18623542 3876899 20% 0.57K 333069 56 10658208K radix_tree_node 8240850 837650 10% 1.06K 274695 30 8790240K xfs_inode 10468080 6422369 61% 0.09K 249240 42 996960K kmalloc-96 10072062 677935 6% 0.38K 239811 42 3836976K blkdev_requests 13732880 415473 3% 0.25K 214577 64 3433232K kmalloc-256 --------------------------------------- I can see that with a slightly long-lived process (as shown below), the case has not been hit even though we have run through 90k iterations. The active slabs is at 18k for 1s long processes while for the short-lived process the active slabs was at 400k at a similar point. --------------------------------------- time ./sleeper 4427 real 0m1.001s user 0m0.000s sys 0m0.001s --------------------------------------- time echo $$ 29978 real 0m0.000s user 0m0.000s sys 0m0.000s --------------------------------------- Based on a superficial understanding of the comments in the bug and the test results, it seems that the 'leak' is exercised in cases where there are many short-lived processes / containers. Is this assumption right? We can implement this as a workaround if this reduces the frequency of the leak. In continuation to Comment 50, I just tried to run the script with the short-lived process on the side. The 'Active / Total Slabs (% used)' jumped from 18k to 27k in about 1k iterations of the script and does not show any signs of going down. So I strongly suspect that a quickly exiting process could somehow lead to this state. Could someone please corroborate this based on better understanding of the issue and the patch? We could implement a workaround based on this. I cannot view External Bug ID: Red Hat Knowledge Base (Solution) 3562061 I cannot see that KB article either? Can you please update this? (In reply to Vamsee Yarlagadda from comment #48) > Kernel version of the host that I was able to reproduce the issue. > > [root@vd0916 ~]# uname -r > > 3.10.0-327.36.3.el7.x86_64 As per Bug 1502818, can you attempt to reproduce the reported issue under kernel-3.10.0-862.el7? We seem to be experiencing this issue with short lived containers with kernel 3.10.0-862.9.1.el7.x86_64 and Docker version 18.03.1-ce, build 9ee9f40 So, as far as I know kernel memory controller in RHEL7 kernel is very old, useless*, and buggy and it either needs to be disabled entirely, or fixed by backported changes from 4.x kernels (where it was basically rewritten from scratch). A workaround is NOT to enable/use kmem; I am working on that in https://github.com/opencontainers/runc/pull/1921 (or https://github.com/opencontainers/runc/pull/1920). * it is useless as there's no reclamation of kernel memory, i.e. dcache and icache entries are not being reclaimed once we run out of kernel memory in this cgroup. RHEL 7.6 kernel should have version number of 3.10.0-957 -- can anyone check if the bug is fixed in there? The issue is a bit different, but same symptoms. Upstream the slab caches aren't immediately freed but only when memory pressure occours. Because of changes present upstream, it's possible to free the memcg id as soon as it goes offline so the slab caches and corresponding memcg keep existing until everything else is freed. Two things are happening in RHEL7: kmem caches aren't being emptied even during memory pressure and the memcg id isn't being freed so the -ENOMEM and failure to create new memcgs come from the fact that there're 65535 memory cgroups still using an id waiting for the slab caches to be freed. This has not been fixed in 7.6 and I'm working on the fix. It's possible that just solving the first problem will address both, but I'm not sure yet. Test kernel: http://people.redhat.com/~arozansk/a18a3f62f17ae367d23647cad92c05f0b5ca39dc/ Known issues: dentry isn't being accounted; moving a process while it's allocating kmem with existing or new slab caches will leak both cache and the old memory cgroup. In practical terms this test kernel should reduce significantly the kmem cache and memory cgroup leaks in short lived containers where the processes running in the container finish before the cgroup is removed. I'll keep working on the remaining issues. Aristeu, is there any decent way to see which backports have you applied (other than to download src.rpm and try to figure it out)? In general, all related upstream fixes should be from Vladimir Davydov (vdavydov and vdavydov.dev). Created attachment 1578367 [details]
Test patches.
Attached.
Folks, can I get some feedback on the test kernel please? Created attachment 1591753 [details]
Slab active/Total Graph
Created attachment 1591754 [details]
Slab active/Total Graph data
Patch(es) committed on kernel-3.10.0-1075.el7 Quick question about the kernel the update is committed on... I see that 7.7 was released with kernel 3.10.0-1062.el7 earlier this month and in https://bugzilla.redhat.com/show_bug.cgi?id=1507149#c101 Jan Stancek mentioned the patch was included in kernel 3.10.0-1075.el7. RHEL 8.0 starts with the new 4.18.0 kernel. Typically the kernel updates are 1062.x.x releases, i.e. the latest kernel for 7.6 is 3.10.0-957.27.4.el7. My question is, which release of RHEL is 3.10.0-1075.el7 slated to be included in? 7.8? (In reply to Aaron from comment #105) > Quick question about the kernel the update is committed on... I see that 7.7 > was released with kernel 3.10.0-1062.el7 earlier this month and in > https://bugzilla.redhat.com/show_bug.cgi?id=1507149#c101 Jan Stancek > mentioned the patch was included in kernel 3.10.0-1075.el7. RHEL 8.0 starts > with the new 4.18.0 kernel. Typically the kernel updates are 1062.x.x > releases, i.e. the latest kernel for 7.6 is 3.10.0-957.27.4.el7. > > My question is, which release of RHEL is 3.10.0-1075.el7 slated to be > included in? 7.8? 7.8. These fixes are likely to be present in a future zstream release. *** Bug 1739145 has been marked as a duplicate of this bug. *** https://bugzilla.redhat.com/show_bug.cgi?id=1507149#c106 Thanks @Aristeu, I see https://bugzilla.redhat.com/show_bug.cgi?id=1748237 was created for a z-stream update to 7.5 with status of "ON_QA". Since 7.8 has not yet been released, is there a z-stream 3.10.0-1062.x release in the works as well? Hi Aaron, (In reply to Aaron from comment #118) > https://bugzilla.redhat.com/show_bug.cgi?id=1507149#c106 > > Thanks @Aristeu, I see https://bugzilla.redhat.com/show_bug.cgi?id=1748237 > was created for a z-stream update to 7.5 with status of "ON_QA". Since 7.8 > has not yet been released, is there a z-stream 3.10.0-1062.x release in the > works as well? yes, there's a BZ to include it on 7.7 kernel as well (1752421) Verified with kernel-3.10.0-1098.el7: ============================================ slableakv1 - ppc64le: ...... Wait for SUnreclaim stabilized SUnreclaim stabilized now One exec per cpu on 8 - Iterations: 4000 - Before: SUnreclaim: 164672 kB - After: SUnreclaim: 190400 kB Total Leak: 25728 kB - Leak per execute: .8040 kB - Total time: 5863s Kernel: 3.10.0-1098.el7.ppc64le - podman: podman version 1.4.4 slableakv2 - ppc64le: ...... Wait for SUnreclaim stabilized SUnreclaim stabilized now One exec per cpu on 8 - Iterations: 4000 - Before: SUnreclaim: 168384 kB - After: SUnreclaim: 199488 kB Total Leak: 31104 kB - Leak per execute: .9720 kB - Total time: 5172s Kernel: 3.10.0-1098.el7.ppc64le - podman: podman version 1.4.4 slableakv1 - x86_64: ...... Wait for SUnreclaim stabilized SUnreclaim stabilized now One exec per cpu on 12 - Iterations: 6000 - Before: SUnreclaim: 96400 kB - After: SUnreclaim: 158772 kB Total Leak: 62372 kB - Leak per execute: .8662 kB - Total time: 10442s Kernel: 3.10.0-1098.el7.x86_64 - podman: podman version 1.4.4 slableakv2 - x86_64: ...... Wait for SUnreclaim stabilized SUnreclaim stabilized now One exec per cpu on 12 - Iterations: 6000 - Before: SUnreclaim: 82584 kB - After: SUnreclaim: 115444 kB Total Leak: 32860 kB - Leak per execute: .4563 kB - Total time: 5274s Kernel: 3.10.0-1098.el7.x86_64 - podman: podman version 1.4.4 Both slableakv1(podman exec) and slableakv2(shortlive containers) tests showing good result. Move to VERIFIED. Beaker Jobs: https://beaker.engineering.redhat.com/jobs/3807009 *** Bug 1392283 has been marked as a duplicate of this bug. *** ------- Comment From mjwolf.com 2017-05-17 17:25 EDT------- This should not be with Gustavo on the IBM side. Re-assigning the bug ------- Comment From gmgaydos.com 2017-05-26 14:06 EDT------- Red Hat: This bug has been reassigned on May 25 in order to improve response time. ------- Comment From chavez.com 2017-05-30 18:21 EDT------- Hello, How much RAM should we have installed in the system? Also, does this problem happen with PowerVM, KVM and/or bare metal? Any information you can give us on the config will help us set up a similar environment for a recreate. Thanks. ------- Comment From chavez.com 2017-06-01 18:37 EDT------- (In reply to comment #10) > Hello, > How much RAM should we have installed in the system? Also, does this problem > happen with PowerVM, KVM and/or bare metal? Any information you can give us > on the config will help us set up a similar environment for a recreate. > Thanks. Just to clarify, this question was addressed to the Red Hat reporter, Wang Shu. The needinfo flag in the Red Hat bug was already set accordingly. ------- Comment From diegodo.com 2017-06-13 16:10 EDT------- Mauricio did a check on the LPAR used to reproduce the problem. This system has 16 CPUs & 2677 MB RAM. I was able to reproduce this problem using my local system (KVM guest): Red Hat Enterprise Linux Server 7.4 Beta (Maipo) [root@rhel74snapshot2 ~]# lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 8 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Model: 2.0 (pvr 004d 0200) Model name: POWER8 (raw), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 64K L1i cache: 32K NUMA node0 CPU(s): 0-31 [root@rhel74snapshot2 ~]# free -m total used free shared buff/cache available Mem: 498 193 25 4 278 20 Swap: 8192 136 8056 I wasn't able to reproduce with less cpus. The steps were the same as documented in the first comment: [root@rhel74snapshot2 ~]# systemctl start cgconfig [root@rhel74snapshot2 ~]# for i in $(seq 1 1000); do mkdir /sys/fs/cgroup/cpuacct/cgroup_$i; done ------- Comment From diegodo.com 2017-06-23 11:41 EDT------- UPDATE I couldn't reproduce this problem on kernel >= 4.1. I was able to increase the "for" execution in 10x and nothing happened. Now I'm trying to find the problem and creating a backport to kernel 3.10. ------- Comment From diegodo.com 2017-06-28 14:09 EDT------- UPDATE: I did some research using a basic "free -m" command, and there is a huge difference between kernel 3.10 and kernel >= 4.1 - the creation of 1000 dirs in 3.10 consumes about 100MB, while in 4.1 consumes 20MB. Because there are a lot of changes between these kernels in this area of cgroup, I'm trying to figure out what exactly did the difference in this size and trying to create a backport to this improvement. Please feel free to suggest/ask something. ------- Comment From diegodo.com 2017-07-11 12:59 EDT------- UPDATE I did some tests on x86 arch and I figured out that the creation of 1000 cgroups have almost the same memory consumption (100MB) as we can see below: ------------------------------------------------------------------------------- Before: total used free shared buff/cache available Mem: 570 126 290 4 154 293 Swap: 0 0 0 After: total used free shared buff/cache available Mem: 570 125 185 4 259 193 Swap: 0 0 0 [root@localhost ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 32 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 94 Model name: Intel Core Processor (Skylake) Stepping: 3 CPU MHz: 2496.000 BogoMIPS: 4992.00 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K NUMA node0 CPU(s): 0-31 [root@localhost ~]# ------------------------------------------------------------------------------- Before: total used free shared buff/cache available Mem: 14560 313 13774 12 472 13947 Swap: 8192 0 8192 After: total used free shared buff/cache available Mem: 14560 329 13659 12 571 13836 Swap: 8192 0 8192 Architecture: ppc64le Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 8 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Model: 2.0 (pvr 004d 0200) Model name: POWER8 (raw), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 64K L1i cache: 32K NUMA node0 CPU(s): 0-31 --------------------------------------------- So, it seems that memory consumption are almost the same for both architectures. ------- Comment From diegodo.com 2017-11-22 14:30 EDT------- (In reply to comment #21) > It seems that the bug can be reproduced only on systems where the first NUMA > node has no CPUs assigned: > $ uname -r > 3.10.0-693.el7.ppc64le > > $ cat test.sh > echo 3 > /proc/sys/vm/drop_caches > grep ^MemAvail /proc/meminfo > for i in $(seq 10); do > for j in $(seq 100); do > mkdir /sys/fs/cgroup/cpuacct/cg_${i}_${j} > done > grep ^MemAvail /proc/meminfo > done > > === First system === > > $ lscpu > Architecture: ppc64le > Byte Order: Little Endian > CPU(s): 16 > On-line CPU(s) list: 0-15 > Thread(s) per core: 8 > Core(s) per socket: 1 > Socket(s): 2 > NUMA node(s): 1 > Model: 2.1 (pvr 004b 0201) > Model name: POWER8 (architected), altivec supported > Hypervisor vendor: (null) > Virtualization type: full > L1d cache: 64K > L1i cache: 32K > L2 cache: 512K > L3 cache: 8192K > NUMA node0 CPU(s): 0-15 > > $ ./test.sh > MemAvailable: 16756096 kB > MemAvailable: 16715200 kB > MemAvailable: 16668544 kB > MemAvailable: 16627712 kB > MemAvailable: 16580992 kB > MemAvailable: 16528448 kB > MemAvailable: 16475840 kB > MemAvailable: 16423360 kB > MemAvailable: 16379648 kB > MemAvailable: 16327040 kB > MemAvailable: 16280384 kB > > === Second system === > > $ lscpu > Architecture: ppc64le > Byte Order: Little Endian > CPU(s): 16 > On-line CPU(s) list: 0-15 > Thread(s) per core: 8 > Core(s) per socket: 1 > Socket(s): 2 > NUMA node(s): 2 > Model: 2.1 (pvr 004b 0201) > Model name: POWER8 (architected), altivec supported > Hypervisor vendor: (null) > Virtualization type: full > L1d cache: 64K > L1i cache: 32K > NUMA node0 CPU(s): 0-15 > NUMA node3 CPU(s): > > $ ./test.sh > MemAvailable: 16766720 kB > MemAvailable: 16725888 kB > MemAvailable: 16679168 kB > MemAvailable: 16618304 kB > MemAvailable: 16571712 kB > MemAvailable: 16519296 kB > MemAvailable: 16466816 kB > MemAvailable: 16420160 kB > MemAvailable: 16367680 kB > MemAvailable: 16318144 kB > MemAvailable: 16265664 kB > > === Third system === > > $ lscpu > Architecture: ppc64le > Byte Order: Little Endian > CPU(s): 16 > On-line CPU(s) list: 0-15 > Thread(s) per core: 8 > Core(s) per socket: 1 > Socket(s): 2 > NUMA node(s): 2 > Model: 2.1 (pvr 004b 0201) > Model name: POWER8 (architected), altivec supported > Hypervisor vendor: (null) > Virtualization type: full > L1d cache: 64K > L1i cache: 32K > L2 cache: 512K > L3 cache: 8192K > NUMA node0 CPU(s): > NUMA node2 CPU(s): 0-15 > > $ uname -r > 3.10.0-693.el7.ppc64le > > $ ./test.sh > MemAvailable: 22156736 kB > MemAvailable: 16632896 kB > MemAvailable: 11104064 kB > MemAvailable: 5566976 kB > MemAvailable: 34112 kB > !!! OOM !!! I was not able to reproduce the result with Numa Node0 with no Cpu using qemu: [root@localhost ~]# lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 16 NUMA node(s): 2 Model: 2.0 (pvr 004d 0200) Model name: POWER8 (raw), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 64K L1i cache: 32K NUMA node0 CPU(s): NUMA node1 CPU(s): 0-15 [root@localhost ~]# ./test.sh MemAvailable: 86848 kB MemAvailable: 31168 kB MemAvailable: 56704 kB MemAvailable: 51328 kB MemAvailable: 44608 kB MemAvailable: 40576 kB MemAvailable: 37632 kB MemAvailable: 27328 kB MemAvailable: 29440 kB MemAvailable: 30720 kB MemAvailable: 32064 kB ------- Comment From diegodo.com 2018-02-28 06:56 EDT------- Hi Li Wang, I need help to reproduce this problem. I wasn't able to reproduce this problem how you can see in Comment #24. Can anyone help me to reproduce this? Thank you Guys, any news when it will be backported to 7.7? Hi Jean-Philippe, (In reply to Jean-Philippe Menil from comment #124) > Guys, > any news when it will be backported to 7.7? It was backported to RHEL-7.7 in BZ 1752421. Upgrade to kernel-3.10.0-1062.4.1.el7 from RHSA-2019:3055 https://access.redhat.com/errata/RHSA-2019:3055 Yeah i guess it was fixed: https://github.com/docker/for-linux/issues/841 cat /proc/cmdline BOOT_IMAGE=/vmlinuz-3.10.0-1062.7.1.el7.x86_64 root=/dev/mapper/rootvg-lvroot ro crashkernel=auto rd.lvm.lv=rootvg/lvroot rd.lvm.lv=rootvg/lvswap rhgb quiet LANG=en_US.UTF-8 namespace.unpriv_enable=1 user_namespace.enable=1 uname -a Linux hsv5234s 3.10.0-1062.7.1.el7.x86_64 #1 SMP Wed Nov 13 08:44:42 EST 2019 x86_64 x86_64 x86_64 GNU/Linux Dec 18 08:20:41 hsv5234s kernel: [1080660.051671] runc:[1:CHILD]: page allocation failure: order:7, mode:0xc0d0 Dec 18 08:20:41 hsv5234s kernel: [1080660.051679] CPU: 17 PID: 16064 Comm: runc:[1:CHILD] Kdump: loaded Tainted: G ------------ T 3.10.0-1062.7.1.el7.x86_64 #1 Dec 18 08:20:41 hsv5234s kernel: [1080660.051682] Hardware name: LENOVO System x3650 M5 -[8871AC1]-/01KN179, BIOS -[TCE140D-2.90]- 02/13/2019 Dec 18 08:20:41 hsv5234s kernel: [1080660.051684] Call Trace: Dec 18 08:20:41 hsv5234s kernel: [1080660.051696] [<ffffffffbcb7ac23>] dump_stack+0x19/0x1b Dec 18 08:20:41 hsv5234s kernel: [1080660.051704] [<ffffffffbc5c3e50>] warn_alloc_failed+0x110/0x180 Dec 18 08:20:41 hsv5234s kernel: [1080660.051708] [<ffffffffbc5c8a5f>] __alloc_pages_nodemask+0x9df/0xbe0 Dec 18 08:20:41 hsv5234s kernel: [1080660.051714] [<ffffffffbc616c08>] alloc_pages_current+0x98/0x110 Dec 18 08:20:41 hsv5234s kernel: [1080660.051719] [<ffffffffbc5e3c08>] kmalloc_order+0x18/0x40 Dec 18 08:20:41 hsv5234s kernel: [1080660.051723] [<ffffffffbc622136>] kmalloc_order_trace+0x26/0xa0 Dec 18 08:20:41 hsv5234s kernel: [1080660.051727] [<ffffffffbc6266f1>] __kmalloc+0x211/0x230 Dec 18 08:20:41 hsv5234s kernel: [1080660.051732] [<ffffffffbc63ee41>] memcg_alloc_cache_params+0x81/0xb0 Dec 18 08:20:41 hsv5234s kernel: [1080660.051735] [<ffffffffbc5e38b4>] do_kmem_cache_create+0x74/0xf0 Dec 18 08:20:41 hsv5234s kernel: [1080660.051738] [<ffffffffbc5e3a32>] kmem_cache_create+0x102/0x1b0 Dec 18 08:20:41 hsv5234s kernel: [1080660.051753] [<ffffffffc06b4dc1>] nf_conntrack_init_net+0xf1/0x260 [nf_conntrack] Dec 18 08:20:41 hsv5234s kernel: [1080660.051759] [<ffffffffc06b56c4>] nf_conntrack_pernet_init+0x14/0x150 [nf_conntrack] Dec 18 08:20:41 hsv5234s kernel: [1080660.051764] [<ffffffffbca44054>] ops_init+0x44/0x150 Dec 18 08:20:41 hsv5234s kernel: [1080660.051767] [<ffffffffbca44203>] setup_net+0xa3/0x160 Dec 18 08:20:41 hsv5234s kernel: [1080660.051770] [<ffffffffbca449a5>] copy_net_ns+0xb5/0x180 Dec 18 08:20:41 hsv5234s kernel: [1080660.051774] [<ffffffffbc4cb469>] create_new_namespaces+0xf9/0x180 Dec 18 08:20:41 hsv5234s kernel: [1080660.051777] [<ffffffffbc4cb6aa>] unshare_nsproxy_namespaces+0x5a/0xc0 Dec 18 08:20:41 hsv5234s kernel: [1080660.051781] [<ffffffffbc49ae8b>] SyS_unshare+0x1cb/0x340 Dec 18 08:20:41 hsv5234s kernel: [1080660.051785] [<ffffffffbcb8dede>] system_call_fastpath+0x25/0x2a Dec 18 08:20:41 hsv5234s kernel: [1080660.051787] Mem-Info: Dec 18 08:20:41 hsv5234s kernel: [1080660.051800] active_anon:603062 inactive_anon:658360 isolated_anon:0 Dec 18 08:20:41 hsv5234s kernel: [1080660.051800] active_file:2900635 inactive_file:2830989 isolated_file:4 Dec 18 08:20:41 hsv5234s kernel: [1080660.051800] unevictable:0 dirty:1534 writeback:0 unstable:0 Dec 18 08:20:41 hsv5234s kernel: [1080660.051800] slab_reclaimable:341029 slab_unreclaimable:390332 Dec 18 08:20:41 hsv5234s kernel: [1080660.051800] mapped:56516 shmem:134670 pagetables:12006 bounce:0 Dec 18 08:20:41 hsv5234s kernel: [1080660.051800] free:46786 free_pcp:2175 free_cma:0 Dec 18 08:20:41 hsv5234s kernel: [1080660.051807] Node 0 DMA free:15880kB min:44kB low:52kB high:64kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15980kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Dec 18 08:20:41 hsv5234s kernel: [1080660.051815] lowmem_reserve[]: 0 971 15057 15057 Dec 18 08:20:41 hsv5234s kernel: [1080660.051820] Node 0 DMA32 free:60060kB min:2808kB low:3508kB high:4212kB active_anon:94876kB inactive_anon:149864kB active_file:233272kB inactive_file:218016kB unevictable:0kB isolated(anon):0kB isolated(file):16kB present:1226660kB managed:995204kB mlocked:0kB dirty:112kB writeback:0kB mapped:14792kB shmem:16016kB slab_reclaimable:112532kB slab_unreclaimable:89812kB kernel_stack:2080kB pagetables:2928kB unstable:0kB bounce:0kB free_pcp:1384kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:23 all_unreclaimable? no Dec 18 08:20:41 hsv5234s kernel: [1080660.051828] lowmem_reserve[]: 0 0 14085 14085 Dec 18 08:20:41 hsv5234s kernel: [1080660.051831] Node 0 Normal free:51656kB min:40708kB low:50884kB high:61060kB active_anon:1375020kB inactive_anon:1401180kB active_file:4987756kB inactive_file:4918788kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14680064kB managed:14423212kB mlocked:0kB dirty:3076kB writeback:0kB mapped:94496kB shmem:141648kB slab_reclaimable:612400kB slab_unreclaimable:685176kB kernel_stack:10704kB pagetables:32204kB unstable:0kB bounce:0kB free_pcp:4540kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Dec 18 08:20:41 hsv5234s kernel: [1080660.051839] lowmem_reserve[]: 0 0 0 0 Dec 18 08:20:41 hsv5234s kernel: [1080660.051843] Node 1 Normal free:59548kB min:46548kB low:58184kB high:69820kB active_anon:942352kB inactive_anon:1082396kB active_file:6381128kB inactive_file:6187152kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16495992kB mlocked:0kB dirty:2948kB writeback:0kB mapped:116776kB shmem:381016kB slab_reclaimable:639184kB slab_unreclaimable:786324kB kernel_stack:9344kB pagetables:12892kB unstable:0kB bounce:0kB free_pcp:2744kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:64 all_unreclaimable? no Dec 18 08:20:41 hsv5234s kernel: [1080660.051851] lowmem_reserve[]: 0 0 0 0 Dec 18 08:20:41 hsv5234s kernel: [1080660.051854] Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB Dec 18 08:20:41 hsv5234s kernel: [1080660.051866] Node 0 DMA32: 667*4kB (UEM) 594*8kB (UEM) 263*16kB (UEM) 222*32kB (UEM) 258*64kB (UEM) 132*128kB (UEM) 31*256kB (UEM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 60076kB Dec 18 08:20:41 hsv5234s kernel: [1080660.051878] Node 0 Normal: 8586*4kB (UEM) 2023*8kB (UM) 6*16kB (M) 6*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 50816kB Dec 18 08:20:41 hsv5234s kernel: [1080660.051888] Node 1 Normal: 3685*4kB (UEM) 5529*8kB (UEM) 24*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 59356kB Dec 18 08:20:41 hsv5234s kernel: [1080660.051899] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Dec 18 08:20:41 hsv5234s kernel: [1080660.051901] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Dec 18 08:20:41 hsv5234s kernel: [1080660.051903] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Dec 18 08:20:41 hsv5234s kernel: [1080660.051905] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Dec 18 08:20:41 hsv5234s kernel: [1080660.051906] 5869920 total pagecache pages Dec 18 08:20:41 hsv5234s kernel: [1080660.051910] 3487 pages in swap cache Dec 18 08:20:41 hsv5234s kernel: [1080660.051913] Swap cache stats: add 38962826, delete 38956860, find 14019018/18441754 Dec 18 08:20:41 hsv5234s kernel: [1080660.051914] Free swap = 8253408kB Dec 18 08:20:41 hsv5234s kernel: [1080660.051915] Total swap = 8388604kB Dec 18 08:20:41 hsv5234s kernel: [1080660.051917] 8174980 pages RAM Dec 18 08:20:41 hsv5234s kernel: [1080660.051918] 0 pages HighMem/MovableOnly Dec 18 08:20:41 hsv5234s kernel: [1080660.051920] 192404 pages reserved Dec 18 08:20:41 hsv5234s kernel: [1080660.051923] kmem_cache_create(nf_conntrack_ffff98b468a20000) failed with error -12 Dec 18 08:20:41 hsv5234s kernel: [1080660.051926] CPU: 17 PID: 16064 Comm: runc:[1:CHILD] Kdump: loaded Tainted: G ------------ T 3.10.0-1062.7.1.el7.x86_64 #1 Dec 18 08:20:41 hsv5234s kernel: [1080660.051928] Hardware name: LENOVO System x3650 M5 -[8871AC1]-/01KN179, BIOS -[TCE140D-2.90]- 02/13/2019 Dec 18 08:20:41 hsv5234s kernel: [1080660.051930] Call Trace: Dec 18 08:20:41 hsv5234s kernel: [1080660.051933] [<ffffffffbcb7ac23>] dump_stack+0x19/0x1b Dec 18 08:20:41 hsv5234s kernel: [1080660.051936] [<ffffffffbc5e3ab7>] kmem_cache_create+0x187/0x1b0 Dec 18 08:20:41 hsv5234s kernel: [1080660.051942] [<ffffffffc06b4dc1>] nf_conntrack_init_net+0xf1/0x260 [nf_conntrack] Dec 18 08:20:41 hsv5234s kernel: [1080660.051947] [<ffffffffc06b56c4>] nf_conntrack_pernet_init+0x14/0x150 [nf_conntrack] Dec 18 08:20:41 hsv5234s kernel: [1080660.051950] [<ffffffffbca44054>] ops_init+0x44/0x150 Dec 18 08:20:41 hsv5234s kernel: [1080660.051953] [<ffffffffbca44203>] setup_net+0xa3/0x160 Dec 18 08:20:41 hsv5234s kernel: [1080660.051956] [<ffffffffbca449a5>] copy_net_ns+0xb5/0x180 Dec 18 08:20:41 hsv5234s kernel: [1080660.051959] [<ffffffffbc4cb469>] create_new_namespaces+0xf9/0x180 Dec 18 08:20:41 hsv5234s kernel: [1080660.051961] [<ffffffffbc4cb6aa>] unshare_nsproxy_namespaces+0x5a/0xc0 Dec 18 08:20:41 hsv5234s kernel: [1080660.051965] [<ffffffffbc49ae8b>] SyS_unshare+0x1cb/0x340 Dec 18 08:20:41 hsv5234s kernel: [1080660.051967] [<ffffffffbcb8dede>] system_call_fastpath+0x25/0x2a Dec 18 08:20:41 hsv5234s kernel: [1080660.051970] Unable to create nf_conn slab cache Dec 18 08:20:41 hsv5234s kernel: [1080660.119124] runc:[1:CHILD]: page allocation failure: order:7, mode:0xc0d0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119135] CPU: 4 PID: 16094 Comm: runc:[1:CHILD] Kdump: loaded Tainted: G ------------ T 3.10.0-1062.7.1.el7.x86_64 #1 Dec 18 08:20:41 hsv5234s kernel: [1080660.119139] Hardware name: LENOVO System x3650 M5 -[8871AC1]-/01KN179, BIOS -[TCE140D-2.90]- 02/13/2019 Dec 18 08:20:41 hsv5234s kernel: [1080660.119142] Call Trace: Dec 18 08:20:41 hsv5234s kernel: [1080660.119156] [<ffffffffbcb7ac23>] dump_stack+0x19/0x1b Dec 18 08:20:41 hsv5234s kernel: [1080660.119167] [<ffffffffbc5c3e50>] warn_alloc_failed+0x110/0x180 Dec 18 08:20:41 hsv5234s kernel: [1080660.119174] [<ffffffffbc5c8a5f>] __alloc_pages_nodemask+0x9df/0xbe0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119182] [<ffffffffbc616c08>] alloc_pages_current+0x98/0x110 Dec 18 08:20:41 hsv5234s kernel: [1080660.119190] [<ffffffffbc5e3c08>] kmalloc_order+0x18/0x40 Dec 18 08:20:41 hsv5234s kernel: [1080660.119196] [<ffffffffbc622136>] kmalloc_order_trace+0x26/0xa0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119202] [<ffffffffbc6266f1>] __kmalloc+0x211/0x230 Dec 18 08:20:41 hsv5234s kernel: [1080660.119209] [<ffffffffbc63ee41>] memcg_alloc_cache_params+0x81/0xb0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119215] [<ffffffffbc5e38b4>] do_kmem_cache_create+0x74/0xf0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119221] [<ffffffffbc5e3a32>] kmem_cache_create+0x102/0x1b0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119238] [<ffffffffc06b4dc1>] nf_conntrack_init_net+0xf1/0x260 [nf_conntrack] Dec 18 08:20:41 hsv5234s kernel: [1080660.119249] [<ffffffffc06b56c4>] nf_conntrack_pernet_init+0x14/0x150 [nf_conntrack] Dec 18 08:20:41 hsv5234s kernel: [1080660.119256] [<ffffffffbca44054>] ops_init+0x44/0x150 Dec 18 08:20:41 hsv5234s kernel: [1080660.119260] [<ffffffffbca44203>] setup_net+0xa3/0x160 Dec 18 08:20:41 hsv5234s kernel: [1080660.119265] [<ffffffffbca449a5>] copy_net_ns+0xb5/0x180 Dec 18 08:20:41 hsv5234s kernel: [1080660.119271] [<ffffffffbc4cb469>] create_new_namespaces+0xf9/0x180 Dec 18 08:20:41 hsv5234s kernel: [1080660.119276] [<ffffffffbc4cb6aa>] unshare_nsproxy_namespaces+0x5a/0xc0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119286] [<ffffffffbc49ae8b>] SyS_unshare+0x1cb/0x340 Dec 18 08:20:41 hsv5234s kernel: [1080660.119293] [<ffffffffbcb8dede>] system_call_fastpath+0x25/0x2a Dec 18 08:20:41 hsv5234s kernel: [1080660.119296] Mem-Info: Dec 18 08:20:41 hsv5234s kernel: [1080660.119309] active_anon:606064 inactive_anon:659529 isolated_anon:0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119309] active_file:2897344 inactive_file:2822214 isolated_file:4 Dec 18 08:20:41 hsv5234s kernel: [1080660.119309] unevictable:0 dirty:2854 writeback:0 unstable:0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119309] slab_reclaimable:340962 slab_unreclaimable:390372 Dec 18 08:20:41 hsv5234s kernel: [1080660.119309] mapped:56516 shmem:134670 pagetables:12006 bounce:0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119309] free:55605 free_pcp:512 free_cma:0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119318] Node 0 DMA free:15880kB min:44kB low:52kB high:64kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15980kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Dec 18 08:20:41 hsv5234s kernel: [1080660.119331] lowmem_reserve[]: 0 971 15057 15057 Dec 18 08:20:41 hsv5234s kernel: [1080660.119338] Node 0 DMA32 free:62628kB min:2808kB low:3508kB high:4212kB active_anon:94796kB inactive_anon:149968kB active_file:233240kB inactive_file:216416kB unevictable:0kB isolated(anon):0kB isolated(file):16kB present:1226660kB managed:995204kB mlocked:0kB dirty:112kB writeback:0kB mapped:14792kB shmem:16016kB slab_reclaimable:112628kB slab_unreclaimable:89972kB kernel_stack:2080kB pagetables:2928kB unstable:0kB bounce:0kB free_pcp:316kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:66 all_unreclaimable? no Dec 18 08:20:41 hsv5234s kernel: [1080660.119351] lowmem_reserve[]: 0 0 14085 14085 Dec 18 08:20:41 hsv5234s kernel: [1080660.119357] Node 0 Normal free:61096kB min:40708kB low:50884kB high:61060kB active_anon:1381536kB inactive_anon:1404520kB active_file:4987048kB inactive_file:4901144kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14680064kB managed:14423212kB mlocked:0kB dirty:3780kB writeback:0kB mapped:94496kB shmem:141648kB slab_reclaimable:612400kB slab_unreclaimable:685176kB kernel_stack:10704kB pagetables:32204kB unstable:0kB bounce:0kB free_pcp:1144kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:76 all_unreclaimable? no Dec 18 08:20:41 hsv5234s kernel: [1080660.119369] lowmem_reserve[]: 0 0 0 0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119375] Node 1 Normal free:82816kB min:46548kB low:58184kB high:69820kB active_anon:947596kB inactive_anon:1084024kB active_file:6369088kB inactive_file:6170932kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16495992kB mlocked:0kB dirty:7524kB writeback:0kB mapped:116776kB shmem:381016kB slab_reclaimable:638820kB slab_unreclaimable:786324kB kernel_stack:9344kB pagetables:12892kB unstable:0kB bounce:0kB free_pcp:584kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Dec 18 08:20:41 hsv5234s kernel: [1080660.119386] lowmem_reserve[]: 0 0 0 0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119391] Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB Dec 18 08:20:41 hsv5234s kernel: [1080660.119412] Node 0 DMA32: 1166*4kB (UEM) 677*8kB (UEM) 266*16kB (UEM) 225*32kB (UEM) 258*64kB (UEM) 131*128kB (UEM) 31*256kB (UEM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 62752kB Dec 18 08:20:41 hsv5234s kernel: [1080660.119432] Node 0 Normal: 9639*4kB (UEM) 2269*8kB (UEM) 251*16kB (UEM) 83*32kB (UEM) 3*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 63572kB Dec 18 08:20:41 hsv5234s kernel: [1080660.119450] Node 1 Normal: 5940*4kB (UEM) 7125*8kB (UEM) 221*16kB (UEM) 30*32kB (UEM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 85320kB Dec 18 08:20:41 hsv5234s kernel: [1080660.119470] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Dec 18 08:20:41 hsv5234s kernel: [1080660.119474] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Dec 18 08:20:41 hsv5234s kernel: [1080660.119477] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Dec 18 08:20:41 hsv5234s kernel: [1080660.119481] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Dec 18 08:20:41 hsv5234s kernel: [1080660.119483] 5858268 total pagecache pages Dec 18 08:20:41 hsv5234s kernel: [1080660.119488] 3481 pages in swap cache Dec 18 08:20:41 hsv5234s kernel: [1080660.119492] Swap cache stats: add 38962827, delete 38956867, find 14019018/18441754 Dec 18 08:20:41 hsv5234s kernel: [1080660.119494] Free swap = 8253408kB Dec 18 08:20:41 hsv5234s kernel: [1080660.119496] Total swap = 8388604kB Dec 18 08:20:41 hsv5234s kernel: [1080660.119499] 8174980 pages RAM Dec 18 08:20:41 hsv5234s kernel: [1080660.119516] 0 pages HighMem/MovableOnly Dec 18 08:20:41 hsv5234s kernel: [1080660.119518] 192404 pages reserved Dec 18 08:20:41 hsv5234s kernel: [1080660.119523] kmem_cache_create(nf_conntrack_ffff98b287b81480) failed with error -12 Dec 18 08:20:41 hsv5234s kernel: [1080660.119535] CPU: 4 PID: 16094 Comm: runc:[1:CHILD] Kdump: loaded Tainted: G ------------ T 3.10.0-1062.7.1.el7.x86_64 #1 Dec 18 08:20:41 hsv5234s kernel: [1080660.119552] Hardware name: LENOVO System x3650 M5 -[8871AC1]-/01KN179, BIOS -[TCE140D-2.90]- 02/13/2019 Dec 18 08:20:41 hsv5234s kernel: [1080660.119567] Call Trace: Dec 18 08:20:41 hsv5234s kernel: [1080660.119590] [<ffffffffbcb7ac23>] dump_stack+0x19/0x1b Dec 18 08:20:41 hsv5234s kernel: [1080660.119611] [<ffffffffbc5e3ab7>] kmem_cache_create+0x187/0x1b0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119630] [<ffffffffc06b4dc1>] nf_conntrack_init_net+0xf1/0x260 [nf_conntrack] Dec 18 08:20:41 hsv5234s kernel: [1080660.119663] [<ffffffffc06b56c4>] nf_conntrack_pernet_init+0x14/0x150 [nf_conntrack] Dec 18 08:20:41 hsv5234s kernel: [1080660.119680] [<ffffffffbca44054>] ops_init+0x44/0x150 Dec 18 08:20:41 hsv5234s kernel: [1080660.119706] [<ffffffffbca44203>] setup_net+0xa3/0x160 Dec 18 08:20:41 hsv5234s kernel: [1080660.119721] [<ffffffffbca449a5>] copy_net_ns+0xb5/0x180 Dec 18 08:20:41 hsv5234s kernel: [1080660.119735] [<ffffffffbc4cb469>] create_new_namespaces+0xf9/0x180 Dec 18 08:20:41 hsv5234s kernel: [1080660.119752] [<ffffffffbc4cb6aa>] unshare_nsproxy_namespaces+0x5a/0xc0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119772] [<ffffffffbc49ae8b>] SyS_unshare+0x1cb/0x340 Dec 18 08:20:41 hsv5234s kernel: [1080660.119790] [<ffffffffbcb8dede>] system_call_fastpath+0x25/0x2a Dec 18 08:20:41 hsv5234s kernel: [1080660.119807] Unable to create nf_conn slab cache Do you wan me to open a new bug for the same issue ? Hello Jean-Philippe Menil, Are you using ip_vs in your env? There is a know issue with ip_vs + user ns+ net ns, BZ#1540585. Thanks, Chao nope: lsmod | grep ip_vs return nothing :) By the way, BZ#1540585 is restricted ... (In reply to Jean-Philippe Menil from comment #128) > nope: > > lsmod | grep ip_vs > > return nothing :) > > By the way, BZ#1540585 is restricted ... Thanks you, Jean-Philippe Menil, Could you help try boot with 'cgroup.memory=nokmem' and see if this issue still exist. Also, can you try to run container without nf_conntrack loaded. I'll check with our network qe, and see if we could reproduce this in house. Many thanks! Chao Chao, i will try the cgroup.memory=nokmem parameter, effectively, could be a good candidate. Just to let you know that the issue is intermittent and took us around a week to trig it. There is absolutely no way that we run anything without nf_conntrack ... Per log snippets pasted on comment #126, it doesn't look like to be the same issue that was coped with in this ticket, but a severe memory fragmentation case instead. There's no panic, but page allocation failure warnings, for high-order blocks: ... Dec 18 08:20:41 hsv5234s kernel: [1080660.051671] runc:[1:CHILD]: page allocation failure: order:7, mode:0xc0d0 ... and the reason is because there are no order-7 (128kB) blocks available on the zones the request could grab memory from (per mode flags), as well as there are no blocks bigger than order-7 available for buddy splits, as we can observe on the provided log splats: ... Dec 18 08:20:41 hsv5234s kernel: [1080660.119432] Node 0 Normal: 9639*4kB (UEM) 2269*8kB (UEM) 251*16kB (UEM) 83*32kB (UEM) 3*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 63572kB Dec 18 08:20:41 hsv5234s kernel: [1080660.119450] Node 1 Normal: 5940*4kB (UEM) 7125*8kB (UEM) 221*16kB (UEM) 30*32kB (UEM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 85320kB ... and meminfo data doesn't suggest a bloated slab usage, but a bloated page-cache usage instead: ... Dec 18 08:20:41 hsv5234s kernel: [1080660.119296] Mem-Info: Dec 18 08:20:41 hsv5234s kernel: [1080660.119309] active_anon:606064 inactive_anon:659529 isolated_anon:0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119309] active_file:2897344 inactive_file:2822214 isolated_file:4 Dec 18 08:20:41 hsv5234s kernel: [1080660.119309] unevictable:0 dirty:2854 writeback:0 unstable:0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119309] slab_reclaimable:340962 slab_unreclaimable:390372 Dec 18 08:20:41 hsv5234s kernel: [1080660.119309] mapped:56516 shmem:134670 pagetables:12006 bounce:0 Dec 18 08:20:41 hsv5234s kernel: [1080660.119309] free:55605 free_pcp:512 free_cma:0 ... -- Rafael So, what you suggest is just increase min_free_kbytes? *** Bug 1489799 has been marked as a duplicate of this bug. *** Chao, the cgroup.memory=nokmem seem to do the trick. Thanks again and best regards, Jean-Philippe Hi Chao, I got a customer [case : 02605103] facing the same problem. ~~~~~ Mar 10 10:44:45 prd-k8s-worker101 kubelet: E0310 10:44:45.452884 9223 pod_workers.go:190] Error syncing pod 101ce83b-60eb-4212-9ab2-08bf2d933054 ("kafka-connector-healer-1583854800-2gb8s_ tenet(101ce83b-60eb-4212-9ab2-08bf2d933054)"), skipping: failed to ensure that the pod: 101ce83b-60eb-4212-9ab2-08bf2d933054 cgroups exist and are correctly applied: failed to create contain er for [kubepods burstable pod101ce83b-60eb-4212-9ab2-08bf2d933054] : mkdir /sys/fs/cgroup/memory/kubepods/burstable/pod101ce83b-60eb-4212-9ab2-08bf2d933054: cannot allocate memory ~~~~~ Messages are flooding just as in above snip. He is running 3.10.0-1062.12.1.el7.x86_64 already. I grabbed strace when problem was present. [root@prd-k8s-worker101 psjoberg]# strace -f -o /root/strace_cgroup2.out cgcreate -g memory:kubepods/burstable/IODINE cgcreate: can't create cgroup kubepods/burstable/IODINE: Cgroup, operation not allowed [root@prd-k8s-worker101 psjoberg]# cat /root/strace_cgroup2.out 132066 execve("/bin/cgcreate", ["cgcreate", "-g", "memory:kubepods/burstable/IODINE"], [/* 22 vars */]) = 0 132066 brk(NULL) = 0x555645f29000 132066 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f51d43b8000 132066 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) 132066 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 132066 fstat(3, {st_mode=S_IFREG|0644, st_size=26713, ...}) = 0 132066 mmap(NULL, 26713, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f51d43b1000 132066 close(3) = 0 132066 open("/lib64/libcgroup.so.1", O_RDONLY|O_CLOEXEC) = 3 132066 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240D\0\0\0\0\0\0"..., 832) = 832 132066 fstat(3, {st_mode=S_IFREG|0755, st_size=104224, ...}) = 0 132066 mmap(NULL, 4665984, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f51d3d24000 132066 mprotect(0x7f51d3d3b000, 2097152, PROT_NONE) = 0 132066 mmap(0x7f51d3f3b000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x7f51d3f3b000 132066 mmap(0x7f51d3f3d000, 2466432, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f51d3f3d000 132066 close(3) = 0 132066 open("/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3 132066 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200m\0\0\0\0\0\0"..., 832) = 832 132066 fstat(3, {st_mode=S_IFREG|0755, st_size=142144, ...}) = 0 132066 mmap(NULL, 2208904, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f51d3b08000 132066 mprotect(0x7f51d3b1f000, 2093056, PROT_NONE) = 0 132066 mmap(0x7f51d3d1e000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16000) = 0x7f51d3d1e000 132066 mmap(0x7f51d3d20000, 13448, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f51d3d20000 132066 close(3) = 0 132066 open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 132066 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P&\2\0\0\0\0\0"..., 832) = 832 132066 fstat(3, {st_mode=S_IFREG|0755, st_size=2156072, ...}) = 0 132066 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f51d43b0000 132066 mmap(NULL, 3985888, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f51d373a000 132066 mprotect(0x7f51d38fd000, 2097152, PROT_NONE) = 0 132066 mmap(0x7f51d3afd000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c3000) = 0x7f51d3afd000 132066 mmap(0x7f51d3b03000, 16864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f51d3b03000 132066 close(3) = 0 132066 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f51d43af000 132066 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f51d43ad000 132066 arch_prctl(ARCH_SET_FS, 0x7f51d43adb80) = 0 132066 mprotect(0x7f51d3afd000, 16384, PROT_READ) = 0 132066 mprotect(0x7f51d3d1e000, 4096, PROT_READ) = 0 132066 mprotect(0x7f51d3f3b000, 4096, PROT_READ) = 0 132066 mprotect(0x555645ac1000, 4096, PROT_READ) = 0 132066 mprotect(0x7f51d43b9000, 4096, PROT_READ) = 0 132066 munmap(0x7f51d43b1000, 26713) = 0 132066 set_tid_address(0x7f51d43ade50) = 132066 132066 set_robust_list(0x7f51d43ade60, 24) = 0 132066 rt_sigaction(SIGRTMIN, {0x7f51d3b0e860, [], SA_RESTORER|SA_SIGINFO, 0x7f51d3b17630}, NULL, 8) = 0 132066 rt_sigaction(SIGRT_1, {0x7f51d3b0e8f0, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x7f51d3b17630}, NULL, 8) = 0 132066 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0 132066 getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0 132066 brk(NULL) = 0x555645f29000 132066 brk(0x555645f4a000) = 0x555645f4a000 132066 brk(NULL) = 0x555645f4a000 132066 open("/proc/cgroups", O_RDONLY|O_CLOEXEC) = 3 132066 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 132066 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f51d43b7000 132066 read(3, "#subsys_name\thierarchy\tnum_cgrou"..., 1024) = 230 132066 read(3, "", 1024) = 0 132066 open("/proc/mounts", O_RDONLY|O_CLOEXEC) = 4 132066 fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 132066 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f51d43b6000 132066 read(4, "rootfs / rootfs rw 0 0\nsysfs /sy"..., 1024) = 1024 132066 read(4, "l,nosuid,nodev,noexec,relatime,p"..., 1024) = 1024 132066 read(4, "ebugfs rw,relatime 0 0\nhugetlbfs"..., 1024) = 1024 132066 read(4, "proc rw,nosuid,nodev,noexec,rela"..., 1024) = 1024 132066 read(4, "25fee5d1d08edf87aefb18dd7a5c144f"..., 1024) = 1024 132066 read(4, "abel,relatime 0 0\ntmpfs /var/lib"..., 1024) = 1024 132066 read(4, "7f3cd7f4557a51bff97846504d968e27"..., 1024) = 1024 132066 read(4, ".io~secret/ssot-processor-token-"..., 1024) = 1024 132066 read(4, "ir=/var/lib/docker/overlay/4022a"..., 1024) = 1024 132066 read(4, "bef704f791045ca20906025d914/moun"..., 1024) = 1024 132066 read(4, "mpfs rw,seclabel,relatime 0 0\ntm"..., 1024) = 1024 132066 read(4, "e78bce0b/merged overlay rw,secla"..., 1024) = 1024 132066 read(4, "roc rw,nosuid,nodev,noexec,relat"..., 1024) = 1024 132066 read(4, "cker/containers/3594e77e3c3b353e"..., 1024) = 1024 132066 read(4, ".io~secret/ssot-processor-token-"..., 1024) = 1024 132066 read(4, "latime,lowerdir=/var/lib/docker/"..., 1024) = 1024 132066 read(4, "147daf/volumes/kubernetes.io~sec"..., 1024) = 1024 132066 read(4, "1d84afbaef3dd73e500fd212cfefb4e5"..., 1024) = 1024 132066 read(4, "bel,nosuid,nodev,noexec,relatime"..., 1024) = 1024 132066 read(4, "kubelet/pods/5229f250-0bb5-4615-"..., 1024) = 1024 132066 read(4, "rkdir=/var/lib/docker/overlay/ec"..., 1024) = 1024 132066 read(4, "owerdir=/var/lib/docker/overlay/"..., 1024) = 1024 132066 read(4, "mounts/shm tmpfs rw,seclabel,nos"..., 1024) = 1024 132066 read(4, "per,workdir=/var/lib/docker/over"..., 1024) = 1024 132066 read(4, "nodev,noexec,relatime 0 0\ntmpfs "..., 1024) = 1024 132066 read(4, "var/lib/docker/overlay/ac0ad3d37"..., 1024) = 1024 132066 read(4, "-4bc6-9c5b-1a0c2ef092dd/volumes/"..., 1024) = 1024 132066 read(4, "55e99237c7d32a8e85b4f2f47c5b2d4d"..., 1024) = 1024 132066 read(4, " /var/lib/docker/containers/3e46"..., 1024) = 1024 132066 read(4, "ar/lib/docker/overlay/708310bef3"..., 1024) = 1024 132066 read(4, "/var/lib/docker/overlay/3a7bc822"..., 1024) = 1024 132066 read(4, "10abe71d4cfa4506fffa59471a5a54e4"..., 1024) = 1024 132066 read(4, "100183f1c4cda2efc169622d625662/w"..., 1024) = 1024 132066 read(4, "b76b/merged overlay rw,seclabel,"..., 1024) = 1024 132066 read(4, "e13b53/root,upperdir=/var/lib/do"..., 1024) = 1024 132066 read(4, "/docker/overlay/3004b765e83e7c5b"..., 1024) = 1024 132066 read(4, "73da8084d3646be98da1ebf7bf32e039"..., 1024) = 1024 132066 read(4, "7071f5387761c2d09186dcb877d50ac8"..., 1024) = 1024 132066 read(4, "dbdc16d7dd5add9575579534180/uppe"..., 1024) = 1024 132066 read(4, "ork 0 0\noverlay /var/lib/docker/"..., 1024) = 1024 132066 read(4, "relatime,lowerdir=/var/lib/docke"..., 1024) = 1024 132066 read(4, "cker/overlay/3f7aee3450f473d4c94"..., 1024) = 1024 132066 read(4, "3d14643f5fb12fe46d1243ab246298dc"..., 1024) = 1024 132066 read(4, "13e047b643fc0f48a46dbb/merged ov"..., 1024) = 1024 132066 read(4, "eb6bf8d8b956019dbf8587ab/root,up"..., 1024) = 1024 132066 read(4, "r,workdir=/var/lib/docker/overla"..., 1024) = 1024 132066 read(4, "overlay/b5c2a757796e0a00e5a6b2b1"..., 1024) = 1024 132066 read(4, "r/overlay/3a7bc82268e1e432147f4a"..., 1024) = 1024 132066 read(4, "149df10503bf62d621f098e7a765e36a"..., 1024) = 1024 132066 read(4, "2f029e2616d6d67d/work 0 0\noverla"..., 1024) = 1024 132066 read(4, "erlay rw,seclabel,relatime,lower"..., 1024) = 1024 132066 read(4, "perdir=/var/lib/docker/overlay/2"..., 1024) = 1024 132066 read(4, "y/5501fd0719bf03746d78240bedd47a"..., 1024) = 1024 132066 read(4, "81a143658ac3da9afed334682ccf18a6"..., 1024) = 1024 132066 read(4, "144c26b4d72d1a0023b6def921366f54"..., 1024) = 1024 132066 read(4, "ea91e64f67495/upper,workdir=/var"..., 1024) = 1024 132066 read(4, "y /var/lib/docker/overlay/3ab88f"..., 1024) = 1024 132066 read(4, "l,relatime 0 0\ntmpfs /var/lib/ku"..., 1024) = 1024 132066 read(4, "overlay/d02e11e096cca9176e98b00d"..., 1024) = 1024 132066 read(4, "b/docker/overlay/86ea03d933cbb06"..., 1024) = 1024 132066 read(4, "f91d54c proc rw,nosuid,nodev,noe"..., 1024) = 1024 132066 read(4, "rdir=/var/lib/docker/overlay/1c5"..., 1024) = 1024 132066 read(4, "da131c59a5af78d91c476cf64c99ced1"..., 1024) = 1024 132066 read(4, "e18708ae244f60e55945e97842f1504d"..., 1024) = 1024 132066 read(4, "44a2b1e09f182b20a7502af60a08571a"..., 1024) = 1024 132066 read(4, "/overlay/732d75ed02b7f3cd7f4557a"..., 1024) = 1024 132066 read(4, " rw,seclabel,relatime 0 0\noverla"..., 1024) = 1024 132066 read(4, "18c91aa4a50241d06a2bf72820ee/upp"..., 1024) = 1024 132066 read(4, "69868c0615d4a4cd3482c36a0e40/upp"..., 1024) = 1024 132066 read(4, "ns/817eb67a936c proc rw,nosuid,n"..., 1024) = 1024 132066 read(4, "686b97c79d3b919bef603/merged ove"..., 1024) = 1024 132066 read(4, "fc5ce34c296f7f590a3598b/root,upp"..., 1024) = 1024 132066 read(4, ",workdir=/var/lib/docker/overlay"..., 1024) = 1024 132066 read(4, "6d7cd22/upper,workdir=/var/lib/d"..., 1024) = 1024 132066 read(4, "8d8b956019dbf8587ab/root,upperdi"..., 1024) = 1024 132066 read(4, "kdir=/var/lib/docker/overlay/48e"..., 1024) = 1024 132066 read(4, "/netns/5ea3bbc2414d proc rw,nosu"..., 1024) = 1024 132066 read(4, "5d79e30ba4afd6c68dc0c8752/merged"..., 1024) = 1024 132066 read(4, "023b6def921366f540a9a99897d/root"..., 1024) = 1024 132066 read(4, "/docker/netns/d243aeae8026 proc "..., 1024) = 1024 132066 read(4, "c40a0c8c1bf834bf515fe56086877a3d"..., 1024) = 1024 132066 read(4, "04cbfe0388fa7c1ba942aec189949896"..., 1024) = 1024 132066 read(4, "ffd3959467c515ec5fb3d857a8e09dda"..., 1024) = 1024 132066 read(4, "b968701cd3a58b18d790c5601/merged"..., 1024) = 1024 132066 read(4, "60a0a54b1608f500d0bd66e4ab28bd6d"..., 1024) = 1024 132066 read(4, "7d9d2d37/merged overlay rw,secla"..., 1024) = 1024 132066 read(4, "b8faf9667fc2a0bf755eb57bf4315bd/"..., 1024) = 1024 132066 read(4, "0272c09b34b6a59894bec081972aff77"..., 1024) = 1024 132066 read(4, "latime,lowerdir=/var/lib/docker/"..., 1024) = 1024 132066 read(4, "tmpfs rw,seclabel,relatime 0 0\no"..., 1024) = 1024 132066 read(4, "6ae83c0f426419532642913841d367d7"..., 1024) = 1024 132066 read(4, "61b4/work 0 0\noverlay /var/lib/d"..., 1024) = 1024 132066 read(4, "4d5e5678e4ebb09334056cf4d4bbc37e"..., 1024) = 1024 132066 read(4, "d30890fe8d9d7733/upper,workdir=/"..., 1024) = 1024 132066 read(4, "cker/overlay/07ef1c31ef2e2a4bdfa"..., 1024) = 1024 132066 read(4, "069c86989d3ed346e2bcfa3e59305ecd"..., 1024) = 1024 132066 read(4, " 0\nshm /var/lib/docker/container"..., 1024) = 1024 132066 read(4, "29db4a56add2f6f4a31cbe6302a8a96a"..., 1024) = 1024 132066 read(4, "y/4502a0aaf969ad02344062e2c30a94"..., 1024) = 1024 132066 read(4, "869ff7cfce56845562753ed8b206cef3"..., 1024) = 1024 132066 read(4, "71aca264fe7dbedf11c74df4a9c19f73"..., 1024) = 1024 132066 read(4, "cd41cdc5e1467/upper,workdir=/var"..., 1024) = 1024 132066 read(4, "y /var/lib/docker/overlay/171cc7"..., 1024) = 1024 132066 read(4, "dir=/var/lib/docker/overlay/d75f"..., 1024) = 1024 132066 read(4, "4b99d3bd80356131daca1e52fc6ae34d"..., 1024) = 1024 132066 read(4, "34ce2c5b3f0c50884d0c235b513211d5"..., 1024) = 1024 132066 read(4, "620515ad/merged overlay rw,secla"..., 1024) = 1024 132066 read(4, "b8b984f9a6735c10f02ca98161f6dfaf"..., 1024) = 1024 132066 read(4, "verlay/8eac9bf965261c6606bace18a"..., 1024) = 1024 132066 read(4, "0c6c127fbd685955f6ea262d21a89dfb"..., 1024) = 1024 132066 read(4, "1991d3d8176ad2cd/merged overlay "..., 1024) = 1024 132066 read(4, "f2d7-6bea-4300-b34b-658f0ff15f82"..., 1024) = 507 132066 read(4, "", 1024) = 0 132066 close(3) = 0 132066 munmap(0x7f51d43b7000, 4096) = 0 132066 close(4) = 0 132066 munmap(0x7f51d43b6000, 4096) = 0 132066 open("/proc/mounts", O_RDONLY|O_CLOEXEC) = 3 132066 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 132066 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f51d43b7000 132066 read(3, "rootfs / rootfs rw 0 0\nsysfs /sy"..., 1024) = 1024 132066 close(3) = 0 132066 munmap(0x7f51d43b7000, 4096) = 0 132066 mkdir("/sys", 0775) = -1 EEXIST (File exists) 132066 mkdir("/sys/fs", 0775) = -1 EEXIST (File exists) 132066 mkdir("/sys/fs/cgroup", 0775) = -1 EEXIST (File exists) 132066 mkdir("/sys/fs/cgroup/memory", 0775) = -1 EEXIST (File exists) 132066 mkdir("/sys/fs/cgroup/memory/kubepods", 0775) = -1 EEXIST (File exists) 132066 mkdir("/sys/fs/cgroup/memory/kubepods/burstable", 0775) = -1 EEXIST (File exists) 132066 mkdir("/sys/fs/cgroup/memory/kubepods/burstable/IODINE", 0775) = -1 ENOMEM (Cannot allocate memory) <<=---------------- 132066 stat("/sys/fs/cgroup/memory/kubepods/burstable/IODINE", 0x7ffea363eac0) = -1 ENOENT (No such file or directory) 132066 write(2, "cgcreate: can't create cgroup ku"..., 87) = 87 132066 exit_group(50007) = ? 132066 +++ exited with 87 +++ I do see that 'ip_vs' is loaded, but I think problem is different here than bug 1540585. Could you confirm and give suggestions. I am asking customer to test 1062.18.1.el7 because of bug 1768386 [Sanitize MM backported code for RHEL7 [rhel-7.7.z]] I will share the result if he is able to test. He did not got a reproducer, but it happens almost twice a week on production kubernetes server. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:1016 Hello, I am getting this bug on 3.10.0-1062.18.1.el7.x86_64. Nov 5 09:19:21 server-with-a-problem dockerd: time="2020-11-05T09:19:21.562680365-05:00" level=error msg="Handler for POST /containers/4e00783f74ad9033b146fb5ef7a2c2499ffb74e25cf7800fe6e479a1385180c5/start returned error: OCI runtime create failed: container_linux.go:349: starting container process caused \"process_linux.go:297: applying cgroup configuration for process caused \\\"mkdir /sys/fs/cgroup/memory/kubepods/burstable/pod5e1a5f41-c6ce-47db-b640-e9db42f79b37/4e00783f74ad9033b146fb5ef7a2c2499ffb74e25cf7800fe6e479a1385180c5: cannot allocate memory\\\"\": unknown" I am ***not using cgroup.memory=nokmem*** as I thought the bug was fixed in 1062.18.1. Has anyone tested 1062.18.1? I have obviously not experienced the bug with cgroup.memory=nokmem. Thanks, Doug |