Bug 1507149 - [LLNL 7.5 Bug] slab leak causing a crash when using kmem control group
Summary: [LLNL 7.5 Bug] slab leak causing a crash when using kmem control group
Keywords:
Status: ON_QA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.4
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: rc
: ---
Assignee: Aristeu Rozanski
QA Contact: Chao Ye
URL:
Whiteboard:
: 1739145 (view as bug list)
Depends On:
Blocks: 1392283 1599298 1649189 1713152 1748234 1748236 1748237 1752421 1549423
TreeView+ depends on / blocked
 
Reported: 2017-10-27 20:06 UTC by Ben Woodard
Modified: 2019-09-16 13:46 UTC (History)
67 users (show)

Fixed In Version: kernel-3.10.0-1075.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1748234 1748236 1748237 1752421 (view as bug list)
Environment:
Last Closed:


Attachments (Terms of Use)
0003-mm-memcg-slab-fix-races-in-per-memcg-cache-creation-destruction.patch (2.98 KB, patch)
2018-02-19 10:14 UTC, Aaron Tomlin
no flags Details | Diff
0004-mm-memcg-slab-do-not-destroy-children-caches-if-parent-has-aliases.patch (3.78 KB, patch)
2018-02-19 10:15 UTC, Aaron Tomlin
no flags Details | Diff
0001-memcg-slab-kmem-cache-create-memcg-fix-memleak-on-fail-path.patch (3.03 KB, patch)
2018-02-19 10:18 UTC, Aaron Tomlin
no flags Details | Diff
0002-memcg-slab-clean-up-memcg-cache-initialization-destruction.patch (4.32 KB, patch)
2018-02-19 10:19 UTC, Aaron Tomlin
no flags Details | Diff
Output of /proc/slabinfo on the machine showing the issue (12.67 KB, text/plain)
2018-08-30 00:38 UTC, Vamsee Yarlagadda
no flags Details
Test patches. (16.46 KB, application/x-bzip)
2019-06-07 14:05 UTC, Aristeu Rozanski
no flags Details
Slab active/Total Graph (30.66 KB, image/png)
2019-07-18 12:04 UTC, Markus Schibli
no flags Details
Slab active/Total Graph data (786.97 KB, text/plain)
2019-07-18 12:04 UTC, Markus Schibli
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3291481 None None None 2018-10-08 18:50:24 UTC

Internal Links: 1508164

Description Ben Woodard 2017-10-27 20:06:02 UTC
Description of problem:
We upgraded SLURM from 2.3.3 to 17.02.7.on some of our HPC clusters and we are now having machines which use PSM2 i.e. machines that have hfi1/OPA crash at a rate of about 1/node week of uptime. Because of the size of the clusters, this ends up being at a rate of several per day. The new slurm the 17.02.7 is more aggressive in how it uses cgroups. 

Control groups are implicated and in particular memcg's are circumstantially implicated but possibly it has to do with the layout of the control groups or the way that they are used. This is not containers and the way that systemd creates slices, scopes, and services. These are HPC compute nodes and the layout of the control groups is along the lines of: 

/sys/fs/cgroup/memory/slurm/uid_#/job_#/step_# 

They also use CPU  and freezer cgroups on the cluster where the problem first appeared. 

The 4 new cgroup parameters the newer Slurm on rzgenie is now setting:
>         slurm_cgroup_conf->constrain_kmem_space = false ;
>         slurm_cgroup_conf->allowed_kmem_space = -1;
>         slurm_cgroup_conf->max_kmem_percent = 100;
>         slurm_cgroup_conf->min_kmem_space = XCGROUP_DEFAULT_MIN_RAM;

What we see is memory.kmem.limit_in_bytes and memory.limit_in_bytes are being set in /sys/fs/cgroup/memory/slurm/uid_<uid>/job_<jobid>, whereas on the older slurm memory.memsw.limit_in_bytes and memory.limit_in_bytes are being set.

We're working on figuring out a minimal reproducer but we haven't pinned it down fully. 

Interestingly, it seems to leak slab caches but it may be more harmless on PSM machines which have qib cards. I don't think that we've seen one crash yet. It is possible that the leak is just slower on PSM machines.

One thing is kind of unusual about PSM & PSM2 vs. other users of the slab caches. They create a new slab for each and every job and then supposedly cleans it up at the end of the job. I personally don't know the association but from what I gather so far PSM uses this in some way with its intra-node MPI traffic.


As evidence that we are leaking slabs on a 7.3.z cluster and old Slurm.
quartz187{foraker1}24: uptime; ls /sys/kernel/slab | wc -l; symlinks /sys/kernel/slab/ | grep dangling | wc -l
  14:17:57 up 85 days, 22:47,  8 users,  load average: 0.85, 0.55, 0.49
 305
 0

On a cluster with the same kernel and a new Slurm this is a state of a node before it starts complaining in the log files.
 rzgenie46{foraker1}22: uptime; ls /sys/kernel/slab | wc -l; symlinks /sys/kernel/slab/ | grep dangling | wc -l
  14:24:05 up 8 days,  3:09,  1 user,  load average: 0.00, 0.01, 0.05
 106884
 44496

With the same kernel and the new Slurm this is a node after errors start appearing in the logs but before it crashes
 rzgenie11{foraker1}22: uptime; ls /sys/kernel/slab | wc -l; symlinks /sys/kernel/slab/ | grep dangling | wc -l
  14:25:02 up 2 days, 7 min,  2 users,  load average: 0.08, 0.03, 0.92
 480947
 288058

On a test machine with 7.4.z and a new Slurm
 [root@opal108:~]# uptime; ls /sys/kernel/slab | wc -l; symlinks /sys/kernel/slab/ | grep dangling | wc -l
  14:17:44 up 23 days,  6:10,  1 user,  load average: 0.00, 0.01, 1.35
 55687
 41502

In the log on PSM machines we see:

2017-10-25 12:01:52 [12719.510450] kmem_cache_destroy qib-user-sdma-pkts-0-02.00(161:step_172): Slab cache still has objects
2017-10-25 13:41:45 [18712.867408] kmem_cache_destroy qib-user-sdma-pkts-0-02.00(335:step_337): Slab cache still has objects
2017-10-25 13:55:54 [19561.888042] kmem_cache_destroy qib-user-sdma-pkts-0-02.00(349:step_382): Slab cache still has objects
2017-10-25 14:00:52 [19860.274998] kmem_cache_destroy qib-user-sdma-pkts-0-02.00(353:step_397): Slab cache still has objects

but on the PSM2 machines we see a different error:

[169008.932378] cache_from_obj: Wrong slab cache. kmalloc-64(8628:step_2444) but
 object is from kmem_cache_node
[169008.943989] cache_from_obj: Wrong slab cache. kmalloc-256(8628:step_2444) bu
t object is from kmem_cache
[169018.286339] cache_from_obj: Wrong slab cache. kmalloc-64(8685:step_2503) but
 object is from kmem_cache_node
[169018.298377] cache_from_obj: Wrong slab cache. kmalloc-64(8685:step_2503) but
 object is from kmem_cache_node

Eventually when there are excessive numbers of slab caches the node crashes with something like:

2017-10-23 04:29:29 [234888.228548] BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8
2017-10-23 04:29:29 [234888.238069] IP: [<ffffffff811fa883>] mem_cgroup_css_offline+0x133/0x170
2017-10-23 04:29:29 [234888.246169] PGD 1cf4281067 PUD 11e728f067 PMD 0
2017-10-23 04:29:29 [234888.252066] Oops: 0000 1 SMP
2017-10-23 04:29:29 [234888.256337] Modules linked in: osc(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) kpatch_stack_clash_514_26_1_1chaos_kpatch(OE) kpatch(OE) xt_owner nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables nfsv3 nf_log_ipv4 nf_log_common xt_LOG xt_multiport intel_powerclamp coretemp intel_rapl iosf_mbi kvm iTCO_wdt irqbypass ipmi_devintf iTCO_vendor_support hfi1 sb_edac rdmavt pcspkr edac_core lpc_ich shpchp i2c_i801 ipmi_si ipmi_msghandler acpi_power_meter acpi_cpufreq binfmt_misc ib_ipoib rdma_ucm ib_ucm msr_safe(OE) ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core nfsd nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache mgag200 crct10dif_pclmul 8021q crct10dif_common crc32_pclmul garp drm_kms_helper stp syscopyarea llc sysfillrect crc32c_intel mrp ghash_clmulni_intel sysimgblt aesni_intel igb lrw scsi_transport_iscsi gf128mul fb_sys_fops dca glue_helper ttm ahci mxm_wmi ablk_helper ptp cryptd libahci drm pps_core i2c_algo_bit libata i2c_core fjes wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ip_tables]
2017-10-23 04:29:29 [234888.382291] CPU: 12 PID: 131244 Comm: slurmstepd Tainted: G W OE K------------ 3.10.0-514.26.1.1chaos.ch6_1.x86_64 #1
2017-10-23 04:29:29 [234888.396817] Hardware name: Penguin Computing Relion OCP1930e/S2600KPR, BIOS SE5C610.86B.01.01.0020.122820161512 12/28/2016

[169020.702294] BUG: unable to handle kernel NULL pointer dereference at 0000000
0000000b8
[169020.712241] IP: [<ffffffff811fa883>] mem_cgroup_css_offline+0x133/0x170
[169020.720775] PGD 7ca706067 PUD f2a6c2067 PMD 0 
[169020.726856] Oops: 0000 [#1] SMP 
[169020.731585] Modules linked in: osc(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) kpatch_stack_clash_514_26_1_1chaos_kpatch(OE) kpatch(OE) xt_owner nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables nfsv3 nf_log_ipv4 nf_log_common xt_LOG xt_multiport ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_powerclamp coretemp intel_rapl iosf_mbi hfi1 kvm rdmavt iTCO_wdt irqbypass ipmi_devintf iTCO_vendor_support ib_core sb_edac lpc_ich edac_core pcspkr i2c_i801 shpchp ipmi_si ipmi_msghandler acpi_power_meter acpi_cpufreq binfmt_misc msr_safe(OE) nfsd nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache 8021q garp stp
[169020.818684]  llc crct10dif_pclmul crct10dif_common mrp crc32_pclmul crc32c_intel mgag200 ghash_clmulni_intel drm_kms_helper scsi_transport_iscsi igb syscopyarea aesni_intel sysfillrect sysimgblt lrw gf128mul fb_sys_fops dca glue_helper ttm ahci ablk_helper ptp mxm_wmi cryptd libahci drm pps_core i2c_algo_bit libata i2c_core fjes wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ip_tables]
[169020.863456] CPU: 65 PID: 180664 Comm: slurmstepd Tainted: G        W  OE K------------   3.10.0-514.26.1.1chaos.ch6_1.x86_64 #1
[169020.879535] Hardware name: Penguin Computing Relion OCP1930e/S2600KPR, BIOS SE5C610.86B.01.01.0020.122820161512 12/28/2016
[169020.893659] task: ffff8817102f0fb0 ti: ffff88168e680000 task.ti: ffff88168e680000
[169020.903834] RIP: 0010:[<ffffffff811fa883>]  [<ffffffff811fa883>] mem_cgroup_css_offline+0x133/0x170
[169020.915852] RSP: 0018:ffff88168e683dd0  EFLAGS: 00010286
[169020.923654] RAX: 0000000000000000 RBX: ffff881e1c209480 RCX: ffff8815c02c0000
[169020.933528] RDX: ffff88097b0b7c00 RSI: ffff88017fc08e00 RDI: 0000000000001400
[169020.943379] RBP: ffff88168e683df8 R08: 0000000000000000 R09: 0000000000000000
[169020.953217] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88097b0b7f18
[169020.963069] R13: ffff88097b0b7f08 R14: ffff880fcdb26800 R15: ffff88168e683fd8
[169020.972935] FS:  00002aaaaab12d00(0000) GS:ffff88203e540000(0000) knlGS:0000000000000000
[169020.983899] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[169020.992242] CR2: 00000000000000b8 CR3: 0000000457d91000 CR4: 00000000003407e0
[169021.002158] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[169021.012072] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[169021.021976] Stack:
[169021.026133]  ffff880fcdb26800 ffffffff81b05660 ffff8817cad27680 0000000000000000
[169021.036389]  ffff882032667a00 ffff88168e683e40 ffffffff81112397 00000000fffffffe
[169021.046663]  ffff88168e683e58 ffff8817cad27680 ffff8808f36ee120 0000000000000000
[169021.056929] Call Trace:
[169021.061630]  [<ffffffff81112397>] cgroup_destroy_locked+0xe7/0x360
[169021.070506]  [<ffffffff81112632>] cgroup_rmdir+0x22/0x40
[169021.078427]  [<ffffffff812147a8>] vfs_rmdir+0xa8/0x100
[169021.086141]  [<ffffffff81218ca5>] do_rmdir+0x1a5/0x200
[169021.093838]  [<ffffffff81219e56>] SyS_rmdir+0x16/0x20
[169021.101459]  [<ffffffff816ae549>] system_call_fastpath+0x16/0x1b
[169021.110131] Code: 48 85 d2 48 8b 88 b8 00 00 00 48 c7 c0 ff ff ff ff 74 07 48 63 82 40 03 00 00 48 8b 44 c1 08 48 8b 35 3a 40 90 00 bf 00 14 00 00 <48> 8b 90 b8 00 00 00 c6 42 28 01 48 8b 90 b8 00 00 00 48 83 c2 
[169021.135947] RIP  [<ffffffff811fa883>] mem_cgroup_css_offline+0x133/0x170
[169021.145477]  RSP <ffff88168e683dd0>
[169021.151407] CR2: 00000000000000b8

The dump stack from cache_from_object prints:
[ 6613.519407] cache_from_obj: Wrong slab cache. kmalloc-64(73:step_0) but objec
t is from kmem_cache_node
[ 6613.530358] ------------[ cut here ]------------
[ 6613.535972] WARNING: at mm/slab.h:249 kmem_cache_free+0x1d8/0x230()
[ 6613.543400] Modules linked in: osc(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) kpatch_stack_clash_514_26_1_1chaos_kpatch(OE) kpatch(OE) xt_owner nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables nfsv3 nf_log_ipv4 nf_log_common xt_LOG xt_multiport ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_powerclamp coretemp intel_rapl iosf_mbi hfi1 kvm rdmavt iTCO_wdt irqbypass ipmi_devintf iTCO_vendor_support ib_core sb_edac lpc_ich edac_core pcspkr i2c_i801 shpchp ipmi_si ipmi_msghandler acpi_power_meter acpi_cpufreq binfmt_misc msr_safe(OE) nfsd nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache 8021q garp stp
[ 6613.625935]  llc crct10dif_pclmul crct10dif_common mrp crc32_pclmul crc32c_intel mgag200 ghash_clmulni_intel drm_kms_helper scsi_transport_iscsi igb syscopyarea aesni_intel sysfillrect sysimgblt lrw gf128mul fb_sys_fops dca glue_helper ttm ahci ablk_helper ptp mxm_wmi cryptd libahci drm pps_core i2c_algo_bit libata i2c_core fjes wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ip_tables]
[ 6613.667238] CPU: 8 PID: 115530 Comm: ares.weekly Tainted: G           OE K------------   3.10.0-514.26.1.1chaos.ch6_1.x86_64 #1
[ 6613.681346] Hardware name: Penguin Computing Relion OCP1930e/S2600KPR, BIOS SE5C610.86B.01.01.0020.122820161512 12/28/2016
[ 6613.694394]  0000000000000000 0000000073dc6f76 ffff881034ea7c10 ffffffff8169d5bc
[ 6613.703419]  ffff881034ea7c48 ffffffff81087280 ffff88103d848e40 ffff88017fc07f00
[ 6613.712461]  0000000000000400 ffff880ff8ab4500 ffff881d30b6bff0 ffff881034ea7c58
[ 6613.721511] Call Trace:
[ 6613.724988]  [<ffffffff8169d5bc>] dump_stack+0x19/0x1b
[ 6613.731490]  [<ffffffff81087280>] warn_slowpath_common+0x70/0xb0
[ 6613.738978]  [<ffffffff810873ca>] warn_slowpath_null+0x1a/0x20
[ 6613.746263]  [<ffffffff811e58c8>] kmem_cache_free+0x1d8/0x230
[ 6613.753459]  [<ffffffff811e5980>] free_kmem_cache_nodes+0x60/0xa0
[ 6613.761050]  [<ffffffff811e7fbc>] kmem_cache_close+0x27c/0x330
[ 6613.768356]  [<ffffffff811e8794>] __kmem_cache_shutdown+0x14/0x80
[ 6613.775961]  [<ffffffff811ad1d4>] kmem_cache_destroy+0x44/0xf0
[ 6613.783299]  [<ffffffffa38b55f2>] hfi1_user_sdma_free_queues+0x1b2/0x210 [hfi1]
[ 6613.792294]  [<ffffffffa387fb7a>] hfi1_file_close+0x7a/0x350 [hfi1]
[ 6613.800123]  [<ffffffff8120971c>] __fput+0xfc/0x270
[ 6613.806410]  [<ffffffff812099de>] ____fput+0xe/0x10
[ 6613.812713]  [<ffffffff810b00b4>] task_work_run+0xb4/0xe0
[ 6613.819592]  [<ffffffff8108db0b>] do_exit+0x2eb/0xa80
[ 6613.826093]  [<ffffffff816a9984>] ? __do_page_fault+0x184/0x4a0
[ 6613.833578]  [<ffffffff8108e31f>] do_group_exit+0x3f/0xa0
[ 6613.840480]  [<ffffffff8108e394>] SyS_exit_group+0x14/0x20
[ 6613.847456]  [<ffffffff816ae549>] system_call_fastpath+0x16/0x1b
[ 6613.855017] ---[ end trace a39e704f62b38bb2 ]---

The maintainers of SLURM have already looked at the problem and bounced it off of the OPA maintainers at Intel and they said that it was a core kernel problem. See: https://bugs.schedmd.com/show_bug.cgi?id=3694

crash> dis -s mem_cgroup_css_offline+0x133
FILE: mm/memcontrol.c
LINE: 3503
 
  3498                  return;
  3499  
  3500          mutex_lock(&memcg->slab_caches_mutex);
  3501          list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
  3502                  cachep = memcg_params_to_cache(params);
* 3503                  cachep->memcg_params->dead = true;
  3504                  schedule_work(&cachep->memcg_params->destroy);
  3505          }
  3506          mutex_unlock(&memcg->slab_caches_mutex);
  3507  }
 
crash> dis mem_cgroup_css_offline+0x133 2
0xffffffff811fa883 <mem_cgroup_css_offline+307>:        mov    0xb8(%rax),%rdx
0xffffffff811fa88a <mem_cgroup_css_offline+314>:        movb   $0x1,0x28(%rdx)
crash> *memcg_cache_params.dead -ox
struct memcg_cache_params {
  [0x28]         bool dead;
}
crash> *kmem_cache.memcg_params -ox              
struct kmem_cache {
    [0xb8] struct memcg_cache_params *memcg_params;
}

Version-Release number of selected component (if applicable):
We see the crashes with the latest 7.3.z in production as well as the symptoms leading to the crashes on our test clusters which are running the latest 7.4.z. The fact that we have yet to have the crashes on the test clusters yet is easily attributed to the number of jobs that we are running on them.

We initially thought that it might be a problem in the 7.3.z kernels and thought that https://bugzilla.redhat.com/show_bug.cgi?id=1470325 might be the culprit but we are also seeing the kcache leak on Opal a test cluster running an up to date 7.4 kernel. However because of the lower number of jobs and the size on the test clusters we have yet to see a crash on that cluster.

How reproducible:
It takes a while to reproduce and seems to be tied to how many psm2 using jobs have been run.


Steps to Reproduce:
Still working on a minimal reproducer. Right now it still seems to be:
Install and configure the latest SLURM on an OPA cluster and then run MPI jobs which use PSM2 for intra-node MPI traffic.

Additional info:
We looked at https://bugzilla.redhat.com/show_bug.cgi?id=1502818 and we have some hope that it is a duplicate of that bug but the backport for the patch wasn't trivial and we haven't seen a RHEL7 backport yet.

Comment 6 Ben Woodard 2017-10-30 17:43:28 UTC
We found that when we turned off ConstrainKmemSpace=no in the slurm configuration there were no crashes over the weekend. Therefore it appears that the kmem cg is strongly implicated as the source of the problem. I asked LLNL's kernel engineer to apply a44cb9449182fd7b25bf5f1cc38b7f19e0b96f6d which could deal with at least part of the problem. ref: https://bugzilla.redhat.com/show_bug.cgi?id=1442618

Comment 7 Ben Woodard 2017-10-31 19:38:05 UTC
From the reporter:

I did some testing on a 7.4 node today, and I was able to trivially recreate the behavior where setting a kmem limit in a cgroup causes a number of slab caches to get created that don't go away when the cgroup is removed.  The steps were,
 # mkdir /sys/fs/cgroup/memory/kmem_test
 # echo 137438953472 > /sys/fs/cgroup/memory/kmem_test/memory.kmem.max_usage_in_bytes (128GB, total RAM in the node)
 # Spawn a shell and echo its pid into /sys/fs/cgroup/memory/kmem_test/tasks to put the shell into the cgroup.
 # Do a little work in the shell. Pinging the management node seemed to be enough.
 # 'ls /sys/kernel/slab | grep kmem_test' will show a bunch of new slab caches
 # Exit the shell.  When I tried this, it took the shell 30-60 seconds to exit.  Not sure what was up there.
 # rmdir /sys/fs/cgroup/memory/kmem_test
 # 'ls /sys/kernel/slab | grep kmem_test' shows the slabs are still there.

No panics/messages to syslog though, and no broken symlinks in /sys/kernel/slab.  I'm not rightly certain whether all of the child slab caches are supposed to go away when the cgroup is removed or not, but letting them build up into the hundreds of thousands probably isn't good behavior.

----------

Is that the expected behavior or do we have our reproducer here? What is supposed to reap these now abandoned slab caches? It seems like the build up of these slab caches is what ultimately crashes the node. So the reproducer that ultimately leads to the crash is probably something like what is described above but iterated many times.

Comment 8 Josko Plazonic 2017-11-09 16:30:31 UTC
Good morning,

I hope you don't mind if I butt in.  We have a cluster of 80 Intel OPA/PSM2 nodes running 7.4, slurm 17.04 with memory enforcement and with hfi1 drivers updated to Intel's latest 10.6.x.x that's experiencing similar issues. E.g. - after a while lots of:

[684578.150960] cache_from_obj: Wrong slab cache. kmem_cache_node but object is from kmalloc-64(6281:step_0)
[684578.162326] cache_from_obj: Wrong slab cache. kmem_cache_node but object is from kmalloc-64(6281:step_0)
[684578.173546] cache_from_obj: Wrong slab cache. kmem_cache but object is from kmalloc-256(6281:step_0)
[684578.186536] cache_from_obj: Wrong slab cache. kmem_cache_node but object is from kmalloc-64(6281:step_0)
[684578.197869] cache_from_obj: Wrong slab cache. kmem_cache_node but object is from kmalloc-64(6281:step_0)
[684578.210447] cache_from_obj: Wrong slab cache. kmem_cache but object is from kmalloc-256(6281:step_0)

in logs and we see huge /sys/kernel/slab dirs (e.g. 120k entries). No node crashes but we do get weird program crashes that we cannot easily explain any other way. 

Anyway, I just wanted to report that we tried applying a44cb9449182fd7b25bf5f1cc38b7f19e0b96f6d to 3.10.0-693.5.2.el7 and while it does not fix the problem fully it does help.  We ran repeatedly a simple MPI program (report rank) and on node without patch
ls /sys/kernel/slab|grep step|wc -l
grew at twice the rate vs the node with patch (after an hour 2043 vs 965). We'd be happy to try more things if that might help.

Thanks,

Josko

Comment 9 Travis Gummels 2017-11-09 17:05:51 UTC
Hi,

Can you both try this patch and report back it's effect on the issue:

commit f57947d27711451a7739a25bba6cddc8a385e438
Author: Li Zefan <lizefan@huawei.com>
Date:   Tue Jun 18 18:41:53 2013 +0800

    cgroup: fix memory leak in cgroup_rm_cftypes()
    
    The memory allocated in cgroup_add_cftypes() should be freed. The
    effect of this bug is we leak a bit memory everytime we unload
    cfq-iosched module if blkio cgroup is enabled.

I was advised the above might help.

Travis

Comment 10 Ben Woodard 2017-11-09 20:39:19 UTC
Travis, before we ask LLNL to test random patches. Can we do some testing to make sure that we don't continue to see evidence of the problem using the reproducer provided?

Comment 11 Josko Plazonic 2017-11-09 20:51:30 UTC
Good afternoon,

just finished testing also with the 2nd patch and there isn't a huge difference.  After 60 MPI runs I get 3175 items in /sys/kernel/slab on unpatched, 1285 with first patch, 1290 with both.  I also kept list of /sys/kernel/slab dir after every test - if you want to compare what's happening - and can put it up somewhere.

Thanks for the suggestion - if it can help keep'em coming.

Comment 13 chenxu 2017-12-14 02:04:56 UTC
encountered this issue with nvidia-docker(cgroup open) and 3.10.0-514.el7.x86_64 on centos

[6728212.703168]  [<ffffffff811c35c4>] __kmem_cache_shutdown+0x14/0x80
[6728212.703196]  [<ffffffff8118ceaf>] kmem_cache_destroy+0x3f/0xe0
[6728212.703222]  [<ffffffff811d4949>] kmem_cache_destroy_memcg_children+0x89/0xb0
[6728212.703252]  [<ffffffff8118ce84>] kmem_cache_destroy+0x14/0xe0
[6728212.703284]  [<ffffffffa167a507>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm]
[6728212.703321]  [<ffffffffa167c11e>] uvm_pmm_gpu_deinit+0x3e/0x70 [nvidia_uvm]
[6728212.703355]  [<ffffffffa1650b10>] remove_gpu+0x220/0x300 [nvidia_uvm]
[6728212.703387]  [<ffffffffa1650e31>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm]
[6728212.703423]  [<ffffffffa1654588>] uvm_va_space_destroy+0x348/0x3b0 [nvidia_uvm]
[6728212.703458]  [<ffffffffa164a401>] uvm_release+0x11/0x20 [nvidia_uvm]
[6728212.703486]  [<ffffffff811e0329>] __fput+0xe9/0x270
[6728212.703509]  [<ffffffff811e05ee>] ____fput+0xe/0x10
[6728212.703532]  [<ffffffff810a22f4>] task_work_run+0xc4/0xe0
[6728212.703557]  [<ffffffff810815eb>] do_exit+0x2cb/0xa60
[6728212.703580]  [<ffffffff810e1f22>] ? __unqueue_futex+0x32/0x70
[6728212.703605]  [<ffffffff810e2f7d>] ? futex_wait+0x11d/0x280
[6728212.704619]  [<ffffffff81081dff>] do_group_exit+0x3f/0xa0
[6728212.705598]  [<ffffffff81092c10>] get_signal_to_deliver+0x1d0/0x6d0
[6728212.706559]  [<ffffffff81014417>] do_signal+0x57/0x6c0
[6728212.707495]  [<ffffffff8110b796>] ? __audit_syscall_exit+0x1e6/0x280
[6728212.708407]  [<ffffffff81014adf>] do_notify_resume+0x5f/0xb0

Comment 29 Aaron Tomlin 2018-02-19 10:32:18 UTC
Ben,

Can you please try the attached patches (non obsolete)?
The following Linus' commits are included:

 363a044 memcg, slab: kmem_cache_create_memcg(): fix memleak on fail path
 1aa1325 memcg, slab: clean up memcg cache initialization/destruction
 2edefe1 memcg, slab: fix races in per-memcg cache creation/destruction
 b852990 memcg, slab: do not destroy children caches if parent has aliases

Comment 30 Josko Plazonic 2018-02-20 22:06:16 UTC
Hullo,

I am not Ben but I gave it a try nevertheless.  Patches apply cleanly to 693.17.1 kernel but don't seem to help at all.  Number of entries in /sys/kernel/slab increases at almost exactly the same rate as for clean 693.17.1 kernel.

Josko

Comment 31 Kir Kolyshkin 2018-03-08 19:29:12 UTC
There are a few reports of the same thing happening when running Docker here: https://github.com/moby/moby/issues/35446

Kernel versions reported there are all different, but they look recent.

Comment 32 xyh 2018-03-22 09:33:02 UTC
#!/bin/bash 
mkdir -p pages
for x in `seq 1280000`; do
        [ $((x % 1000)) -eq 0 ] && echo $x
        mkdir /sys/fs/cgroup/memory/foo
        # echo 1M > /sys/fs/cgroup/memory/foo/memory.limit_in_bytes
        echo 100M > /sys/fs/cgroup/memory/foo/memory.kmem.limit_in_bytes
        echo $$ >/sys/fs/cgroup/memory/foo/cgroup.procs
        memhog 4K &>/dev/null
        echo trex>pages/$x
        echo $$ >/sys/fs/cgroup/memory/cgroup.procs 
        rmdir /sys/fs/cgroup/memory/foo 
done




[root@localhost ~]# cat /proc/mounts 
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
devtmpfs /dev devtmpfs rw,nosuid,size=4077588k,nr_inodes=1019397,mode=755 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0
tmpfs /sys/fs/cgroup tmpfs rw,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
configfs /sys/kernel/config configfs rw,relatime 0 0
/dev/mapper/centos-root / xfs rw,relatime,attr2,inode64,noquota 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=31,pgrp=1,timeout=300,minproto=5,maxproto=5,direct 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0
mqueue /dev/mqueue mqueue rw,relatime 0 0
/dev/vda1 /boot xfs rw,relatime,attr2,inode64,noquota 0 0


i encountered same issue use sciript above 

i got this scripit from
https://github.com/torvalds/linux/commit/73f576c04b9410ed19660f74f97521bee6e1c546

and did some changes to reproduce 

"Docker leaking cgroups causing no space left on device? #29638"
https://github.com/moby/moby/issues/29638

unfortunately,encountered  this bug .

on Longterm stable 4.4.121 ,i did not see the same scence.

FYI.

Comment 34 Cody Jarrett 2018-05-15 15:54:35 UTC
Encountering this same issue with Centos 7, kernel 3.10.0-693.21.1.el7.x86_64, and docker-ce-17.12.1.ce-1.el7.centos.x86_64. The behavior we see is after a period of heavy container usage on hosts, the /sys/kernel/slab dir will have over 1 mil files/dirs, and new containers won't startup anymore. 

Also the following is logged heavily:
May 14 19:50:06 hostname kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(5071:9934c4c3821bcb474f5a113d5a1822c276fcf196d6c2e31b1b8dd969df9d50d2)
May 14 19:50:06 hostname kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(5071:9934c4c3821bcb474f5a113d5a1822c276fcf196d6c2e31b1b8dd969df9d50d2)
May 14 19:50:06 hostname kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(5071:9934c4c3821bcb474f5a113d5a1822c276fcf196d6c2e31b1b8dd969df9d50d2)
May 14 19:50:06 hostname kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(5071:9934c4c3821bcb474f5a113d5a1822c276fcf196d6c2e31b1b8dd969df9d50d2)

Comment 36 Di Weng 2018-07-10 09:25:33 UTC
Looks like Kubernetes users are also experiencing this issue, where the recent changes of enabling kmem limit in Kubernetes cause cgroups to leak. For details please refer to https://github.com/kubernetes/kubernetes/issues/61937.

Comment 40 Cody Jarrett 2018-08-13 18:57:29 UTC
Is this just pending someone testing the patches?

Comment 41 Joe Thompson 2018-08-15 02:11:27 UTC
1442618 referenced up near the top is a private bug -- can the relevant parts be copied here (or has that already been done), or is there another bug that has them?

Comment 44 Anand Patil 2018-08-23 20:31:17 UTC
This is biting us and at least one of our customers via the Kubernetes issue linked above. The following script repros it reliably on a Kubernetes node running kernel 3.10.0-862.9.1.el7.x86_64 :

#!/bin/bash
run_some_pods () {
  for j in {1..20}
  do
    kubectl run -i -t busybox-${j} --image=busybox --restart=Never -- echo "hi" &> /dev/null &
  done
  sleep 1
  wait
  for j in {1..20}
  do
    kubectl delete pods busybox-${j} --grace-period=0 --force &> /dev/null || true
  done
}
for i in {1..3279}
do
  echo "Running pod batch $i"
  run_some_pods
done

We'd be happy to test any patches.

Comment 45 Anand Patil 2018-08-29 16:16:54 UTC
We were able to work around the issue in the context of kubernetes 1.8.12 by doing the following on startup. Before running the v1.8 kubelet, we run a v1.6 kubelet with suitable arguments, and verify that it completes enough of its startup process to create /sys/fs/cgroup/memory/kubepods . Then, we start up the kubernetes 1.8.12 cluster as normal.


We tried creating those cgroup folders manually as a simpler version of the above workaround using the following script:

#!/bin/bash

for thing in memory blkio "cpu,cpuacct" cpuset devices freezer hugetlb memory net_cls perf_event systemd
do
   mkdir-p /sys/fs/cgroup/${thing}/kubepods
   mkdir -p /sys/fs/cgroup/${thing}/kubepods/burstable
   mkdir -p /sys/fs/cgroup/${thing}/kubepods/besteffort
done 

However, this simpler attempt did not prevent the repro script above from showing the issue.


What would we need to add to this simpler workaround to prevent the issue as starting a v1.6 kubelet does?

Comment 46 Vamsee Yarlagadda 2018-08-30 00:36:11 UTC
Posting stats from one of the machines that shows this problem:

Problem:
> [root@vd0916 ~]# mkdir /sys/fs/cgroup/memory/b6c438c7ba469d5c3bad89c4a81d732a68a375b096d3791c4847cd517f8eee7c
> mkdir: cannot create directory ‘/sys/fs/cgroup/memory/b6c438c7ba469d5c3bad89c4a81d732a68a375b096d3791c4847cd517f8eee7c’: No space left on device

Some metrics:
> [root@vd0916 ~]# ll /sys/kernel/slab | wc -l 
> 2587635

> [root@vd0916 ~]# slabtop -s -c

 Active / Total Objects (% used)    : 87188577 / 222858888 (39.1%)
 Active / Total Slabs (% used)      : 4716565 / 4716565 (100.0%)
 Active / Total Caches (% used)     : 77 / 118 (65.3%)
 Active / Total Size (% used)       : 17769913.34K / 63916938.41K (27.8%)
 Minimum / Average / Maximum Object : 0.01K / 0.29K / 8.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
29129856 15635925  53%    0.19K 693568       42   5548544K dentry
23371439 5516568  23%    0.57K 417348       56  13355136K radix_tree_node
22468896 22386411  99%    0.11K 624136       36   2496544K sysfs_dir_cache
19747455 13294931  67%    0.10K 506345       39   2025380K buffer_head
17578139 5788599  32%    0.15K 331663       53   2653304K xfs_ili
15482112  88288   0%    0.25K 241908	   64   3870528K kmalloc-256
11236544  44493   0%    0.50K 175571	   64   5618272K kmalloc-512
11229792 1416428  12%    0.38K 267376       42   4278016K blkdev_requests
11110932 5855107  52%    0.09K 264546       42   1058184K kmalloc-96
9687120 1467586  15%    1.06K 322904	   30  10332928K xfs_inode
9019200 3112167  34%    0.12K 140925	   64   1127400K kmalloc-128
8103040 894521  11%    0.06K 126610	  64    506440K kmalloc-64
7442190 2117485  28%    0.19K 177195	   42   1417560K kmalloc-192
7131390  32130   0%    0.23K 101877	  70   1630032K cfq_queue
4258608 753366  17%    0.66K  88721	  48   2839072K shmem_inode_cache
3663990 2814407  76%    0.58K  66618	   55   2131776K inode_cache
3575296 382122  10%    0.01K   6983	 512     27932K kmalloc-8
3378048 2483023  73%    0.03K  26391	  128    105564K kmalloc-32
1172800 1168793  99%    0.06K  18325	   64     73300K kmem_cache_node
660992 152980  23%    0.02K   2582	256     10328K kmalloc-16
588736 587894  99%    0.25K   9199	 64    147184K kmem_cache
570556 336689  59%    0.64K  11644	 49    372608K proc_inode_cache
527904  28209   5%    1.00K  16497	 32    527904K kmalloc-1024
308160   6622   2%    1.56K  15408	 20    493056K mm_struct
225215   4247   1%    1.03K   7265	 31    232480K nfs_inode_cache
201952 133963  66%    4.00K  25244        8    807808K kmalloc-4096
132600   3240   2%    0.39K   3315	 40     53040K xfs_efd_item
123368 123368 100%    0.07K   2203	 56	 8812K Acpi-ParseExt
110048  37821  34%    2.00K   6878	 16    220096K kmalloc-2048
 71055   3090   4%    2.06K   4737	 15    151584K idr_layer_cache
 51684  50881  98%    0.05K    708	 73	 2832K fanotify_event_info
 51000  50933  99%    0.12K    750	 68	 6000K dnotify_mark
 37504  36803  98%    0.06K    586	 64	 2344K anon_vma
 33303  33303 100%    0.08K    653	 51	 2612K selinux_inode_security
 28083  27187  96%    0.21K    759	 37	 6072K vm_area_struct
 27534  18839  68%    0.81K    706	 39     22592K task_xstate
 26214  21624  82%    0.04K    257	102	 1028K Acpi-Namespace
 26163  25143  96%    0.31K    513	 51	 8208K nf_conntrack_ffffffff81a26d80
 25959  25808  99%    0.62K    509	 51     16288K sock_inode_cache
 25530  25312  99%    0.09K    555	 46	 2220K configfs_dir_cache
 24565  24225  98%    0.05K    289	 85	 1156K shared_policy_node
 23205  22976  99%    0.10K    595	 39	 2380K blkdev_ioc
 20416  20197  98%    0.18K    464	 44	 3712K xfs_log_ticket


> cat /proc/slabinfo

   <<<< See file attached >>>>

Comment 47 Vamsee Yarlagadda 2018-08-30 00:38:31 UTC
Created attachment 1479631 [details]
Output of /proc/slabinfo on the machine showing the issue

Comment 48 Vamsee Yarlagadda 2018-08-30 01:51:23 UTC
Kernel version of the host that I was able to reproduce the issue.

> [root@vd0916 ~]# uname -r
> 3.10.0-327.36.3.el7.x86_64

Reproducibility steps:

> I was running a script similar to what @Anand has posted https://bugzilla.redhat.com/show_bug.cgi?id=1507149#c44.
> By starting a lot of kubernetes pods and killing them in a tight loop.

Comment 49 Anand Patil 2018-08-30 01:53:46 UTC
The source of the strange workaround in comment 45 was this article: http://www.linuxfly.org/kubernetes-19-conflict-with-centos7/? which one of our colleagues translated for us. His summary was:

The reason that upgrading k8s from 1.6 to 1.9 alone does not reproduce the cgroup memory leak issue is that without restarting the node kubelet will re-use the

/sys/fs/cgroup/memory/kubepods

directory that was created by k8s 1.6 kubelet, which has cgroup kernel memory limit disabled by default. Restarting the nodes will result in

/sys/fs/cgroup/memory/kubepods

directory being re-created by k8s 1.9 kubelet and hence will have cgroup kernel memory limit enabled by default.

Tomorrow I'll start a test to double-check the workaround on a second node.

Comment 50 Arun M. Krishnakumar 2018-08-30 02:19:10 UTC
I tried the script in Comment 32 and was able to have the same issue (where a docker run fails due to ENOSPC). The script is copied below for ease of reading:

#!/bin/bash 
mkdir -p pages
for x in `seq 1280000`; do
        [ $((x % 1000)) -eq 0 ] && echo $x
        mkdir /sys/fs/cgroup/memory/foo
        # echo 1M > /sys/fs/cgroup/memory/foo/memory.limit_in_bytes
        echo 100M > /sys/fs/cgroup/memory/foo/memory.kmem.limit_in_bytes
        echo $$ >/sys/fs/cgroup/memory/foo/cgroup.procs
        memhog 4K &>/dev/null
        echo trex>pages/$x
        echo $$ >/sys/fs/cgroup/memory/cgroup.procs 
        rmdir /sys/fs/cgroup/memory/foo 
done

I suspected that this could be due to short-lived processes (our use case potentially has many short-lived containers in an error-handling path). I modified the script to the following with the sleeper being a compiled C program that prints pid (getpid()) and then sleeps for a second:

#!/bin/bash
mkdir -p pages
for x in `seq 1280000`; do
        [ $((x % 1000)) -eq 0 ] && echo $x
        mkdir /sys/fs/cgroup/memory/test2
        # echo 1M > /sys/fs/cgroup/memory/test2/memory.limit_in_bytes
        echo 100M > /sys/fs/cgroup/memory/test2/memory.kmem.limit_in_bytes
        ./sleeper >/sys/fs/cgroup/memory/test2/cgroup.procs
        memhog 4K &>/dev/null
        echo trex>pages/$x
        echo $$ >/sys/fs/cgroup/memory/cgroup.procs
        rmdir /sys/fs/cgroup/memory/test2
done

---------------------------------------
On a machine without the ENOSPC:

ls -lrt /sys/kernel/slab/ | wc -l
251

slabtop -s l -o
 Active / Total Objects (% used)    : 533404 / 535929 (99.5%)
 Active / Total Slabs (% used)      : 18399 / 18399 (100.0%)
 Active / Total Caches (% used)     : 65 / 95 (68.4%)
 Active / Total Size (% used)       : 182723.21K / 183691.12K (99.5%)
 Minimum / Average / Maximum Object : 0.01K / 0.34K / 8.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
124131 123889  99%    0.19K   5911       21     23644K dentry
 89908  89908 100%    0.15K   3458       26     13832K xfs_ili
 95280  95280 100%    1.06K   3176       30    101632K xfs_inode
 89427  89427 100%    0.10K   2293       39      9172K buffer_head
 16335  16147  98%    0.58K    605       27      9680K inode_cache
 18072  18072 100%    0.11K    502       36      2008K sysfs_dir_cache


---------------------------------------
On a machine with the ENOSPC:

ls -lrt /sys/kernel/slab/ | wc -l
2491573

sudo slabtop -s l -o
 Active / Total Objects (% used)    : 99818530 / 209669755 (47.6%)
 Active / Total Slabs (% used)      : 4532638 / 4532638 (100.0%)
 Active / Total Caches (% used)     : 125 / 161 (77.6%)
 Active / Total Size (% used)       : 21303475.11K / 59862027.02K (35.6%)
 Minimum / Average / Maximum Object : 0.01K / 0.29K / 8.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
30771153 17915952  58%    0.19K 732647       42   5861176K dentry
22472028 22454051  99%    0.11K 624223       36   2496892K sysfs_dir_cache
22912383 19079451  83%    0.10K 587497       39   2349988K buffer_head
18623542 3876899  20%    0.57K 333069       56  10658208K radix_tree_node
8240850 837650  10%    1.06K 274695       30   8790240K xfs_inode
10468080 6422369  61%    0.09K 249240       42    996960K kmalloc-96
10072062 677935   6%    0.38K 239811       42   3836976K blkdev_requests
13732880 415473   3%    0.25K 214577       64   3433232K kmalloc-256

---------------------------------------

I can see that with a slightly long-lived process (as shown below), the case has not been hit even though we have run through 90k iterations. The active slabs is at 18k for 1s long processes while for the short-lived process the active slabs was at 400k at a similar point.

---------------------------------------
time ./sleeper
4427

real	0m1.001s
user	0m0.000s
sys	0m0.001s
---------------------------------------
time echo $$
29978

real	0m0.000s
user	0m0.000s
sys	0m0.000s
---------------------------------------

Based on a superficial understanding of the comments in the bug and the test results, it seems that the 'leak' is exercised in cases where there are many short-lived processes / containers. Is this assumption right? We can implement this as a workaround if this reduces the frequency of the leak.

Comment 51 Arun M. Krishnakumar 2018-08-30 03:02:44 UTC
In continuation to Comment 50, I just tried to run the script with the short-lived process on the side. The 'Active / Total Slabs (% used)' jumped from 18k to 27k in about 1k iterations of the script and does not show any signs of going down. So I strongly suspect that a quickly exiting process could somehow lead to this state.

Could someone please corroborate this based on better understanding of the issue and the patch? We could implement a workaround based on this.

Comment 53 Steven Fishback 2018-09-20 13:56:51 UTC
I cannot view External Bug ID: Red Hat Knowledge Base (Solution) 3562061

Comment 54 Cody Jarrett 2018-09-21 15:15:06 UTC
I cannot see that KB article either? Can you please update this?

Comment 58 Aaron Tomlin 2018-10-15 12:53:58 UTC
(In reply to Vamsee Yarlagadda from comment #48)
> Kernel version of the host that I was able to reproduce the issue.
> > [root@vd0916 ~]# uname -r
> > 3.10.0-327.36.3.el7.x86_64

As per Bug 1502818, can you attempt to reproduce the reported issue under
kernel-3.10.0-862.el7?

Comment 60 Erik Lattimore 2018-10-17 15:37:49 UTC
We seem to be experiencing this issue with short lived containers with kernel 3.10.0-862.9.1.el7.x86_64 and Docker version 18.03.1-ce, build 9ee9f40

Comment 62 Kir Kolyshkin 2018-11-01 21:46:06 UTC
So, as far as I know kernel memory controller in RHEL7 kernel is very old, useless*, and buggy and it either needs to be disabled entirely, or fixed by backported changes from 4.x kernels (where it was basically rewritten from scratch). A workaround is NOT to enable/use kmem; I am working on that in https://github.com/opencontainers/runc/pull/1921 (or https://github.com/opencontainers/runc/pull/1920).

* it is useless as there's no reclamation of kernel memory, i.e. dcache and icache entries are not being reclaimed once we run out of kernel memory in this cgroup.

Comment 63 Kir Kolyshkin 2018-11-01 22:01:10 UTC
RHEL 7.6 kernel should have version number of 3.10.0-957 -- can anyone check if the bug is fixed in there?

Comment 64 Aristeu Rozanski 2018-11-01 23:16:55 UTC
The issue is a bit different, but same symptoms. Upstream the slab caches aren't
immediately freed but only when memory pressure occours. Because of changes
present upstream, it's possible to free the memcg id as soon as it goes offline
so the slab caches and corresponding memcg keep existing until everything else
is freed.

Two things are happening in RHEL7: kmem caches aren't being emptied even during
memory pressure and the memcg id isn't being freed so the -ENOMEM and failure
to create new memcgs come from the fact that there're 65535 memory cgroups still
using an id waiting for the slab caches to be freed.

This has not been fixed in 7.6 and I'm working on the fix. It's possible that
just solving the first problem will address both, but I'm not sure yet.

Comment 71 Aristeu Rozanski 2019-05-08 19:59:06 UTC
Test kernel: http://people.redhat.com/~arozansk/a18a3f62f17ae367d23647cad92c05f0b5ca39dc/

Known issues: dentry isn't being accounted; moving a process while it's allocating kmem with
existing or new slab caches will leak both cache and the old memory cgroup.

In practical terms this test kernel should reduce significantly the kmem cache and memory cgroup leaks in short lived containers where the processes running in the container finish before the
cgroup is removed.

I'll keep working on the remaining issues.

Comment 74 Kir Kolyshkin 2019-06-07 01:11:10 UTC
Aristeu, is there any decent way to see which backports have you applied (other than to download src.rpm and try to figure it out)?

In general, all related upstream fixes should be from Vladimir Davydov (vdavydov@virtuozzo.com and vdavydov.dev@gmail.com).

Comment 75 Aristeu Rozanski 2019-06-07 14:05:37 UTC
Created attachment 1578367 [details]
Test patches.

Attached.

Comment 76 Aristeu Rozanski 2019-06-14 13:24:09 UTC
Folks, can I get some feedback on the test kernel please?

Comment 87 Markus Schibli 2019-07-18 12:04:18 UTC
Created attachment 1591753 [details]
Slab active/Total Graph

Comment 88 Markus Schibli 2019-07-18 12:04:48 UTC
Created attachment 1591754 [details]
Slab active/Total Graph data

Comment 101 Jan Stancek 2019-08-16 07:19:39 UTC
Patch(es) committed on kernel-3.10.0-1075.el7

Comment 105 Aaron 2019-08-19 12:26:17 UTC
Quick question about the kernel the update is committed on... I see that 7.7 was released with kernel 3.10.0-1062.el7 earlier this month and in https://bugzilla.redhat.com/show_bug.cgi?id=1507149#c101 Jan Stancek mentioned the patch was included in kernel 3.10.0-1075.el7. RHEL 8.0 starts with the new 4.18.0 kernel. Typically the kernel updates are 1062.x.x releases, i.e. the latest kernel for 7.6 is 3.10.0-957.27.4.el7.

My question is, which release of RHEL is 3.10.0-1075.el7 slated to be included in? 7.8?

Comment 106 Aristeu Rozanski 2019-08-19 14:08:24 UTC
(In reply to Aaron from comment #105)
> Quick question about the kernel the update is committed on... I see that 7.7
> was released with kernel 3.10.0-1062.el7 earlier this month and in
> https://bugzilla.redhat.com/show_bug.cgi?id=1507149#c101 Jan Stancek
> mentioned the patch was included in kernel 3.10.0-1075.el7. RHEL 8.0 starts
> with the new 4.18.0 kernel. Typically the kernel updates are 1062.x.x
> releases, i.e. the latest kernel for 7.6 is 3.10.0-957.27.4.el7.
> 
> My question is, which release of RHEL is 3.10.0-1075.el7 slated to be
> included in? 7.8?

7.8. These fixes are likely to be present in a future zstream release.

Comment 108 Rafael Aquini 2019-08-21 19:27:20 UTC
*** Bug 1739145 has been marked as a duplicate of this bug. ***

Comment 118 Aaron 2019-09-13 17:08:45 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1507149#c106

Thanks @Aristeu, I see https://bugzilla.redhat.com/show_bug.cgi?id=1748237 was created for a z-stream update to 7.5 with status of "ON_QA". Since 7.8 has not yet been released, is there a z-stream 3.10.0-1062.x release in the works as well?

Comment 120 Aristeu Rozanski 2019-09-16 13:46:21 UTC
Hi Aaron,

(In reply to Aaron from comment #118)
> https://bugzilla.redhat.com/show_bug.cgi?id=1507149#c106
> 
> Thanks @Aristeu, I see https://bugzilla.redhat.com/show_bug.cgi?id=1748237
> was created for a z-stream update to 7.5 with status of "ON_QA". Since 7.8
> has not yet been released, is there a z-stream 3.10.0-1062.x release in the
> works as well?

yes, there's a BZ to include it on 7.7 kernel as well (1752421)


Note You need to log in before you can comment on or make changes to this bug.