Bug 870011
Summary: | Process constrained by CGroup memory.limit_in_bytes is killed instead of being swapped | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Martin Bukatovic <mbukatov> | ||||
Component: | kernel | Assignee: | Johannes Weiner <jweiner> | ||||
Status: | CLOSED ERRATA | QA Contact: | Li Wang <liwan> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.3 | CC: | ccui, lwang, lwoodman, matt, mbukatov, sgraf, tlavigne | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | kernel-2.6.32-464.el6 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2014-10-14 05:11:00 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 525237, 876610 | ||||||
Attachments: |
|
Description
Martin Bukatovic
2012-10-25 11:09:46 UTC
I just tested this on my RHEL6 system and it works correctly. [root@dhcp47-183 ~]# uname -a Linux dhcp47-183.lab.bos.redhat.com 2.6.32.339 #9 SMP Thu Nov 8 16:58:12 EST 2012 x86_64 x86_64 x86_64 GNU/Linux [root@dhcp47-183 ~]# cd /cgroup/memory/ [root@dhcp47-183 memory]# mkdir test [root@dhcp47-183 memory]# cd test/ [root@dhcp47-183 test]# cat memory.limit_in_bytes 9223372036854775807 [root@dhcp47-183 test]# echo 1G > memory.limit_in_bytes [root@dhcp47-183 test]# echo $$ > tasks [root@dhcp47-183 test]# /common/lwoodman/code/memory 2G & [1] 2330 [root@dhcp47-183 test]# size = 2147483648 mmaping 2147483648 anonymous bytes 7f56a4b8b000 touching 524288 pages [root@dhcp47-183 test]# vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 1 215912 6378476 30552 160880 0 194 181 201 1123 203 0 1 94 5 0 0 1 249704 6380524 30552 160528 0 33792 0 33792 10134 4375 0 1 87 11 0 0 1 283496 6379096 30552 160748 0 33792 0 33792 10062 4199 0 1 87 11 0 0 1 317288 6379096 30552 160852 0 33792 0 33792 10119 4321 0 1 87 11 0 0 1 367464 6380432 30552 160672 0 50176 0 50176 11019 6386 0 2 88 11 0 0 1 401256 6378384 30552 160748 0 33792 0 33792 10066 4198 0 1 87 11 0 ... Success: faulting took 29.768751s Going one step further, memory.memsw.limit_in_bytes limits the sum of swap & RAM: [root@dhcp47-183 test]# echo 2G > memory.memsw.limit_in_bytes [root@dhcp47-183 test]# /common/lwoodman/code/memory 2 size = 2147483648 mmaping 2147483648 anonymous bytes 7fba4f845000 touching 524288 pages Killed [root@dhcp47-183 test]# echo 2100M > memory.memsw.limit_in_bytes [root@dhcp47-183 test]# /common/lwoodman/code/memory 2 size = 2147483648 mmaping 2147483648 anonymous bytes 7f68db5b4000 touching 524288 pages Success: faulting took 29.237335s I have updated kernel to 2.6.32-345.el6.x86_64: ~~~ [root@rhel-6-x86_64 ~]# uname -a Linux rhel-6-x86_64.virtualdomain 2.6.32-345.el6.x86_64 #1 SMP Wed Nov 28 21:10:19 EST 2012 x86_64 x86_64 x86_64 GNU/Linux ~~~ But when I run my memory-bomb.py script, the results are the same as before, process is killed (nevertheless it seems that some swapping is actually done): ~~~ [root@rhel-6-x86_64 ~]# ./memory-bomb.py > out & vmstat 1 [1] 1399 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 25704 258124 12072 108676 0 64 77 73 52 27 2 1 97 1 0 0 0 25704 254032 12072 108680 0 0 0 0 83 22 1 1 99 0 0 0 0 25704 251552 12072 108680 0 0 0 0 61 22 1 0 100 0 0 0 0 25704 247460 12080 108672 0 0 0 60 65 31 0 0 96 4 0 0 0 25704 239276 12080 108672 0 0 0 0 70 27 0 1 100 0 0 0 0 2024 246592 12080 108724 0 0 0 0 81 25 1 1 99 0 0 0 0 25704 257628 12080 108636 0 25704 0 25736 181 57 1 4 94 0 2 0 0 25704 257636 12080 108684 0 0 0 0 47 20 0 0 100 0 0 0 0 25704 257760 12088 108676 0 0 0 76 60 27 0 0 96 5 0 ^C [1]+ Killed ./memory-bomb.py > out [root@rhel-6-x86_64 ~]# ~~~ ~~~ [root@rhel-6-x86_64 ]# cat out pid: 1399 allocating 2 MB cgroup: ['total_rss 28422144', 'total_swap 0'], proc: ['VmRSS:\t 6088 kB', 'VmSwap:\t 0 kB'] allocating 4 MB cgroup: ['total_rss 30523392', 'total_swap 0'], proc: ['VmRSS:\t 8148 kB', 'VmSwap:\t 0 kB'] allocating 8 MB cgroup: ['total_rss 34717696', 'total_swap 0'], proc: ['VmRSS:\t 12244 kB', 'VmSwap:\t 0 kB'] allocating 16 MB cgroup: ['total_rss 43106304', 'total_swap 0'], proc: ['VmRSS:\t 20436 kB', 'VmSwap:\t 0 kB'] allocating 32 MB cgroup: ['total_rss 35635200', 'total_swap 0'], proc: ['VmRSS:\t 36820 kB', 'VmSwap:\t 0 kB'] allocating 64 MB[root@rhel-6-x86_64 ~]# ~~~ Could you try to run the script (it's included in the bz) so we can compare the results? Since both cases does almost the same thing, I don't understand why the results differs. Martin, can you post the dmesg output so I can see the shew_mem() output when the OOMkill occurred? Thanks, Larry Created attachment 658706 [details]
dmesg output
Here you can find full dmesg output (I rebooted the machine and run the memory bomb script).
The issue here is the memory.limit_in_bytes. If its too small the process will be OOMkilled, it its larger it will not be OOMkilled. I'll get to the bottom of why this is. Larry That is interesting. On my virtual when machine memory.limit_in_bytes is larger than 160 MB process won't be killed. Do you think that this limit is always the same or depends on something? I'm guessing it depends on there being enough pages of memory allowed in the cgroup so the page reclaim code can succeed in swapping at least something out before it gives up. BTW, this change in mem_cgroup_reclaim() fixes the problem because for some reason I dont understand it bails out after only two itterations around the loop: /* * If nothing was reclaimed after two attempts, there * may be no reclaimable pages in this hierarchy. */ - if (loop && !total) - break; +// if (loop && !total) +// break; } return total; } This little piece of code is the same in RHEL6 as upstream, however much of the memory cgroup has been rewritten upstream and I dont like the idea of backporting all that code since its so fragile to begin with. Also, BTW, does this work OK on the upstream kernel? Larry Martin, I am working on fixing this now. Does this work OK upstream or is this a RHEL6 only issue? Like I said earlier, the upstream kernel has a totally different memory cgroups reclaim codebase that I an concerned about backporting into RHEL6. Also I did zero down on the exact cause of the problem, mem_cgroup_reclaim() breaks out of the loop after 2 itterations if nothing is reclaimed. This happens when the cgroup is so small that every pages is in writeback state and we enter this loop before the swap device gets a chance to complete any of the previous swapouts. If I either remove that test I commented out in Comments #10 or change it to break out after several itterations(5 or 10) with no progress we dont see the failure. I need to verify whether it happens the same upstream and go from there. If it does, I'll propose a fix upstream. If it doesnt, I'll analyze what in the upstream code prevents the failure and see if thats backportable. Larry Larry, by upstream do you mean the vanilla from kernel.org? So far I have tested only RHEL6 kernel (the problem occures) and fedora 17 one (works ok), so I would guess that it's RHEL6 only issue (nevertheless in fedora kernel there are lots of distro specific patches in it). This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. Larry, being curious I have checked that, as expected, the feature works ok on latest mainline kernel 3.7.0 (using kernel from https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories on my machine running fedora 17). This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. (In reply to comment #15) > Larry, being curious I have checked that, as expected, the feature works ok > on latest mainline kernel 3.7.0 (using kernel from > https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories on my machine > running fedora 17). The upstream reclaim changes that Larry talked about were backported into RHEL6. Could you retry with a (.350 or) later kernel, please? (In reply to comment #17) > (In reply to comment #15) > > Larry, being curious I have checked that, as expected, the feature works ok > > on latest mainline kernel 3.7.0 (using kernel from > > https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories on my machine > > running fedora 17). > > The upstream reclaim changes that Larry talked about were backported into > RHEL6. Could you retry with a (.350 or) later kernel, please? I have retried with 2.6.32-358.6.1.el6 on both i686 and x86_64 (using the python script from description), and the problem is still here. Patch(es) available on kernel-2.6.32-464.el6 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-1392.html |