Bug 870011
| Summary: | Process constrained by CGroup memory.limit_in_bytes is killed instead of being swapped | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Martin Bukatovic <mbukatov> | ||||
| Component: | kernel | Assignee: | Johannes Weiner <jweiner> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Li Wang <liwan> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 6.3 | CC: | ccui, lwang, lwoodman, matt, mbukatov, sgraf, tlavigne | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | kernel-2.6.32-464.el6 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2014-10-14 05:11:00 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 525237, 876610 | ||||||
| Attachments: |
|
||||||
I just tested this on my RHEL6 system and it works correctly. [root@dhcp47-183 ~]# uname -a Linux dhcp47-183.lab.bos.redhat.com 2.6.32.339 #9 SMP Thu Nov 8 16:58:12 EST 2012 x86_64 x86_64 x86_64 GNU/Linux [root@dhcp47-183 ~]# cd /cgroup/memory/ [root@dhcp47-183 memory]# mkdir test [root@dhcp47-183 memory]# cd test/ [root@dhcp47-183 test]# cat memory.limit_in_bytes 9223372036854775807 [root@dhcp47-183 test]# echo 1G > memory.limit_in_bytes [root@dhcp47-183 test]# echo $$ > tasks [root@dhcp47-183 test]# /common/lwoodman/code/memory 2G & [1] 2330 [root@dhcp47-183 test]# size = 2147483648 mmaping 2147483648 anonymous bytes 7f56a4b8b000 touching 524288 pages [root@dhcp47-183 test]# vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 1 215912 6378476 30552 160880 0 194 181 201 1123 203 0 1 94 5 0 0 1 249704 6380524 30552 160528 0 33792 0 33792 10134 4375 0 1 87 11 0 0 1 283496 6379096 30552 160748 0 33792 0 33792 10062 4199 0 1 87 11 0 0 1 317288 6379096 30552 160852 0 33792 0 33792 10119 4321 0 1 87 11 0 0 1 367464 6380432 30552 160672 0 50176 0 50176 11019 6386 0 2 88 11 0 0 1 401256 6378384 30552 160748 0 33792 0 33792 10066 4198 0 1 87 11 0 ... Success: faulting took 29.768751s Going one step further, memory.memsw.limit_in_bytes limits the sum of swap & RAM: [root@dhcp47-183 test]# echo 2G > memory.memsw.limit_in_bytes [root@dhcp47-183 test]# /common/lwoodman/code/memory 2 size = 2147483648 mmaping 2147483648 anonymous bytes 7fba4f845000 touching 524288 pages Killed [root@dhcp47-183 test]# echo 2100M > memory.memsw.limit_in_bytes [root@dhcp47-183 test]# /common/lwoodman/code/memory 2 size = 2147483648 mmaping 2147483648 anonymous bytes 7f68db5b4000 touching 524288 pages Success: faulting took 29.237335s I have updated kernel to 2.6.32-345.el6.x86_64: ~~~ [root@rhel-6-x86_64 ~]# uname -a Linux rhel-6-x86_64.virtualdomain 2.6.32-345.el6.x86_64 #1 SMP Wed Nov 28 21:10:19 EST 2012 x86_64 x86_64 x86_64 GNU/Linux ~~~ But when I run my memory-bomb.py script, the results are the same as before, process is killed (nevertheless it seems that some swapping is actually done): ~~~ [root@rhel-6-x86_64 ~]# ./memory-bomb.py > out & vmstat 1 [1] 1399 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 25704 258124 12072 108676 0 64 77 73 52 27 2 1 97 1 0 0 0 25704 254032 12072 108680 0 0 0 0 83 22 1 1 99 0 0 0 0 25704 251552 12072 108680 0 0 0 0 61 22 1 0 100 0 0 0 0 25704 247460 12080 108672 0 0 0 60 65 31 0 0 96 4 0 0 0 25704 239276 12080 108672 0 0 0 0 70 27 0 1 100 0 0 0 0 2024 246592 12080 108724 0 0 0 0 81 25 1 1 99 0 0 0 0 25704 257628 12080 108636 0 25704 0 25736 181 57 1 4 94 0 2 0 0 25704 257636 12080 108684 0 0 0 0 47 20 0 0 100 0 0 0 0 25704 257760 12088 108676 0 0 0 76 60 27 0 0 96 5 0 ^C [1]+ Killed ./memory-bomb.py > out [root@rhel-6-x86_64 ~]# ~~~ ~~~ [root@rhel-6-x86_64 ]# cat out pid: 1399 allocating 2 MB cgroup: ['total_rss 28422144', 'total_swap 0'], proc: ['VmRSS:\t 6088 kB', 'VmSwap:\t 0 kB'] allocating 4 MB cgroup: ['total_rss 30523392', 'total_swap 0'], proc: ['VmRSS:\t 8148 kB', 'VmSwap:\t 0 kB'] allocating 8 MB cgroup: ['total_rss 34717696', 'total_swap 0'], proc: ['VmRSS:\t 12244 kB', 'VmSwap:\t 0 kB'] allocating 16 MB cgroup: ['total_rss 43106304', 'total_swap 0'], proc: ['VmRSS:\t 20436 kB', 'VmSwap:\t 0 kB'] allocating 32 MB cgroup: ['total_rss 35635200', 'total_swap 0'], proc: ['VmRSS:\t 36820 kB', 'VmSwap:\t 0 kB'] allocating 64 MB[root@rhel-6-x86_64 ~]# ~~~ Could you try to run the script (it's included in the bz) so we can compare the results? Since both cases does almost the same thing, I don't understand why the results differs. Martin, can you post the dmesg output so I can see the shew_mem() output when the OOMkill occurred? Thanks, Larry Created attachment 658706 [details]
dmesg output
Here you can find full dmesg output (I rebooted the machine and run the memory bomb script).
The issue here is the memory.limit_in_bytes. If its too small the process will be OOMkilled, it its larger it will not be OOMkilled. I'll get to the bottom of why this is. Larry That is interesting. On my virtual when machine memory.limit_in_bytes is larger than 160 MB process won't be killed. Do you think that this limit is always the same or depends on something?
I'm guessing it depends on there being enough pages of memory allowed in the cgroup so the page reclaim code can succeed in swapping at least something out before it gives up. BTW, this change in mem_cgroup_reclaim() fixes the problem because for some reason I dont understand it bails out after only two itterations around the loop:
/*
* If nothing was reclaimed after two attempts, there
* may be no reclaimable pages in this hierarchy.
*/
- if (loop && !total)
- break;
+// if (loop && !total)
+// break;
}
return total;
}
This little piece of code is the same in RHEL6 as upstream, however much of the memory cgroup has been rewritten upstream and I dont like the idea of backporting all that code since its so fragile to begin with.
Also, BTW, does this work OK on the upstream kernel?
Larry
Martin, I am working on fixing this now. Does this work OK upstream or is this a RHEL6 only issue? Like I said earlier, the upstream kernel has a totally different memory cgroups reclaim codebase that I an concerned about backporting into RHEL6. Also I did zero down on the exact cause of the problem, mem_cgroup_reclaim() breaks out of the loop after 2 itterations if nothing is reclaimed. This happens when the cgroup is so small that every pages is in writeback state and we enter this loop before the swap device gets a chance to complete any of the previous swapouts. If I either remove that test I commented out in Comments #10 or change it to break out after several itterations(5 or 10) with no progress we dont see the failure. I need to verify whether it happens the same upstream and go from there. If it does, I'll propose a fix upstream. If it doesnt, I'll analyze what in the upstream code prevents the failure and see if thats backportable. Larry Larry, by upstream do you mean the vanilla from kernel.org? So far I have tested only RHEL6 kernel (the problem occures) and fedora 17 one (works ok), so I would guess that it's RHEL6 only issue (nevertheless in fedora kernel there are lots of distro specific patches in it). This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. Larry, being curious I have checked that, as expected, the feature works ok on latest mainline kernel 3.7.0 (using kernel from https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories on my machine running fedora 17). This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. (In reply to comment #15) > Larry, being curious I have checked that, as expected, the feature works ok > on latest mainline kernel 3.7.0 (using kernel from > https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories on my machine > running fedora 17). The upstream reclaim changes that Larry talked about were backported into RHEL6. Could you retry with a (.350 or) later kernel, please? (In reply to comment #17) > (In reply to comment #15) > > Larry, being curious I have checked that, as expected, the feature works ok > > on latest mainline kernel 3.7.0 (using kernel from > > https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories on my machine > > running fedora 17). > > The upstream reclaim changes that Larry talked about were backported into > RHEL6. Could you retry with a (.350 or) later kernel, please? I have retried with 2.6.32-358.6.1.el6 on both i686 and x86_64 (using the python script from description), and the problem is still here. Patch(es) available on kernel-2.6.32-464.el6 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-1392.html |
Description of problem: When rss of a process is limited by using control file 'memory.limit_in_bytes' of cgroup memory controller, constrained process is killed when the limit is breached instead of being swapped. Version-Release number of selected component (if applicable): # rpm -q kernel libcgroup kernel-2.6.32-279.11.1.el6.i686 libcgroup-0.37-4.el6.i686 How reproducible: always (tested on both i386 and x86_64 rhel6 virtual machines as well as on physical machine with rhel6 x86_64) Steps to Reproduce: 1. run following python script as root ~~~ #!/usr/bin/env python2 # -*- coding: utf8 -*- import time import sys import os import re if os.path.exists("/cgroup/"): CGROUP = "/cgroup/memory/memory-bomb" # rhel else: CGROUP = "/sys/fs/cgroup/memory/memory-bomb" # fedora MB = 2**20 RE_CGROUP = re.compile("^total_(swap|rss)") RE_STATUS = re.compile("^Vm(Swap|RSS)") def setup_cgroup(): if not os.path.isdir(CGROUP): print "initializing cgroup %s" % CGROUP os.makedirs(CGROUP) limit = open(os.path.join(CGROUP, "memory.limit_in_bytes"), "w") limit.write(str(50*MB)) def join_cgroup(): tasks = open(os.path.join(CGROUP, "tasks"), "w") tasks.write(str(os.getpid())) def check_cgroup(): stat = open(os.path.join(CGROUP, "memory.stat")) list = filter(lambda x: RE_CGROUP.match(x), stat.readlines()) return [x.rstrip() for x in list] def check_proc(): status = open("/proc/self/status") list = filter(lambda x: RE_STATUS.match(x), status.readlines()) return [x.rstrip() for x in list] def main(argv=None): if len(argv) > 1: sleep_time = float(argv[1]) else: sleep_time = 1 print "pid: %d" % os.getpid() try: setup_cgroup() join_cgroup() except: print "can't join cgroup %s" % CGROUP for i in xrange(1, 10): space = 2**i print "allocating %4d MB" % space, sys.stdout.flush() alloc = "x" * (space*MB) print "cgroup: %s,\tproc: %s" % (check_cgroup(), check_proc()) time.sleep(sleep_time) if __name__ == '__main__': sys.exit(main(sys.argv)) ~~~ Actual results: The process is killed immediatelly when memory limit is breached: ~~~ pid: 1422 allocating 2 MB cgroup: ['total_rss 28405760', 'total_swap 0'], proc: ['VmRSS:\t 5232 kB', 'VmSwap:\t 0 kB'] allocating 4 MB cgroup: ['total_rss 30507008', 'total_swap 0'], proc: ['VmRSS:\t 7280 kB', 'VmSwap:\t 0 kB'] allocating 8 MB cgroup: ['total_rss 34701312', 'total_swap 0'], proc: ['VmRSS:\t 11376 kB', 'VmSwap:\t 0 kB'] allocating 16 MB cgroup: ['total_rss 43089920', 'total_swap 0'], proc: ['VmRSS:\t 19568 kB', 'VmSwap:\t 0 kB'] allocating 32 MB cgroup: ['total_rss 35618816', 'total_swap 0'], proc: ['VmRSS:\t 35952 kB', 'VmSwap:\t 0 kB'] allocating 64 MBKilled ~~~ Expected results: The process is not killed, but forced to use swap instead. This happens when I run mentioned script on Fedora 17: ~~~ # ./memory-bomb.py pid: 28513 allocating 2 MB cgroup: ['total_rss 2101248'], proc: ['VmRSS:\t 10080 kB', 'VmSwap:\t 0 kB'] allocating 4 MB cgroup: ['total_rss 4198400'], proc: ['VmRSS:\t 12132 kB', 'VmSwap:\t 0 kB'] allocating 8 MB cgroup: ['total_rss 8392704'], proc: ['VmRSS:\t 16228 kB', 'VmSwap:\t 0 kB'] allocating 16 MB cgroup: ['total_rss 16781312'], proc: ['VmRSS:\t 24420 kB', 'VmSwap:\t 0 kB'] allocating 32 MB cgroup: ['total_rss 33558528'], proc: ['VmRSS:\t 40804 kB', 'VmSwap:\t 0 kB'] allocating 64 MB cgroup: ['total_rss 48164864'], proc: ['VmRSS:\t 44828 kB', 'VmSwap:\t 28744 kB'] allocating 128 MB cgroup: ['total_rss 52326400'], proc: ['VmRSS:\t 47096 kB', 'VmSwap:\t 92012 kB'] allocating 256 MB cgroup: ['total_rss 52342784'], proc: ['VmRSS:\t 44528 kB', 'VmSwap:\t 225652 kB'] allocating 512 MB cgroup: ['total_rss 52412416'], proc: ['VmRSS:\t 44648 kB', 'VmSwap:\t 487676 kB'] ~~~ Additional info: When the OOM killer is disabled for the cgroup, process is not killed. ~~~ # cd /cgroup/memory/memory-bomb # echo 1 > memory.oom_control ~~~ The OOM killer is not triggered by insufficient memory: ~~~ # free -m total used free shared buffers cached Mem: 499 81 417 0 2 11 -/+ buffers/cache: 68 430 Swap: 1023 36 987 ~~~ Also other memory limits has not been changed: ~~~ # cd /cgroup/memory/memory-bomb # cat memory.memsw.limit_in_bytes 9223372036854775807 ~~~ which means there should be no problem with forcing the process to use swap.