Bug 870011

Summary: Process constrained by CGroup memory.limit_in_bytes is killed instead of being swapped
Product: Red Hat Enterprise Linux 6 Reporter: Martin Bukatovic <mbukatov>
Component: kernelAssignee: Johannes Weiner <jweiner>
Status: CLOSED ERRATA QA Contact: Li Wang <liwan>
Severity: high Docs Contact:
Priority: high    
Version: 6.3CC: ccui, lwang, lwoodman, matt, mbukatov, sgraf, tlavigne
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.32-464.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-14 05:11:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 525237, 876610    
Attachments:
Description Flags
dmesg output none

Description Martin Bukatovic 2012-10-25 11:09:46 UTC
Description of problem:

When rss of a process is limited by using control file 'memory.limit_in_bytes' of cgroup memory controller, constrained process is killed when the limit is breached instead of being swapped.

Version-Release number of selected component (if applicable):

# rpm -q kernel libcgroup
kernel-2.6.32-279.11.1.el6.i686
libcgroup-0.37-4.el6.i686

How reproducible:

always (tested on both i386 and x86_64 rhel6 virtual machines as well as on physical machine with rhel6 x86_64)

Steps to Reproduce:
1. run following python script as root

~~~
#!/usr/bin/env python2
# -*- coding: utf8 -*-

import time
import sys
import os
import re

if os.path.exists("/cgroup/"):
    CGROUP = "/cgroup/memory/memory-bomb" # rhel
else:
    CGROUP = "/sys/fs/cgroup/memory/memory-bomb" # fedora

MB = 2**20
RE_CGROUP = re.compile("^total_(swap|rss)")
RE_STATUS = re.compile("^Vm(Swap|RSS)")

def setup_cgroup():
    if not os.path.isdir(CGROUP):
        print "initializing cgroup %s" % CGROUP
        os.makedirs(CGROUP)
        limit = open(os.path.join(CGROUP, "memory.limit_in_bytes"), "w")
        limit.write(str(50*MB))

def join_cgroup():
    tasks = open(os.path.join(CGROUP, "tasks"), "w")
    tasks.write(str(os.getpid()))

def check_cgroup():
    stat = open(os.path.join(CGROUP, "memory.stat"))
    list = filter(lambda x: RE_CGROUP.match(x), stat.readlines())
    return [x.rstrip() for x in list]

def check_proc():
    status = open("/proc/self/status")
    list = filter(lambda x: RE_STATUS.match(x), status.readlines())
    return [x.rstrip() for x in list]

def main(argv=None):
    if len(argv) > 1:
        sleep_time = float(argv[1])
    else:
        sleep_time = 1
    print "pid: %d" % os.getpid()
    try:
        setup_cgroup()
        join_cgroup()
    except:
        print "can't join cgroup %s" % CGROUP
    for i in xrange(1, 10):
        space = 2**i
        print "allocating %4d MB" % space,
        sys.stdout.flush()
        alloc = "x" * (space*MB)
        print "cgroup: %s,\tproc: %s" % (check_cgroup(), check_proc())
        time.sleep(sleep_time)

if __name__ == '__main__':
    sys.exit(main(sys.argv))
~~~

Actual results:

The process is killed immediatelly when memory limit is breached:

~~~
pid: 1422
allocating    2 MB cgroup: ['total_rss 28405760', 'total_swap 0'],      proc: ['VmRSS:\t    5232 kB', 'VmSwap:\t       0 kB']
allocating    4 MB cgroup: ['total_rss 30507008', 'total_swap 0'],      proc: ['VmRSS:\t    7280 kB', 'VmSwap:\t       0 kB']
allocating    8 MB cgroup: ['total_rss 34701312', 'total_swap 0'],      proc: ['VmRSS:\t   11376 kB', 'VmSwap:\t       0 kB']
allocating   16 MB cgroup: ['total_rss 43089920', 'total_swap 0'],      proc: ['VmRSS:\t   19568 kB', 'VmSwap:\t       0 kB']
allocating   32 MB cgroup: ['total_rss 35618816', 'total_swap 0'],      proc: ['VmRSS:\t   35952 kB', 'VmSwap:\t       0 kB']
allocating   64 MBKilled
~~~

Expected results:

The process is not killed, but forced to use swap instead. This happens when I
run mentioned script on Fedora 17:

~~~
# ./memory-bomb.py 
pid: 28513
allocating    2 MB cgroup: ['total_rss 2101248'],       proc: ['VmRSS:\t   10080 kB', 'VmSwap:\t       0 kB']
allocating    4 MB cgroup: ['total_rss 4198400'],       proc: ['VmRSS:\t   12132 kB', 'VmSwap:\t       0 kB']
allocating    8 MB cgroup: ['total_rss 8392704'],       proc: ['VmRSS:\t   16228 kB', 'VmSwap:\t       0 kB']
allocating   16 MB cgroup: ['total_rss 16781312'],      proc: ['VmRSS:\t   24420 kB', 'VmSwap:\t       0 kB']
allocating   32 MB cgroup: ['total_rss 33558528'],      proc: ['VmRSS:\t   40804 kB', 'VmSwap:\t       0 kB']
allocating   64 MB cgroup: ['total_rss 48164864'],      proc: ['VmRSS:\t   44828 kB', 'VmSwap:\t   28744 kB']
allocating  128 MB cgroup: ['total_rss 52326400'],      proc: ['VmRSS:\t   47096 kB', 'VmSwap:\t   92012 kB']
allocating  256 MB cgroup: ['total_rss 52342784'],      proc: ['VmRSS:\t   44528 kB', 'VmSwap:\t  225652 kB']
allocating  512 MB cgroup: ['total_rss 52412416'],      proc: ['VmRSS:\t   44648 kB', 'VmSwap:\t  487676 kB']
~~~

Additional info:

When the OOM killer is disabled for the cgroup, process is not killed.

~~~
# cd /cgroup/memory/memory-bomb
# echo 1 > memory.oom_control
~~~

The OOM killer is not triggered by insufficient memory:

~~~
# free -m
             total       used       free     shared    buffers     cached
Mem:           499         81        417          0          2         11
-/+ buffers/cache:         68        430
Swap:         1023         36        987
~~~

Also other memory limits has not been changed:

~~~
# cd /cgroup/memory/memory-bomb
# cat memory.memsw.limit_in_bytes
9223372036854775807
~~~

which means there should be no problem with forcing the process to use swap.

Comment 3 Larry Woodman 2012-12-03 16:02:44 UTC
I just tested this on my RHEL6 system and it works correctly.

[root@dhcp47-183 ~]# uname -a
Linux dhcp47-183.lab.bos.redhat.com 2.6.32.339 #9 SMP Thu Nov 8 16:58:12 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
[root@dhcp47-183 ~]# cd /cgroup/memory/
[root@dhcp47-183 memory]# mkdir test
[root@dhcp47-183 memory]# cd test/
[root@dhcp47-183 test]# cat memory.limit_in_bytes 
9223372036854775807
[root@dhcp47-183 test]# echo 1G > memory.limit_in_bytes
[root@dhcp47-183 test]# echo $$ > tasks 
[root@dhcp47-183 test]# /common/lwoodman/code/memory 2G &
[1] 2330
[root@dhcp47-183 test]# size = 2147483648
mmaping 2147483648 anonymous bytes
7f56a4b8b000
touching 524288 pages

[root@dhcp47-183 test]# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  1 215912 6378476  30552 160880    0  194   181   201 1123  203  0  1 94  5  0	
 0  1 249704 6380524  30552 160528    0 33792     0 33792 10134 4375  0  1 87 11  0	
 0  1 283496 6379096  30552 160748    0 33792     0 33792 10062 4199  0  1 87 11  0	
 0  1 317288 6379096  30552 160852    0 33792     0 33792 10119 4321  0  1 87 11  0	
 0  1 367464 6380432  30552 160672    0 50176     0 50176 11019 6386  0  2 88 11  0	
 0  1 401256 6378384  30552 160748    0 33792     0 33792 10066 4198  0  1 87 11  0	
...

Success: faulting took 29.768751s

Comment 4 Larry Woodman 2012-12-03 16:05:12 UTC
Going one step further, memory.memsw.limit_in_bytes limits the sum of swap & RAM:

[root@dhcp47-183 test]# echo 2G > memory.memsw.limit_in_bytes
[root@dhcp47-183 test]# /common/lwoodman/code/memory 2
size = 2147483648
mmaping 2147483648 anonymous bytes
7fba4f845000
touching 524288 pages

Killed


[root@dhcp47-183 test]# echo 2100M > memory.memsw.limit_in_bytes
[root@dhcp47-183 test]# /common/lwoodman/code/memory 2
size = 2147483648
mmaping 2147483648 anonymous bytes
7f68db5b4000
touching 524288 pages


Success: faulting took 29.237335s

Comment 5 Martin Bukatovic 2012-12-06 12:28:03 UTC
I have updated kernel to 2.6.32-345.el6.x86_64:

~~~
[root@rhel-6-x86_64 ~]# uname -a
Linux rhel-6-x86_64.virtualdomain 2.6.32-345.el6.x86_64 #1 SMP Wed Nov 28 21:10:19 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
~~~

But when I run my memory-bomb.py script, the results are the same as before,
process is killed (nevertheless it seems that some swapping is actually done):

~~~
[root@rhel-6-x86_64 ~]# ./memory-bomb.py > out & vmstat 1
[1] 1399
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0  25704 258124  12072 108676    0   64    77    73   52   27  2  1 97  1  0
 0  0  25704 254032  12072 108680    0    0     0     0   83   22  1  1 99  0  0
 0  0  25704 251552  12072 108680    0    0     0     0   61   22  1  0 100  0  0
 0  0  25704 247460  12080 108672    0    0     0    60   65   31  0  0 96  4  0
 0  0  25704 239276  12080 108672    0    0     0     0   70   27  0  1 100  0  0
 0  0   2024 246592  12080 108724    0    0     0     0   81   25  1  1 99  0  0
 0  0  25704 257628  12080 108636    0 25704     0 25736  181   57  1  4 94  0  2
 0  0  25704 257636  12080 108684    0    0     0     0   47   20  0  0 100  0  0
 0  0  25704 257760  12088 108676    0    0     0    76   60   27  0  0 96  5  0
^C
[1]+  Killed                  ./memory-bomb.py > out
[root@rhel-6-x86_64 ~]#
~~~

~~~
[root@rhel-6-x86_64 ]# cat out
pid: 1399
allocating    2 MB cgroup: ['total_rss 28422144', 'total_swap 0'],      proc: ['VmRSS:\t    6088 kB', 'VmSwap:\t       0 kB']
allocating    4 MB cgroup: ['total_rss 30523392', 'total_swap 0'],      proc: ['VmRSS:\t    8148 kB', 'VmSwap:\t       0 kB']
allocating    8 MB cgroup: ['total_rss 34717696', 'total_swap 0'],      proc: ['VmRSS:\t   12244 kB', 'VmSwap:\t       0 kB']
allocating   16 MB cgroup: ['total_rss 43106304', 'total_swap 0'],      proc: ['VmRSS:\t   20436 kB', 'VmSwap:\t       0 kB']
allocating   32 MB cgroup: ['total_rss 35635200', 'total_swap 0'],      proc: ['VmRSS:\t   36820 kB', 'VmSwap:\t       0 kB']
allocating   64 MB[root@rhel-6-x86_64 ~]#
~~~

Could you try to run the script (it's included in the bz) so we can compare the
results? Since both cases does almost the same thing, I don't understand why
the results differs.

Comment 6 Larry Woodman 2012-12-06 12:30:57 UTC
Martin, can you post the dmesg output so I can see the shew_mem() output when the OOMkill occurred?

Thanks, Larry

Comment 7 Martin Bukatovic 2012-12-06 12:45:22 UTC
Created attachment 658706 [details]
dmesg output

Here you can find full dmesg output (I rebooted the machine and run the memory bomb script).

Comment 8 Larry Woodman 2012-12-06 16:11:25 UTC
The issue here is the memory.limit_in_bytes.  If its too small the process will be OOMkilled, it its larger it will not be OOMkilled.  I'll get to the bottom of why this is.

Larry

Comment 9 Martin Bukatovic 2012-12-07 11:22:02 UTC
That is interesting. On my virtual when machine memory.limit_in_bytes is larger than 160 MB process won't be killed. Do you think that this limit is always the same or depends on something?

Comment 10 Larry Woodman 2012-12-07 11:53:25 UTC
I'm guessing it depends on there being enough pages of memory allowed in the cgroup so the page reclaim code can succeed in swapping at least something out before it gives up.  BTW, this change in mem_cgroup_reclaim() fixes the problem because for some reason I dont understand it bails out after only two itterations around the loop:

                /*
                 * If nothing was reclaimed after two attempts, there
                 * may be no reclaimable pages in this hierarchy.
                 */
-               if (loop && !total)
-                       break;
+//             if (loop && !total)
+//                     break;
        }
        return total;
 }

This little piece of code is the same in RHEL6 as upstream, however much of the memory cgroup has been rewritten upstream and I dont like the idea of backporting  all that code since its so fragile to begin with.  

Also, BTW, does this work OK on the upstream kernel?

Larry

Comment 12 Larry Woodman 2012-12-13 15:55:01 UTC
Martin, I am working on fixing this now.  Does this work OK upstream or is this a RHEL6 only issue?  Like I said earlier, the upstream kernel has a totally different memory cgroups reclaim codebase that I an concerned about backporting into RHEL6.  Also I did zero down on the exact cause of the problem, mem_cgroup_reclaim() breaks out of the loop after 2 itterations if nothing is reclaimed.  This happens when the cgroup is so small that every pages is in writeback state and we enter this loop before the swap device gets a chance to complete any of the previous swapouts.  If I either remove that test I commented out in Comments #10 or change it to break out after several itterations(5 or 10) with no progress we dont see the failure.

I need to verify whether it happens the same upstream and go from there.  If it does, I'll propose a fix upstream.  If it doesnt, I'll analyze what in the upstream code prevents the failure and see if thats backportable.

Larry

Comment 13 Martin Bukatovic 2012-12-13 16:33:48 UTC
Larry, by upstream do you mean the vanilla from kernel.org? So far I have tested only RHEL6 kernel (the problem occures) and fedora 17 one (works ok), so I would guess that it's RHEL6 only issue (nevertheless in fedora kernel there are lots of distro specific patches in it).

Comment 14 RHEL Program Management 2012-12-17 06:47:59 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 15 Martin Bukatovic 2012-12-18 14:12:31 UTC
Larry, being curious I have checked that, as expected, the feature works ok on latest mainline kernel 3.7.0 (using kernel from https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories on my machine running fedora 17).

Comment 16 RHEL Program Management 2012-12-22 06:47:37 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 17 Johannes Weiner 2013-05-13 19:35:24 UTC
(In reply to comment #15)
> Larry, being curious I have checked that, as expected, the feature works ok
> on latest mainline kernel 3.7.0 (using kernel from
> https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories on my machine
> running fedora 17).

The upstream reclaim changes that Larry talked about were backported into RHEL6.  Could you retry with a (.350 or) later kernel, please?

Comment 18 Martin Bukatovic 2013-05-14 15:45:27 UTC
(In reply to comment #17)
> (In reply to comment #15)
> > Larry, being curious I have checked that, as expected, the feature works ok
> > on latest mainline kernel 3.7.0 (using kernel from
> > https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories on my machine
> > running fedora 17).
> 
> The upstream reclaim changes that Larry talked about were backported into
> RHEL6.  Could you retry with a (.350 or) later kernel, please?

I have retried with 2.6.32-358.6.1.el6 on both i686 and x86_64 (using the python script from description), and the problem is still here.

Comment 27 Rafael Aquini 2014-05-13 14:24:16 UTC
Patch(es) available on kernel-2.6.32-464.el6

Comment 33 errata-xmlrpc 2014-10-14 05:11:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-1392.html