Bug 816365 - memcg controller can cause kernel oops when migration and swap tracking is enabled
memcg controller can cause kernel oops when migration and swap tracking is en...
Status: CLOSED DUPLICATE of bug 800328
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.2
x86_64 Linux
high Severity high
: rc
: ---
Assigned To: Red Hat Kernel Manager
Red Hat Kernel QE team
:
Depends On:
Blocks: 435010 846704
  Show dependency treegraph
 
Reported: 2012-04-25 17:40 EDT by Brian Bockelman
Modified: 2012-11-13 08:45 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-11-13 08:45:30 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
An (untested) proposed approach to addressing the symptom without addressing the cause. (970 bytes, patch)
2012-04-25 17:42 EDT, Brian Bockelman
no flags Details | Diff

  None (edit)
Description Brian Bockelman 2012-04-25 17:40:53 EDT
Description of problem:

Under heavy load, when migrating a task from one cgroup to another in the memory controller, a kernel oops may occur when looking up the swap pages.

Version-Release number of selected component (if applicable):

kernel-2.6.32-220.7.1.el6.x86_64

How reproducible:

Very difficult.  On a cluster of 200 heavily-loaded nodes on continuous load (a HPC cluster), we see this about twice a week.  I have a few crash kernels on demand.

Another way we saw this (I think) was to run a memory-heavy workload with the memory controller / migration on, and then enable the cgred daemon, which moves tasks to different cgroups.

Steps to Reproduce:
1. Run memory-intensive workload (for us, this is scientific software).
2. Periodically move tasks between cgroups with memcg, swap accounting, and migration on.
3. Repeat (2) until a kernel oops happens
  
Actual results:

Kernel oops

Expected results:

No kernel oops

Additional info:

From kernel log:

swap_free: Bad swap file entry 2000000000000000
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff8116bdf7>] lookup_swap_cgroup+0x37/0x70

Relevant part of the traceback from crash kernel:

    [exception RIP: lookup_swap_cgroup+55]
    RIP: ffffffff8116bdf7  RSP: ffff88060d4cfb58  RFLAGS: 00010246
    RAX: 0000160000000000  RBX: 0000000000000000  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000046  RDI: 0000000000000000
    RBP: ffff88060d4cfb58   R8: 0000000000000001   R9: ffffffff8163a920
    R10: 0000000000000001  R11: 0000000000000000  R12: 2000000000000000
    R13: 000000000000000b  R14: ffff8806157ed2d8  R15: ffffea0000516010
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff88060d4cfb60] is_target_pte_for_mc at ffffffff81167683
#10 [ffff88060d4cfbb0] mem_cgroup_count_precharge_pte_range at ffffffff81167925
#11 [ffff88060d4cfc00] walk_page_range at ffffffff8114a919
#12 [ffff88060d4cfc90] mem_cgroup_can_attach at ffffffff8116abfb
#13 [ffff88060d4cfd20] cgroup_attach_task at ffffffff810bfaf6
#14 [ffff88060d4cfe10] cgroup_tasks_write at ffffffff810c00ec
#15 [ffff88060d4cfe40] cgroup_file_write at ffffffff810c1c3a
#16 [ffff88060d4cfef0] vfs_write at ffffffff81176588
#17 [ffff88060d4cff30] sys_write at ffffffff81176f91
#18 [ffff88060d4cff80] system_call_fastpath at ffffffff8100b0f2

Disassembler of the function in question:

crash> dis -rl lookup_swap_cgroup+55
/usr/src/debug/kernel-2.6.32-220.7.1.el6/linux-2.6.32-220.7.1.el6.x86_64/mm/page_cgroup.c: 477
0xffffffff8116bdc0 <lookup_swap_cgroup>:        push   %rbp
0xffffffff8116bdc1 <lookup_swap_cgroup+1>:      mov    %rsp,%rbp
0xffffffff8116bdc4 <lookup_swap_cgroup+4>:      nopl   0x0(%rax,%rax,1)
/usr/src/debug/kernel-2.6.32-220.7.1.el6/linux-2.6.32-220.7.1.el6.x86_64/mm/page_cgroup.c: 489
0xffffffff8116bdc9 <lookup_swap_cgroup+9>:      mov    %rdi,%rax
/usr/src/debug/kernel-2.6.32-220.7.1.el6/linux-2.6.32-220.7.1.el6.x86_64/include/linux/swapops.h: 42
0xffffffff8116bdcc <lookup_swap_cgroup+12>:     mov    %rdi,%rdx
/usr/src/debug/kernel-2.6.32-220.7.1.el6/linux-2.6.32-220.7.1.el6.x86_64/mm/page_cgroup.c: 489
0xffffffff8116bdcf <lookup_swap_cgroup+15>:     and    $0x7ff,%edi
0xffffffff8116bdd5 <lookup_swap_cgroup+21>:     shr    $0x3b,%rax
/usr/src/debug/kernel-2.6.32-220.7.1.el6/linux-2.6.32-220.7.1.el6.x86_64/include/linux/swapops.h: 42
0xffffffff8116bdd9 <lookup_swap_cgroup+25>:     shl    $0x5,%rdx
/usr/src/debug/kernel-2.6.32-220.7.1.el6/linux-2.6.32-220.7.1.el6.x86_64/mm/page_cgroup.c: 489
0xffffffff8116bddd <lookup_swap_cgroup+29>:     lea    (%rax,%rax,2),%rax
0xffffffff8116bde1 <lookup_swap_cgroup+33>:     shr    $0x10,%rdx
0xffffffff8116bde5 <lookup_swap_cgroup+37>:     mov    -0x7e0418e0(,%rax,8),%rcx
0xffffffff8116bded <lookup_swap_cgroup+45>:     mov    $0x160000000000,%rax
0xffffffff8116bdf7 <lookup_swap_cgroup+55>:     add    (%rcx,%rdx,8),%rax

Looking at the source code, it appears the swap entry is 0x2000000000000000; the bitshift for getting the swap type is 59, hence requesting a swap type "4".  There's only one swap device on these systems, so when it looks up the relevant offset, you end up with a null pointer dereference and a kernel oops.

I attach an (untested) patch that demonstrates a possible approach - tracking the largest valid swap type and avoiding the null pointer.  That's why "swap_free" didn't cause a kernel oops, just a log message.

As the swap entry is just a single bit enabled, it makes me think that this cgroups code doesn't have some necessary lock, or something is zapping the PTE out from underneath it.

Unfortunately, everything looks correct.  I'm hoping someone more experienced with this code might have some ideas.  Regardless, it seems like the added protection is worth having.
Comment 1 Brian Bockelman 2012-04-25 17:42:22 EDT
Created attachment 580291 [details]
An (untested) proposed approach to addressing the symptom without addressing the cause.
Comment 3 Brian Bockelman 2012-05-02 15:49:34 EDT
Found an easier way to trigger this - start and stop cgred (in libcgroup) on a node with lots of processes running.

Use cgclear to move every process to the "/" cgroup, then cgred to move everything back again.  This causes lots of swap accounting migration to be done.  Do it often enough, and you'll catch a kernel oops.

Crash kernel backtrace below


PID: 30125  TASK: ffff88033678c100  CPU: 15  COMMAND: "cgclear"
 #0 [ffff880113b51720] machine_kexec at ffffffff810321cb
 #1 [ffff880113b51780] crash_kexec at ffffffff810b8f22
 #2 [ffff880113b51850] oops_end at ffffffff814f0560
 #3 [ffff880113b51880] no_context at ffffffff8104234b
 #4 [ffff880113b518d0] __bad_area_nosemaphore at ffffffff810425d5
 #5 [ffff880113b51920] bad_area at ffffffff810426fe
 #6 [ffff880113b51950] __do_page_fault at ffffffff81042e03
 #7 [ffff880113b51a70] do_page_fault at ffffffff814f253e
 #8 [ffff880113b51aa0] page_fault at ffffffff814ef8f5
    [exception RIP: lookup_swap_cgroup+55]
    RIP: ffffffff8116bdf7  RSP: ffff880113b51b58  RFLAGS: 00010206
    RAX: 0000160000000000  RBX: 0000000000000000  RCX: 0000000000000000
    RDX: 000000aa20aa20aa  RSI: 0000000000000046  RDI: 0000000000000105
    RBP: ffff880113b51b58   R8: 0000000000000001   R9: ffffffff8163a920
    R10: 0000000000000001  R11: 0000000000000000  R12: 8805510551055105
    R13: 0000000000000006  R14: ffff880336a3bae8  R15: ffffea0003aa7010
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff880113b51b60] is_target_pte_for_mc at ffffffff81167683
#10 [ffff880113b51bb0] mem_cgroup_count_precharge_pte_range at ffffffff81167925
#11 [ffff880113b51c00] walk_page_range at ffffffff8114a919
#12 [ffff880113b51c90] mem_cgroup_can_attach at ffffffff8116abfb
#13 [ffff880113b51d20] cgroup_attach_task at ffffffff810bfaf6
#14 [ffff880113b51e10] cgroup_tasks_write at ffffffff810c00ec
#15 [ffff880113b51e40] cgroup_file_write at ffffffff810c1c3a
#16 [ffff880113b51ef0] vfs_write at ffffffff81176588
#17 [ffff880113b51f30] sys_write at ffffffff81176f91
#18 [ffff880113b51f80] system_call_fastpath at ffffffff8100b0f2
Comment 4 RHEL Product and Program Management 2012-05-06 00:04:56 EDT
Since RHEL 6.3 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.
Comment 5 Brian Bockelman 2012-05-12 14:23:52 EDT
I think this might be the underlying cause:

https://bugzilla.redhat.com/show_bug.cgi?id=816365

We tried turning off swap accounting (which avoids the code paths mentioned above), and still got the following in the kernel log:

BUG: Bad page map in process condor_procd  pte:464542474154425f pmd:800000000fe000e7
addr:000000000d63e000 vm_flags:00100073 anon_vma:ffff880819996b50 mapping:(null) index:d63e
Pid: 22191, comm: condor_procd Tainted: G    B      ----------------   2.6.32-220.4.1.el6.x86_64 #1
Call Trace:
 [<ffffffff81136d88>] ? print_bad_pte+0x1d8/0x290
 [<ffffffff81136eab>] ? vm_normal_page+0x6b/0x70
 [<ffffffff8116759d>] ? is_target_pte_for_mc+0x17d/0x340
 [<ffffffff81167815>] ? mem_cgroup_count_precharge_pte_range+0xb5/0xf0
 [<ffffffff8114a809>] ? walk_page_range+0x379/0x4e0
 [<ffffffff8116aaeb>] ? mem_cgroup_can_attach+0x13b/0x180
 [<ffffffff8114579d>] ? page_add_new_anon_rmap+0x9d/0xf0
 [<ffffffff81167760>] ? mem_cgroup_count_precharge_pte_range+0x0/0xf0
 [<ffffffff810bf9e6>] ? cgroup_attach_task+0x86/0x620
 [<ffffffff8113c354>] ? handle_mm_fault+0x1e4/0x2b0
 [<ffffffff814edeae>] ? mutex_lock+0x1e/0x50
 [<ffffffff810bffdc>] ? cgroup_tasks_write+0x5c/0xf0
 [<ffffffff810c1b2a>] ? cgroup_file_write+0x2ba/0x320
 [<ffffffff81218d2b>] ? selinux_file_permission+0xfb/0x150
 [<ffffffff81176478>] ? vfs_write+0xb8/0x1a0
 [<ffffffff810d4582>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff81176e81>] ? sys_write+0x51/0x90
 [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b

and, shortly thereafter, a crash:

    [exception RIP: is_target_pte_for_mc+393]
    RIP: ffffffff811675a9  RSP: ffff88101a341b68  RFLAGS: 00010286
    RAX: ffffea0003010cc0  RBX: 0000000000000000  RCX: 0000000000000000
    RDX: ffffea0000000000  RSI: 000000000241a000  RDI: 00000000dbba883f
    RBP: ffff88101a341ba8   R8: ffff88015218bed0   R9: 0000000000000001
    R10: 0000000000000001  R11: 0000000000000000  R12: ffff8800516000d0
    R13: ffffea0003010cc0  R14: ffff88015218bed0  R15: ffffea00011cd010
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff88101a341bb0] mem_cgroup_count_precharge_pte_range at ffffffff81167815
#10 [ffff88101a341c00] walk_page_range at ffffffff8114a809
#11 [ffff88101a341c90] mem_cgroup_can_attach at ffffffff8116aaeb
#12 [ffff88101a341d20] cgroup_attach_task at ffffffff810bf9e6
#13 [ffff88101a341e10] cgroup_tasks_write at ffffffff810bffdc
#14 [ffff88101a341e40] cgroup_file_write at ffffffff810c1b2a
#15 [ffff88101a341ef0] vfs_write at ffffffff81176478
#16 [ffff88101a341f30] sys_write at ffffffff81176e81
#17 [ffff88101a341f80] system_call_fastpath at ffffffff8100b0f2
    RIP: 0000003ba42d8a10  RSP: 00007fffe82c7210  RFLAGS: 00010202
    RAX: 0000000000000001  RBX: ffffffff8100b0f2  RCX: 00007fe913071005
    RDX: 0000000000000005  RSI: 00007fe913071000  RDI: 0000000000000009
    RBP: 00007fe913071000   R8: 00000000ffffffff   R9: 0000000000000000
    R10: 00000000ffffffff  R11: 0000000000000246  R12: 0000000000000005
    R13: 000000000143f390  R14: 0000000000000005  R15: 000000000143f390
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
Comment 6 Brian Bockelman 2012-11-13 08:19:43 EST
Hi,

After a bit more digging, this is identical to / solved by CVE-2012-1179.  It no longer appears in RHEL 6.3.

Brian
Comment 7 Matthew Farrellee 2012-11-13 08:45:30 EST

*** This bug has been marked as a duplicate of bug 800328 ***

Note You need to log in before you can comment on or make changes to this bug.