Description of problem:
Kernel panic when using cgroups on an EC2 F18 machine
Version-Release number of selected component (if applicable):
Fedora 18 AMI from fedoraproject page
3 out of 4 VMs crash
Steps to Reproduce:
1. Start F18 AMI
2. Use cgroups (cgset/chget) and perform computation
[63447736.613490] ------------[ cut here ]------------
[63447736.613711] kernel BUG at arch/x86/mm/fault.c:396!
[63447736.613839] invalid opcode: 0000 [#1] SMP
[63447736.614003] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack coretemp crc32c_intel microcode xen_netfront xen_blkfront [last unloaded: ip6_tables]
[63447736.614123] CPU 1
[63447736.614129] Pid: 10866, comm: httpd Not tainted 3.6.10-4.fc18.x86_64 #1
[63447736.614136] RIP: e030:[<ffffffff816271bf>] [<ffffffff816271bf>] vmalloc_fault+0x11f/0x208
[63447736.614151] RSP: e02b:ffff8801a55879c8 EFLAGS: 00010046
[63447736.614157] RAX: ffff8801a6003ff8 RBX: ffffe8ffffd00058 RCX: 0000000000000000
[63447736.614162] RDX: 00003ffffffff000 RSI: ffff880000000ff8 RDI: 0000000000000000
[63447736.614167] RBP: ffff8801a55879e8 R08: ffff8801be6c8840 R09: 0000000000000000
[63447736.614173] R10: 0000000000007ff0 R11: 0000000000000001 R12: ffff8801d2f8ee88
[63447736.614178] R13: ffff8801a6003ff8 R14: ffff880000000ff8 R15: 0000000000000002
[63447736.614187] FS: 00007ff2b1d95840(0000) GS:ffff8801dfd00000(0000) knlGS:0000000000000000
[63447736.614193] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[63447736.614197] CR2: ffffe8ffffd00058 CR3: 00000001d2f8e000 CR4: 0000000000002660
[63447736.614204] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[63447736.614209] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[63447736.614214] Process httpd (pid: 10866, threadinfo ffff8801a5586000, task ffff8801a57a0000)
[63447736.614222] ffffe8ffffd00058 0000000000000029 ffff8801a5587b08 0000000000000000
[63447736.614230] ffff8801a5587af8 ffffffff81627759 ffff8801dffedb00 0000000000000000
[63447736.614237] ffff8801a57a0000 0000000000000060 0000000000000041 ffff8801dffedb08
[63447736.614244] Call Trace:
[63447736.614251] [<ffffffff81627759>] do_page_fault+0x399/0x4b0
[63447736.614260] [<ffffffff81004f4c>] ? xen_mc_extend_args+0xec/0x110
[63447736.614266] [<ffffffff81624065>] page_fault+0x25/0x30
[63447736.614276] [<ffffffff81184d03>] ? mem_cgroup_charge_statistics.isra.13+0x13/0x50
[63447736.614283] [<ffffffff81186f78>] __mem_cgroup_uncharge_common+0xd8/0x350
[63447736.813234] [<ffffffff8118aac7>] mem_cgroup_uncharge_page+0x57/0x60
[63447736.813245] [<ffffffff8115fbc0>] page_remove_rmap+0xe0/0x150
[63447736.813252] [<ffffffff8115311a>] ? vm_normal_page+0x1a/0x80
[63447736.813257] [<ffffffff81153e61>] unmap_single_vma+0x531/0x870
[63447736.813263] [<ffffffff81154962>] unmap_vmas+0x52/0xa0
[63447736.813270] [<ffffffff81007442>] ? pte_mfn_to_pfn+0x72/0x100
[63447736.813276] [<ffffffff8115c8f8>] exit_mmap+0x98/0x170
[63447736.813281] [<ffffffff810050d9>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
[63447736.813290] [<ffffffff81059ce3>] mmput+0x83/0xf0
[63447736.813297] [<ffffffff810624c4>] exit_mm+0x104/0x130
[63447736.813302] [<ffffffff8106264a>] do_exit+0x15a/0x8c0
[63447736.813308] [<ffffffff810630ff>] do_group_exit+0x3f/0xa0
[63447736.813313] [<ffffffff81063177>] sys_exit_group+0x17/0x20
[63447736.813321] [<ffffffff8162bae9>] system_call_fastpath+0x16/0x1b
[63447736.813326] Code: 4c 89 e7 e8 12 1a ff ff 4c 89 ef 48 89 de 49 89 c6 e8 04 1a ff ff 48 83 38 00 49 89 c5 0f 84 e5 00 00 00 49 8b 3e 48 85 ff 75 02 <0f> 0b ff 14 25 60 0d c2 81 48 89 c2 49 8b 7d 00 ff 14 25 60 0d
[63447736.813372] RIP [<ffffffff816271bf>] vmalloc_fault+0x11f/0x208
[63447736.813380] RSP <ffff8801a55879c8>
[63447736.813388] ---[ end trace 6a18a32ed5ee7093 ]---
[63447736.813394] Fixing recursive fault but reboot is needed!
I am working with OpenShift on F18. If you need any additional info or a login into a VM which has the crash, please contact me.
I have noticed similar crashes with F17 as well
Is this reproducing with the current 3.7 kernels in F17 and F18? There were a number of fixes since 3.6.10
Konrad, have you seen anything like this oops before?
Hm. Looks like someone hit this in Ubuntu with 3.5:
and someone reported on the systemd-devel list (??) that it showed up in 3.3 with the memcfg rewrite:
Same guy, reported to the cgroups list:
thread went nowhere.
Oh, hey. Spiffy. Konrad _has_ seen this:
I am running 3.6.10-4.fc18.x86_64 on a m1.large machine which is under some pretty heavy load. I do not see kernel 3.7 in updates or updates-testing repo.
Will be glad to try 3.7 if you think it will help.
(In reply to comment #7)
> I am running 3.6.10-4.fc18.x86_64 on a m1.large machine which is under some
> pretty heavy load. I do not see kernel 3.7 in updates or updates-testing
That's pretty baffling. What happens when you run 'yum update' on your machine? Are the update repos configured and enabled? We've had the 3.7 kernel in F18 updates since the initial GA. The latest one in F18 stable is 3.7.9.
> Will be glad to try 3.7 if you think it will help.
It probably won't immediately, but when this gets fixed it will go out in a normal kernel update for Fedora, which will be 3.7 (or 3.8) based. So figuring out why you can't see updates now would be good.
Krishna, please test this scratch build when it finishes building. It contains the patch I've linked to.
I tried this against my code today and did not run into any kernel panics so looks like this kernel is a lot more stable. Is this change going to make it into fedora updates?
(In reply to comment #9)
> Krishna, please test this scratch build when it finishes building. It
> contains the patch I've linked to.
(In reply to comment #10)
> I tried this against my code today and did not run into any kernel panics so
> looks like this kernel is a lot more stable. Is this change going to make it
> into fedora updates?
Yes. I'll get it in today.
Committed to all active Fedora branches. Bodhi will leave the usual comments when it shows up in the updates repositories.
Is it OK to put Tested-by: Krishna Raman <email@example.com> on the patch (need to resubmit to hpa or ingo).
(In reply to comment #13)
> Is it OK to put Tested-by: Krishna Raman <firstname.lastname@example.org> on the patch
> (need to resubmit to hpa or ingo).
Yep, should be fine. Ive tested a bunch of OpenShift builds on it now and have not seen any more failures.
kernel-3.7.10-101.fc17 has been submitted as an update for Fedora 17.
Krishna, the upstream kernel developers came up with an additional patch in this area that patches out the function calls when running on bare metal. I've grabbed that and included it in the scratch build below:
it would be very appreciated if you could test that in your environment to make sure it's working as expected. I'll test on bare metal here.
Thanks. Will test it out over the weekend
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.7.10-101.fc17'
as soon as you are able to, then reboot.
Please go to the following url:
then log in and leave karma (feedback).
kernel-3.8.2-105.fc17 has been submitted as an update for Fedora 17.
(In reply to comment #17)
> Thanks. Will test it out over the weekend
Did you ever get a chance to test the kernel build out? Upstream is looking for feedback.
(In reply to comment #20)
> (In reply to comment #17)
> > Thanks. Will test it out over the weekend
> Did you ever get a chance to test the kernel build out? Upstream is looking
> for feedback.
Didn't get a chance that weekend. But when I went back to look later, I could not find the compiled kernel rpm. the link you send me just included the srpm
(In reply to comment #21)
> (In reply to comment #20)
> > (In reply to comment #17)
> > > Thanks. Will test it out over the weekend
> > Did you ever get a chance to test the kernel build out? Upstream is looking
> > for feedback.
> Didn't get a chance that weekend. But when I went back to look later, I
> could not find the compiled kernel rpm. the link you send me just included
> the srpm
That's because you waited a week and koji pruned the scratch build. It only keeps them for a week.
I've submitted another for you to test here when it finishes building:
kernel-3.8.2-105.fc17 has been submitted as an update for Fedora 17.
kernel-3.8.3-101.fc17 has been submitted as an update for Fedora 17.
kernel-3.8.3-103.fc17 has been pushed to the Fedora 17 stable repository. If problems still persist, please make note of it in this bug report.