Bug 914737

Summary: Fedora 18 on EC2 kernel panic due to cgroups
Product: [Fedora] Fedora Reporter: Krishna Raman <kraman>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 18CC: bfan, drjones, gansalmon, itamar, jforbes, jonathan, kernel-maint, ketuzsezr, leiwang, madhu.chinakonda, mattdm, mfisher, moli, qguan, wshi
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 955646 (view as bug list) Environment:
Last Closed: 2013-03-22 00:17:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 955646    

Description Krishna Raman 2013-02-22 16:31:22 UTC
Description of problem:
Kernel panic when using cgroups on an EC2 F18 machine

Version-Release number of selected component (if applicable):
Fedora 18 AMI from fedoraproject page

How reproducible:
3 out of 4 VMs crash

Steps to Reproduce:
1. Start F18 AMI
2. Use cgroups (cgset/chget) and perform computation
  
Actual results:
[63447736.613490] ------------[ cut here ]------------
[63447736.613711] kernel BUG at arch/x86/mm/fault.c:396!
[63447736.613839] invalid opcode: 0000 [#1] SMP 
[63447736.614003] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack coretemp crc32c_intel microcode xen_netfront xen_blkfront [last unloaded: ip6_tables]
[63447736.614123] CPU 1 
[63447736.614129] Pid: 10866, comm: httpd Not tainted 3.6.10-4.fc18.x86_64 #1  
[63447736.614136] RIP: e030:[<ffffffff816271bf>]  [<ffffffff816271bf>] vmalloc_fault+0x11f/0x208
[63447736.614151] RSP: e02b:ffff8801a55879c8  EFLAGS: 00010046
[63447736.614157] RAX: ffff8801a6003ff8 RBX: ffffe8ffffd00058 RCX: 0000000000000000
[63447736.614162] RDX: 00003ffffffff000 RSI: ffff880000000ff8 RDI: 0000000000000000
[63447736.614167] RBP: ffff8801a55879e8 R08: ffff8801be6c8840 R09: 0000000000000000
[63447736.614173] R10: 0000000000007ff0 R11: 0000000000000001 R12: ffff8801d2f8ee88
[63447736.614178] R13: ffff8801a6003ff8 R14: ffff880000000ff8 R15: 0000000000000002
[63447736.614187] FS:  00007ff2b1d95840(0000) GS:ffff8801dfd00000(0000) knlGS:0000000000000000
[63447736.614193] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[63447736.614197] CR2: ffffe8ffffd00058 CR3: 00000001d2f8e000 CR4: 0000000000002660
[63447736.614204] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[63447736.614209] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[63447736.614214] Process httpd (pid: 10866, threadinfo ffff8801a5586000, task ffff8801a57a0000)
[63447736.614220] Stack:
[63447736.614222]  ffffe8ffffd00058 0000000000000029 ffff8801a5587b08 0000000000000000
[63447736.614230]  ffff8801a5587af8 ffffffff81627759 ffff8801dffedb00 0000000000000000
[63447736.614237]  ffff8801a57a0000 0000000000000060 0000000000000041 ffff8801dffedb08
[63447736.614244] Call Trace:
[63447736.614251]  [<ffffffff81627759>] do_page_fault+0x399/0x4b0
[63447736.614260]  [<ffffffff81004f4c>] ? xen_mc_extend_args+0xec/0x110
[63447736.614266]  [<ffffffff81624065>] page_fault+0x25/0x30
[63447736.614276]  [<ffffffff81184d03>] ? mem_cgroup_charge_statistics.isra.13+0x13/0x50
[63447736.614283]  [<ffffffff81186f78>] __mem_cgroup_uncharge_common+0xd8/0x350
[63447736.813234]  [<ffffffff8118aac7>] mem_cgroup_uncharge_page+0x57/0x60
[63447736.813245]  [<ffffffff8115fbc0>] page_remove_rmap+0xe0/0x150
[63447736.813252]  [<ffffffff8115311a>] ? vm_normal_page+0x1a/0x80
[63447736.813257]  [<ffffffff81153e61>] unmap_single_vma+0x531/0x870
[63447736.813263]  [<ffffffff81154962>] unmap_vmas+0x52/0xa0
[63447736.813270]  [<ffffffff81007442>] ? pte_mfn_to_pfn+0x72/0x100
[63447736.813276]  [<ffffffff8115c8f8>] exit_mmap+0x98/0x170
[63447736.813281]  [<ffffffff810050d9>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
[63447736.813290]  [<ffffffff81059ce3>] mmput+0x83/0xf0
[63447736.813297]  [<ffffffff810624c4>] exit_mm+0x104/0x130
[63447736.813302]  [<ffffffff8106264a>] do_exit+0x15a/0x8c0
[63447736.813308]  [<ffffffff810630ff>] do_group_exit+0x3f/0xa0
[63447736.813313]  [<ffffffff81063177>] sys_exit_group+0x17/0x20
[63447736.813321]  [<ffffffff8162bae9>] system_call_fastpath+0x16/0x1b
[63447736.813326] Code: 4c 89 e7 e8 12 1a ff ff 4c 89 ef 48 89 de 49 89 c6 e8 04 1a ff ff 48 83 38 00 49 89 c5 0f 84 e5 00 00 00 49 8b 3e 48 85 ff 75 02 <0f> 0b ff 14 25 60 0d c2 81 48 89 c2 49 8b 7d 00 ff 14 25 60 0d 
[63447736.813372] RIP  [<ffffffff816271bf>] vmalloc_fault+0x11f/0x208
[63447736.813380]  RSP <ffff8801a55879c8>
[63447736.813388] ---[ end trace 6a18a32ed5ee7093 ]---
[63447736.813394] Fixing recursive fault but reboot is needed!

Expected results:
No crash

Additional info:
I am working with OpenShift on F18. If you need any additional info or a login into a VM which has the crash, please contact me.

Comment 1 Krishna Raman 2013-02-22 16:32:06 UTC
I have noticed similar crashes with F17 as well

Comment 2 Justin M. Forbes 2013-02-22 23:21:06 UTC
Is this reproducing with the current 3.7 kernels in F17 and F18? There were a number of fixes since 3.6.10

Comment 3 Josh Boyer 2013-02-22 23:21:44 UTC
Konrad, have you seen anything like this oops before?

Comment 4 Josh Boyer 2013-02-22 23:35:30 UTC
Hm.  Looks like someone hit this in Ubuntu with 3.5:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1073238/

and someone reported on the systemd-devel list (??) that it showed up in 3.3 with the memcfg rewrite:

http://lists.freedesktop.org/archives/systemd-devel/2012-December/007826.html

Comment 5 Josh Boyer 2013-02-22 23:59:06 UTC
Same guy, reported to the cgroups list:

http://thread.gmane.org/gmane.linux.kernel.cgroups/5540

thread went nowhere.

Comment 6 Josh Boyer 2013-02-23 00:00:55 UTC
Oh, hey.  Spiffy.  Konrad _has_ seen this:

https://lkml.org/lkml/2013/2/21/271

Comment 7 Krishna Raman 2013-02-23 02:57:30 UTC
I am running 3.6.10-4.fc18.x86_64 on a m1.large machine which is under some pretty heavy load. I do not see kernel 3.7 in updates or updates-testing repo.
Will be glad to try 3.7 if you think it will help.

Comment 8 Josh Boyer 2013-02-23 12:55:39 UTC
(In reply to comment #7)
> I am running 3.6.10-4.fc18.x86_64 on a m1.large machine which is under some
> pretty heavy load. I do not see kernel 3.7 in updates or updates-testing
> repo.

That's pretty baffling.  What happens when you run 'yum update' on your machine?  Are the update repos configured and enabled?  We've had the 3.7 kernel in F18 updates since the initial GA.  The latest one in F18 stable is 3.7.9.

http://mirrors.kernel.org/fedora//updates/18/x86_64/kernel-3.7.9-201.fc18.x86_64.rpm

> Will be glad to try 3.7 if you think it will help.

It probably won't immediately, but when this gets fixed it will go out in a normal kernel update for Fedora, which will be 3.7 (or 3.8) based.  So figuring out why you can't see updates now would be good.

Comment 9 Josh Boyer 2013-02-25 14:14:51 UTC
Krishna, please test this scratch build when it finishes building.  It contains the patch I've linked to.

http://koji.fedoraproject.org/koji/taskinfo?taskID=5053087

Comment 10 Krishna Raman 2013-02-26 00:12:42 UTC
I tried this against my code today and did not run into any kernel panics so looks like this kernel is a lot more stable. Is this change going to make it into fedora updates?

(In reply to comment #9)
> Krishna, please test this scratch build when it finishes building.  It
> contains the patch I've linked to.
> 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=5053087

Comment 11 Josh Boyer 2013-02-26 13:10:56 UTC
(In reply to comment #10)
> I tried this against my code today and did not run into any kernel panics so
> looks like this kernel is a lot more stable. Is this change going to make it
> into fedora updates?

Yes.  I'll get it in today.

Comment 12 Josh Boyer 2013-02-26 13:18:17 UTC
Committed to all active Fedora branches.  Bodhi will leave the usual comments when it shows up in the updates repositories.

Comment 13 Konrad Rzeszutek Wilk 2013-02-27 00:53:57 UTC
Krishna,

Is it OK to put Tested-by: Krishna Raman <kraman> on the patch (need to resubmit to hpa or ingo).

Comment 14 Krishna Raman 2013-02-27 02:12:01 UTC
(In reply to comment #13)
> Krishna,
> 
> Is it OK to put Tested-by: Krishna Raman <kraman> on the patch
> (need to resubmit to hpa or ingo).

Yep, should be fine. Ive tested a bunch of OpenShift builds on it now and have not seen any more failures.

Comment 15 Fedora Update System 2013-02-28 14:52:41 UTC
kernel-3.7.10-101.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/kernel-3.7.10-101.fc17

Comment 16 Josh Boyer 2013-03-01 13:28:03 UTC
Krishna, the upstream kernel developers came up with an additional patch in this area that patches out the function calls when running on bare metal.  I've grabbed that and included it in the scratch build below:

http://koji.fedoraproject.org/koji/taskinfo?taskID=5066692

it would be very appreciated if you could test that in your environment to make sure it's working as expected.  I'll test on bare metal here.

Comment 17 Krishna Raman 2013-03-01 16:52:29 UTC
Thanks. Will test it out over the weekend

Comment 18 Fedora Update System 2013-03-02 20:04:58 UTC
Package kernel-3.7.10-101.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.7.10-101.fc17'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-3252/kernel-3.7.10-101.fc17
then log in and leave karma (feedback).

Comment 19 Fedora Update System 2013-03-08 22:17:00 UTC
kernel-3.8.2-105.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/kernel-3.8.2-105.fc17

Comment 20 Josh Boyer 2013-03-13 13:43:47 UTC
(In reply to comment #17)
> Thanks. Will test it out over the weekend

Did you ever get a chance to test the kernel build out?  Upstream is looking for feedback.

Comment 21 Krishna Raman 2013-03-14 06:50:37 UTC
(In reply to comment #20)
> (In reply to comment #17)
> > Thanks. Will test it out over the weekend
> 
> Did you ever get a chance to test the kernel build out?  Upstream is looking
> for feedback.

Didn't get a chance that weekend. But when I went back to look later, I could not find the compiled kernel rpm. the link you send me just included the srpm

Comment 22 Josh Boyer 2013-03-14 12:48:15 UTC
(In reply to comment #21)
> (In reply to comment #20)
> > (In reply to comment #17)
> > > Thanks. Will test it out over the weekend
> > 
> > Did you ever get a chance to test the kernel build out?  Upstream is looking
> > for feedback.
> 
> Didn't get a chance that weekend. But when I went back to look later, I
> could not find the compiled kernel rpm. the link you send me just included
> the srpm

That's because you waited a week and koji pruned the scratch build.  It only keeps them for a week.

I've submitted another for you to test here when it finishes building:

http://koji.fedoraproject.org/koji/taskinfo?taskID=5121103

Comment 23 Fedora Update System 2013-03-14 15:20:43 UTC
kernel-3.8.2-105.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/FEDORA-2013-3638/kernel-3.8.2-105.fc17

Comment 24 Fedora Update System 2013-03-14 22:56:33 UTC
kernel-3.8.3-101.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/kernel-3.8.3-101.fc17

Comment 25 Fedora Update System 2013-03-22 00:17:40 UTC
kernel-3.8.3-103.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.