| Summary: | kernel panic in nf_nat_cleanup_conntrack in netns_cleanup_net with Docker | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Alexander Larsson <alexl> |
| Component: | kernel | Assignee: | fedora-kernel-networking |
| Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 20 | CC: | admwiggin, arozansk, fwestpha, gansalmon, grossws, itamar, jbrouer, jonathan, jpoimboe, kernel-maint, madhu.chinakonda, marcelo.barbosa, michele, qqshfox, rkhan, samu.kallio |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | kernel-3.15.3-200.fc20 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2014-07-08 00:59:00 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Alexander Larsson
2013-10-07 08:33:10 UTC
Possible fix ? http://www.spinics.net/lists/netfilter-devel/msg28026.html (In reply to Alexander Larsson from comment #1) > Possible fix ? > http://www.spinics.net/lists/netfilter-devel/msg28026.html It's possible, yes. Here is a scratch build with the patch included. Could you please test it when it finishes building and let us know if it solves the issue for you? http://koji.fedoraproject.org/koji/taskinfo?taskID=6030994 Ok, i run hours of the docker tests with that kernel, no crash. This is no guarantee of course, but thats far much better than with the other kernels. (In reply to Alexander Larsson from comment #3) > Ok, i run hours of the docker tests with that kernel, no crash. This is no > guarantee of course, but thats far much better than with the other kernels. Great, thanks. I'll get that patch included. The same is happening in F19 now btw, did you get it fixed there too? (In reply to Alexander Larsson from comment #5) > The same is happening in F19 now btw, did you get it fixed there too? Yes. kernel-3.11.4-201.fc19 has been submitted as an update for Fedora 19. https://admin.fedoraproject.org/updates/kernel-3.11.4-201.fc19 kernel-3.11.4-101.fc18 has been submitted as an update for Fedora 18. https://admin.fedoraproject.org/updates/kernel-3.11.4-101.fc18 kernel-3.11.4-301.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/kernel-3.11.4-301.fc20 Package kernel-3.11.4-201.fc19: * should fix your issue, * was pushed to the Fedora 19 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing kernel-3.11.4-201.fc19' as soon as you are able to, then reboot. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2013-18820/kernel-3.11.4-201.fc19 then log in and leave karma (feedback). kernel-3.11.4-301.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report. kernel-3.11.4-201.fc19 has been pushed to the Fedora 19 stable repository. If problems still persist, please make note of it in this bug report. Ugh, I got this again (same backtrace), with the kernel-3.11.4-300.1.fc20 scratch build above. I've been running this a lot though, and not previously seen it reproduce with that kernel. I wonder whats up with that. Backtrace: https://www.dropbox.com/sc/yfvcvho9099cjuw/qe4QYSEU3L kernel-3.11.4-201.fc19 has been pushed to the Fedora 19 stable repository. If problems still persist, please make note of it in this bug report. kernel-3.11.4-101.fc18 has been pushed to the Fedora 18 stable repository. If problems still persist, please make note of it in this bug report. Got it again with 3.11.4-302.fc20 OK, reopening. Neil, any ideas on this one? I'd try this one that made 3.13-rc1 (backtrace slightly different but Pablo had
CONFIG_DEBUG_OBJECTS_FREE on in this case):
commit 0c3c6c00c69649f4749642b3e5d82125fde1600c
Author: Pablo Neira Ayuso <pablo>
Date: Mon Nov 18 12:53:59 2013 +0100
netfilter: nf_conntrack: decrement global counter after object release
nf_conntrack_free() decrements our counter (net->ct.count)
before releasing the conntrack object. That counter is used in the
nf_conntrack_cleanup_net_list path to check if it's time to
kmem_cache_destroy our cache of conntrack objects. I think we have
a race there that should be easier to trigger (although still hard)
with CONFIG_DEBUG_OBJECTS_FREE as object releases become slowier
according to the following splat:
[ 1136.321305] WARNING: CPU: 2 PID: 2483 at lib/debugobjects.c:260
debug_print_object+0x83/0xa0()
[ 1136.321311] ODEBUG: free active (active state 0) object type:
timer_list hint: delayed_work_timer_fn+0x0/0x20
...
[ 1136.321390] Call Trace:
[ 1136.321398] [<ffffffff8160d4a2>] dump_stack+0x45/0x56
[ 1136.321405] [<ffffffff810514e8>] warn_slowpath_common+0x78/0xa0
[ 1136.321410] [<ffffffff81051557>] warn_slowpath_fmt+0x47/0x50
[ 1136.321414] [<ffffffff812f8883>] debug_print_object+0x83/0xa0
[ 1136.321420] [<ffffffff8106aa90>] ? execute_in_process_context+0x90/0x90
[ 1136.321424] [<ffffffff812f99fb>] debug_check_no_obj_freed+0x20b/0x250
[ 1136.321429] [<ffffffff8112e7f2>] ? kmem_cache_destroy+0x92/0x100
[ 1136.321433] [<ffffffff8115d945>] kmem_cache_free+0x125/0x210
[ 1136.321436] [<ffffffff8112e7f2>] kmem_cache_destroy+0x92/0x100
[ 1136.321443] [<ffffffffa046b806>] nf_conntrack_cleanup_net_list+0x126/0x160 [nf_conntrack]
[ 1136.321449] [<ffffffffa046c43d>] nf_conntrack_pernet_exit+0x6d/0x80 [nf_conntrack]
[ 1136.321453] [<ffffffff81511cc3>] ops_exit_list.isra.3+0x53/0x60
[ 1136.321457] [<ffffffff815124f0>] cleanup_net+0x100/0x1b0
[ 1136.321460] [<ffffffff8106b31e>] process_one_work+0x18e/0x430
[ 1136.321463] [<ffffffff8106bf49>] worker_thread+0x119/0x390
[ 1136.321467] [<ffffffff8106be30>] ? manage_workers.isra.23+0x2a0/0x2a0
[ 1136.321470] [<ffffffff8107210b>] kthread+0xbb/0xc0
[ 1136.321472] [<ffffffff81072050>] ? kthread_create_on_node+0x110/0x110
[ 1136.321477] [<ffffffff8161b8fc>] ret_from_fork+0x7c/0xb0
[ 1136.321479] [<ffffffff81072050>] ? kthread_create_on_node+0x110/0x110
[ 1136.321481] ---[ end trace 25f53c192da70825 ]---
Reported-by: Linus Torvalds <torvalds>
Signed-off-by: Pablo Neira Ayuso <pablo>
So far, unable to reproduce it locally As per https://github.com/dotcloud/docker/issues/2960#issuecomment-33854171 this still happens in 3.13.1, which has the fix from comment 18, so that did not fix it. Alex, can you enable kdump and try to recreate? *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs. Fedora 20 has now been rebased to 3.13.4-200.fc20. Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel. If you experience different issues, please open a new bug report for those. I've not tried 3.13.4, but i've seen it in 3.12.10-300.fc20 (not tried later kernels yet) and others in 3.13.1, so I don't think this is fixed. Got this in 3.13.5-202.fc20.x86_64 too. I asked in the docker meeting today for people who have seen this, and a bunch of people had never seen it and some had. One thing that seemed to be consistent with not seeing the panic is running the kernel in a VM. So, maybe this only triggers on bare metal. I have been encountering this same issue (LXC, non-Docker) on Amazon EC2 (see https://bugzilla.kernel.org/show_bug.cgi?id=65191). So this is definitely happening on PV as well. *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs. Fedora 20 has now been rebased to 3.14.4-200.fc20. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you experience different issues, please open a new bug report for those. This is still happening. This seems to be the upstream bug: https://bugzilla.kernel.org/show_bug.cgi?id=65191 Seems like this has a possible fix at: https://bugzilla.kernel.org/show_bug.cgi?id=65191 fix is now in linus tree, https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=945b2b2d259d1a4364a2799e80e8ff32f8c6ee6f The upstream netfilter maintainer has said he has the patch queued to be sent to stable. I'll try and get it queued for the 3.15.y rebase we're working on for F20. Added in Fedora git. F19 will pick it up via the next 3.14.y stable rebase. kernel-3.15.3-200.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/kernel-3.15.3-200.fc20 Package kernel-3.15.3-200.fc20: * should fix your issue, * was pushed to the Fedora 20 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing kernel-3.15.3-200.fc20' as soon as you are able to, then reboot. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2014-8017/kernel-3.15.3-200.fc20 then log in and leave karma (feedback). kernel-3.15.3-200.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report. When fix will be available in rhel7/centos7? It's absent in 3.10.0-123.13.2.el7. |