Bug 1015989

Summary: kernel panic in nf_nat_cleanup_conntrack in netns_cleanup_net with Docker
Product: [Fedora] Fedora Reporter: Alexander Larsson <alexl>
Component: kernelAssignee: fedora-kernel-networking
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 20CC: admwiggin, arozansk, fwestpha, gansalmon, grossws, itamar, jbrouer, jonathan, jpoimboe, kernel-maint, madhu.chinakonda, marcelo.barbosa, michele, qqshfox, rkhan, samu.kallio
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-3.15.3-200.fc20 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-07-08 00:59:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Alexander Larsson 2013-10-07 08:33:10 UTC
I've been doing some work on Docker, which heavily uses containers, including net namespaces. Recently I've upgraded to F20, and I've got this panic at least 3 times now when running the docker test suite:

https://www.dropbox.com/sc/y6jq2pso2lze82l/k1VfAt0wpV

Some info from the backtrace typed in:

BUG: unable to handler kernel paging request at ffff...

3.11.2-301.fc20.x86_64

Workqueue: netns_cleanup_net

__nf_ct_ext_destroy
nf_conntrack_free
destroy_contrack
...

Kernel panic - no syncinc: Fatal exception in interrupt

Comment 1 Alexander Larsson 2013-10-07 09:40:56 UTC
Possible fix ?
http://www.spinics.net/lists/netfilter-devel/msg28026.html

Comment 2 Josh Boyer 2013-10-07 13:23:30 UTC
(In reply to Alexander Larsson from comment #1)
> Possible fix ?
> http://www.spinics.net/lists/netfilter-devel/msg28026.html

It's possible, yes.

Here is a scratch build with the patch included.  Could you please test it when it finishes building and let us know if it solves the issue for you?

http://koji.fedoraproject.org/koji/taskinfo?taskID=6030994

Comment 3 Alexander Larsson 2013-10-08 11:10:20 UTC
Ok, i run hours of the docker tests with that kernel, no crash. This is no guarantee of course, but thats far much better than with the other kernels.

Comment 4 Josh Boyer 2013-10-08 12:25:37 UTC
(In reply to Alexander Larsson from comment #3)
> Ok, i run hours of the docker tests with that kernel, no crash. This is no
> guarantee of course, but thats far much better than with the other kernels.

Great, thanks.  I'll get that patch included.

Comment 5 Alexander Larsson 2013-10-10 12:38:48 UTC
The same is happening in F19 now btw, did you get it fixed there too?

Comment 6 Josh Boyer 2013-10-10 12:49:15 UTC
(In reply to Alexander Larsson from comment #5)
> The same is happening in F19 now btw, did you get it fixed there too?

Yes.

Comment 7 Fedora Update System 2013-10-10 17:41:39 UTC
kernel-3.11.4-201.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/kernel-3.11.4-201.fc19

Comment 8 Fedora Update System 2013-10-10 17:41:50 UTC
kernel-3.11.4-101.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/kernel-3.11.4-101.fc18

Comment 9 Fedora Update System 2013-10-10 22:34:13 UTC
kernel-3.11.4-301.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/kernel-3.11.4-301.fc20

Comment 10 Fedora Update System 2013-10-11 02:33:10 UTC
Package kernel-3.11.4-201.fc19:
* should fix your issue,
* was pushed to the Fedora 19 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.11.4-201.fc19'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-18820/kernel-3.11.4-201.fc19
then log in and leave karma (feedback).

Comment 11 Fedora Update System 2013-10-13 19:56:24 UTC
kernel-3.11.4-301.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 12 Fedora Update System 2013-10-14 07:11:26 UTC
kernel-3.11.4-201.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 13 Alexander Larsson 2013-10-14 10:22:00 UTC
Ugh, I got this again (same backtrace), with the kernel-3.11.4-300.1.fc20 scratch build above.
I've been running this a lot though, and not previously seen it reproduce with that kernel. I wonder whats up with that.

Backtrace:
https://www.dropbox.com/sc/yfvcvho9099cjuw/qe4QYSEU3L

Comment 14 Fedora Update System 2013-10-14 17:18:17 UTC
kernel-3.11.4-201.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 15 Fedora Update System 2013-10-18 19:32:05 UTC
kernel-3.11.4-101.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 16 Alexander Larsson 2013-10-21 13:02:33 UTC
Got it again with 3.11.4-302.fc20

Comment 17 Josh Boyer 2013-10-22 12:53:12 UTC
OK, reopening.

Neil, any ideas on this one?

Comment 18 Michele Baldessari 2013-11-30 09:12:19 UTC
I'd try this one that made 3.13-rc1 (backtrace slightly different but Pablo had
CONFIG_DEBUG_OBJECTS_FREE on in this case):
commit 0c3c6c00c69649f4749642b3e5d82125fde1600c
Author: Pablo Neira Ayuso <pablo>
Date:   Mon Nov 18 12:53:59 2013 +0100

    netfilter: nf_conntrack: decrement global counter after object release
    
    nf_conntrack_free() decrements our counter (net->ct.count)
    before releasing the conntrack object. That counter is used in the
    nf_conntrack_cleanup_net_list path to check if it's time to
    kmem_cache_destroy our cache of conntrack objects. I think we have
    a race there that should be easier to trigger (although still hard)
    with CONFIG_DEBUG_OBJECTS_FREE as object releases become slowier
    according to the following splat:
    
    [ 1136.321305] WARNING: CPU: 2 PID: 2483 at lib/debugobjects.c:260
    debug_print_object+0x83/0xa0()
    [ 1136.321311] ODEBUG: free active (active state 0) object type:
    timer_list hint: delayed_work_timer_fn+0x0/0x20
    ...
    [ 1136.321390] Call Trace:
    [ 1136.321398]  [<ffffffff8160d4a2>] dump_stack+0x45/0x56
    [ 1136.321405]  [<ffffffff810514e8>] warn_slowpath_common+0x78/0xa0
    [ 1136.321410]  [<ffffffff81051557>] warn_slowpath_fmt+0x47/0x50
    [ 1136.321414]  [<ffffffff812f8883>] debug_print_object+0x83/0xa0
    [ 1136.321420]  [<ffffffff8106aa90>] ? execute_in_process_context+0x90/0x90
    [ 1136.321424]  [<ffffffff812f99fb>] debug_check_no_obj_freed+0x20b/0x250
    [ 1136.321429]  [<ffffffff8112e7f2>] ? kmem_cache_destroy+0x92/0x100
    [ 1136.321433]  [<ffffffff8115d945>] kmem_cache_free+0x125/0x210
    [ 1136.321436]  [<ffffffff8112e7f2>] kmem_cache_destroy+0x92/0x100
    [ 1136.321443]  [<ffffffffa046b806>] nf_conntrack_cleanup_net_list+0x126/0x160 [nf_conntrack]
    [ 1136.321449]  [<ffffffffa046c43d>] nf_conntrack_pernet_exit+0x6d/0x80 [nf_conntrack]
    [ 1136.321453]  [<ffffffff81511cc3>] ops_exit_list.isra.3+0x53/0x60
    [ 1136.321457]  [<ffffffff815124f0>] cleanup_net+0x100/0x1b0
    [ 1136.321460]  [<ffffffff8106b31e>] process_one_work+0x18e/0x430
    [ 1136.321463]  [<ffffffff8106bf49>] worker_thread+0x119/0x390
    [ 1136.321467]  [<ffffffff8106be30>] ? manage_workers.isra.23+0x2a0/0x2a0
    [ 1136.321470]  [<ffffffff8107210b>] kthread+0xbb/0xc0
    [ 1136.321472]  [<ffffffff81072050>] ? kthread_create_on_node+0x110/0x110
    [ 1136.321477]  [<ffffffff8161b8fc>] ret_from_fork+0x7c/0xb0
    [ 1136.321479]  [<ffffffff81072050>] ? kthread_create_on_node+0x110/0x110
    [ 1136.321481] ---[ end trace 25f53c192da70825 ]---
    
    Reported-by: Linus Torvalds <torvalds>
    Signed-off-by: Pablo Neira Ayuso <pablo>

Comment 19 Aristeu Rozanski 2014-01-16 16:22:00 UTC
So far, unable to reproduce it locally

Comment 20 Alexander Larsson 2014-02-05 20:00:58 UTC
As per https://github.com/dotcloud/docker/issues/2960#issuecomment-33854171 this still happens in 3.13.1, which has the fix from comment 18, so that did not fix it.

Comment 21 Josh Poimboeuf 2014-02-05 20:50:38 UTC
Alex, can you enable kdump and try to recreate?

Comment 22 Justin M. Forbes 2014-02-24 13:51:18 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs.

Fedora 20 has now been rebased to 3.13.4-200.fc20.  Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 23 Alexander Larsson 2014-03-03 14:15:57 UTC
I've not tried 3.13.4, but i've seen it in 3.12.10-300.fc20 (not tried later kernels yet) and others in 3.13.1, so I don't think this is fixed.

Comment 24 Alexander Larsson 2014-03-14 10:14:22 UTC
Got this in 3.13.5-202.fc20.x86_64 too.

Comment 25 Alexander Larsson 2014-03-27 17:49:58 UTC
I asked in the docker meeting today for people who have seen this, and a bunch of people had never seen it and some had. One thing that seemed to be consistent with not seeing the panic is running the kernel in a VM. So, maybe this only triggers on bare metal.

Comment 26 Samu Kallio 2014-04-04 08:38:00 UTC
I have been encountering this same issue (LXC, non-Docker) on Amazon EC2 (see https://bugzilla.kernel.org/show_bug.cgi?id=65191). So this is definitely happening on PV as well.

Comment 27 Justin M. Forbes 2014-05-21 19:37:00 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs.

Fedora 20 has now been rebased to 3.14.4-200.fc20.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 28 Alexander Larsson 2014-06-03 14:03:02 UTC
This is still happening. 

This seems to be the upstream bug: https://bugzilla.kernel.org/show_bug.cgi?id=65191

Comment 29 Alexander Larsson 2014-06-06 08:34:14 UTC
Seems like this has a possible fix at:
https://bugzilla.kernel.org/show_bug.cgi?id=65191

Comment 31 Josh Boyer 2014-06-30 19:18:30 UTC
The upstream netfilter maintainer has said he has the patch queued to be sent to stable.  I'll try and get it queued for the 3.15.y rebase we're working on for F20.

Comment 32 Josh Boyer 2014-06-30 19:22:18 UTC
Added in Fedora git.  F19 will pick it up via the next 3.14.y stable rebase.

Comment 33 Fedora Update System 2014-07-02 02:08:36 UTC
kernel-3.15.3-200.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/kernel-3.15.3-200.fc20

Comment 34 Fedora Update System 2014-07-03 04:06:48 UTC
Package kernel-3.15.3-200.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.15.3-200.fc20'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-8017/kernel-3.15.3-200.fc20
then log in and leave karma (feedback).

Comment 35 Fedora Update System 2014-07-08 00:59:00 UTC
kernel-3.15.3-200.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 36 Konstantin Gribov 2014-12-29 01:38:04 UTC
When fix will be available in rhel7/centos7? It's absent in 3.10.0-123.13.2.el7.