Bug 167389 - Kernel panic with high network load
Kernel panic with high network load
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
i686 Linux
medium Severity high
: ---
: ---
Assigned To: David Miller
Brian Brock
:
: 167396 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-09-02 05:48 EDT by Ilkka Pietikäinen
Modified: 2011-08-19 09:27 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-06-07 00:52:05 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
More detailed dump (49.73 KB, text/plain)
2005-09-07 03:31 EDT, Ilkka Pietikäinen
no flags Details
Proposed fix (409 bytes, patch)
2005-09-18 21:07 EDT, James Morris
no flags Details | Diff
workaround for some problems (721 bytes, patch)
2005-09-19 09:06 EDT, Nuutti Kotivuori
no flags Details | Diff

  None (edit)
Description Ilkka Pietikäinen 2005-09-02 05:48:01 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050827 Firefox/1.0.6

Description of problem:
When we execute high load tests overnigth with our cluster. Every morning at least one of the nodes has crashed with kernel panic. Here is the output from netdump:

Unable to handle kernel NULL pointer dereference at virtual address
00000018
 printing eip:
c01a387f
*pde = 3663c001
Oops: 0000 [#1]
SMP
Modules linked in: arpt_mangle arptable_filter arp_tables iptable_filter
ip_tables ip_queue parport_pc lp parport netconsole netdump autofs4
i2c_dev i2c_core sunrpc dm_mod button battery acEIP is at
selinux_ip_postroute_last+0x6a/0x1de
eax: 00000000   ebx: 00000000   ecx: f7b04bb0   edx: 00000003
esi: e81ee680   edi: c0455780   ebp: 00000004   esp: f7b04b8c
ds: 007b   es: 007b   ss: 0068
Process dispatcher (pid: 2625, threadinfo=f7b04000 task=f763f3b0)
Stack: 00000000 ed9b8e00 00000000 dc00cd80 00000002 f88a965a 37e51e9e
00000000
       00000206 000000f5 f88a983c c026f163 e81f0580 f72956e8 c02c3188
000000f2  __kfree_skb+0xf4/0xf7
 [<c02c3188>] packet_rcv+0x2ca/0x2d4
 [<c0273ca8>] dev_queue_xmit_nit+0xc1/0xd3
 [<c01a3a02>] selinux_ipv4_postroute_last+0xf/0x13
 [<c028d11f>] ip_finish_output2+0x0/0x16d
 [<c027cb23>] nf_iterate+0x40/0x81
 [<c028d11f>] ip_finish_output2+0x0/0x16d
 [<c027ce21>] nf_hook_slow+0x47/0xb4
 [<c028d11f>] ip_finish_output2+0x0/0x16d
 [<c028d116>] ip_finish_output+0x1a5/0x1ae
 [<c028d11f>] ip_finish_output2+0x0/0x16d
 [<c028cf66>] dst_output+0xf/0x1a
 [<c027cfdb>] nf_reinject+0x14d/0x1a9
 [<f891401e>] ipq_issue_verdict+0x1e/0x2b [ip_queue]
 [<f8914676>] ipq_set_verdict+0x53/0x5a [ip_queue]
 [<f891472c>] ipq_receive_peer+0x3d/0x46 [ip_queue]
 [<f891487d>] ipq_rcv_sk+0xfc/0x175 [ip_queue]
 [<c0285b11>] netlink_data_ready+0x14/0x44
 [<c028525b>] netlink_sendskb+0x52/0x6c
 [<c028592c>] netlink_sendmsg+0x254/0x263
 [<c011dcf5>] __wake_up+0x29/0x3c
 [<c026b92d>] sock_sendmsg+0xdb/0xf7
 [<c0285ae9>] netlink_recvmsg+0x1ae/0x1c2
 [<c026ba64>] sock_recvmsg+0xef/0x10c
 [<c02c7d34>] common_interrupt+0x18/0x20
 [<c011f6ee>] autoremove_wake_function+0x0/0x2d
 [<c02709ba>] verify_iovec+0x76/0xc2
 [<c026d07c>] sys_sendmsg+0x1ee/0x23b
 [<c026b4fe>] move_addr_to_user+0x67/0x7f
 [<c01335b7>] get_futex_key+0x39/0x108
 [<c0133b04>] unqueue_me+0x73/0x79
 [<c014b9b5>] find_extend_vma+0x12/0x4f
 [<c01335b7>] get_futex_key+0x39/0x108
 [<c026d465>] sys_socketcall+0x1c1/0x1dd
 [<c0125351>] sys_gettimeofday+0x53/0xac
 [<c02c7377>] syscall_call+0x7/0xb
 [<c02c007b>] unix_release_sock+0x15a/0x201
Code: 89 d3 83 c3 2c 0f 84 8c 01 00 00 8b 44 24 7c 31 c9 8d 54 24 24 e8
df 29

The kernel version was 2.6.9-11 that was patched with patch from https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=159815
The kernel was compiled for SMP i686.

Version-Release number of selected component (if applicable):
kernel-2.6.9-11

How reproducible:
Always

Steps to Reproduce:
1. Compile the kernel
2. Start the tests with high load (our application does mysql clustering there are lot of connections to database and CPU is used heavily. All connections will go truough ip_queue module).

  

Actual Results:  In 24 hours the panic will happen

Expected Results:  Kernel should continue working

Additional info:

This may be related to

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=159815
Comment 1 Suzanne Hillman 2005-09-02 13:37:56 EDT
*** Bug 167396 has been marked as a duplicate of this bug. ***
Comment 2 Ilkka Pietikäinen 2005-09-07 03:04:55 EDT
We are using tg3 drivers, one test was also executed using basp drivers with
similar results.
Comment 3 Ilkka Pietikäinen 2005-09-07 03:31:30 EDT
Created attachment 118541 [details]
More detailed dump
Comment 4 Nuutti Kotivuori 2005-09-16 07:39:28 EDT
I have been debugging the problem for EMIC as a consultant for the past week.

The bug does not appear on uniprocessor kernels, only on SMP kernels. The bug
does not appear on vanilla 2.6.13.1, but does appear on 2.6.9-11.EL1. The patch
mentioned in bug 159815 is necessary, as the machine will crash a lot sooner to
that bug otherwise. Disabling SELinux hides the bug, but the machine crashed
into a totally different thing that might or might not be related. All of these
results seem solid, but are inconclusive as the reproduction of the bug is
random and may take more than 8 hours.

Patrick McHardy provided a good dissertation on what is happening, and I have
confirmed that this is actually happening. The crash happens in
security/selinux/hooks.c:selinux_ip_postroute_last. Here is the piece of code
that matters:

        sk = skb->sk;
        if (!sk)
                goto out;

        sock = sk->sk_socket;
        if (!sock)
                goto out;

        inode = SOCK_INODE(sock);
        if (!inode)
                goto out;

        err = sel_netif_sids(dev, &if_sid, NULL);
        if (err)
                goto out;

        isec = inode->i_security;

        switch (isec->sclass) {

The crash happens on the last line, as isec is NULL. The only reason for
inode->i_security to be NULL should be that the inode has been freed. And the
only reason why that is happening is probably that something does not remember
to increment the use count on the socket, even though it is referenced by the skb.

Our hypothesis on this, is that a packet generated by a raw socket is sent, then
queued to userspace via the QUEUE target. While the packet is waiting for
processing, the socket is freed. Then the packets gets reinjected back to the
kernel and blows up on the output path because the socket it references does not
exist anymore. But we haven't been able to find evidence of any mistakes in
reference handling or proof that this is happening.

In any case, the problem is very real for us, and of critical importance.
Comment 5 Nuutti Kotivuori 2005-09-16 10:47:50 EDT
Packet dumps of the offending packets show that they are locally generated
outbound TCP FIN,ACK packets. Somehow the socket they are generated from gets
freed before the packet reaches network, hence the crash.
Comment 6 David Miller 2005-09-16 17:47:48 EDT
The i_security field can be NULL'd out if the ->inode_free_security()
method is invoked as well.  Which also is invoked by destroy_inode(),
meaning as you say that the inode is released too early for some reason,
likely a missing reference to the socket.

I reverified the socket refcounting in the output path for TCP FIN
packets.  In tcp_transmit_skb() it does the right think by calling
skb_set_owner_w() which grabs the socket reference.  At this point
unless netfilter does something weird with skb->sk or it's reference
the socket should stay around until this skb is freed up.

I also verified ip_queue.c and it does the correct thing if it needs
to expand and thus copy the SKB, namely it propagates the reference
to the socket from the old SKB to the newly allocated expanded one.
So that should be fine as well.

I really can't see anything that could allow a socket to be released
early.

Are you absolutely sure it is a TCP FIN packet that kills the machine?
Perhaps you can add some debugging to the selinux module where it
crashes, dumping out the TCP header when inode->i_security is NULL so
we can see exactly what kind of packet this is.

Do you have any other interesting netfilter rules installed on this
machine other than the ip_queue stuff?  That could play a part in this
bug as well.   Does your ip_queue handler support ipv6 as well? That
is another possibility as IPV6's netfilter queue module has the same
exact nf_reinject() bug and needs the same exact fix the ipv4 side got
in bug 159815
Comment 7 Nuutti Kotivuori 2005-09-17 06:33:30 EDT
(In reply to comment #6)
> The i_security field can be NULL'd out if the ->inode_free_security()
> method is invoked as well.  Which also is invoked by destroy_inode(),
> meaning as you say that the inode is released too early for some reason,
> likely a missing reference to the socket.

Right, that was my mistake. I didn't realize the inode could be freed
separately from the socket.

> I really can't see anything that could allow a socket to be released
> early.

I tried to do this to the best of my ability as well and came to the same
conclusion. And the TCP output path is not the least tested one of them for sure.

> Are you absolutely sure it is a TCP FIN packet that kills the machine?
> Perhaps you can add some debugging to the selinux module where it
> crashes, dumping out the TCP header when inode->i_security is NULL so
> we can see exactly what kind of packet this is.

I did exactly that. That's how I found out what kind of a packet it is - there
is no doubt about it, it is always a TCP FIN+ACK packet with no data in it.

> Do you have any other interesting netfilter rules installed on this
> machine other than the ip_queue stuff?  That could play a part in this
> bug as well.   Does your ip_queue handler support ipv6 as well? That
> is another possibility as IPV6's netfilter queue module has the same
> exact nf_reinject() bug and needs the same exact fix the ipv4 side got
> in bug 159815

No other netfilter rules. IPv6 is in the kernel, but unused and ip6tables
is not used. The only additional things to consider are that there pcap is
used on the interface as well and UDP packets are sent via raw socket.

I've had the suspicion that the packet socket rejecting the packet (which
frees an skb) somehow manages to decrement the use count on the socket or
invoke the destructor or something, which would make this happen - but my
knowledge of the kernel is limited and I couldn't locate anything wrong
with it.
Comment 8 James Morris 2005-09-18 21:07:34 EDT
Created attachment 118954 [details]
Proposed fix

Backport of the upstream fix for IPv4.
See also:
https://lists.netfilter.org/pipermail/netfilter-devel/2005-July/020513.html
Comment 9 Nuutti Kotivuori 2005-09-19 02:39:26 EDT
(In reply to comment #8)
> Created an attachment (id=118954) [edit]
> Proposed fix
> 
> Backport of the upstream fix for IPv4.
> See also:
> https://lists.netfilter.org/pipermail/netfilter-devel/2005-July/020513.html

I just want to clarify - that patch is already included in the kernels being
tested. Without it, the kernel crashes a lot sooner. The bug here is a problem
that was discovered after fixing that one since the kernel still crashes.
Comment 10 Nuutti Kotivuori 2005-09-19 04:51:29 EDT
A brief note on the reproduceability of this bug. In a three node cluster
running stress tests for 36 hours, there were 4 packets with isecurity == NULL
on the first node, 6 packets on the second node and 9 packets on the last node.
All of these were outgoing TCP packets with the FIN and ACK bits set. Included
is a dump of one such packet.

0x       0: 45 00 00 34 d8 65 40 00 40 06 17 02 c0 a8 65 0a
0x      10: c0 a8 65 01 0c ea af d3 c4 3b af 7c 46 3f 23 01
0x      20: 80 11 08 cc 6b a4 00 00 01 01 08 0a 07 ba 53 21
0x      30: 01 35 c1 28
Comment 11 Nuutti Kotivuori 2005-09-19 09:06:07 EDT
Created attachment 118974 [details]
workaround for some problems

It seems that the fixing of this bug has taken far too long now and some sort
of a solution, albeit temporary, has to be found.

I have attached a patch which manages to keep the kernel running for far longer
than it otherwise would, but does nothing to fix the actual problem. If the
actual cause is not found, kernels with this patch will end up being shipped to
customers.

As far as I can see, if the problematic packets never traverse a different code
path, and if the freed inode space does not get reused in the tiny bit of time
before the packet is sent to network, this patch should be safe and create no
additional problems.

Note You need to log in before you can comment on or make changes to this bug.