Bug 487414

Summary: Panic/Crash 2.6.9-78.0.8.ELsmp in tcp_recvmsg() during heavy load tcp/ip
Product: Red Hat Enterprise Linux 4 Reporter: Seamus Gallagher <sgallagher>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 4.7CC: davem, djuran, heinzm, rdassen, sgallagher, tgraf
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-10-21 19:31:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
inet.c source file which is caller of sock_recvmsg
none
One log dump as provided by netdump
none
Second log dump as provided by netdump
none
The kernel source file: /usr/src/debug/kernel-2.6.9/linux-2.6.9/net/ipv4/tcp.c
none
kernel source file: include/net/sock.h
none
kernel source file: include/linux/skbuff.h none

Description Seamus Gallagher 2009-02-25 21:06:35 UTC
Created attachment 333228 [details]
inet.c source file which is caller of sock_recvmsg

Description of problem:

Crash happens at following location (netdump log):

Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:  
<ffffffff802e0728>{tcp_recvmsg+1577} 
PML4 11168c067 PGD 111675067 PMD 0  
Oops: 0002 [1] SMP  
CPU 0  
Modules linked in: netconsole netdump streams_tp0(U) streams_sfx(U) streams_clnp(U) streams_cots(U) streams_timod(U) streams_ldl(U) parport_pc lp parport streams_inet(U) streams_uw7compat(U) streams_svr4compat(U) streams_svr3compat(U) streams_suncompat(U) streams_osfcompat(U) streams_mpscompat(U) streams_maccompat(U) streams_liscompat(U) streams_irixcompat(U) streams_hpuxcompat(U) streams_aixcompat(U) autofs4 streams(U) specfs(U) i2c_dev i2c_core md5 ipv6 dlm(U) cman(U) sunrpc dm_multipath dm_mod button battery ac uhci_hcd ehci_hcd hw_random tg3 floppy ata_piix libata ext3 jbd cciss sd_mod scsi_mod 
Pid: 3120, comm: kstreamd/0 Tainted: GF     2.6.9-78.0.8.ELsmp 
RIP: 0010:[<ffffffff802e0728>] <ffffffff802e0728>{tcp_recvmsg+1577} 
RSP: 0018:0000010114923ad8  EFLAGS: 00010202 
RAX: 0000000000000000 RBX: 00000000000005a8 RCX: 0000000000000019 
RDX: 0000000000000000 RSI: 000001010de6adfc RDI: 000001010df7f680 
RBP: 0000010111e8fcb0 R08: 0000000000000000 R09: 0000010111e8fcb0 
R10: 0000000000000000 R11: 0000000000000000 R12: 0000010111e8f980 
R13: 00000000ffffff95 R14: 0000000000000000 R15: 000001010df7f680 
FS:  0000000000000000(0000) GS:ffffffff8050d800(0000) knlGS:0000000000000000 
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b 
CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0 
Process kstreamd/0 (pid: 3120, threadinfo 0000010114922000, task 0000010117313030) 
Stack: 00000001103a8a80 0000000000000000 0000010111e8fe48 00004000000005a8  
       0000000000000000 0000010114923dd8 0000000000000002 0000000000000000  
       0000000000000000 0000010114923dd8  
Call Trace:<ffffffff802b7d62>{sock_common_recvmsg+48} <ffffffff802b4808>{sock_recvmsg+284}  
       <ffffffff80133046>{try_to_wake_up+876} <ffffffffa01ea97a>{:streams:kmem_alloc+74}  
       <ffffffff801360cc>{autoremove_wake_function+0} <ffffffffa02654fa>{:streams_inet:ss_allocb_func+58}  
       <ffffffffa026e591>{:streams_inet:ss_sock_recvmsg+785}  
       <ffffffff8014c5b4>{keventd_create_kthread+0} <ffffffffa0274f17>{:streams_inet:ss_rsrv+23}  
       <ffffffffa01e3d6e>{:streams:__runqueues+446} <ffffffffa01ed110>{:streams:kstreamd+0}  
       <ffffffff8014c5b4>{keventd_create_kthread+0} <ffffffffa01ed165>{:streams:kstreamd+85}  
       <ffffffffa01ed110>{:streams:kstreamd+0} <ffffffff8014c58b>{kthread+200}  
       <ffffffff80110fd3>{child_rip+8} <ffffffff8014c5b4>{keventd_create_kthread+0}  
       <ffffffff8014c4c3>{kthread+0} <ffffffff80110fcb>{child_rip+0}  
        
 
Code: 48 89 10 48 89 42 08 e8 13 7e fd ff eb 45 48 8b 54 24 10 ff  
RIP <ffffffff802e0728>{tcp_recvmsg+1577} RSP <0000010114923ad8> 
CR2: 0000000000000000 

Research done so far:

It seems the socket sk->sk_receive_queue list can at times have  
inconsistency in the linked list (i.e. prev pointer reset while not  
expected) which can lead to a crash. One theoretical reason for this is  
parallel access to the socket queue (perhaps more likely under high  
pressure). 

I guess it is 100% reproducable on one system. However not on others.. Our  
feeling is that the hardware perhaps makes a difference. 
The problem could be related to SMP/number of processors/parallel access to  
certain datastructures (review from our end seems to show that the function 
peeks in the socket receive_queue but does not yet take the element off the  
queue until some time later. Perhaps this leaves time for parallel access. 
 
Our review so far of this problem (details passed); 

Details: 
 
rpm -qpl kernel-debuginfo-2.6.9-78.0.8.EL.x86_64.rpm | grep tcp.c 
... 
/usr/src/debug/kernel-2.6.9/linux-2.6.9/net/ipv4/tcp.c 
 
Verifying with gdb on; /usr/lib/debug/lib/modules/2.6.9-78.0.8.ELsmp/vmlinux, using the call trace as information: 
 
net/ipv4/tcp.c (from: kernel-debuginfo-2.6.9-78.0.8.EL.x86_64.rpm): 
 
tcp_recvmsg() 
{ 
... 
        found_fin_ok: 
                /* Process the FIN. */ 
                ++*seq; 
                if (!(flags & MSG_PEEK)) 
Crash here -->                        sk_eat_skb(sk, skb); 
} 
 
Inside ./include/net/sock.h: 
 
static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb) 
{ 
-->        __skb_unlink(skb, &sk->sk_receive_queue); 
        __kfree_skb(skb); 
} 
 
extern void        skb_unlink(struct sk_buff *skb); 
static inline void __skb_unlink(struct sk_buff *skb, struct sk_buff_head  
*list) 
{ 
        struct sk_buff *next, *prev; 
 
        list->qlen--; 
        next       = skb->next; 
        prev       = skb->prev; 
        skb->next  = skb->prev = NULL; 
        skb->list  = NULL; 
        next->prev = prev; 
Crash here, prev=0! -->        prev->next = next; 
} 


Version-Release number of selected component (if applicable):

2.6.9-78.0.8.ELsmp 

How reproducible:

Not very reproducible. It appears more likely on systems with higher number of processors/SMP enable systems.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Seamus Gallagher 2009-02-25 21:08:14 UTC
I attached a file inet.c which describes the mechanism used to call the sock_recvmsg() which causes (in certain cases) a crash. 
 
In particular this function is used: 
 
STATIC int 
ss_recvmsg(ss_t *ss, struct msghdr *msg, int size) 
{ 
int res; 
int sflags = MSG_DONTWAIT | MSG_NOSIGNAL; 
 
ensure(ss, return (-EFAULT)); 
ensure(ss->sock, return (-EPROTO)); 
ensure(ss->sock->ops, return (-EFAULT)); 
ensure(ss->sock->ops->recvmsg, return (-EFAULT)); 
ensure(ss->sock->sk, return (-EFAULT)); 
ensure(ss->sock->sk->sk_prot, return (-EFAULT)); 
ensure(ss->sock->sk->sk_prot->recvmsg, return (-EFAULT)); 
{ 
mm_segment_t fs = get_fs(); 
 
set_fs(KERNEL_DS); 
res = sock_recvmsg(ss->sock, msg, size, sflags); 
set_fs(fs); 
} 
if (res < 0) 
if (inet_debug & DEBUG_CONN) printd(("%s: %p: ERROR: from sock->ops->recvmsg %d\n", DRV_NAME, ss, res)); 
return (res); 
} 
 
From user perspective this is governed by a  streams framework which has a daemon called kstreamd (kernel thread). You can see this module name also in the stack trace. One kstreams process is started per processor on the system (SMP system). Periodically the function ss_rsrv() is called which will poll the socket for incoming data. 
 
ss_rsrv(queue_t *q) 
 ss_r_prim() 
  ss_r_read() 
   ss_sock_recvmsg(queue_t *q) 
    sock_recvmsg() <--- Crash happens sometimes in tcp_recvmsg during traverse of sk_receive_queue data structure. This is visible in the call trace. 
 
The source file inet.c comes from LFS framework (Linux fast streams).

Comment 2 Seamus Gallagher 2009-02-25 21:09:28 UTC
Created attachment 333230 [details]
One log dump as provided by netdump

Comment 3 Seamus Gallagher 2009-02-25 21:10:01 UTC
Created attachment 333231 [details]
Second log dump as provided by netdump

Comment 4 Seamus Gallagher 2009-02-25 21:25:21 UTC
The way in which it was reproduced:

1. Remove kernel modules (including the strinet as referred to for inet.c)
2. Reboot
3. Install kernel modules (includes the strinet modules)
4. Start a test sequence which includes 120 tcp/ip connections 
   sending moderate to heavy traffic. 

This test traverses through various kernel layers/modules:

tp0 --> strinet --> tcp --> [other system]--> tcp --> strinet --> tp0

The service provides RFC1006 (OSI over TCP/IP).

Comment 5 Seamus Gallagher 2009-02-25 21:40:36 UTC
Created attachment 333242 [details]
The kernel source file: /usr/src/debug/kernel-2.6.9/linux-2.6.9/net/ipv4/tcp.c

Comment 6 Seamus Gallagher 2009-02-25 21:43:24 UTC
Created attachment 333243 [details]
kernel source file: include/net/sock.h

Comment 7 Seamus Gallagher 2009-02-25 21:44:30 UTC
Created attachment 333244 [details]
kernel source file: include/linux/skbuff.h

Comment 9 Heinz Mauelshagen 2009-03-06 11:39:40 UTC
Yes, the trace shows a race on skbuf unlinking, which is more likely to trigger the more parallel the system is.

Hint to prevent from Vanilla kernel source:

 * sk_eat_skb - Release a skb if it is no longer needed
<SNIP>
 * * This routine must be called with interrupts disabled or with the socket
 * locked so that the sk_buff queue operation is ok.   <-----

Do you lock_sock()/release_sock() or disable interrupts ?

Comment 10 Seamus Gallagher 2009-03-06 18:46:45 UTC
Regarding to the question:

"Do you lock_sock()/release_sock() or disable interrupts ? "

The crash occurs in the function tcp_recvmsg() which itself does a lock_sock at the start and release_sock at the end of the function. The function or file is not changed from the vanilla flavor (see the attachment).

That is also what is puzzling about this problem. But at least you seem to confirm that it is a race condition in the tcp kernel code and indeed it seems to be more likely to appear on SMP enabled servers with more processors.

If you (or any one else) have ideas about what could cause the race condition on skbuf (maybe someone adding to the buffer from below TCP, while kernel module tried to read it from above in tcp_recvmsg() ?)

Comment 11 Neil Horman 2009-08-24 16:58:12 UTC
I'm sorry, I'm not sure what we can do here.  The entire tcp v4 receive path is protected by a call to bh_lock_sock on the socket lock in tcp_v4_rcv which is the lowest entry point for frames that are recognized as tcp.  Any calls which enqueue frames to a given socket are made from within tcp_v4_do_rcv, and its called functions, so its all protected by the socket lock.  I'm not saying that its a certainty that there isn't a path that misses that lock, but a visual inspection says they're not (theres not many paths from tcp_v4_rcv to any funcion which enqueues to a socket buffer).  Given the nature of the reproducer, I think its far more likely that something in the openss7 streams modules are inappropriately altering the sk_receive_queue.

I would say the best place to start is by obtaining a core from the system and inspecting the sysrq-t output (the backtrace of all the processes).   With luck, we might catch part of a bactrace that shows which context/path is modifying sk_receive_queue without holding the lock.

Alternatively, reproducing this bug without the openss7 modules (which we don't ship) would serve as conclusive proof that the bug is not in that code.

Comment 12 Neil Horman 2009-10-21 19:31:57 UTC
closing, no response.