Created attachment 333228 [details] inet.c source file which is caller of sock_recvmsg Description of problem: Crash happens at following location (netdump log): Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: <ffffffff802e0728>{tcp_recvmsg+1577} PML4 11168c067 PGD 111675067 PMD 0 Oops: 0002 [1] SMP CPU 0 Modules linked in: netconsole netdump streams_tp0(U) streams_sfx(U) streams_clnp(U) streams_cots(U) streams_timod(U) streams_ldl(U) parport_pc lp parport streams_inet(U) streams_uw7compat(U) streams_svr4compat(U) streams_svr3compat(U) streams_suncompat(U) streams_osfcompat(U) streams_mpscompat(U) streams_maccompat(U) streams_liscompat(U) streams_irixcompat(U) streams_hpuxcompat(U) streams_aixcompat(U) autofs4 streams(U) specfs(U) i2c_dev i2c_core md5 ipv6 dlm(U) cman(U) sunrpc dm_multipath dm_mod button battery ac uhci_hcd ehci_hcd hw_random tg3 floppy ata_piix libata ext3 jbd cciss sd_mod scsi_mod Pid: 3120, comm: kstreamd/0 Tainted: GF 2.6.9-78.0.8.ELsmp RIP: 0010:[<ffffffff802e0728>] <ffffffff802e0728>{tcp_recvmsg+1577} RSP: 0018:0000010114923ad8 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 00000000000005a8 RCX: 0000000000000019 RDX: 0000000000000000 RSI: 000001010de6adfc RDI: 000001010df7f680 RBP: 0000010111e8fcb0 R08: 0000000000000000 R09: 0000010111e8fcb0 R10: 0000000000000000 R11: 0000000000000000 R12: 0000010111e8f980 R13: 00000000ffffff95 R14: 0000000000000000 R15: 000001010df7f680 FS: 0000000000000000(0000) GS:ffffffff8050d800(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0 Process kstreamd/0 (pid: 3120, threadinfo 0000010114922000, task 0000010117313030) Stack: 00000001103a8a80 0000000000000000 0000010111e8fe48 00004000000005a8 0000000000000000 0000010114923dd8 0000000000000002 0000000000000000 0000000000000000 0000010114923dd8 Call Trace:<ffffffff802b7d62>{sock_common_recvmsg+48} <ffffffff802b4808>{sock_recvmsg+284} <ffffffff80133046>{try_to_wake_up+876} <ffffffffa01ea97a>{:streams:kmem_alloc+74} <ffffffff801360cc>{autoremove_wake_function+0} <ffffffffa02654fa>{:streams_inet:ss_allocb_func+58} <ffffffffa026e591>{:streams_inet:ss_sock_recvmsg+785} <ffffffff8014c5b4>{keventd_create_kthread+0} <ffffffffa0274f17>{:streams_inet:ss_rsrv+23} <ffffffffa01e3d6e>{:streams:__runqueues+446} <ffffffffa01ed110>{:streams:kstreamd+0} <ffffffff8014c5b4>{keventd_create_kthread+0} <ffffffffa01ed165>{:streams:kstreamd+85} <ffffffffa01ed110>{:streams:kstreamd+0} <ffffffff8014c58b>{kthread+200} <ffffffff80110fd3>{child_rip+8} <ffffffff8014c5b4>{keventd_create_kthread+0} <ffffffff8014c4c3>{kthread+0} <ffffffff80110fcb>{child_rip+0} Code: 48 89 10 48 89 42 08 e8 13 7e fd ff eb 45 48 8b 54 24 10 ff RIP <ffffffff802e0728>{tcp_recvmsg+1577} RSP <0000010114923ad8> CR2: 0000000000000000 Research done so far: It seems the socket sk->sk_receive_queue list can at times have inconsistency in the linked list (i.e. prev pointer reset while not expected) which can lead to a crash. One theoretical reason for this is parallel access to the socket queue (perhaps more likely under high pressure). I guess it is 100% reproducable on one system. However not on others.. Our feeling is that the hardware perhaps makes a difference. The problem could be related to SMP/number of processors/parallel access to certain datastructures (review from our end seems to show that the function peeks in the socket receive_queue but does not yet take the element off the queue until some time later. Perhaps this leaves time for parallel access. Our review so far of this problem (details passed); Details: rpm -qpl kernel-debuginfo-2.6.9-78.0.8.EL.x86_64.rpm | grep tcp.c ... /usr/src/debug/kernel-2.6.9/linux-2.6.9/net/ipv4/tcp.c Verifying with gdb on; /usr/lib/debug/lib/modules/2.6.9-78.0.8.ELsmp/vmlinux, using the call trace as information: net/ipv4/tcp.c (from: kernel-debuginfo-2.6.9-78.0.8.EL.x86_64.rpm): tcp_recvmsg() { ... found_fin_ok: /* Process the FIN. */ ++*seq; if (!(flags & MSG_PEEK)) Crash here --> sk_eat_skb(sk, skb); } Inside ./include/net/sock.h: static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb) { --> __skb_unlink(skb, &sk->sk_receive_queue); __kfree_skb(skb); } extern void skb_unlink(struct sk_buff *skb); static inline void __skb_unlink(struct sk_buff *skb, struct sk_buff_head *list) { struct sk_buff *next, *prev; list->qlen--; next = skb->next; prev = skb->prev; skb->next = skb->prev = NULL; skb->list = NULL; next->prev = prev; Crash here, prev=0! --> prev->next = next; } Version-Release number of selected component (if applicable): 2.6.9-78.0.8.ELsmp How reproducible: Not very reproducible. It appears more likely on systems with higher number of processors/SMP enable systems. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I attached a file inet.c which describes the mechanism used to call the sock_recvmsg() which causes (in certain cases) a crash. In particular this function is used: STATIC int ss_recvmsg(ss_t *ss, struct msghdr *msg, int size) { int res; int sflags = MSG_DONTWAIT | MSG_NOSIGNAL; ensure(ss, return (-EFAULT)); ensure(ss->sock, return (-EPROTO)); ensure(ss->sock->ops, return (-EFAULT)); ensure(ss->sock->ops->recvmsg, return (-EFAULT)); ensure(ss->sock->sk, return (-EFAULT)); ensure(ss->sock->sk->sk_prot, return (-EFAULT)); ensure(ss->sock->sk->sk_prot->recvmsg, return (-EFAULT)); { mm_segment_t fs = get_fs(); set_fs(KERNEL_DS); res = sock_recvmsg(ss->sock, msg, size, sflags); set_fs(fs); } if (res < 0) if (inet_debug & DEBUG_CONN) printd(("%s: %p: ERROR: from sock->ops->recvmsg %d\n", DRV_NAME, ss, res)); return (res); } From user perspective this is governed by a streams framework which has a daemon called kstreamd (kernel thread). You can see this module name also in the stack trace. One kstreams process is started per processor on the system (SMP system). Periodically the function ss_rsrv() is called which will poll the socket for incoming data. ss_rsrv(queue_t *q) ss_r_prim() ss_r_read() ss_sock_recvmsg(queue_t *q) sock_recvmsg() <--- Crash happens sometimes in tcp_recvmsg during traverse of sk_receive_queue data structure. This is visible in the call trace. The source file inet.c comes from LFS framework (Linux fast streams).
Created attachment 333230 [details] One log dump as provided by netdump
Created attachment 333231 [details] Second log dump as provided by netdump
The way in which it was reproduced: 1. Remove kernel modules (including the strinet as referred to for inet.c) 2. Reboot 3. Install kernel modules (includes the strinet modules) 4. Start a test sequence which includes 120 tcp/ip connections sending moderate to heavy traffic. This test traverses through various kernel layers/modules: tp0 --> strinet --> tcp --> [other system]--> tcp --> strinet --> tp0 The service provides RFC1006 (OSI over TCP/IP).
Created attachment 333242 [details] The kernel source file: /usr/src/debug/kernel-2.6.9/linux-2.6.9/net/ipv4/tcp.c
Created attachment 333243 [details] kernel source file: include/net/sock.h
Created attachment 333244 [details] kernel source file: include/linux/skbuff.h
Yes, the trace shows a race on skbuf unlinking, which is more likely to trigger the more parallel the system is. Hint to prevent from Vanilla kernel source: * sk_eat_skb - Release a skb if it is no longer needed <SNIP> * * This routine must be called with interrupts disabled or with the socket * locked so that the sk_buff queue operation is ok. <----- Do you lock_sock()/release_sock() or disable interrupts ?
Regarding to the question: "Do you lock_sock()/release_sock() or disable interrupts ? " The crash occurs in the function tcp_recvmsg() which itself does a lock_sock at the start and release_sock at the end of the function. The function or file is not changed from the vanilla flavor (see the attachment). That is also what is puzzling about this problem. But at least you seem to confirm that it is a race condition in the tcp kernel code and indeed it seems to be more likely to appear on SMP enabled servers with more processors. If you (or any one else) have ideas about what could cause the race condition on skbuf (maybe someone adding to the buffer from below TCP, while kernel module tried to read it from above in tcp_recvmsg() ?)
I'm sorry, I'm not sure what we can do here. The entire tcp v4 receive path is protected by a call to bh_lock_sock on the socket lock in tcp_v4_rcv which is the lowest entry point for frames that are recognized as tcp. Any calls which enqueue frames to a given socket are made from within tcp_v4_do_rcv, and its called functions, so its all protected by the socket lock. I'm not saying that its a certainty that there isn't a path that misses that lock, but a visual inspection says they're not (theres not many paths from tcp_v4_rcv to any funcion which enqueues to a socket buffer). Given the nature of the reproducer, I think its far more likely that something in the openss7 streams modules are inappropriately altering the sk_receive_queue. I would say the best place to start is by obtaining a core from the system and inspecting the sysrq-t output (the backtrace of all the processes). With luck, we might catch part of a bactrace that shows which context/path is modifying sk_receive_queue without holding the lock. Alternatively, reproducing this bug without the openss7 modules (which we don't ship) would serve as conclusive proof that the bug is not in that code.
closing, no response.