Description of problem: Customer uploaded a vmcore with a kernel panic and wants to know the root cause. Warning: kfree_skb passed an skb still on a list (from 39f082c0). ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:228! invalid operand: 0000 [#1] SMP Modules linked in: nfs lockd nfs_acl mptctl mptbase sg gab(U) llt(U) netconsole netdump i2c_dev i2c_core sunrpc ide_dump cciss_dump scsi_dump diskdump zlib_deflate lpfcdfc dmpaa(U) vxspec(U) vxio(U) vxdmp(U) fdd(U) vxportal(U) vxfs(U) dm_mod joydev button battery ac ohci_hcd hw_random tg3 e1000 floppy ext3 jbd lpfc scsi_transport_fc cciss sd_mod scsi_mod CPU: 3 EIP: 0060:[<022756c0>] Tainted: PF VLI EFLAGS: 00010246 (2.6.9-42.0.8.EL.IT122348HFhugemem) EIP is at __kfree_skb+0x19/0xf7 eax: 00000045 ebx: f65c0240 ecx: 023c8f70 edx: 022ffe37 esi: f75e9200 edi: f88f285c ebp: 0000006b esp: 023c8f6c ds: 007b es: 007b ss: 0068 Process tibemsd (pid: 13849, threadinfo=023c8000 task=f083f430) Stack: 022ffe37 39f082c0 7b8eaa00 f65c0240 f893922e 00000001 00000000 0000006b 39f7e6b0 f65c0000 f65c0240 7b8eaa38 f65c0240 f64c9400 f65c02dc 00000040 f8939070 00000000 00000000 023c8fd4 00000000 00000000 f64c9400 0b0bbd80 Call Trace: [<f893922e>] e1000_clean_tx_irq+0xe5/0x293 [e1000] [<f8939070>] e1000_clean+0x9d/0x176 [e1000] [<0227aaab>] net_rx_action+0xae/0x160 [<02126424>] __do_softirq+0x4c/0xb1 [<021084af>] do_softirq+0x4f/0x56 ======================= [<0211745a>] smp_apic_timer_interrupt+0x9a/0x9c [<0211254b>] get_offset_pmtmr+0x19/0xace [<0210b0ce>] do_gettimeofday+0x1a/0x9c [<02125d81>] sys_gettimeofday+0x17/0x55 Code: Bad EIP value. Version-Release number of selected component (if applicable): 2.6.9-42.0.8.EL.IT122348HFhugemem How reproducible: One time event Steps to Reproduce: Not reproduced anymore Additional info: * Core is on seg.rdu.redhat.com/export/nfs/samfw/136995 case_136955_vmcore * The kernel is a hotfix kernel Analysis done so far: PID = 13849, ... #9 [ 23c8f7c] e1000_clean_tx_irq at f8939229 [RA: f8939070 SP: 23c8f80 FP: 23c8fac SIZE: 48] 23c8f80: 00000001 00000000 0000006b 39f7e6b0 23c8f90: f65c0000 f65c0240 7b8eaa38 f65c0240 23c8fa0: f64c9400 f65c02dc 00000040 f8939070 ... There is a (struct e1000_adapter *) on 0xf65c0240, crash> e1000_adapter.tx_ring 0xf65c0240 tx_ring = 0xf75e9200, and crash> e1000_tx_ring 0xf75e9200 struct e1000_tx_ring { desc = 0x39f7e000, dma = 938991616, size = 4096, count = 256, <-------- how many buffers allocated next_to_use = 56, next_to_clean = 107, buffer_info = 0xf88f2000, <-------- buffers tx_lock = { lock = 1, magic = 3735899821 }, tdh = 14352, tdt = 14360, last_tx_tso = FALSE } Iteracting from first buffer_info to last one we can get all sk_buff and then all ->list, see: crash> e1000_buffer.skb 0xf88f2000 skb = 0xf647b6c0, crash> sk_buff.list 0xf647b6c0 list = 0x0, The results is only one sk_buff with ->list not 0x0: ... crash> sk_buff.list 0x7b8eaa00 list = 0xb0bbd8c, <-------- BUG() crash> sk_buff.list 0x52ba0100 list = 0x0, crash> sk_buff.list 0xf21d3180 list = 0x0 .... We can confirm that checking on __skb_free() stack frame for 0x7b8eaa00: #8 [ 23c8f68] __kfree_skb at 22756c0 [RA: f893922e SP: 23c8f6c FP: 23c8f7c SIZE: 20] 23c8f6c: 022ffe37 39f082c0 *7b8eaa00* f65c0240 23c8f7c: f893922e Checking the sk_buff the fields appears to be correctly to me, so they are valid and __kfree_skb() is right to BUG() on it. What is calling my attention now are input_dev and dev fields: crash> sk_buff.dev 7b8eaa00 dev = 0x233c160, crash> sk_buff.input_dev 7b8eaa00 input_dev = 0x233c160, crash> sk_buff.real_dev 7b8eaa00 real_dev = 0x0, Notes dev == input_dev and: crash> net_device.name 0x233c160 name = "lo\000\000\000\000\000\000\000\000\000\000\000\000\000", This sk_buff was actually received from loopback interface and it never should be on e1000 tx queue. The eth0 interface is using netlog/netdump so it may be related to our bug but there is no similar report until now and I didn't find anything that could point that too. The other processes running are java (pid=4439), cp (pid=9008), java (pid=16174) and they haven't nothing related to network or any area in backtrace that could impact on our scenario, like a race condition or something. The tx_queue_lock is held and the magic number is valid so the queue is safe at that point, other struct fields are ok too. The next_to_clean is 107, which points to our buffer_info and the board descriptor (at 0x39f7e6b0) crash> e1000_tx_desc 39f7e6b0 struct e1000_tx_desc { buffer_addr = 0x37efee02, lower = { data = 0x8b00004b, flags = { length = 0x4b, cso = 0x0, cmd = 0x8b } }, upper = { data = 0x1, fields = { status = 0x1, css = 0x0, special = 0x0 } } } The e1000_buffer: crash> e1000_buffer.skb,next_to_watch,dma f88f285c skb = 0x7b8eaa00, next_to_watch = 0x6b dma = 0x37efee02, Note that both are connected and correct although e1000_buffer->skb is valid but with another unexpected content. Looking in vmcore memory finding relations between the skb and another areas we can see the following references: Addr referencing: Addr referenced 23c8ebc: 7b8eaa00 --\ 23c8ed8: 7b8eaa00 --+--- softirq_stack 23c8f74: 7b8eaa00 --/ b0bbd8c: 7b8eaa00 ----+-- another sk_buff (sk_buff_head) b0bbd90: 7b8eaa00 ---/ 39fe01b0: 7b8eaa00 -----\ 39fe30e8: 7b8eaa00 ------+-- SLAB size-512/Unidentified 39fe30f0: 7b8eaa00 -----/ f77d61d8: 7b8eaa00 ----+---- Unidentified f77d61e0: 7b8eaa00 ----/ f88f285c: 7b8eaa00 --------- diskdump storing softirq_stack on disk There are still two areas unidentified that may help us see what was using this socket and give a clue to find the root cause, but this is a partial dump and we maybe missing important information now: crash> sys KERNEL: /usr/lib/debug/lib/modules/2.6.9-42.0.8.EL.IT122348HFhugemem/vmlinux DUMPFILE: case_136955_vmcore [PARTIAL DUMP] ^^^^^^^^^^^^^^^ I'm thinking that skb_buff 0x7b8eaa00 was correct before because to get in tx_ring it should have had ->dev correctly to call dev->hard_start_xmit(). The entries in tx_ring are correct and so the others sk_buff, then the 0x7b8eaa00 is the only entry with ->list pointer filled. The e1000 driver does: e1000_unmap_and_free_tx_resource(struct e1000_adapter *adapter, struct e1000_buffer *buffer_info) { ... if (buffer_info->skb) dev_kfree_skb_any(buffer_info->skb); memset(buffer_info, 0, sizeof(struct e1000_buffer)); } The code above calls memset() to clean with zeros the entire buffer_info, which includes buffer_info->skb too, meaning that buffer_info was valid in tx_ring and not zeroed yet. I'm thinking in the following event sequence: 1. the sk_buff was correct and added to e1000 tx_ring 2. sk_buff was transmitted but NOT freed yet 3. some time later a softirq woke up and tried to clean all sent sk_buff from tx_ring finding this sk_buff altered. I'm almost certain that e1000 isn't the culprit here, nothing points to it until now, and the new/corrupted sk_buff comes from loopback and it is also correct/valid. With that in mind, I can imagine these two scenarios: 1) The caller of dev->hard_start_xmit() passed a free sk_buff. The theory here is sk_buff was freed _before_ be passed to e1000 driver and then the sk_buff was recycled throw slab cache to another interface (loopback), then when softirq woke up found an unexpected sk_buff. (some iptable rules, traffic control and etc..) Low probability. 2) The caller of dev->hard_start_xmit() freed sk_buff right after. The theory here is sk_buff was delivered to e1000 driver and then freed, so it was transmitted but also recycled throw slab cache causing the softirq to find out an unexpected sk_buff. The problem is finding out where this sk_buff was freed assuming that my thoughts are correct. Any idea what to do next to find out the root cause? Flavio