Bug 429868

Summary:	kernel BUG at net/core/skbuff.c:228
Product:	Red Hat Enterprise Linux 4	Reporter:	Flavio Leitner <fleitner>
Component:	kernel	Assignee:	Red Hat Kernel Manager <kernel-mgr>
Status:	CLOSED NOTABUG	QA Contact:	Martin Jenner <mjenner>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.4	CC:	tao
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-02-14 21:45:38 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Flavio Leitner 2008-01-23 15:37:25 UTC

Description of problem:
Customer uploaded a vmcore with a kernel panic and wants to know 
the root cause.

 Warning: kfree_skb passed an skb still on a list (from 39f082c0).
 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:228!
 invalid operand: 0000 [#1]
 SMP
 Modules linked in: nfs lockd nfs_acl mptctl mptbase sg gab(U) llt(U) 
 netconsole netdump i2c_dev i2c_core sunrpc ide_dump cciss_dump scsi_dump
 diskdump zlib_deflate lpfcdfc dmpaa(U) vxspec(U) vxio(U) vxdmp(U) fdd(U)
 vxportal(U) vxfs(U) dm_mod joydev button battery ac ohci_hcd hw_random 
 tg3 e1000 floppy ext3 jbd lpfc scsi_transport_fc cciss sd_mod scsi_mod
 CPU:    3
 EIP:    0060:[<022756c0>]    Tainted: PF     VLI
 EFLAGS: 00010246   (2.6.9-42.0.8.EL.IT122348HFhugemem)
 EIP is at __kfree_skb+0x19/0xf7
 eax: 00000045   ebx: f65c0240   ecx: 023c8f70   edx: 022ffe37
 esi: f75e9200   edi: f88f285c   ebp: 0000006b   esp: 023c8f6c
 ds: 007b   es: 007b   ss: 0068
 Process tibemsd (pid: 13849, threadinfo=023c8000 task=f083f430)
 Stack: 022ffe37 39f082c0 7b8eaa00 f65c0240 f893922e 00000001 00000000 0000006b
      39f7e6b0 f65c0000 f65c0240 7b8eaa38 f65c0240 f64c9400 f65c02dc 00000040
      f8939070 00000000 00000000 023c8fd4 00000000 00000000 f64c9400 0b0bbd80
 Call Trace:
 [<f893922e>] e1000_clean_tx_irq+0xe5/0x293 [e1000]
 [<f8939070>] e1000_clean+0x9d/0x176 [e1000]
 [<0227aaab>] net_rx_action+0xae/0x160
 [<02126424>] __do_softirq+0x4c/0xb1
 [<021084af>] do_softirq+0x4f/0x56
 =======================
 [<0211745a>] smp_apic_timer_interrupt+0x9a/0x9c
 [<0211254b>] get_offset_pmtmr+0x19/0xace
 [<0210b0ce>] do_gettimeofday+0x1a/0x9c
 [<02125d81>] sys_gettimeofday+0x17/0x55
 Code:  Bad EIP value.


Version-Release number of selected component (if applicable):
2.6.9-42.0.8.EL.IT122348HFhugemem

How reproducible:
One time event

Steps to Reproduce:
Not reproduced anymore
  
Additional info:
* Core is on seg.rdu.redhat.com/export/nfs/samfw/136995
case_136955_vmcore
* The kernel is a hotfix kernel

Analysis done so far:
PID = 13849,
...
#9 [ 23c8f7c] e1000_clean_tx_irq at f8939229
   [RA: f8939070  SP: 23c8f80  FP: 23c8fac  SIZE: 48]
   23c8f80: 00000001  00000000  0000006b  39f7e6b0
   23c8f90: f65c0000  f65c0240  7b8eaa38  f65c0240
   23c8fa0: f64c9400  f65c02dc  00000040  f8939070
...

There is a (struct e1000_adapter *) on 0xf65c0240,
crash> e1000_adapter.tx_ring 0xf65c0240
 tx_ring = 0xf75e9200,

and
crash> e1000_tx_ring 0xf75e9200
struct e1000_tx_ring {
 desc = 0x39f7e000,
 dma = 938991616,
 size = 4096,
 count = 256,              <-------- how many buffers allocated
 next_to_use = 56,
 next_to_clean = 107,
 buffer_info = 0xf88f2000, <-------- buffers
 tx_lock = {
   lock = 1,
   magic = 3735899821
 },
 tdh = 14352,
 tdt = 14360,
 last_tx_tso = FALSE
}

Iteracting from first buffer_info to last one we can get all sk_buff
and then all ->list, see:
crash> e1000_buffer.skb 0xf88f2000
 skb = 0xf647b6c0,
crash> sk_buff.list 0xf647b6c0
 list = 0x0,

The results is only one sk_buff with ->list not 0x0:
...
crash> sk_buff.list 0x7b8eaa00
 list = 0xb0bbd8c,               <-------- BUG()
crash> sk_buff.list 0x52ba0100
 list = 0x0,
crash> sk_buff.list 0xf21d3180
 list = 0x0
....

We can confirm that checking on __skb_free() stack frame for 0x7b8eaa00:
 #8 [ 23c8f68] __kfree_skb at 22756c0
   [RA: f893922e  SP: 23c8f6c  FP: 23c8f7c  SIZE: 20]
   23c8f6c: 022ffe37  39f082c0  *7b8eaa00*  f65c0240
   23c8f7c: f893922e

Checking the sk_buff the fields appears to be correctly to me, so 
they are valid and __kfree_skb() is right to BUG() on it.

What is calling my attention now are input_dev and dev fields:

crash> sk_buff.dev 7b8eaa00
 dev = 0x233c160,
crash> sk_buff.input_dev 7b8eaa00
 input_dev = 0x233c160,
crash> sk_buff.real_dev 7b8eaa00
 real_dev = 0x0,

Notes dev == input_dev and:
crash> net_device.name 0x233c160
 name = "lo\000\000\000\000\000\000\000\000\000\000\000\000\000",

This sk_buff was actually received from loopback interface and it
never should be on e1000 tx queue.

The eth0 interface is using netlog/netdump so it may be related to 
our bug but there is no similar report until now and I didn't find 
anything that could point that too.

The other processes running are java (pid=4439), cp (pid=9008), 
java (pid=16174) and they haven't nothing related to network or 
any area in backtrace that could impact on our scenario, like a 
race condition or something.

The tx_queue_lock is held and the magic number is valid so the 
queue is safe at that point, other struct fields are ok too.

The next_to_clean is 107, which points to our buffer_info and the 
board descriptor  (at 0x39f7e6b0)

crash> e1000_tx_desc 39f7e6b0
struct e1000_tx_desc {
 buffer_addr = 0x37efee02,
 lower = {
   data = 0x8b00004b,
   flags = {
     length = 0x4b,
     cso = 0x0,
     cmd = 0x8b
   }
 },
 upper = {
   data = 0x1,
   fields = {
     status = 0x1,
     css = 0x0,
     special = 0x0
   }
 }
}

The e1000_buffer:
crash> e1000_buffer.skb,next_to_watch,dma f88f285c
 skb = 0x7b8eaa00,
 next_to_watch = 0x6b
 dma = 0x37efee02,

Note that both are connected and correct although e1000_buffer->skb 
is valid but with another unexpected content.

Looking in vmcore memory finding relations between the skb and 
another areas we can see the following references:

Addr referencing: Addr referenced
23c8ebc: 7b8eaa00 --\
23c8ed8: 7b8eaa00 --+--- softirq_stack
23c8f74: 7b8eaa00 --/

b0bbd8c: 7b8eaa00 ----+-- another sk_buff (sk_buff_head)
b0bbd90: 7b8eaa00 ---/

39fe01b0: 7b8eaa00 -----\
39fe30e8: 7b8eaa00 ------+-- SLAB size-512/Unidentified
39fe30f0: 7b8eaa00 -----/

f77d61d8: 7b8eaa00 ----+---- Unidentified
f77d61e0: 7b8eaa00 ----/

f88f285c: 7b8eaa00 --------- diskdump storing softirq_stack on disk

There are still two areas unidentified that may help us see what 
was using this socket and give a clue to find the root cause, but 
this is a partial dump and we maybe missing important information now:
crash> sys
     KERNEL: /usr/lib/debug/lib/modules/2.6.9-42.0.8.EL.IT122348HFhugemem/vmlinux
   DUMPFILE: case_136955_vmcore  [PARTIAL DUMP]
                                 ^^^^^^^^^^^^^^^

I'm thinking that skb_buff 0x7b8eaa00 was correct before because 
to get in tx_ring it should have had ->dev correctly to call 
dev->hard_start_xmit().

The entries in tx_ring are correct and so the others sk_buff, 
then the 0x7b8eaa00 is the only entry with ->list pointer filled.

The e1000 driver does:
e1000_unmap_and_free_tx_resource(struct e1000_adapter *adapter,
                       struct e1000_buffer *buffer_info)
{
...
       if (buffer_info->skb)
               dev_kfree_skb_any(buffer_info->skb);
       memset(buffer_info, 0, sizeof(struct e1000_buffer));
}

The code above calls memset() to clean with zeros the entire 
buffer_info, which includes buffer_info->skb too, meaning that 
buffer_info was valid in tx_ring and not zeroed yet.

I'm thinking in the following event sequence:

1. the sk_buff was correct and added to e1000 tx_ring

2. sk_buff was transmitted but NOT freed yet

3. some time later a softirq woke up and tried to clean
  all sent sk_buff from tx_ring finding this sk_buff altered.

I'm almost certain that e1000 isn't the culprit here, nothing points 
to it until now, and the new/corrupted sk_buff comes from loopback 
and it is also correct/valid. With that in mind, 
I can imagine these two scenarios:

1) The caller of dev->hard_start_xmit() passed a free sk_buff.

The theory here is sk_buff was freed _before_ be passed to e1000 driver
and then the sk_buff was recycled throw slab cache to another interface
(loopback), then when softirq woke up found an unexpected sk_buff.
(some iptable rules, traffic control and etc..) Low probability.

2) The caller of dev->hard_start_xmit() freed sk_buff right after.

The theory here is sk_buff was delivered to e1000 driver and then
freed, so it was transmitted but also recycled throw slab cache causing
the softirq to find out an unexpected sk_buff.

The problem is finding out where this sk_buff was freed assuming that
my thoughts are correct.

Any idea what to do next to find out the root cause? 

Flavio