Bug 219783

Summary: [RHEL4.6] qla3xxx panics in ql_process_macip_rx_intr
Product: Red Hat Enterprise Linux 4 Reporter: Konrad Rzeszutek <konradr>
Component: kernelAssignee: Konrad Rzeszutek <konradr>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.6CC: andriusb, benjamin.li, coughlan, david.somayajulu, dwm, lcm, mbarrow, mchristi, qlogic-redhat-ext, redhat, ron.mercer
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-02-27 17:51:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 217546    
Bug Blocks: 216987    
Attachments:
Description Flags
tcpdump file.
none
Panic. With kdb. Some information. none

Description Konrad Rzeszutek 2006-12-15 13:41:47 UTC
The test was carried on a 2.6.9-42.25 kernel with:
 - add-qla4xxx5.patch   (taken from  Mike Christie from BZ 180363: LTC17917-FEAT
7154: Qlogic iSCSI TOE adapters driver for Power).
 - add-qla3xxx_4.patch (taken from BZ 217546)

With the qla4xxx and qla3xxx loaded and not doing anything, the machine panics 
occasionaly.

The kernel is at:
x86_64:
http://darnok.com/qlogic/kernel-smp-2.6.9-42.25.EL_qla3xxx_qla4xxx_qla_fixes_9.x86_64.rpm
i686:
http://darnok.com/qlogic/kernel-smp-2.6.9-42.25.EL_qla3xxx_qla4xxx_qla_fixes_9.i686.rpm
source:
http://darnok.com/qlogic/kernel-2.6.9-42.25.EL_qla3xxx_qla4xxx_qla_fixes_9.src.rpm

This is the panic output:

------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:91!
invalid operand: 0000 [#1]
SMP
Modules linked in: md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core nfs
lockd nfs_acl sunrpc dm_multipath button battery ac qla3xxx e100 mii floppy
dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla4xxx cpqarray
sd_mod scsi_mod
CPU:    0
EIP:    0060:[<c027b96f>]    Not tainted VLI
EFLAGS: 00010212   (2.6.9-42.25.EL_qla3xxx_qla4xxx_qla_fixes_9smp)
EIP is at skb_over_panic+0x1f/0x2d
eax: 00000030   ebx: c02f0936   ecx: c03cff5c   edx: c0308906
esi: c02f0936   edi: 00000000   ebp: f6408240   esp: c03cff58
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, threadinfo=c03cf000 task=c0323a80)
Stack: c0308906 f88fb81e 0000079a 0000079a c02f0936 f610ca00 0000079a f88fb823
       0000000e f6408000 f610cac0 f640bc50 f640bc3c f6655550 f6408240 00000000
       00000040 f88fb962 f6408000 f88b0000 c03cffc0 f6408000 c03cffd4 f6408240
Call Trace:
 [<f88fb81e>] ql_process_macip_rx_intr+0x114/0x1ca [qla3xxx]
 [<f88fb823>] ql_process_macip_rx_intr+0x119/0x1ca [qla3xxx]
 [<f88fb962>] ql_tx_rx_clean+0x8e/0x1ad [qla3xxx]
 [<f88fbab3>] ql_poll+0x32/0xa5 [qla3xxx]
 [<c0281440>] net_rx_action+0xae/0x160
 [<c0126a08>] __do_softirq+0x4c/0xb1
 [<c010819f>] do_softirq+0x4f/0x56
 =======================
 [<c0107ab4>] do_IRQ+0x1a2/0x1ae
 [<c02d5998>] common_interrupt+0x18/0x20
 [<c0104018>] default_idle+0x0/0x2f
 [<c01e007b>] acpi_ex_load_table_op+0xcd/0x14f
 [<c0104041>] default_idle+0x29/0x2f
 [<c01040a0>] cpu_idle+0x26/0x3b
 [<c0396786>] start_kernel+0x199/0x19d
Code: ec ff c7 43 68 00 00 00 00 5b c3 90 56 be 36 09 2f c0 53 8b 58 18 85 db 0f
44 de 53 52 ff 70 60 51 68 06 89 30 c0 e8 6f 6f ea ff <0f> 0b 5b 00 29 89 30 c0
83 c4 14 5b 5e c3 56 be 36 09 2f c0 53
 <0>Kernel panic - not syncing: Fatal exception in interrupt
 <0>Rebooting in 1 seconds..
Press any key to continue.

From Ron:

"The panic is in the kernel code when it's processing a multicast packet sent 
up by qla3xxx.  The kernel is trying to take a lock that is in a structure 
that is not accessed by the driver.  I am looking to see if there is anything 
the driver could have done to contribute to this, but it would be a good idea 
to have the kernel guys take a look too.
Could I take a look at the test script?  I will try to reproduce it here.  I 
am guessing that it does not need the presence of the iSCSI driver for this to 
happen.
Ron"

Steps to reproduce (it varies)
A)
 1). Boot up machine with this kernel.
 2). Wait fora couple of hours and this happends.

B)
 1). Boot up machine this kernel.
 2). Mount the iSCSI storage device. Start doing in a loop dd on a file, while
at the same time:
   2a). unload/modprobe qla3xxx. Acquire IP address, perform NFS activity, 
        anything that pushes packets across the NIC.
 3). Wait.

The full screen-shot along with the test-cases is in 
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=217546
in comment #33 and #34:
https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=143671

Comment 1 Konrad Rzeszutek 2006-12-15 13:52:19 UTC
Setting some flags.

Comment 2 Ron Mercer 2006-12-15 17:29:34 UTC
Konrad,
This is a different panic than the spinlock we saw yesterday.  This one we can
fix quickly.  I am handing this over to Ben Li and will cc him on this and he
will get a password.
Regards, Ron

Comment 3 Ron Mercer 2006-12-18 20:43:51 UTC
Konrad,
In test case "A", is the card attached to a large network?  
From your steps it looks like you don't even have to start a test.  We are 
unable to reproduce this so far.  I have asked Ben to attach the card to a 
larger network instead of the small private one we use.
Ron


Comment 4 Konrad Rzeszutek 2006-12-18 20:50:55 UTC
It goes through a switch (which then is uplinked to another switch, which goes
to the Red Hat boston entire lab). 

The traffic seen is mostly UDP (multicast) and ARP as seen with ethereal.

Comment 5 Benjamin Li 2006-12-18 22:50:32 UTC
Hi Konrad,

I have setup our RHEL 4 machine with the kernel you had provided earlier with a
similar network setup.  I am looking at the ethereal dump of the traffic going
across and have multicast UDP and ARP traffic.  But, I am still not able to
reproduce the kernel panic.  The machine has been sitting idle for a while.  I
also have tried to pushing TCP traffic loads with varying packet sizes and
haven't been able to reproduce the kernel panic.

What type of machine are you running on?  We are running on a slower 1.4 Ghz
PIII machine.

Thanks again.

-Ben

Comment 6 Konrad Rzeszutek 2006-12-19 16:35:25 UTC
I am running this on a Compaq DL380, Dual PIII 1.0Ghz with 1GB of RAM.

For kicks I ran tcpdump over the eth1 interface as so:

ifconfig                                                                   |
eth1      Link encap:Ethernet  HWaddr 00:C0:DD:08:76:0D                    |
          inet addr:192.168.78.223  Bcast:192.168.79.255  Mask:255.255.252.0
          inet6 addr: fe80::2c0:ddff:fe08:760d/64 Scope:Link               |
          UP BROADCAST RUNNING  MTU:1500  Metric:1-----------
          RX packets:2851270 errors:0 dropped:0 overruns:0 frame:0
          TX packets:50796 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:256  OS, 'e' to edit the
          RX bytes:319147747 (304.3 MiB)  TX bytes:55127718 (52.5 MiB)
          Interrupt:201  'c' for a command-line.
lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:22225 errors:0 dropped:0 overruns:0 frame:0
          TX packets:22225 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:24599934 (23.4 MiB)  TX bytes:24599934 (23.4 MiB)
[root@perf8 ~]# tcpdump -i eth1 > /eth.dump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 96 bytes
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:91!
invalid operand: 0000 [#1]
SMP
Modules linked in: md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core nfs
lockd nfs_acl sunrpc dm_multipath button battery ac qla3xxx e100 mii floppy
dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla4xxx cpqarray sd_mod scsi_mod
CPU:    0
EIP:    0060:[<c027b96f>]    Not tainted VLI
EFLAGS: 00010212   (2.6.9-42.25.EL_qla3xxx_qla4xxx_qla_fixes_9smp)
EIP is at skb_over_panic+0x1f/0x2d
eax: 00000030   ebx: c02f0936   ecx: c03cff5c   edx: c0308906


and I put the dump on:

http://darnok.com/qlogic/eth.dump.gz

Comment 7 Konrad Rzeszutek 2006-12-19 16:55:20 UTC
Sorry about that copy-n-paste job in the previous comment.

Here is the right one:

[root@perf8 ~]# tcpdump -i eth1 > /eth.dump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 96 bytes
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:91!
invalid operand: 0000 [#1]
SMP
Modules linked in: md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core nfs
lockd nfs_acl sunrpc dm_multipath button battery ac qla3xxx e100 mii floppy
dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla4xxx cpqarray sd_mod scsi_mod
CPU:    0
EIP:    0060:[<c027b96f>]    Not tainted VLI
EFLAGS: 00010212   (2.6.9-42.25.EL_qla3xxx_qla4xxx_qla_fixes_9smp)
EIP is at skb_over_panic+0x1f/0x2d
eax: 00000030   ebx: c02f0936   ecx: c03cff5c   edx: c0308906
esi: c02f0936   edi: 00000005   ebp: f6968240   esp: c03cff58
ds: 007b   es: 007b   ss: 0068
Process tcpdump (pid: 26558, threadinfo=c03cf000 task=f51942b0)
Stack: c0308906 f88fb81e 0000076c 0000076c c02f0936 f653fe60 0000076c f88fb823
       0000000e f6968000 e6548980 f696a260 f696a24c f6e268d0 f6968240 00000005
       00000040 f88fb962 f6968000 f88b0000 c03cffc0 f6968000 c03cffd4 f6968240
Call Trace:
 [<f88fb81e>] ql_process_macip_rx_intr+0x114/0x1ca [qla3xxx]
 [<f88fb823>] ql_process_macip_rx_intr+0x119/0x1ca [qla3xxx]
 [<f88fb962>] ql_tx_rx_clean+0x8e/0x1ad [qla3xxx]
 [<f88fbab3>] ql_poll+0x32/0xa5 [qla3xxx]
 [<c0281440>] net_rx_action+0xae/0x160
 [<c0126a08>] __do_softirq+0x4c/0xb1
 [<c010819f>] do_softirq+0x4f/0x56
 =======================
 [<c0107ab4>] do_IRQ+0x1a2/0x1ae
 [<c02d5998>] common_interrupt+0x18/0x20
 [<c02d007b>] unix_gc+0x27a/0x32f
Code: ec ff c7 43 68 00 00 00 00 5b c3 90 56 be 36 09 2f c0 53 8b 58 18 85 db 0f
44 de 53 52 ff 70 60 51 68 06 89 30 c0 e8 6f 6f ea ff <0f> 0b 5b 00 29 89 30 c0
83 c4 14 5b 5e c3 56 be 36 09 2f c0 53
 <0>Kernel panic - not syncing: Fatal exception in interrupt
 <0>Rebooting in 1 seconds..


Comment 8 Konrad Rzeszutek 2006-12-19 17:10:54 UTC
Created attachment 144016 [details]
tcpdump file.

I ran another test, this time actually capturing the packets as so:

tcpdump -i eth1 -w /eth.dump

and the crash happend in about 5 minutes.

Attached is the dump file.

Comment 9 Konrad Rzeszutek 2006-12-20 15:25:58 UTC
Interestingly enough, this panic only occurs during lunch hours in this lab.

I will try to setup the switch to mirror the traffic of one of its ports and see
what packets comes across when the driver panics.

Comment 10 Benjamin Li 2006-12-20 16:20:29 UTC
Hi Konrad,

I am working on getting you an RPM of the debug kernel build with additional
debugging messages and kdb.  The machine it is building on is taking a while and
I will get it to you as soon as the RPM is ready to go.

That is very strange with the panic occurring only at lunch,  I would love to
see the traffic dump.  Thanks again.

Comment 11 Benjamin Li 2006-12-20 17:33:08 UTC
I have created a Linux Kernel with all the given qla3xxx patches + some
additional debugging information placed into the Kernel Ring Buffer. We will
need to look at the Kernel Ring Buffer after the panic.  One way to do this
after the panic is to run with kdb.  KDB can be enabled by 

1.  Kernel boot parameter: kdb=on
2.  Enable via the proc filesystem: echo "1" > /proc/sys/kernel/kdb

The RPM with the additional debug information can be found at: 

ftp://ftp.qlogic.com/outgoing/qla3xxx_RHEL4_update2/kernel-smp-2.6.9-42.25.EL_qla3xxx_qla4xxx_qla_fixes_9_debug.i386.rpm

Please run with this kernel when recreating the panic.

Comment 13 Andrius Benokraitis 2007-01-04 17:16:56 UTC
Konrad, will dependent bug 217546 fix this issue?

Comment 15 Konrad Rzeszutek 2007-01-04 18:32:07 UTC
(In reply to comment #13)
> Konrad, will dependent bug 217546 fix this issue?

Andrius, 

No. We have not yet found a patch for this. Worst yet, those magic packets don't
seem to crep up anymore. I sent an e-mail to the RHKL to see if anybody has been
sending some special packets.

Comment 17 Konrad Rzeszutek 2007-01-11 19:42:16 UTC
Created attachment 145385 [details]
Panic. With kdb. Some information.

Comment 18 Konrad Rzeszutek 2007-01-11 19:50:40 UTC
From looking a bit at the dissambled code and C code it seems that:

        skb_put(skb2, length);  /* Just the second buffer length here. */

wherein the skb2->dev is pointing to NULL.

Comment 19 Andrius Benokraitis 2007-01-24 17:42:05 UTC
This bug has been cloned for RHEL 5.1. See bug 224218.

Comment 20 Konrad Rzeszutek 2007-01-24 17:53:44 UTC
Re-assigned BZ to me since I need to come up with a test-case first.

Comment 22 Konrad Rzeszutek 2007-02-27 17:51:27 UTC
I am closing this BZ as I cannot reproduce it.