Bug 217546 - [RHEL 4.5] qla3xxx panics when eth1 (qla3xxx) is sending pings and when qla4xxx is loaded.
[RHEL 4.5] qla3xxx panics when eth1 (qla3xxx) is sending pings and when qla4x...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.5
All Linux
medium Severity medium
: ---
: ---
Assigned To: Konrad Rzeszutek
Brian Brock
:
Depends On:
Blocks: 209341 216986 219783
  Show dependency treegraph
 
Reported: 2006-11-28 10:47 EST by Konrad Rzeszutek
Modified: 2007-11-30 17:07 EST (History)
11 users (show)

See Also:
Fixed In Version: RHBA-2007-0304
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-05-08 00:16:46 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Serial output with dmesg, cat of test script, and the panic. (147.65 KB, text/plain)
2006-11-28 10:47 EST, Konrad Rzeszutek
no flags Details
screen output when qla4xxx inits with extended debugging. (6.75 KB, text/plain)
2006-12-01 17:47 EST, Konrad Rzeszutek
no flags Details
screen output when qla3xxx panics with qla4xxx with extended debugging. (3.89 KB, text/plain)
2006-12-01 17:49 EST, Konrad Rzeszutek
no flags Details
Fixes panic caused by successive inbound ping frames. (7.02 KB, patch)
2006-12-05 17:27 EST, Ron Mercer
no flags Details | Diff
Screen log of qla3xxx - v2.02.00-k37RH - crashing. (41.74 KB, text/plain)
2006-12-06 13:30 EST, Konrad Rzeszutek
no flags Details
QLA3xxx driver from main-line with the two patches from this BZ. (127.38 KB, patch)
2006-12-13 16:41 EST, Konrad Rzeszutek
no flags Details | Diff
Adds 1us delay on NVRAM access. (4.29 KB, patch)
2006-12-13 19:39 EST, Ron Mercer
no flags Details | Diff
QLA3xxx driver from main-line with the three patches from this BZ. (127.38 KB, patch)
2006-12-14 11:59 EST, Konrad Rzeszutek
no flags Details | Diff
QLA3xxx driver from main-line with the three patches from this BZ. (127.52 KB, patch)
2006-12-14 12:05 EST, Konrad Rzeszutek
no flags Details | Diff
Screen log. (644.19 KB, text/plain)
2006-12-14 13:34 EST, Konrad Rzeszutek
no flags Details

  None (edit)
Description Konrad Rzeszutek 2006-11-28 10:47:48 EST
The test was carried on a 2.6.9-42.25 kernel with:
 - add-qla4xxx2.patch   (taken from  Mike Christie from BZ 180363: LTC17917-FEAT
7154: Qlogic iSCSI TOE adapters driver for Power).
 - qla3xxx.patch   (pretty much a back-port from 2.6.19 of the driver - BZ #
209341: LTC27612-FEAT: 200804: Include the qla3xxx networking driver)
 -  qla_fixes.patch  (back-port of the patch from bugzilla # 215641: Patch to
fix reset issue when ethernet interface is enabled).

With the qla4xxx loaded and not doing anything, the machine panics when a ping
test is performed. 

The patches and the kernel are all available at:
http://www.darnok.org/qlogic/

Attached is the serial output that has: dmesg, cat of the test script, and the
panic.
Comment 1 Konrad Rzeszutek 2006-11-28 10:47:48 EST
Created attachment 142295 [details]
Serial output with dmesg, cat of test script, and the panic.
Comment 2 Konrad Rzeszutek 2006-11-28 11:25:48 EST
This is what the lspci tells me (this is a Compaq DL380 box with the iSCSI/NIC
PCI card)

00:00.0 Host bridge: Broadcom CNB20LE Host Bridge (rev 05)
	Flags: bus master, medium devsel, latency 64

00:00.1 Host bridge: Broadcom CNB20LE Host Bridge (rev 05)
	Flags: bus master, medium devsel, latency 64

00:01.0 RAID bus controller: LSI Logic / Symbios Logic 53C1510 (rev 02)
	Subsystem: Compaq Computer Corporation Integrated Array Controller
	Flags: bus master, medium devsel, latency 192, IRQ 177
	I/O ports at 2000 [size=256]
	Memory at c5000000 (32-bit, non-prefetchable) [size=16M]
	Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
	Capabilities: [40] Power Management version 2

00:02.0 Ethernet controller: Intel Corporation 82557/8/9 [Ethernet Pro 100] (rev 08)
	Subsystem: Compaq Computer Corporation NC3163 Fast Ethernet NIC (embedded, WOL)
	Flags: bus master, medium devsel, latency 64, IRQ 185
	Memory at c3fff000 (32-bit, non-prefetchable) [size=4K]
	I/O ports at 2400 [size=64]
	Memory at c3e00000 (32-bit, non-prefetchable) [size=1M]
	Capabilities: [dc] Power Management version 2

00:03.0 VGA compatible controller: ATI Technologies Inc 3D Rage IIC 215IIC
[Mach64 GT IIC] (rev 7a) (prog-if 00 [VGA])
	Subsystem: ATI Technologies Inc Rage IIC
	Flags: bus master, stepping, medium devsel, latency 64
	Memory at c2000000 (32-bit, prefetchable) [size=16M]
	I/O ports at 2800 [size=256]
	Memory at c3dff000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: [5c] Power Management version 1

00:04.0 System peripheral: Compaq Computer Corporation Advanced System
Management Controller
	Subsystem: Compaq Computer Corporation: Unknown device b0f3
	Flags: medium devsel, IRQ 193
	I/O ports at 1800 [size=256]
	Memory at c3dfef00 (32-bit, non-prefetchable) [size=256]

00:0f.0 ISA bridge: Broadcom OSB4 South Bridge (rev 4f)
	Subsystem: Broadcom OSB4 South Bridge
	Flags: bus master, medium devsel, latency 0

00:0f.1 IDE interface: Broadcom OSB4 IDE Controller (prog-if 8a [Master SecP PriP])
	Flags: bus master, medium devsel, latency 64
	I/O ports at 2c00 [size=16]

03:06.0 Ethernet controller: QLogic Corp. QLA3022 Network Adapter (rev 03)
	Subsystem: QLogic Corp.: Unknown device 0123
	Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 201
	I/O ports at 3000 [size=256]
	Memory at c6fff000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: [44] Power Management version 2
	Capabilities: [4c] PCI-X non-bridge device.
	Capabilities: [54] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-

03:06.1 Network controller: QLogic Corp. QLA4022 iSCSI TOE Adapter (rev 03)
	Subsystem: QLogic Corp.: Unknown device 0124
	Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 209
	I/O ports at 3400 [size=256]
	Memory at c6ffe000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: [44] Power Management version 2
	Capabilities: [4c] PCI-X non-bridge device.
	Capabilities: [54] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-

Comment 3 Konrad Rzeszutek 2006-11-28 11:35:16 EST
Changing the title as the ql_process_mac_tx_intr is in the QLA3xxx code.
Comment 4 David Somayajulu 2006-11-28 15:52:45 EST
We are working on this issue.
Comment 5 David Somayajulu 2006-11-28 16:32:13 EST
is it possible give me the output of ifconfig -a
Comment 6 Konrad Rzeszutek 2006-11-29 11:11:14 EST

eth0      Link encap:Ethernet  HWaddr 00:02:A5:37:E6:70  
          inet addr:192.168.79.212  Bcast:192.168.79.255  Mask:255.255.252.0
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:15234 errors:0 dropped:0 overruns:0 frame:0
          TX packets:56 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1224360 (1.1 MiB)  TX bytes:5157 (5.0 KiB)

eth1      Link encap:Ethernet  HWaddr 00:C0:DD:08:76:0D  
          inet addr:192.168.78.223  Bcast:192.168.79.255  Mask:255.255.252.0
          inet6 addr: fe80::2c0:ddff:fe08:760d/64 Scope:Link
          UP BROADCAST RUNNING  MTU:1500  Metric:1
          RX packets:63549 errors:0 dropped:0 overruns:0 frame:0
          TX packets:328 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:256 
          RX bytes:5589021 (5.3 MiB)  TX bytes:46294 (45.2 KiB)
          Interrupt:201 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:35 errors:0 dropped:0 overruns:0 frame:0
          TX packets:35 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:4540 (4.4 KiB)  TX bytes:4540 (4.4 KiB)

sit0      Link encap:IPv6-in-IPv4  
          NOARP  MTU:1480  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

[root@perf8 ~]# 
Comment 7 David Somayajulu 2006-11-29 13:33:32 EST
I just noticed that the problem is with RHEL4.5. You should not be using the 
qla4xxx open-iscsi driver for this release. You need to use v5.00.04-d4 
qla4xxx driver which is currently in RHEL4.4 along with patches which we are 
going to provide in the next few days.
Comment 8 Konrad Rzeszutek 2006-11-29 14:30:39 EST
David,

The qla4xxx I am using is v5.00.04-d4. In what BZ are the patches you refer too?
Comment 9 David Somayajulu 2006-11-29 14:37:30 EST
Sorry for the confusion, you are using the correct qla4xxx driver. We haven't
submitted the required patches yet and plan to do so in a few days.
Comment 10 Ron Mercer 2006-11-30 14:10:50 EST
Konrad,  Can you increase debug output for the qla4xxx driver?  I want to 
eliminate the possibility of a driver interaction problem. It can be done as 
below.

# insmod ./qla4xxx.ko extended_error_logging=2

Regards, Ron Mercer (author of qla3xxx)
Comment 11 Konrad Rzeszutek 2006-12-01 17:47:53 EST
Created attachment 142635 [details]
screen output when qla4xxx inits with extended debugging.
Comment 12 Konrad Rzeszutek 2006-12-01 17:49:09 EST
Created attachment 142636 [details]
screen output when qla3xxx panics with qla4xxx with extended debugging.
Comment 13 Ron Mercer 2006-12-05 17:27:23 EST
Created attachment 142898 [details]
Fixes panic caused by successive inbound ping frames.

Please do the following as the patch doesn't contain pathing:

# cd /usr/src/YourKernel/drivers/net
# cp /path_to_this_patch/qla3xxx-v2.02.00-k37RH.patch .
# patch -p1 < qla3xxx-v2.02.00-k37RH.patch
	rebuild the drivers/modules

This is the RHEL04 qla3xxx network driver with changes to inbound completion
handling.  It passes Konrad's a.sh ping test in our lab.
Comment 14 Konrad Rzeszutek 2006-12-06 13:30:25 EST
Created attachment 142975 [details]
Screen log of qla3xxx - v2.02.00-k37RH - crashing.

I tested it with your patch and can still reproduce the error. Attached is the
screenhost. I booted the kernel using 'selinux=0' so that you won't see all of
those avc:  denied  { rawip_recv } for	pid=6110 comm=... messages.
Comment 16 Konrad Rzeszutek 2006-12-06 13:38:23 EST
Perhaps I a missing a firmware update to the Qlogic card I have? What is the
firmware version for the one that you are using in your lab?
Comment 17 Ron Mercer 2006-12-06 15:22:26 EST
I don't think it is a firmware issue.  I was able to reproduce it with the 
initial driver and no qla4xxx loaded.  I can no longer reproduce it with the 
latest driver.  I will drop your kernel on my machine and continue.
Comment 18 Ron Mercer 2006-12-07 14:40:55 EST
Please comment out the following line in qla3xxx_probe() and run the test 
again:

	ndev->features = NETIF_F_LLTX; 

This is left over from previous code.  It tell the net device layer that the 
driver is locking the tx queue.  We removed the locking a few revs back, but 
never removed the flag.
If the tests runs to completion I will issue another release.
Comment 19 Konrad Rzeszutek 2006-12-08 17:32:20 EST
Ron,

That fixed it. The ping tests ran to completion.  Thank you!

I will try next week do some with exercising the iSCSI and the qla3xxx at the
same time, if such configuration can be done.
Comment 20 Ron Mercer 2006-12-08 19:51:16 EST
Konrad,
Excellent news.  I have a couple of questions.  
First, I would like to update you with my latest driver that supports our 4032 
chip.  I have been pushing the changes to the upstream kernel, and will update 
you when they're accepted.  
Secondly, I was able to get your kernel running, but I am not able to build my 
drivers on it.  Would it be possible to get the source code in a tarball so I 
can build my own image?
Comment 21 Konrad Rzeszutek 2006-12-09 00:08:43 EST
Ron,

Sounds good. If you want you can just send me the qla3xxx.* files (with the
support for the 4032 chipset) and I will replace the patches which I have been
applying to RHEL4 U4 kernel which are:
1) http://darnok.com/qlogic/qla3xxx.patch
2) http://darnok.com/qlogic/qla_fixes.patch
3) http://darnok.com/qlogic/qla3xxx_fix.patch
4) http://darnok.com/qlogic/qla3xxx_fix_2.patch

and instead have just one patch.

The source ball, with all of these patches and the qla4xxx driver (without the
patch for 4032 for support that Mike posted recently) is

http://darnok.com/qlogic/linux-2.6.9-qla4xxx-qla3xxx-fixes_6.tgz


Comment 22 Konrad Rzeszutek 2006-12-09 00:12:30 EST
Ron,

With regards to comment #21, is the patch #2) actually neccessary?
Comment 23 Konrad Rzeszutek 2006-12-11 12:29:41 EST
Ron,

Just ignore my comment #22. Can you provide me with a qla3xxx.* files this week
or should I go ahead with the composite patches (qla3xxx.patch,
qla3xxx_fix.patch without the printk's, qla3xxx_fix_2.patch)?

Thanks.
Comment 24 Ron Mercer 2006-12-11 12:38:15 EST
Konrad,

I will update you with the two source files this week.  I think that will be 
easier.  It should be tomorrow or Wednesday.

Regards,  Ron
Comment 25 Konrad Rzeszutek 2006-12-13 16:41:45 EST
Created attachment 143562 [details]
QLA3xxx driver from main-line with the two patches from this BZ.

Ron,

This is the patch that I will propose tomorrow unless you have a more updated
one.
We can always provide bug-fix updates to this driver after the dead-line.
Comment 26 Ron Mercer 2006-12-13 18:10:19 EST
Konrad,
It is better to use your patch as the code I was planning to send you doesn't 
have enough testing done on it.  

Comment 27 Ron Mercer 2006-12-13 19:39:02 EST
Created attachment 143586 [details]
Adds 1us delay on NVRAM access.
Comment 28 Konrad Rzeszutek 2006-12-14 10:00:11 EST
Update. I ran the tests overnight with the qla4xxx in action (dd-ing a file to
an iSCSI storage) and rmmod-ing/modprobing the qla3xxx driver and then followed
by a ping test. 

The machine paniced with this:
------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:199!
invalid operand: 0000 [#1]
SMP
Modules linked in: qla3xxx qla4xxx md5 ipv6 parport_pc lp parport autofs4
i2c_dev i2c_core sunrpc dm_multipath button battery ac e100 mii floppy
dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cpqarray sd_mod scsi_mod
CPU:    0
EIP:    0060:[<c02d3b90>]    Not tainted VLI
EFLAGS: 00010213   (2.6.9-42.25.EL_qla3xxx_qla4xxx_qla_fixes_8smp)
EIP is at _read_lock+0x9/0x1d
eax: c1a521f0   ebx: 00000000   ecx: f64fa8c0   edx: 0d0000e1
esi: c1a521e0   edi: 0d0000e1   ebp: 00000011   esp: c03cff00
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, threadinfo=c03cf000 task=c0323a80)
Stack: c02be1d3 f64fa8c0 c1a521e0 0d0000e1 f64fa8c0 f1b40000 c0295eae 00000011
       004b0830 db4afec0 db4b0830 db4b0830 db4afec0 f1b40000 c0297ca3 00000000
       f1b40000 db4afec0 c034f318 00000008 00000000 c0281255 db4afec0 00000001
Call Trace:
 [<c02be1d3>] ip_check_mc+0x1b/0x94
 [<c0295eae>] ip_route_input+0xcd/0x16b
 [<c0297ca3>] ip_rcv+0x1d1/0x438
 [<c0281255>] netif_receive_skb+0x2ac/0x2ec
 [<f8901892>] ql_process_macip_rx_intr+0x196/0x1ca [qla3xxx]
 [<f8901954>] ql_tx_rx_clean+0x8e/0x1ad [qla3xxx]
 [<f8901aa5>] ql_poll+0x32/0xa5 [qla3xxx]
 [<c0281440>] net_rx_action+0xae/0x160
 [<c0126a08>] __do_softirq+0x4c/0xb1
 [<c010819f>] do_softirq+0x4f/0x56
 =======================
 [<c0107ab4>] do_IRQ+0x1a2/0x1ae
 [<c02d5998>] common_interrupt+0x18/0x20
 [<c0104018>] default_idle+0x0/0x2f
 [<c01e007b>] acpi_ex_load_table_op+0xcd/0x14f
 [<c0104041>] default_idle+0x29/0x2f
 [<c01040a0>] cpu_idle+0x26/0x3b
 [<c0396786>] start_kernel+0x199/0x19d
Code: 5b c3 81 78 04 ed 1e af de 74 08 0f 0b cf 00 7d 69 2e c0 f0 81 28 00 00 00
01 74 05 e8 7a ea ff ff c3 81 78 04 ed 1e af de 74 08 <0f> 0b c7 00 7d 69 2e c0
f0 83 28 01 79 05 e8 7d ea ff ff c3 81
 <0>Kernel panic - not syncing: Fatal exception in interrupt

I will coordinate with Mike Christie to see if there is another bug for this or
if we should open a new one as this one is fixed (pings don't panic the driver
anymore).
Comment 29 Konrad Rzeszutek 2006-12-14 11:52:41 EST
Ron,

I am testing with you patch from comment #27 along with an updated qla4xxx
driver from Mark Chrisite.

The drivers are at:

http://darnok.com/qlogic/kernel-2.6.9-42.25.EL_qla3xxx_qla4xxx_qla_fixes_9.src.rpm

So far the above panic in comment #28 is _not_ showing up.
Comment 31 Konrad Rzeszutek 2006-12-14 12:05:11 EST
Created attachment 143659 [details]
QLA3xxx driver from main-line with the three patches from this BZ.

Whoops. Wrong file uploaded.
Comment 32 Ron Mercer 2006-12-14 13:11:48 EST
Konrad,
The patch from #27 is a fix to prevent a problem seen on some platforms when 
the driver is loading.  It just adds a delay to NVRAM access and wouldn't play 
a part in the panic from comment #28.  
The panic is in the kernel code when it's processing a multicast packet sent 
up by qla3xxx.  The kernel is trying to take a lock that is in a structure 
that is not accessed by the driver.  I am looking to see if there is anything 
the driver could have done to contribute to this, but it would be a good idea 
to have the kernel guys take a look too.
Could I take a look at the test script?  I will try to reproduce it here.  I 
am guessing that it does not need the presence of the iSCSI driver for this to 
happen.
Ron
Comment 33 Konrad Rzeszutek 2006-12-14 13:34:35 EST
Created attachment 143671 [details]
Screen log. 

Here is the screen output along with the output of the test-scripts.
Comment 34 Konrad Rzeszutek 2006-12-14 13:38:22 EST
Two missing scripts from the screen output:

[konrad@dhcp83-154 tmp]$ more c.sh
#!/bin/bash

cd /mnt/konrad-junk-ignore-it
while (true)
do
 ls -al
 uptime
 sleep 10
done
[konrad@dhcp83-154 tmp]$ more a.sh
#!/bin/bash
COUNT=1
while (true)
do
 expr $COUNT \> 65500  1>/dev/null
 if [ "$?" == 0 ]; then
        exit 0;
 fi
 COUNT=`expr $COUNT + 1` 1>/dev/null
 ping pbladem -c 5 -s$COUNT 1>/dev/null 2>/dev/null &
done

Comment 35 Mike Christie 2006-12-15 01:32:29 EST
I noticed several places in that qla3xxx patch where the driver sleeps with a
spin lock held and interrupts off. Could you guys fix that up. Doing this fixes
all the soft lockups I am seeing.
Comment 36 Mike Christie 2006-12-15 01:51:00 EST
(In reply to comment #35)
> I noticed several places in that qla3xxx patch where the driver sleeps with a
> spin lock held and interrupts off. Could you guys fix that up. Doing this fixes
> all the soft lockups I am seeing.

Ignore the comment about fixing the soft lockups.
Comment 37 Konrad Rzeszutek 2006-12-15 08:57:12 EST
I openned another BZ: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=219783
": qla3xxx panics in ql_process_macip_rx_intr" to track the panic from comment #28.

I will close this BZ, as the initial problem (panic when sending pings) has been
fixed.
Comment 38 Ron Mercer 2006-12-19 13:53:58 EST
Konrad,
Have you pushed the fixes for this bug into RHEL5, or should we do it?
Ron
Comment 39 Konrad Rzeszutek 2006-12-19 16:19:51 EST
Ron,

I have not pushed the updates to RHEL5. Let me clone this BZ for RHEL5 and ask
RH management to make an exception to get this in.

Have the fixes that are in this BZ been pushed upstream? If so, do you have the
GIT commit?
Comment 42 Jay Turner 2007-01-24 11:18:21 EST
QE ack for RHEL4.5.
Comment 44 Jason Baron 2007-03-20 16:04:18 EDT
committed in stream U5 build 42.38. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/
Comment 47 Red Hat Bugzilla 2007-05-08 00:16:46 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0304.html

Note You need to log in before you can comment on or make changes to this bug.