Bug 219783
Summary: | [RHEL4.6] qla3xxx panics in ql_process_macip_rx_intr | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Konrad Rzeszutek <konradr> | ||||||
Component: | kernel | Assignee: | Konrad Rzeszutek <konradr> | ||||||
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 4.6 | CC: | andriusb, benjamin.li, coughlan, david.somayajulu, dwm, lcm, mbarrow, mchristi, qlogic-redhat-ext, redhat, ron.mercer | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2007-02-27 17:51:27 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 217546 | ||||||||
Bug Blocks: | 216987 | ||||||||
Attachments: |
|
Description
Konrad Rzeszutek
2006-12-15 13:41:47 UTC
Setting some flags. Konrad, This is a different panic than the spinlock we saw yesterday. This one we can fix quickly. I am handing this over to Ben Li and will cc him on this and he will get a password. Regards, Ron Konrad, In test case "A", is the card attached to a large network? From your steps it looks like you don't even have to start a test. We are unable to reproduce this so far. I have asked Ben to attach the card to a larger network instead of the small private one we use. Ron It goes through a switch (which then is uplinked to another switch, which goes to the Red Hat boston entire lab). The traffic seen is mostly UDP (multicast) and ARP as seen with ethereal. Hi Konrad, I have setup our RHEL 4 machine with the kernel you had provided earlier with a similar network setup. I am looking at the ethereal dump of the traffic going across and have multicast UDP and ARP traffic. But, I am still not able to reproduce the kernel panic. The machine has been sitting idle for a while. I also have tried to pushing TCP traffic loads with varying packet sizes and haven't been able to reproduce the kernel panic. What type of machine are you running on? We are running on a slower 1.4 Ghz PIII machine. Thanks again. -Ben I am running this on a Compaq DL380, Dual PIII 1.0Ghz with 1GB of RAM. For kicks I ran tcpdump over the eth1 interface as so: ifconfig | eth1 Link encap:Ethernet HWaddr 00:C0:DD:08:76:0D | inet addr:192.168.78.223 Bcast:192.168.79.255 Mask:255.255.252.0 inet6 addr: fe80::2c0:ddff:fe08:760d/64 Scope:Link | UP BROADCAST RUNNING MTU:1500 Metric:1----------- RX packets:2851270 errors:0 dropped:0 overruns:0 frame:0 TX packets:50796 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:256 OS, 'e' to edit the RX bytes:319147747 (304.3 MiB) TX bytes:55127718 (52.5 MiB) Interrupt:201 'c' for a command-line. lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:22225 errors:0 dropped:0 overruns:0 frame:0 TX packets:22225 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:24599934 (23.4 MiB) TX bytes:24599934 (23.4 MiB) [root@perf8 ~]# tcpdump -i eth1 > /eth.dump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth1, link-type EN10MB (Ethernet), capture size 96 bytes ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:91! invalid operand: 0000 [#1] SMP Modules linked in: md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core nfs lockd nfs_acl sunrpc dm_multipath button battery ac qla3xxx e100 mii floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla4xxx cpqarray sd_mod scsi_mod CPU: 0 EIP: 0060:[<c027b96f>] Not tainted VLI EFLAGS: 00010212 (2.6.9-42.25.EL_qla3xxx_qla4xxx_qla_fixes_9smp) EIP is at skb_over_panic+0x1f/0x2d eax: 00000030 ebx: c02f0936 ecx: c03cff5c edx: c0308906 and I put the dump on: http://darnok.com/qlogic/eth.dump.gz Sorry about that copy-n-paste job in the previous comment. Here is the right one: [root@perf8 ~]# tcpdump -i eth1 > /eth.dump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth1, link-type EN10MB (Ethernet), capture size 96 bytes ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:91! invalid operand: 0000 [#1] SMP Modules linked in: md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core nfs lockd nfs_acl sunrpc dm_multipath button battery ac qla3xxx e100 mii floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla4xxx cpqarray sd_mod scsi_mod CPU: 0 EIP: 0060:[<c027b96f>] Not tainted VLI EFLAGS: 00010212 (2.6.9-42.25.EL_qla3xxx_qla4xxx_qla_fixes_9smp) EIP is at skb_over_panic+0x1f/0x2d eax: 00000030 ebx: c02f0936 ecx: c03cff5c edx: c0308906 esi: c02f0936 edi: 00000005 ebp: f6968240 esp: c03cff58 ds: 007b es: 007b ss: 0068 Process tcpdump (pid: 26558, threadinfo=c03cf000 task=f51942b0) Stack: c0308906 f88fb81e 0000076c 0000076c c02f0936 f653fe60 0000076c f88fb823 0000000e f6968000 e6548980 f696a260 f696a24c f6e268d0 f6968240 00000005 00000040 f88fb962 f6968000 f88b0000 c03cffc0 f6968000 c03cffd4 f6968240 Call Trace: [<f88fb81e>] ql_process_macip_rx_intr+0x114/0x1ca [qla3xxx] [<f88fb823>] ql_process_macip_rx_intr+0x119/0x1ca [qla3xxx] [<f88fb962>] ql_tx_rx_clean+0x8e/0x1ad [qla3xxx] [<f88fbab3>] ql_poll+0x32/0xa5 [qla3xxx] [<c0281440>] net_rx_action+0xae/0x160 [<c0126a08>] __do_softirq+0x4c/0xb1 [<c010819f>] do_softirq+0x4f/0x56 ======================= [<c0107ab4>] do_IRQ+0x1a2/0x1ae [<c02d5998>] common_interrupt+0x18/0x20 [<c02d007b>] unix_gc+0x27a/0x32f Code: ec ff c7 43 68 00 00 00 00 5b c3 90 56 be 36 09 2f c0 53 8b 58 18 85 db 0f 44 de 53 52 ff 70 60 51 68 06 89 30 c0 e8 6f 6f ea ff <0f> 0b 5b 00 29 89 30 c0 83 c4 14 5b 5e c3 56 be 36 09 2f c0 53 <0>Kernel panic - not syncing: Fatal exception in interrupt <0>Rebooting in 1 seconds.. Created attachment 144016 [details]
tcpdump file.
I ran another test, this time actually capturing the packets as so:
tcpdump -i eth1 -w /eth.dump
and the crash happend in about 5 minutes.
Attached is the dump file.
Interestingly enough, this panic only occurs during lunch hours in this lab. I will try to setup the switch to mirror the traffic of one of its ports and see what packets comes across when the driver panics. Hi Konrad, I am working on getting you an RPM of the debug kernel build with additional debugging messages and kdb. The machine it is building on is taking a while and I will get it to you as soon as the RPM is ready to go. That is very strange with the panic occurring only at lunch, I would love to see the traffic dump. Thanks again. I have created a Linux Kernel with all the given qla3xxx patches + some additional debugging information placed into the Kernel Ring Buffer. We will need to look at the Kernel Ring Buffer after the panic. One way to do this after the panic is to run with kdb. KDB can be enabled by 1. Kernel boot parameter: kdb=on 2. Enable via the proc filesystem: echo "1" > /proc/sys/kernel/kdb The RPM with the additional debug information can be found at: ftp://ftp.qlogic.com/outgoing/qla3xxx_RHEL4_update2/kernel-smp-2.6.9-42.25.EL_qla3xxx_qla4xxx_qla_fixes_9_debug.i386.rpm Please run with this kernel when recreating the panic. Konrad, will dependent bug 217546 fix this issue? (In reply to comment #13) > Konrad, will dependent bug 217546 fix this issue? Andrius, No. We have not yet found a patch for this. Worst yet, those magic packets don't seem to crep up anymore. I sent an e-mail to the RHKL to see if anybody has been sending some special packets. Created attachment 145385 [details]
Panic. With kdb. Some information.
From looking a bit at the dissambled code and C code it seems that: skb_put(skb2, length); /* Just the second buffer length here. */ wherein the skb2->dev is pointing to NULL. This bug has been cloned for RHEL 5.1. See bug 224218. Re-assigned BZ to me since I need to come up with a test-case first. I am closing this BZ as I cannot reproduce it. |