Bug 212055

Summary: recurring oops in bnx2_poll
Product: Red Hat Enterprise Linux 4 Reporter: Bryn M. Reeves <bmr>
Component: kernelAssignee: Andy Gospodarek <agospoda>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.4CC: andriusb, cjk, djoo, konradr, linville, mchan, peterm
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0304 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-05-08 03:53:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 221457    
Attachments:
Description Flags
oops message from bnx2 1.4.38 (1)
none
oops message from bnx2 1.4.38 (2)
none
oops message from bnx2 1.4.43 (1)
none
netdev-2.6 commit 104812/2f8af120a159a843948749ea88bcacda9779b132
none
bnx2-mmio.patch
none
bnx2 dump
none
original unpatched bnx2 oops
none
patched kernel oops
none
bnx2-hack.patch
none
debug patch
none
debug patch #2
none
bug fix patch
none
rhel4.patch
none
Revised patch none

Description Bryn M. Reeves 2006-10-24 19:41:06 UTC
Description of problem:
IBM Bladecenter HS21 blades equiped with Broadcom BCM5708 NICs experiencing oops
in the bnx2 module:

Unable to handle kernel NULL pointer dereference at 0000000000000108 RIP:
<ffffffffa0174b28>{:bnx2:bnx2_poll+240}
PML4 16fa70067 PGD 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in: joydev netconsole netdump lock_dlm(U) gfs(U) lock_harness(U)
dlm(U) cman(U) md5 ipv6 dm_mirror dm_mod button battery ac uhci_hcd ehci_hcd
hw_random shpchp tg3 bnx2 ext3 jbd mppVhba(U) qla2400 qla2322 ata_piix libata
qla2xxx scsi_transport_fc mptscsih mptsas mptspi mptfc mptscsi mptbase
mppUpper(U) sg sd_mod scsi_mod
Pid: 0, comm: swapper Not tainted 2.6.9-42.0.3.ELsmp
RIP: 0010:[<ffffffffa0174b28>] <ffffffffa0174b28>{:bnx2:bnx2_poll+240}
RSP: 0018:ffffffff80456e78  EFLAGS: 00010206
RAX: 000000000000dd68 RBX: 00000102266d9530 RCX: 00000102267c4000
RDX: ffffffff804e5180 RSI: 000000000000dd67 RDI: 0000000000000053
RBP: 0000000000000000 R08: 0000010006aa56b0 R09: 0000000000000000
R10: 0000000000000000 R11: 000000000000000a R12: 000001022602e380
R13: 000000000000dd53 R14: 0000000000000060 R15: 0000000000000006
FS:  0000000000000000(0000) GS:ffffffff804e5180(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000108 CR3: 0000000000101000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff804e8000, task ffffffff803d4400)
Stack: 0000000000000000 0000010006aa56b0 ffffffff80114e81 000000000000006c
       00000001429f8010 0000000100006007 00000101429f8010 00d0010006bb2d00
       ffffffff8045f0d1 0000010001066440
Call Trace:<IRQ> <ffffffff80114e81>{end_8259A_irq+96}
<ffffffff8011f06d>{powernowk8_target+373}
       <ffffffff801621dd>{cache_flusharray+107}
<ffffffff802b0960>{net_rx_action+203}
       <ffffffff8013c73c>{__do_softirq+88} <ffffffff8013c7e5>{do_softirq+49}
       <ffffffff80113247>{do_IRQ+328} <ffffffff80110833>{ret_from_intr+0}
        <EOI> <ffffffff8010e84c>{mwait_idle+86} <ffffffff8010e7dc>{cpu_idle+26}
       <ffffffff804eb67b>{start_kernel+470} <ffffffff804eb1d5>{_sinittext+469}


Code: 48 8b 85 08 01 00 00 66 83 78 08 00 74 25 8b 40 04 41 8d 54
RIP <ffffffffa0174b28>{:bnx2:bnx2_poll+240} RSP <ffffffff80456e78>
CR2: 0000000000000108
CPU#0 is executing netdump.
CPU#1 is frozen.
CPU#2 is frozen.
CPU#3 is frozen.
poll_lock is locked, unable to take a dump!
rebooting in 5 seconds

The traces all point to the same instruction causing the oops, picking the
module apart in crash & stap seems to point at the following code in bnx2.c:

1606 #ifdef BCM_TSO
1607            /* partial BD completions possible with TSO packets */
1608            if (skb_shinfo(skb)->tso_size) {
1609                    u16 last_idx, last_ring_idx;
1610
1611                    last_idx = sw_cons +
1612                            skb_shinfo(skb)->nr_frags + 1;
1613                    last_ring_idx = sw_ring_cons +
1614                            skb_shinfo(skb)->nr_frags + 1;
1615                    if (unlikely(last_ring_idx >= MAX_TX_DESC_CNT)) {
1616                            last_idx++;
1617                    }
1618                    if (((s16) ((s16) last_idx - (s16) hw_cons)) > 0) {
1619                            break;
1620                    }
1621            }
1622 #endif

There are several more traces, all very similar and all seeming to come from
softirq context. Seems like something is freeing the skb before the poll
function gets called.

Will attach further traces shortly.

Comment 1 Bryn M. Reeves 2006-10-24 19:46:37 UTC
Shoulda said - the oops trace seems to point to bnx2.c:1608

Comment 2 Bryn M. Reeves 2006-10-24 19:52:22 UTC
Created attachment 139261 [details]
oops message from bnx2 1.4.38 (1)

Comment 3 Bryn M. Reeves 2006-10-24 19:54:54 UTC
Created attachment 139262 [details]
oops message from bnx2 1.4.38 (2)

Comment 4 Bryn M. Reeves 2006-10-24 19:58:57 UTC
Created attachment 139263 [details]
oops message from bnx2 1.4.43 (1)

Similar oops seen on the updated 1.4.43 bnx2 driver in linville's test kernels.
Reporter states that this version oopses more frequently.

Comment 5 Andy Gospodarek 2006-10-24 21:03:10 UTC
Created attachment 139271 [details]
netdev-2.6 commit 104812/2f8af120a159a843948749ea88bcacda9779b132

This patch might be interesting to try.  It came out shortly after the update
provided by linville and it appears to handle a race involving bnx2_tx_int(). 
It seems like it might be a long shot since the scenario described may not
match the situation exactly, but it might be another option.

Comment 6 Marco Bill-Peter 2006-10-25 01:51:32 UTC
raised an exception for 4.5 - significant customer impact in new Korean market
for large customer

thanks

marco

Comment 8 Andy Gospodarek 2006-11-02 20:49:11 UTC
It seems the bnx2 patches in these builds:

http://people.redhat.com/agospoda/#rhel4

resolve this issue.

Comment 9 Daniel Riek 2006-11-14 15:15:08 UTC
PM ACK following Devel ACK

Comment 11 Andy Gospodarek 2006-11-15 21:35:35 UTC
0xffffffffa07eab20 <bnx2_poll+232>:     add    0x38(%r12),%rbx   
bp->tx_buf_ring (tx_buf_ring is offset 0x38)
0xffffffffa07eab25 <bnx2_poll+237>:     mov    (%rbx),%rbp        bnx2.c:1605
skb = tx_buf->skb
0xffffffffa07eab28 <bnx2_poll+240>:     mov    0x108(%rbp),%rax   bnx2.c:1608
skb_shinfo(skb) is really skb->end
0xffffffffa07eab2f <bnx2_poll+247>:     cmpw   $0x0,0x8(%rax)     bnx2.c:1608 if
(skb_shinfo(skb)->tso_size)
0xffffffffa07eab34 <bnx2_poll+252>:     je     0xffffffffa07eab5b bnx2.c:1608

Basically this code maps to the following lines:

   1604                 tx_buf = &bp->tx_buf_ring[sw_ring_cons];
   1605                 skb = tx_buf->skb;
   1606 #ifdef BCM_TSO 
   1607                 /* partial BD completions possible with TSO packets */
   1608                 if (skb_shinfo(skb)->tso_size) {
   1609                         u16 last_idx, last_ring_idx;

So this problem occurs because the skb coming from the ring tx_bug_ring is NULL.
 I suppose that we could add a check to make sure that this isn't NULL before
continuing, but that seems like a hack rather than an actual fix....



Comment 13 Andy Gospodarek 2006-11-16 04:09:11 UTC
I don't think TSO has much to do with this -- it seems that skbs on the
ring-buffer are already free when we try to access them.  I'm not sure if
checking to see that they are NULL before continuing is the right idea, but it
seems like it could be an option.  I'm going to contact Michael Chan (upstream
maint) and see if he's got any thoughts.



Comment 14 Andy Gospodarek 2006-11-16 14:14:43 UTC
I just traded emails with Michael Chan and he suggested a patch that is not
currently upstream.  I will post a file that can be used for testing shortly.

Comment 15 Andy Gospodarek 2006-11-16 21:27:01 UTC
Created attachment 141417 [details]
bnx2-mmio.patch

Suggested patch from Michael Chan.

Comment 18 Need Real Name 2006-11-17 15:50:19 UTC
Created attachment 141486 [details]
bnx2 dump

Comment 19 Need Real Name 2006-11-17 15:58:44 UTC
Comment #18 is a dump from RHEL4u4 (.0.2 kernel) running on an HP DL360 G5. I am
building the kernel with this and the previous two patches mentioned and will
test asap.

Comment 20 Andy Gospodarek 2006-11-17 16:33:19 UTC
To make your life easy, use these 2 patches:

BZ 202053	bnx2-1_4_43.patch	Experimental
BZ 212055	bnx2-poll-fix.patch	Experimental

from http://people.redhat.com/agospoda/#rhel4 and then apply the patch attched
in Comment #15.  

Comment 22 Need Real Name 2006-11-20 14:58:28 UTC
Those patches are what I have built into our kernels (all three of them) and we
still see the same crash. Interesting bit is that the only machines crashing are
the onse running RHCS/GFS and it's always dlm_recvd.  We are going to run some
stress againts out non-cluster bnx2 driven machines to see if we can force a crash.

Anyone have any other ideas?

Is is possible that this "bug" is only exposed by the dlm?


Comment 23 Andy Gospodarek 2006-11-20 15:04:04 UTC
Can you post the panic log from the latest kernel (the one with all 3 patches)?

Comment 24 Need Real Name 2006-11-20 17:34:04 UTC
Created attachment 141671 [details]
original unpatched bnx2 oops

This is the oops from the original RHEL 2.6.9-42.0.2.ELsmp kernel

Comment 25 Need Real Name 2006-11-20 17:35:38 UTC
Created attachment 141673 [details]
patched kernel oops

This is the oops after the latest patch (all three patches applied). There is
little difference and it seems to be crashing the wame way.

Comment 26 Need Real Name 2006-11-20 17:39:20 UTC
I've recompiled the kernel again, with an extra bit on the version string for
the bnx2 driver to ensure that the patched version is indeed being used etc. Not
that there is really any question but since the crash dumps are almost
identical, I thought I'd do it just to make sure.  We have other machines using
RHEL4u4 that are not clustered and seem to be running fine. We will be load
testing them further to see if we can bring them down. Basically all we have to
do is one of our rsyncs from a cluster to a non clusted node and it brings down
the clustered node with the above oops.

Comment 27 Andy Gospodarek 2006-11-20 18:45:28 UTC
Created attachment 141677 [details]
bnx2-hack.patch

I'm not a huge fan of this patch, but if you would like to build a kernel with
it and test it out feel free.  It basically skips around the code that should
free the skb whenever the skb in the ring-buffer has already been freed.  This
doesn't resolve the issue at hand, but it could allow things to continue
running -- or it might cause an infinitie loop inside the while loop in
bnx2_tx_int().

Comment 28 Need Real Name 2006-11-20 18:55:49 UTC
Will try this in the morning.  Important to know though that the patched kernel
did in fact run longer then the unpatched one. Used to be we could force the
crash by running our rsync job, now I've been told that the rsync completes
(several hundred gigs)...

Comment 29 Andy Gospodarek 2006-11-20 19:18:00 UTC
Thanks for the feedback and the testing!

Comment 30 Michael Chan 2006-11-20 22:31:25 UTC
Created attachment 141702 [details]
debug patch

This patch will print out some of the tx states when the SKB becomes
unexpectedly NULL.  Please test it and post the debug dmesg.

Comment 31 Andy Gospodarek 2006-11-20 22:46:43 UTC
Michael's patch is preferred to mine so please consider using it instead.

Comment 32 Need Real Name 2006-11-21 12:37:34 UTC
Building it now. Will let you know how things go...  Thanks. This is fairly
important to us as we were having other issues on another system using different
nics and RHEL4u3. Hard to tell if those other problems were fixed if this one
keeps getting in the way.

Comment 33 Jay Turner 2006-11-21 12:52:30 UTC
Deferring QE ack until we sort out what's going on and the proposed fix.

Comment 35 Need Real Name 2006-11-22 16:56:12 UTC
We've applied the patch and have one of the 5 machines complaining loudly about
NULL skb.  there was a ~30 minute period last night where over 16k of these
messages are present. 

tx_prod b790 tx_cons b67f hw_tx_cons b6a4

There are loads of these, with different values. It's difficult for me to get
the actual files uploaded due to the environment so I am trascribing everything.

In the middle of them all are several instances where the bnx2 link goes down,
after a NETDEV WATCHDOG message then immediately returns to an UP state. There
is also one instance if oom-killer complaining about 

oom-killer: gfp_mask=0xd0

However we have oom-killer disabled so it was bypassed.

The bnx2 erros continue, then cleared up roughly 30 minutes after they started. 

Last weekend, we did have two crashes which were 'pre-debug patch' but also had
the three other patches applied. I suspect it's a matter of time before one of
these nodes gives up.

Unfortunately to ensure our build process is working through kernel patches, we
rebuild the nodes completely and have lost the old logs. 

For what it's worth, we've had some non-cluster machines (same hardware specs)
pounding away at the 1.4.43c drivers as shipped from HP for over a day now
rsyncing 30 GB to 6 clients 13 times now and still chugging..

Anyway, will keep you posted.

Cheers

Comment 36 Michael Chan 2006-11-22 17:19:50 UTC
> tx_prod b790 tx_cons b67f hw_tx_cons b6a4

Thanks for testing.  Something is very strange here.  The tx_prod should never 
be ahead of the tx_cons by more than 0xff because that's the size of the tx 
ring.  Here, the difference between the two is 0x111.

Is this the very first NULL skb dmesg in the log?  If not, please post the very 
first one.

Thanks.

Comment 37 Need Real Name 2006-11-22 18:19:15 UTC
Yes, thats the very first one. Anything else I can do? 

Comment 38 Need Real Name 2006-11-22 18:29:17 UTC
Just did some checking and all of the messages I checked are over 0xff diff. IN
fact the very next one is 0x1f0 off.

Comment 39 Michael Chan 2006-11-22 19:48:15 UTC
Created attachment 141931 [details]
debug patch #2

I don't have a theory why this is happening, so I am attaching another debug
patch to see if it yields more information.

The tx_prod tells the chip to send new packets.  When the tx packets are
completed, the chip tells the driver and the driver updates tx_cons.  The
difference between the 2 can be at most 255.  When the ring is full, we stop
the tx_queue until at least half the ring is available again.  The start_xmit()
thread updates tx_prod, and the NAPI poll thread updates tx_cons.

Please keep the first debug patch and add this one on top.  Please post the
first few demsg entries when the debug code is triggered.

Thanks.

Comment 40 Need Real Name 2006-11-22 20:34:58 UTC
Will do. Started the build, but due to travel it'll be monday before I get the
results. Thanks for all your help.

Comment 41 Andy Gospodarek 2006-11-28 17:44:46 UTC
cjk, any results to report from the second debug patch?

Comment 42 Need Real Name 2006-12-04 01:39:31 UTC
Sorry, I am out of the country and getting back to my shop is a pain. Here are
some results from the patch.

Bad index in bnx2_start_tx()
tx_prod 0 tx_cons fffd hw_tx_cons fffd tmp 0
Bad index in bnx2_start_tx()
tx_prod 1 tx_cons fffd hw_tx_cons fffd tmp 1
Bad index in bnx2_start_tx()
tx_prod 2 tx_cons fffd hw_tx_cons fffd tmp 2
Bad index in bnx2_start_tx()
tx_prod 3 tx_cons fffd hw_tx_cons fffd tmp 3
Bad index in bnx2_start_tx()
tx_prod 4 tx_cons fffd hw_tx_cons fffd tmp 4
Bad index in bnx2_start_tx()
tx_prod 5 tx_cons fffd hw_tx_cons fffd tmp 5
Bad index in bnx2_start_tx()
tx_prod 6 tx_cons fffd hw_tx_cons fffd tmp 6
Bad index in bnx2_start_tx()
tx_prod 7 tx_cons fffd hw_tx_cons fffd tmp 7
Bad index in bnx2_start_tx()
tx_prod 8 tx_cons 8 hw_tx_cons 8 tmp 8
Bad index in bnx2_start_tx()
tx_prod 0 tx_cons fffd hw_tx_cons fffd tmp 0
Bad index in bnx2_start_tx()
tx_prod 1 tx_cons fffe hw_tx_cons fffe tmp 1
Bad index in bnx2_start_tx()
tx_prod 0 tx_cons 0 hw_tx_cons ffff tmp 0
Bad index in bnx2_start_tx()
tx_prod 9 tx_cons fffb hw_tx_cons fffb tmp 9
Bad index in bnx2_start_tx()
tx_prod c tx_cons c hw_tx_cons c tmp c
Bad index in bnx2_start_tx()
tx_prod 0 tx_cons fffe hw_tx_cons fffe tmp 0

Those are the first occurances. There are no bnx2_start_init() errors logged as
of yet, I'll keep looking.

I hope this helps....







Comment 43 Michael Chan 2006-12-04 22:33:29 UTC
This dmesg log doesn't make any sense.  The value of tmp should be copied from 
tx_cons in bnx2_start_xmit().  In the dmesg, it seems that it was copied from 
tx_prod.  Was the patch applied properly?

Comment 44 Need Real Name 2006-12-04 23:01:22 UTC
I'll double check the patch and let you know. If I need to I'll re-type/apply it
and try again.

Thanks

Comment 45 Need Real Name 2006-12-05 15:07:38 UTC
I corrected the patch and rebuilt/reinstalled the kernel. Waiting for the
results now...  Thanks again.



Comment 46 Andy Gospodarek 2006-12-05 15:34:48 UTC
New test kernels with the latest debug patch are available here:

http://people.redhat.com/agospoda/#rhel4

Please give them a try and post the results.

Comment 47 Need Real Name 2006-12-06 22:09:23 UTC
Haven't gotten a chance to get the test kernel but I have some results from the
patches.

Bad index in bnx2_start_tx()
tx_prod c885 tx_cons c785 hw_tx_cons c785 tmp c785
Bad index in bnx2_start_tx()
tx_prod c898 tx_cons c785 hw_tx_cons c785 tmp c785
Bad index in bnx2_start_tx()
tx_prod c8ab tx_cons c785 hw_tx_cons c785 tmp c785
Bad index in bnx2_start_tx()
tx_prod c8be tx_cons c785 hw_tx_cons c785 tmp c785
Bad index in bnx2_start_tx()
tx_prod c8bf tx_cons c785 hw_tx_cons c785 tmp c785
Bad index in bnx2_start_tx()
tx_prod c8c0 tx_cons c785 hw_tx_cons c785 tmp c785
Bad index in bnx2_start_tx()
tx_prod c8c1 tx_cons c785 hw_tx_cons c785 tmp c785
Bad index in bnx2_start_tx()
tx_prod c8c3 tx_cons c785 hw_tx_cons c785 tmp c785
Bad index in bnx2_start_tx()
tx_prod c8c7 tx_cons c785 hw_tx_cons c785 tmp c785
Bad index in bnx2_start_tx()
tx_prod c8d6 tx_cons c785 hw_tx_cons c785 tmp c785
NULL skb in bnx2_tx_int()
tx_prod c8d6 tx_cons c785 hw_tx_cons c8d6
Bad index in bnx2_tx_int()
tmp c8d6 tx_prod c8d6 tx_cons c7d6 hw_tx_cons c8d6
Bad index in bnx2_start_tx()
tx_prod c8e5 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c8f4 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c8f5 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c905 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c913 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c922 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c931 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c940 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c94f tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c950 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c95f tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c96d tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c97c tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c97d tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c98c tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c99a tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c9a8 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c9a9 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c9b8 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
Bad index in bnx2_start_tx()
tx_prod c9c7 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6
NULL skp in bnx2_tx_int()
tx_prod c9c7 tx_cons c7d6 hw_tx_cons c9c7
Bad index in bnx2_tx_int()
tmp c9c7 tx_prod c9c7 tx_cons c8c7 hw_tx_cons c9c7
Bad index in bnx2_start_tx()
tx_prod c9d6 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7
Bad index in bnx2_start_tx()
tx_prod c9e4 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7
Bad index in bnx2_start_tx()
tx_prod c9f3 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7
Bad index in bnx2_start_tx()
tx_prod ca03 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7
Bad index in bnx2_start_tx()
tx_prod ca11 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7
Bad index in bnx2_start_tx()
tx_prod ca18 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7
Bad index in bnx2_start_tx()
tx_prod ca26 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7
Bad index in bnx2_start_tx()
tx_prod ca28 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7
Bad index in bnx2_start_tx()
tx_prod ca2f tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7
Bad index in bnx2_start_tx()
tx_prod ca3d tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7
Bad index in bnx2_start_tx()
tx_prod ca44 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7
Bad index in bnx2_start_tx()
tx_prod ca52 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7
Bad index in bnx2_start_tx()
tx_prod ca59 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7
Bad index in bnx2_start_tx()
tx_prod ca67 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7


that goes on for a bit then this..

NULL skp in bnx2_tx_int()
tx_prod cb19 tx_cons ca28 hw_tx_cons cb19
NETDEV WATCHDOG: eth1: transmit timed out
bnx2: eth1 NIC Link is Doan
bnx2: eth1 NIC Link is Up, 1000 Mbps full duplex

The only thing using eth1 is the cluster infrastructure.


Installing the test kernel will likey not happen until next week when I get back
to my office.

Hope this helps...

Comment 48 Michael Chan 2006-12-07 07:33:19 UTC
Created attachment 143037 [details]
bug fix patch

Thanks for testing.  The dmesg log was very helpful.  There is a bug in
bnx2_tx_avail() when the tx ring is completely full.

I'd say the condition to trigger this bug is very rare.  The tx ring rarely
becomes completely full as we always leave room for the biggest possible TSO
packet.

Please try the patch which should fix the bug.	Thanks a lot.

Comment 49 Andy Gospodarek 2006-12-07 13:45:28 UTC
Thanks for the patch, Michael.  I'll add this to the test kernels I plan to spin
later today and post here when they are ready.

Comment 50 Need Real Name 2006-12-07 21:41:32 UTC
Excellent work. I'll get this in and let it run over the weekend. Thanks...




Comment 51 Andy Gospodarek 2006-12-08 16:14:46 UTC
New test kernels with Michael's fix are available here:

http://people.redhat.com/agospoda/#rhel4

Please report any feedback you have to this Bugzilla.

Comment 52 Andy Gospodarek 2006-12-13 16:18:34 UTC
Any additional updates?

Comment 53 Need Real Name 2006-12-13 16:34:42 UTC
Yes, I was waiting a bit longer before claiming victory. We have had the 
cluster up for almost 6 days now with no debug messages at all. I am not using 
the kernels mentioned above since I didn't want to introduce any other 
variables. We are simply running stock 42.0.2 sources plus the patches from 
here. 

The question was raised about how well HP monitors this work and if they will 
latch onto this and put it into the HP supported bnx2 driver bundle. The bug 
exists in that one as well. So if HP is listening......

Thanks to all who worked on this.

Comment 54 Michael Chan 2006-12-13 16:41:49 UTC
Thanks for testing.  I will push the patch upstream later today.  I will also 
have a new driver sent to HP and other partners.

Comment 55 Andy Gospodarek 2006-12-13 18:47:29 UTC
Created attachment 143541 [details]
rhel4.patch

This is the final RHEL4 patch I plan to push internally.

Comment 57 Michael Chan 2006-12-13 20:45:38 UTC
Sorry, I need to revise the patch a bit.  There's still a potential problem 
when the ring wraps around from 0xffff to 0x0.  What makes the logic so 
difficult is that the ring uses 256 indices for 255 entries.  1 of them is 
unused and needs to be skipped.  I will post the revised patch here later today.

Andy, your combined patch looks ok.  The part to fix this problem will need to 
be revised when my patch is ready.

Thanks.

Comment 58 Michael Chan 2006-12-13 21:33:17 UTC
Created attachment 143559 [details]
Revised patch

This is the revised patch.  I fully expect this to work for cjk, but
please give it a try.  Revert the earlier patch first and then apply this one. 
Thanks a lot.

Comment 59 Andy Gospodarek 2006-12-15 19:06:04 UTC
Builds containing Michael's final patch are available here:

http://people.redhat.com/agospoda/#rhel4



Comment 60 Jay Turner 2007-01-02 13:46:22 UTC
QE ack for RHEL4.5.

Comment 61 Need Real Name 2007-01-02 14:10:06 UTC
Just an update, our cluster has been operational and UP for over three weeks 
now. Thanks to all who helped fix this issue.

Comment 62 Jason Baron 2007-01-05 16:31:02 UTC
committed in stream U5 build 42.38. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/


Comment 66 Mike Gahagan 2007-03-21 21:11:04 UTC
Confirmed that the patch is in the -50.EL kernel.


Comment 69 Red Hat Bugzilla 2007-05-08 03:53:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0304.html