Bug 212055
Summary: | recurring oops in bnx2_poll | ||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Bryn M. Reeves <bmr> | ||||||||||||||||||||||||||||||
Component: | kernel | Assignee: | Andy Gospodarek <agospoda> | ||||||||||||||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||||||||||||||||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||||||||||||||||||||
Priority: | urgent | ||||||||||||||||||||||||||||||||
Version: | 4.4 | CC: | andriusb, cjk, djoo, konradr, linville, mchan, peterm | ||||||||||||||||||||||||||||||
Target Milestone: | --- | ||||||||||||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||||||||||||
Hardware: | All | ||||||||||||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||||||||
Fixed In Version: | RHBA-2007-0304 | Doc Type: | Bug Fix | ||||||||||||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||||||||||||
Last Closed: | 2007-05-08 03:53:19 UTC | Type: | --- | ||||||||||||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||||||||||||||
Bug Blocks: | 221457 | ||||||||||||||||||||||||||||||||
Attachments: |
|
Description
Bryn M. Reeves
2006-10-24 19:41:06 UTC
Shoulda said - the oops trace seems to point to bnx2.c:1608 Created attachment 139261 [details]
oops message from bnx2 1.4.38 (1)
Created attachment 139262 [details]
oops message from bnx2 1.4.38 (2)
Created attachment 139263 [details]
oops message from bnx2 1.4.43 (1)
Similar oops seen on the updated 1.4.43 bnx2 driver in linville's test kernels.
Reporter states that this version oopses more frequently.
Created attachment 139271 [details]
netdev-2.6 commit 104812/2f8af120a159a843948749ea88bcacda9779b132
This patch might be interesting to try. It came out shortly after the update
provided by linville and it appears to handle a race involving bnx2_tx_int().
It seems like it might be a long shot since the scenario described may not
match the situation exactly, but it might be another option.
raised an exception for 4.5 - significant customer impact in new Korean market for large customer thanks marco It seems the bnx2 patches in these builds: http://people.redhat.com/agospoda/#rhel4 resolve this issue. PM ACK following Devel ACK 0xffffffffa07eab20 <bnx2_poll+232>: add 0x38(%r12),%rbx bp->tx_buf_ring (tx_buf_ring is offset 0x38) 0xffffffffa07eab25 <bnx2_poll+237>: mov (%rbx),%rbp bnx2.c:1605 skb = tx_buf->skb 0xffffffffa07eab28 <bnx2_poll+240>: mov 0x108(%rbp),%rax bnx2.c:1608 skb_shinfo(skb) is really skb->end 0xffffffffa07eab2f <bnx2_poll+247>: cmpw $0x0,0x8(%rax) bnx2.c:1608 if (skb_shinfo(skb)->tso_size) 0xffffffffa07eab34 <bnx2_poll+252>: je 0xffffffffa07eab5b bnx2.c:1608 Basically this code maps to the following lines: 1604 tx_buf = &bp->tx_buf_ring[sw_ring_cons]; 1605 skb = tx_buf->skb; 1606 #ifdef BCM_TSO 1607 /* partial BD completions possible with TSO packets */ 1608 if (skb_shinfo(skb)->tso_size) { 1609 u16 last_idx, last_ring_idx; So this problem occurs because the skb coming from the ring tx_bug_ring is NULL. I suppose that we could add a check to make sure that this isn't NULL before continuing, but that seems like a hack rather than an actual fix.... I don't think TSO has much to do with this -- it seems that skbs on the ring-buffer are already free when we try to access them. I'm not sure if checking to see that they are NULL before continuing is the right idea, but it seems like it could be an option. I'm going to contact Michael Chan (upstream maint) and see if he's got any thoughts. I just traded emails with Michael Chan and he suggested a patch that is not currently upstream. I will post a file that can be used for testing shortly. Created attachment 141417 [details]
bnx2-mmio.patch
Suggested patch from Michael Chan.
Created attachment 141486 [details]
bnx2 dump
Comment #18 is a dump from RHEL4u4 (.0.2 kernel) running on an HP DL360 G5. I am building the kernel with this and the previous two patches mentioned and will test asap. To make your life easy, use these 2 patches: BZ 202053 bnx2-1_4_43.patch Experimental BZ 212055 bnx2-poll-fix.patch Experimental from http://people.redhat.com/agospoda/#rhel4 and then apply the patch attched in Comment #15. Those patches are what I have built into our kernels (all three of them) and we still see the same crash. Interesting bit is that the only machines crashing are the onse running RHCS/GFS and it's always dlm_recvd. We are going to run some stress againts out non-cluster bnx2 driven machines to see if we can force a crash. Anyone have any other ideas? Is is possible that this "bug" is only exposed by the dlm? Can you post the panic log from the latest kernel (the one with all 3 patches)? Created attachment 141671 [details]
original unpatched bnx2 oops
This is the oops from the original RHEL 2.6.9-42.0.2.ELsmp kernel
Created attachment 141673 [details]
patched kernel oops
This is the oops after the latest patch (all three patches applied). There is
little difference and it seems to be crashing the wame way.
I've recompiled the kernel again, with an extra bit on the version string for the bnx2 driver to ensure that the patched version is indeed being used etc. Not that there is really any question but since the crash dumps are almost identical, I thought I'd do it just to make sure. We have other machines using RHEL4u4 that are not clustered and seem to be running fine. We will be load testing them further to see if we can bring them down. Basically all we have to do is one of our rsyncs from a cluster to a non clusted node and it brings down the clustered node with the above oops. Created attachment 141677 [details]
bnx2-hack.patch
I'm not a huge fan of this patch, but if you would like to build a kernel with
it and test it out feel free. It basically skips around the code that should
free the skb whenever the skb in the ring-buffer has already been freed. This
doesn't resolve the issue at hand, but it could allow things to continue
running -- or it might cause an infinitie loop inside the while loop in
bnx2_tx_int().
Will try this in the morning. Important to know though that the patched kernel did in fact run longer then the unpatched one. Used to be we could force the crash by running our rsync job, now I've been told that the rsync completes (several hundred gigs)... Thanks for the feedback and the testing! Created attachment 141702 [details]
debug patch
This patch will print out some of the tx states when the SKB becomes
unexpectedly NULL. Please test it and post the debug dmesg.
Michael's patch is preferred to mine so please consider using it instead. Building it now. Will let you know how things go... Thanks. This is fairly important to us as we were having other issues on another system using different nics and RHEL4u3. Hard to tell if those other problems were fixed if this one keeps getting in the way. Deferring QE ack until we sort out what's going on and the proposed fix. We've applied the patch and have one of the 5 machines complaining loudly about NULL skb. there was a ~30 minute period last night where over 16k of these messages are present. tx_prod b790 tx_cons b67f hw_tx_cons b6a4 There are loads of these, with different values. It's difficult for me to get the actual files uploaded due to the environment so I am trascribing everything. In the middle of them all are several instances where the bnx2 link goes down, after a NETDEV WATCHDOG message then immediately returns to an UP state. There is also one instance if oom-killer complaining about oom-killer: gfp_mask=0xd0 However we have oom-killer disabled so it was bypassed. The bnx2 erros continue, then cleared up roughly 30 minutes after they started. Last weekend, we did have two crashes which were 'pre-debug patch' but also had the three other patches applied. I suspect it's a matter of time before one of these nodes gives up. Unfortunately to ensure our build process is working through kernel patches, we rebuild the nodes completely and have lost the old logs. For what it's worth, we've had some non-cluster machines (same hardware specs) pounding away at the 1.4.43c drivers as shipped from HP for over a day now rsyncing 30 GB to 6 clients 13 times now and still chugging.. Anyway, will keep you posted. Cheers > tx_prod b790 tx_cons b67f hw_tx_cons b6a4
Thanks for testing. Something is very strange here. The tx_prod should never
be ahead of the tx_cons by more than 0xff because that's the size of the tx
ring. Here, the difference between the two is 0x111.
Is this the very first NULL skb dmesg in the log? If not, please post the very
first one.
Thanks.
Yes, thats the very first one. Anything else I can do? Just did some checking and all of the messages I checked are over 0xff diff. IN fact the very next one is 0x1f0 off. Created attachment 141931 [details]
debug patch #2
I don't have a theory why this is happening, so I am attaching another debug
patch to see if it yields more information.
The tx_prod tells the chip to send new packets. When the tx packets are
completed, the chip tells the driver and the driver updates tx_cons. The
difference between the 2 can be at most 255. When the ring is full, we stop
the tx_queue until at least half the ring is available again. The start_xmit()
thread updates tx_prod, and the NAPI poll thread updates tx_cons.
Please keep the first debug patch and add this one on top. Please post the
first few demsg entries when the debug code is triggered.
Thanks.
Will do. Started the build, but due to travel it'll be monday before I get the results. Thanks for all your help. cjk, any results to report from the second debug patch? Sorry, I am out of the country and getting back to my shop is a pain. Here are some results from the patch. Bad index in bnx2_start_tx() tx_prod 0 tx_cons fffd hw_tx_cons fffd tmp 0 Bad index in bnx2_start_tx() tx_prod 1 tx_cons fffd hw_tx_cons fffd tmp 1 Bad index in bnx2_start_tx() tx_prod 2 tx_cons fffd hw_tx_cons fffd tmp 2 Bad index in bnx2_start_tx() tx_prod 3 tx_cons fffd hw_tx_cons fffd tmp 3 Bad index in bnx2_start_tx() tx_prod 4 tx_cons fffd hw_tx_cons fffd tmp 4 Bad index in bnx2_start_tx() tx_prod 5 tx_cons fffd hw_tx_cons fffd tmp 5 Bad index in bnx2_start_tx() tx_prod 6 tx_cons fffd hw_tx_cons fffd tmp 6 Bad index in bnx2_start_tx() tx_prod 7 tx_cons fffd hw_tx_cons fffd tmp 7 Bad index in bnx2_start_tx() tx_prod 8 tx_cons 8 hw_tx_cons 8 tmp 8 Bad index in bnx2_start_tx() tx_prod 0 tx_cons fffd hw_tx_cons fffd tmp 0 Bad index in bnx2_start_tx() tx_prod 1 tx_cons fffe hw_tx_cons fffe tmp 1 Bad index in bnx2_start_tx() tx_prod 0 tx_cons 0 hw_tx_cons ffff tmp 0 Bad index in bnx2_start_tx() tx_prod 9 tx_cons fffb hw_tx_cons fffb tmp 9 Bad index in bnx2_start_tx() tx_prod c tx_cons c hw_tx_cons c tmp c Bad index in bnx2_start_tx() tx_prod 0 tx_cons fffe hw_tx_cons fffe tmp 0 Those are the first occurances. There are no bnx2_start_init() errors logged as of yet, I'll keep looking. I hope this helps.... This dmesg log doesn't make any sense. The value of tmp should be copied from tx_cons in bnx2_start_xmit(). In the dmesg, it seems that it was copied from tx_prod. Was the patch applied properly? I'll double check the patch and let you know. If I need to I'll re-type/apply it and try again. Thanks I corrected the patch and rebuilt/reinstalled the kernel. Waiting for the results now... Thanks again. New test kernels with the latest debug patch are available here: http://people.redhat.com/agospoda/#rhel4 Please give them a try and post the results. Haven't gotten a chance to get the test kernel but I have some results from the patches. Bad index in bnx2_start_tx() tx_prod c885 tx_cons c785 hw_tx_cons c785 tmp c785 Bad index in bnx2_start_tx() tx_prod c898 tx_cons c785 hw_tx_cons c785 tmp c785 Bad index in bnx2_start_tx() tx_prod c8ab tx_cons c785 hw_tx_cons c785 tmp c785 Bad index in bnx2_start_tx() tx_prod c8be tx_cons c785 hw_tx_cons c785 tmp c785 Bad index in bnx2_start_tx() tx_prod c8bf tx_cons c785 hw_tx_cons c785 tmp c785 Bad index in bnx2_start_tx() tx_prod c8c0 tx_cons c785 hw_tx_cons c785 tmp c785 Bad index in bnx2_start_tx() tx_prod c8c1 tx_cons c785 hw_tx_cons c785 tmp c785 Bad index in bnx2_start_tx() tx_prod c8c3 tx_cons c785 hw_tx_cons c785 tmp c785 Bad index in bnx2_start_tx() tx_prod c8c7 tx_cons c785 hw_tx_cons c785 tmp c785 Bad index in bnx2_start_tx() tx_prod c8d6 tx_cons c785 hw_tx_cons c785 tmp c785 NULL skb in bnx2_tx_int() tx_prod c8d6 tx_cons c785 hw_tx_cons c8d6 Bad index in bnx2_tx_int() tmp c8d6 tx_prod c8d6 tx_cons c7d6 hw_tx_cons c8d6 Bad index in bnx2_start_tx() tx_prod c8e5 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c8f4 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c8f5 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c905 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c913 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c922 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c931 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c940 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c94f tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c950 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c95f tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c96d tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c97c tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c97d tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c98c tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c99a tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c9a8 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c9a9 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c9b8 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 Bad index in bnx2_start_tx() tx_prod c9c7 tx_cons c7d6 hw_tx_cons c8d6 tmp c7d6 NULL skp in bnx2_tx_int() tx_prod c9c7 tx_cons c7d6 hw_tx_cons c9c7 Bad index in bnx2_tx_int() tmp c9c7 tx_prod c9c7 tx_cons c8c7 hw_tx_cons c9c7 Bad index in bnx2_start_tx() tx_prod c9d6 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7 Bad index in bnx2_start_tx() tx_prod c9e4 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7 Bad index in bnx2_start_tx() tx_prod c9f3 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7 Bad index in bnx2_start_tx() tx_prod ca03 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7 Bad index in bnx2_start_tx() tx_prod ca11 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7 Bad index in bnx2_start_tx() tx_prod ca18 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7 Bad index in bnx2_start_tx() tx_prod ca26 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7 Bad index in bnx2_start_tx() tx_prod ca28 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7 Bad index in bnx2_start_tx() tx_prod ca2f tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7 Bad index in bnx2_start_tx() tx_prod ca3d tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7 Bad index in bnx2_start_tx() tx_prod ca44 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7 Bad index in bnx2_start_tx() tx_prod ca52 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7 Bad index in bnx2_start_tx() tx_prod ca59 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7 Bad index in bnx2_start_tx() tx_prod ca67 tx_cons c8c7 hw_tx_cons c9c7 tmp c8c7 that goes on for a bit then this.. NULL skp in bnx2_tx_int() tx_prod cb19 tx_cons ca28 hw_tx_cons cb19 NETDEV WATCHDOG: eth1: transmit timed out bnx2: eth1 NIC Link is Doan bnx2: eth1 NIC Link is Up, 1000 Mbps full duplex The only thing using eth1 is the cluster infrastructure. Installing the test kernel will likey not happen until next week when I get back to my office. Hope this helps... Created attachment 143037 [details]
bug fix patch
Thanks for testing. The dmesg log was very helpful. There is a bug in
bnx2_tx_avail() when the tx ring is completely full.
I'd say the condition to trigger this bug is very rare. The tx ring rarely
becomes completely full as we always leave room for the biggest possible TSO
packet.
Please try the patch which should fix the bug. Thanks a lot.
Thanks for the patch, Michael. I'll add this to the test kernels I plan to spin later today and post here when they are ready. Excellent work. I'll get this in and let it run over the weekend. Thanks... New test kernels with Michael's fix are available here: http://people.redhat.com/agospoda/#rhel4 Please report any feedback you have to this Bugzilla. Any additional updates? Yes, I was waiting a bit longer before claiming victory. We have had the cluster up for almost 6 days now with no debug messages at all. I am not using the kernels mentioned above since I didn't want to introduce any other variables. We are simply running stock 42.0.2 sources plus the patches from here. The question was raised about how well HP monitors this work and if they will latch onto this and put it into the HP supported bnx2 driver bundle. The bug exists in that one as well. So if HP is listening...... Thanks to all who worked on this. Thanks for testing. I will push the patch upstream later today. I will also have a new driver sent to HP and other partners. Created attachment 143541 [details]
rhel4.patch
This is the final RHEL4 patch I plan to push internally.
Sorry, I need to revise the patch a bit. There's still a potential problem when the ring wraps around from 0xffff to 0x0. What makes the logic so difficult is that the ring uses 256 indices for 255 entries. 1 of them is unused and needs to be skipped. I will post the revised patch here later today. Andy, your combined patch looks ok. The part to fix this problem will need to be revised when my patch is ready. Thanks. Created attachment 143559 [details]
Revised patch
This is the revised patch. I fully expect this to work for cjk, but
please give it a try. Revert the earlier patch first and then apply this one.
Thanks a lot.
Builds containing Michael's final patch are available here: http://people.redhat.com/agospoda/#rhel4 QE ack for RHEL4.5. Just an update, our cluster has been operational and UP for over three weeks now. Thanks to all who helped fix this issue. committed in stream U5 build 42.38. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/ Confirmed that the patch is in the -50.EL kernel. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0304.html |