Description of problem: The following 3 commit IDs do a better job of explaining the problem than I can here. Very briefly, it is possible for one MSI-X vector to receive enough traffic that it will overwrite data needed by another MSI-X vector's receive path. Commit IDs =========== e4af1af900328e4aa71cd5df75bb22669ab11522 e92967bfb1f4fa7da7c425df9239c4bb615dec30 f89f38b8ec3171664314669a1396ab70b43e8961 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Matt, we are at the very end of the RHEL 5.5 development cycle - no guarantees on this making it.
Understood. At least the problem is documented and trackable.
A rpm with this fix can be found on my people page. See http://people.redhat.com/jfeeney/.rhel5-tg3 Testing feedback would be appreciated.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
The rx side locks up under mild stress. Can I see the sources?
unfortunately John is on PTO until next week. I'm not sure if anyone is filling in for him though.
(In reply to comment #9) > The rx side locks up under mild stress. Can I see the sources? Unfortunately I do not have access to the source as John is out of town. Is there any interesting output we might be able to use for debugging?
Well. I have a couple observations, but I'm not sure where they are leading me. If I turn TSO off, the problem goes away. However, I don't think the problem is with TSO. The problem these patches are trying to solve is brought about by large volumes of traffic. Changing the TSO setting alters the traffic flow seen by the hardware. So on the surface, the behavior looks similar to the behavior of the problem we were trying to solve. I have the 57765 connected back-to-back with another machine. On the remote machine, I see packets arriving and responses leaving. On the local side, I see the tx and rx stat counters incrementing, but obviously the traffic gets lost somewhere on the rx side. rxbds_empty isn't increasing. If only the first two patches were applied, you would see the hardware being starved of buffers, which would have made this counter increment. So this doesn't appear to hint at a cause yet. In the back of my head, I'm wondering if it is O.K. in RedHat's pseudo-NAPI scheme for one NAPI instance to schedule another NAPI instance. Testing so far suggests it is O.K., but this is where my mind runs when I encounter a problem. All-in-all, I suspect a backporting problem but I'm still trying to guess where the problem might be.
Created attachment 395580 [details] fix submitted for review Attached is the patch that was submitted internally for review, which I'd assume matches what John had in his test build...
Matt @ BRCM: Based on the issues you found late last week, we are going to hold off on this bugzilla for including until the issues are resolved. Jarod's attachment should be the source you were looking for, since John is on vacation.
Actually, the behavior we see now is still better than a crash. The device stall can be quickly cleared with an 'ifconfig eth0 down && ifconfig eth0 up'. But I think I found the problem. @@ -4435,18 +4442,19 @@ next_pkt_nopost: tpr->rx_std_prod_idx = std_prod_idx % TG3_RX_RING_SIZE; tpr->rx_jmb_prod_idx = jmb_prod_idx % TG3_RX_JUMBO_RING_SIZE; - netif_rx_schedule(tp->napi[0].dummy_netdev); + if (tnapi != &tp->napi[1]) + netif_rx_schedule(tp->napi[0].dummy_netdev); I think this should be : + netif_rx_schedule(tp->napi[1].dummy_netdev);
(In reply to comment #15) > Actually, the behavior we see now is still better than a crash. The device > stall can be quickly cleared with an 'ifconfig eth0 down && ifconfig eth0 up'. > But I think I found the problem. > > @@ -4435,18 +4442,19 @@ next_pkt_nopost: > tpr->rx_std_prod_idx = std_prod_idx % TG3_RX_RING_SIZE; > tpr->rx_jmb_prod_idx = jmb_prod_idx % TG3_RX_JUMBO_RING_SIZE; > > - netif_rx_schedule(tp->napi[0].dummy_netdev); > + if (tnapi != &tp->napi[1]) > + netif_rx_schedule(tp->napi[0].dummy_netdev); > > I think this should be : > > + netif_rx_schedule(tp->napi[1].dummy_netdev); That looks correct to me, Matt. I will respin the patch and attach it. If I make them against the latest kernels here: http://people.redhat.com/jwilson/el5/ will you be able to test them or should I spin some test kernels for you?
Created attachment 395777 [details] tg3-fix-panic-under-load.patch I agree with Matt's assessment, this is the patch I would propose in-place of the one John originally posted.
I think I should be able to test against the kernels at jwilson, as long as I have a source RPM to extract the tg3 driver from. It looks like it is there. Which test kernel at jwilson should I apply this patch against?
Matt, they were against, 2.6.18-189.el5: http://people.redhat.com/jwilson/el5/189.el5/
The fix is looking good. I'm going to let the machine run in my home-grown stress environment for an hour or two before I give the official thumbs-up though.
Thanks, Matt. I'm glad you were able to get it going (I'm not surprised). :) Sometimes it takes a while to get an official build out of here, so I'm glad you were able to get it built and kick of a test run. As long as this looks good by tomorrow morning I don't see any problem getting this included in 5.5.
I'm satisfied that this patch fixes the problem. My test ran for long enough without stalls or crashes. I'll submit my test version to our QA dept for more testing.
Thanks, Matt. I will submit this for inclusion in RHEL5.5.
Matt, my test kernels finished, so if you want to pass them on to others on your team feel free. They are available here: http://people.redhat.com/agospoda/#rhel5
Follow comment 23, verified on 57765 with the test version that Matt gave us. Load/unload and copy/compare stress pass.
Follow Comment 26, this test kernel (2.6.18.190.el5_x86_64) is still failed on broadcom 57765. Ping failed occurred due to Rx packets dropped. Check statistics by ethtool -S, rxbds_empty=1 and rx_discard is incresing.
Ben, I'm having a hard time understanding the the problems you are seeing. Can you clarify exactly which kernel you were using? (2.6.18-190.el5 will not have the fix for this issue, but 2.6.18-190.el5.gtest.85 will.)
Ah. That may be Ben's problem. I just tested 2.6.18-190.el5.gtest.85. It works fine for me.
Thanks, Matt!
Sorry, It's my fault to use a wrong test kernel. By now, using 2.6.18-190.el5.gtest.85 for test and it works well.
No problem, Ben. Glad this issue is resolved with my test kernels.
in kernel-2.6.18-191.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details.
For kernel 2.6.18-191.el5, QA has verified the driver load/unload and IPv6 Checksum Offload test. It works well.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html