Bug 508839 - [Emulex 5.4 bug] be2net: traffic stops when using INTx interrupts
Summary: [Emulex 5.4 bug] be2net: traffic stops when using INTx interrupts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.4
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: 5.4
Assignee: Andy Gospodarek
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On: 490074 508404
Blocks: 461676 508871
TreeView+ depends on / blocked
 
Reported: 2009-06-30 08:31 UTC by Subbu Seetharaman
Modified: 2009-09-03 13:53 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-02 08:15:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch to fix spurious interrupt and a race in interrupt processing (4.47 KB, patch)
2009-06-30 08:45 UTC, Subbu Seetharaman
no flags Details | Diff
rhel5-be2net-intx-fix-proposed.patch (1.14 KB, patch)
2009-07-01 21:24 UTC, Andy Gospodarek
no flags Details | Diff
rhel5-be2net-intx-fix-proposed2.patch (1.19 KB, patch)
2009-07-01 21:26 UTC, Andy Gospodarek
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1243 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update 2009-09-01 08:53:34 UTC

Description Subbu Seetharaman 2009-06-30 08:31:40 UTC
Description of problem:

OS shuts off interrupt complaining nobody cared since at times we return IRQ_NONE when the  interrupt was generated by our device.

Steps to Reproduce:
Run two port I/O on a platform that does not support MSIx.  Within 30 mins the traffic will stop.

Comment 1 Subbu Seetharaman 2009-06-30 08:45:10 UTC
Created attachment 349918 [details]
Patch to fix spurious interrupt and a race in interrupt processing

This patch fixes the spurious interript problem that could result in the interrupt vector being shut off by OS.

Subbu

Comment 4 Andy Gospodarek 2009-06-30 13:36:17 UTC
Subbu, do you plan to post this patch upstream as well?

Comment 5 Subbu Seetharaman 2009-06-30 16:50:08 UTC
Yes.  It will be submitted upstream tomorrow.

Comment 6 Andy Gospodarek 2009-07-01 21:24:10 UTC
Created attachment 350200 [details]
rhel5-be2net-intx-fix-proposed.patch

The upstream patch differs pretty significantly from the one attached in comment #1.  I really don't want to deviate too much from upstrem, so can we try and keep it close.

It would seem the true backport of the upstream patch would be something like the attached patch.

Thoughts?

Comment 7 Andy Gospodarek 2009-07-01 21:26:01 UTC
Created attachment 350201 [details]
rhel5-be2net-intx-fix-proposed2.patch

Sorry, this patch actually compiles.

It would probably be a better choice.

Comment 8 Subbu Seetharaman 2009-07-02 11:15:43 UTC
Andy,

The reason why the patch I attached to this bug deviates significantly from the upstream patch for this problem is because we did this patch differently to also fix a race in the completion processing that exists only in the RH5U4 version of the driver. Since they were in the same area I decided to fix  both problem in the same patch.  This issue does not exist in the upstream where we use different event queue and interrupt vector for TX and RX completions.

The race that we fixed here is that when we are in be_poll() because of an RX completion, an interrupt could come for TX completion for which we will schedule NAPI; but when we return to be_poll from ISR, we could end up doing a napi_complete() cancelling the NAPI schedule done in TX completion interrupt.  This will result in no NAPI scheduled as well as no interrupt enabled in device brining the traffic to a halt.  We could either go with my first patch or make another patch on top of this to fix this race,  In either case we will end up the same driver.

Thanks.

Subbu

Comment 9 Andy Gospodarek 2009-07-02 18:35:45 UTC
Subbu, thanks for the clarification.  If this passes your test we can include it.  I would like to see you change the upstream driver and get rid of the napi-polling for tx, however.  It really doesn't make much sense and we can then try and bring these drivers back together for RHEL5.5.

Comment 11 Andy Gospodarek 2009-07-02 19:02:28 UTC
Subbu, I also noticed an error in this hunk:

        tx_work = be_poll_tx(adapter, budget);
        rx_work = be_poll_rx(adapter, budget);

-
        /* All consumed */
-       if (rx_work < budget) {
-               compat_napi_complete(napi);
-               be_cq_notify(&adapter->ctrl, rx_cq->id, true, rx_work);
-       } else {
-               /* More to be consumed; continue with interrupts
                disabled */
-               be_cq_notify(&adapter->ctrl, rx_cq->id, false, rx_work);
-       }
+       if (rx_work < budget)
+               napi_complete(napi);

-       if (tx_work)
-               be_cq_notify(&adapter->ctrl, tx_cq->id, true, tx_work);
+       drvr_stats(adapter)->be_num_events += num;
+       if (num)
+               be_eq_notify(ctrl, eq_obj->q.id, true, true, num);

        return rx_work;
 }

with the call to 'napi_complete' so I changed it to 'compat_napi_complete' as that should be the correct call there.  The new hunk looks like this:

        tx_work = be_poll_tx(adapter, budget);
        rx_work = be_poll_rx(adapter, budget);

-
        /* All consumed */
-       if (rx_work < budget) {
-               compat_napi_complete(napi);
-               be_cq_notify(&adapter->ctrl, rx_cq->id, true, rx_work);
-       } else {
-               /* More to be consumed; continue with interrupts
                disabled */
-               be_cq_notify(&adapter->ctrl, rx_cq->id, false, rx_work);
-       }
+       if (rx_work < budget)
+               compat_napi_complete(napi);

-       if (tx_work)
-               be_cq_notify(&adapter->ctrl, tx_cq->id, true, tx_work);
+       drvr_stats(adapter)->be_num_events += num;
+       if (num)
+               be_eq_notify(ctrl, eq_obj->q.id, true, true, num);

        return rx_work;
 }

Comment 12 RHEL Program Management 2009-07-02 19:03:48 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 13 Subbu Seetharaman 2009-07-04 11:26:21 UTC
Your chaange is correct Andy.  Thanks.

Subbu

Comment 14 Subbu Seetharaman 2009-07-06 12:03:52 UTC
Just wanted to add that we have done additional teting with INTx interrupts and the change seems to be fine.  Thanks.

Subbu

Comment 16 Don Zickus 2009-07-07 15:06:04 UTC
in kernel-2.6.18-157.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 18 Subbu Seetharaman 2009-07-09 12:49:25 UTC
Verified in 2.6.18-157.

Subbu

Comment 20 Chris Ward 2009-08-03 15:47:26 UTC
~~ Attention Partners - RHEL 5.4 Snapshot 5 Released! ~~

RHEL 5.4 Snapshot 5 is the FINAL snapshot to be release before RC. It has been 
released on partners.redhat.com. If you have already reported your test results, 
you can safely ignore this request. Otherwise, please notice that there should be 
a fix available now that addresses this particular issue. Please test and report 
back your results here, at your earliest convenience.

If you encounter any issues while testing Beta, please describe the 
issues you have encountered and set the bug into NEED_INFO. If you 
encounter new issues, please clone this bug to open a new issue and 
request it be reviewed for inclusion in RHEL 5.4 or a later update, if it 
is not of urgent severity. If it is urgent, escalate the issue to your partner manager as soon as possible. There is /very/ little time left to get additional code into 5.4 before GA.

Partners, after you have verified, do not flip the bug status to VERIFIED. Instead, please set your Partner ID in the Verified field above if you have successfully verified the resolution of this issue. 

Further questions can be directed to your Red Hat Partner Manager or other 
appropriate customer representative.

Comment 22 errata-xmlrpc 2009-09-02 08:15:18 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html


Note You need to log in before you can comment on or make changes to this bug.