Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 508839

Summary: [Emulex 5.4 bug] be2net: traffic stops when using INTx interrupts
Product: Red Hat Enterprise Linux 5 Reporter: Subbu Seetharaman <subbu.seetharaman>
Component: kernelAssignee: Andy Gospodarek <agospoda>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 5.4CC: andriusb, benl, bobby.suber, cward, dzickus, ivecera, jamie.wellnitz, jbao, laurie.barry, mjenner, peterm, sandy.garza, sean.murphy, syeghiay, tcamuso, vaios.papadimitriou
Target Milestone: rcKeywords: OtherQA
Target Release: 5.4   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 08:15:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 490074, 508404    
Bug Blocks: 461676, 508871    
Attachments:
Description Flags
Patch to fix spurious interrupt and a race in interrupt processing
none
rhel5-be2net-intx-fix-proposed.patch
none
rhel5-be2net-intx-fix-proposed2.patch none

Description Subbu Seetharaman 2009-06-30 08:31:40 UTC
Description of problem:

OS shuts off interrupt complaining nobody cared since at times we return IRQ_NONE when the  interrupt was generated by our device.

Steps to Reproduce:
Run two port I/O on a platform that does not support MSIx.  Within 30 mins the traffic will stop.

Comment 1 Subbu Seetharaman 2009-06-30 08:45:10 UTC
Created attachment 349918 [details]
Patch to fix spurious interrupt and a race in interrupt processing

This patch fixes the spurious interript problem that could result in the interrupt vector being shut off by OS.

Subbu

Comment 4 Andy Gospodarek 2009-06-30 13:36:17 UTC
Subbu, do you plan to post this patch upstream as well?

Comment 5 Subbu Seetharaman 2009-06-30 16:50:08 UTC
Yes.  It will be submitted upstream tomorrow.

Comment 6 Andy Gospodarek 2009-07-01 21:24:10 UTC
Created attachment 350200 [details]
rhel5-be2net-intx-fix-proposed.patch

The upstream patch differs pretty significantly from the one attached in comment #1.  I really don't want to deviate too much from upstrem, so can we try and keep it close.

It would seem the true backport of the upstream patch would be something like the attached patch.

Thoughts?

Comment 7 Andy Gospodarek 2009-07-01 21:26:01 UTC
Created attachment 350201 [details]
rhel5-be2net-intx-fix-proposed2.patch

Sorry, this patch actually compiles.

It would probably be a better choice.

Comment 8 Subbu Seetharaman 2009-07-02 11:15:43 UTC
Andy,

The reason why the patch I attached to this bug deviates significantly from the upstream patch for this problem is because we did this patch differently to also fix a race in the completion processing that exists only in the RH5U4 version of the driver. Since they were in the same area I decided to fix  both problem in the same patch.  This issue does not exist in the upstream where we use different event queue and interrupt vector for TX and RX completions.

The race that we fixed here is that when we are in be_poll() because of an RX completion, an interrupt could come for TX completion for which we will schedule NAPI; but when we return to be_poll from ISR, we could end up doing a napi_complete() cancelling the NAPI schedule done in TX completion interrupt.  This will result in no NAPI scheduled as well as no interrupt enabled in device brining the traffic to a halt.  We could either go with my first patch or make another patch on top of this to fix this race,  In either case we will end up the same driver.

Thanks.

Subbu

Comment 9 Andy Gospodarek 2009-07-02 18:35:45 UTC
Subbu, thanks for the clarification.  If this passes your test we can include it.  I would like to see you change the upstream driver and get rid of the napi-polling for tx, however.  It really doesn't make much sense and we can then try and bring these drivers back together for RHEL5.5.

Comment 11 Andy Gospodarek 2009-07-02 19:02:28 UTC
Subbu, I also noticed an error in this hunk:

        tx_work = be_poll_tx(adapter, budget);
        rx_work = be_poll_rx(adapter, budget);

-
        /* All consumed */
-       if (rx_work < budget) {
-               compat_napi_complete(napi);
-               be_cq_notify(&adapter->ctrl, rx_cq->id, true, rx_work);
-       } else {
-               /* More to be consumed; continue with interrupts
                disabled */
-               be_cq_notify(&adapter->ctrl, rx_cq->id, false, rx_work);
-       }
+       if (rx_work < budget)
+               napi_complete(napi);

-       if (tx_work)
-               be_cq_notify(&adapter->ctrl, tx_cq->id, true, tx_work);
+       drvr_stats(adapter)->be_num_events += num;
+       if (num)
+               be_eq_notify(ctrl, eq_obj->q.id, true, true, num);

        return rx_work;
 }

with the call to 'napi_complete' so I changed it to 'compat_napi_complete' as that should be the correct call there.  The new hunk looks like this:

        tx_work = be_poll_tx(adapter, budget);
        rx_work = be_poll_rx(adapter, budget);

-
        /* All consumed */
-       if (rx_work < budget) {
-               compat_napi_complete(napi);
-               be_cq_notify(&adapter->ctrl, rx_cq->id, true, rx_work);
-       } else {
-               /* More to be consumed; continue with interrupts
                disabled */
-               be_cq_notify(&adapter->ctrl, rx_cq->id, false, rx_work);
-       }
+       if (rx_work < budget)
+               compat_napi_complete(napi);

-       if (tx_work)
-               be_cq_notify(&adapter->ctrl, tx_cq->id, true, tx_work);
+       drvr_stats(adapter)->be_num_events += num;
+       if (num)
+               be_eq_notify(ctrl, eq_obj->q.id, true, true, num);

        return rx_work;
 }

Comment 12 RHEL Program Management 2009-07-02 19:03:48 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 13 Subbu Seetharaman 2009-07-04 11:26:21 UTC
Your chaange is correct Andy.  Thanks.

Subbu

Comment 14 Subbu Seetharaman 2009-07-06 12:03:52 UTC
Just wanted to add that we have done additional teting with INTx interrupts and the change seems to be fine.  Thanks.

Subbu

Comment 16 Don Zickus 2009-07-07 15:06:04 UTC
in kernel-2.6.18-157.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 18 Subbu Seetharaman 2009-07-09 12:49:25 UTC
Verified in 2.6.18-157.

Subbu

Comment 20 Chris Ward 2009-08-03 15:47:26 UTC
~~ Attention Partners - RHEL 5.4 Snapshot 5 Released! ~~

RHEL 5.4 Snapshot 5 is the FINAL snapshot to be release before RC. It has been 
released on partners.redhat.com. If you have already reported your test results, 
you can safely ignore this request. Otherwise, please notice that there should be 
a fix available now that addresses this particular issue. Please test and report 
back your results here, at your earliest convenience.

If you encounter any issues while testing Beta, please describe the 
issues you have encountered and set the bug into NEED_INFO. If you 
encounter new issues, please clone this bug to open a new issue and 
request it be reviewed for inclusion in RHEL 5.4 or a later update, if it 
is not of urgent severity. If it is urgent, escalate the issue to your partner manager as soon as possible. There is /very/ little time left to get additional code into 5.4 before GA.

Partners, after you have verified, do not flip the bug status to VERIFIED. Instead, please set your Partner ID in the Verified field above if you have successfully verified the resolution of this issue. 

Further questions can be directed to your Red Hat Partner Manager or other 
appropriate customer representative.

Comment 22 errata-xmlrpc 2009-09-02 08:15:18 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html