Description of problem: OS shuts off interrupt complaining nobody cared since at times we return IRQ_NONE when the interrupt was generated by our device. Steps to Reproduce: Run two port I/O on a platform that does not support MSIx. Within 30 mins the traffic will stop.
Created attachment 349918 [details] Patch to fix spurious interrupt and a race in interrupt processing This patch fixes the spurious interript problem that could result in the interrupt vector being shut off by OS. Subbu
Subbu, do you plan to post this patch upstream as well?
Yes. It will be submitted upstream tomorrow.
Created attachment 350200 [details] rhel5-be2net-intx-fix-proposed.patch The upstream patch differs pretty significantly from the one attached in comment #1. I really don't want to deviate too much from upstrem, so can we try and keep it close. It would seem the true backport of the upstream patch would be something like the attached patch. Thoughts?
Created attachment 350201 [details] rhel5-be2net-intx-fix-proposed2.patch Sorry, this patch actually compiles. It would probably be a better choice.
Andy, The reason why the patch I attached to this bug deviates significantly from the upstream patch for this problem is because we did this patch differently to also fix a race in the completion processing that exists only in the RH5U4 version of the driver. Since they were in the same area I decided to fix both problem in the same patch. This issue does not exist in the upstream where we use different event queue and interrupt vector for TX and RX completions. The race that we fixed here is that when we are in be_poll() because of an RX completion, an interrupt could come for TX completion for which we will schedule NAPI; but when we return to be_poll from ISR, we could end up doing a napi_complete() cancelling the NAPI schedule done in TX completion interrupt. This will result in no NAPI scheduled as well as no interrupt enabled in device brining the traffic to a halt. We could either go with my first patch or make another patch on top of this to fix this race, In either case we will end up the same driver. Thanks. Subbu
Subbu, thanks for the clarification. If this passes your test we can include it. I would like to see you change the upstream driver and get rid of the napi-polling for tx, however. It really doesn't make much sense and we can then try and bring these drivers back together for RHEL5.5.
Subbu, I also noticed an error in this hunk: tx_work = be_poll_tx(adapter, budget); rx_work = be_poll_rx(adapter, budget); - /* All consumed */ - if (rx_work < budget) { - compat_napi_complete(napi); - be_cq_notify(&adapter->ctrl, rx_cq->id, true, rx_work); - } else { - /* More to be consumed; continue with interrupts disabled */ - be_cq_notify(&adapter->ctrl, rx_cq->id, false, rx_work); - } + if (rx_work < budget) + napi_complete(napi); - if (tx_work) - be_cq_notify(&adapter->ctrl, tx_cq->id, true, tx_work); + drvr_stats(adapter)->be_num_events += num; + if (num) + be_eq_notify(ctrl, eq_obj->q.id, true, true, num); return rx_work; } with the call to 'napi_complete' so I changed it to 'compat_napi_complete' as that should be the correct call there. The new hunk looks like this: tx_work = be_poll_tx(adapter, budget); rx_work = be_poll_rx(adapter, budget); - /* All consumed */ - if (rx_work < budget) { - compat_napi_complete(napi); - be_cq_notify(&adapter->ctrl, rx_cq->id, true, rx_work); - } else { - /* More to be consumed; continue with interrupts disabled */ - be_cq_notify(&adapter->ctrl, rx_cq->id, false, rx_work); - } + if (rx_work < budget) + compat_napi_complete(napi); - if (tx_work) - be_cq_notify(&adapter->ctrl, tx_cq->id, true, tx_work); + drvr_stats(adapter)->be_num_events += num; + if (num) + be_eq_notify(ctrl, eq_obj->q.id, true, true, num); return rx_work; }
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Your chaange is correct Andy. Thanks. Subbu
Just wanted to add that we have done additional teting with INTx interrupts and the change seems to be fine. Thanks. Subbu
in kernel-2.6.18-157.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
Verified in 2.6.18-157. Subbu
~~ Attention Partners - RHEL 5.4 Snapshot 5 Released! ~~ RHEL 5.4 Snapshot 5 is the FINAL snapshot to be release before RC. It has been released on partners.redhat.com. If you have already reported your test results, you can safely ignore this request. Otherwise, please notice that there should be a fix available now that addresses this particular issue. Please test and report back your results here, at your earliest convenience. If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. If it is urgent, escalate the issue to your partner manager as soon as possible. There is /very/ little time left to get additional code into 5.4 before GA. Partners, after you have verified, do not flip the bug status to VERIFIED. Instead, please set your Partner ID in the Verified field above if you have successfully verified the resolution of this issue. Further questions can be directed to your Red Hat Partner Manager or other appropriate customer representative.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html