Description of problem: if dev->quota in net_rx_action becomes zero or less, and dev->poll() makes a call to netif_rx_complete, we will try to do another list_del back in net_rx_action, leading to an oops on the poinsoned pointers in dev->poll_list. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: oops Expected results: no opps Additional info:
Created attachment 327468 [details] netpoll-fix.patch This on-top of what's currently in RHEL4 seems like a reasonable fix.
Created attachment 327476 [details] new version of same patch quick update to andys patch. I forgot that my origional version included a chunk to do the __LINK_STATE_NETPOLL check in __netif_rx_complete as well as netif_rx_complete. This prevents us from needing to fix up dozens of drivers. We want to keep that one chunk. Everything else is unaltered
Looks good, Neil.
Thanks gospo, posted for review.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Committed in 78.23.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Neil, any feeling for how reproducible this is or steps to trigger it?
This problem would happen every time you processed exactly the number of frames that were in your quota. Vivek seemed to be able to reproduce it pretty easily in RHTS after we added some new code that protected concurrent calls to poll_napi. I think he was doing a connectathon test with netconsole enabled.
yes, thats right. Andy hit the nail on the head. Not sure what it is about connectathon, but that did seem to have a 100% reproduction rate.
Thanks Andy & Neil - looks like connectathon tickles this particularly well then (we've had a test kernel running for a few months with all the earlier netpoll/bond patches included but have yet to see an occurrence of this oops).
Would it be possible to get a sample stack trace of this failure? It might be easier to match up to see if customers are encountering this problem. Thanks, Rick
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html