Bug 211270
Summary: | e1000 does not exit poll mode correctly | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Bryn M. Reeves <bmr> | ||||||||
Component: | kernel | Assignee: | Bryn M. Reeves <bmr> | ||||||||
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 4.4 | CC: | jesse.brandeburg, linville | ||||||||
Target Milestone: | --- | Keywords: | OtherQA | ||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2008-02-06 18:48:39 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Bryn M. Reeves
2006-10-18 10:56:34 UTC
Created attachment 138772 [details]
proposed patch for e1000
Isn't this exactly how polling is supposed to work? If the scheduler (or irq balancer) is broken and the system continues to migrate processes to this processor that is completely consumed with softirq (polling) load, then why is it e1000 that is broken? Aside from that concern, I don't have too much problem with the patch besides that it might increase overall cpu utilization. If anything I would consider this a workaround for a broken environment. Created attachment 139163 [details]
jwltest-e1000-napi-poll-exit.patch
Test kernels w/ above patch are available here: http://people.redhat.com/linville/kernels/rhel4/ Please give them a try and post the results here...thanks! Hi John, No problem - I'll talk to the TAM & try to figure out what's up here. IBM reported a similar problem when I first asked them to test with your kernels, so I'm guessing it's unrelated to the extra patch now added. I'm officially on vacation at the minute but I'll be able to find some time for this in the next day or two. Thanks, I'm very concerned about the peformance implications of this patch. I'm attempting to work on numbers, but IBM going directly to you isn't helping me. here is some data: I'm sending 64 byte packets to the discard port using pktgen from a remote machine. This is a pretty good indicator of how many packets per second can be received. newnapi is our latest driver with only the patch in this bug applied, 3 runs, with avg computed on two different systems with kernel.org kernels. 2.6.17 i686 7520 intel system. baseline: 654200/652601/654349: avg 653716 newnapi: 648985/646299/650799: avg 648694 2.6.18 pSeries 630 baseline: 237926/235999/238146: avg 237357 newnapi: 229909/226574/229849: avg 228777 I can acquire more data, but from this we see that there is a definite decrease in performance for at least this test. It is not very significant however. ugh, upgraded 7520 to 2.6.18 and now same two drivers show baseline: 656616 newnapi: 679900 so maybe the new code is okay. Can't win consistency on a friday night I guess. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux major release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Major release. This request is not yet committed for inclusion. This is looking like an interrupt delivery problem. The dmesg from the two kernels shows a difference in how they are scanning the PCI bus: kerne-2.6.9-42.EL: Unable to get OpenPIC IRQ for cascade NET: Registered protocol family 16 PCI: Probing PCI hardware PCI: Address space collision on region 0 of device 0001:00:02.0 [400bfff8000:400bfff7fff] PCI: Address space collision on region 0 of device 0001:00:02.2 [400bfff0000:400bffeffff] PCI: Address space collision on region 0 of device 0001:00:02.6 ... IOMMU table initialized, virtual merging enabled PCI: Probing PCI hardware done usbcore: registered new driver usbfs kernel-2.6.9-42.22.EL.jwltest.175: Unable to get OpenPIC IRQ for cascade NET: Registered protocol family 16 PCI: Probing PCI hardware IOMMU table initialized, virtual merging enabled PCI: Probing PCI hardware done The resource conflict messages have disappeared in the later kernel, but the ethtool stats suggest something is wrong with interrupt delivery. I'm waiting to get some more info to confirm this. There was a similar problem discussed on lkml recently relating to an IOAPIC patch in 2.6.18-mm1 - I'll look to see if anything similar went into the later RHEL kernel. Created attachment 145902 [details]
Revised patch to apply to later RHEL kernels
acked-by: Jesse Brandeburg <jesse.brandeburg> Marking severity on RH side to High to match the priority on the IBM side. This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST. please see bug #229603, as it appears this change may be the instigator of a new bug on T60/X60 laptops. I don't know how to link these bugs, can someone do it? At this point there is strong correlation that reverting this patch causes the problem to go away, but debugging continues. This patch has been reverted upstream as it triggered regressions on some widely deployed e1000 revisions. Given that this code is no longer in mainline and has not been re-proposed upstream, closing this bug WONTFIX. |