Bug 211270

Summary: e1000 does not exit poll mode correctly
Product: Red Hat Enterprise Linux 4 Reporter: Bryn M. Reeves <bmr>
Component: kernelAssignee: Bryn M. Reeves <bmr>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: high    
Version: 4.4CC: jesse.brandeburg, linville
Target Milestone: ---Keywords: OtherQA
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-02-06 18:48:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
proposed patch for e1000
none
jwltest-e1000-napi-poll-exit.patch
none
Revised patch to apply to later RHEL kernels none

Description Bryn M. Reeves 2006-10-18 10:56:34 UTC
Description of problem:
The current e1000 driver does not correctly exit polling mode when NAPI is
enabled. This is reported to cause load problems on SMP systems with four or
more e1000 NICs installed. Because all NIC activity is being bound to a single
CPU, that processor becomes excessively loaded.

Initially, an interrupt is recieved on one CPU and is handled by the e1000
driver. The driver disables the interrupt and adds the NIC to the CPUs polling
list. Future tx/rx on this NIC are now bound to this CPU until we return to
interrupt mode.

The NAPI budget & weight controls determine how much work can be done for each
poll. If the budget is exhausted, the NIC is place back on the tail of the list
and a softirq is generated. If no more packets need to be processed, the NIC
should exit poll mode and re-enable the NIC interrupt.

If the check for packets to be processed always returns true, the NIC becomes
bound to this one CPU indefinitely, leading to a load imbalance for this CPU in
SMP systems.

The check in the current e1000 code uses the following condition:

        /* If no Tx and not enough Rx work done, exit the polling mode */
        if ((!tx_cleaned && (work_done == 0)) ||
           !netif_running(adapter->netdev)) {
quit_polling:
                netif_rx_complete(poll_dev);
                e1000_irq_enable(adapter);
                return 0;
        }

In the case that we have just finished sending packets and had no recieve work
to do, tx_cleaned will be true and work_done == 0 so we do not exit poll mode.

If no further packets are recieved, the driver exits poll mode on the next
polling loop (tx_cleaned is now false). The problem occurs when packets are
recieved in this period, causing the NIC to remain in polled mode.

Comment 1 Bryn M. Reeves 2006-10-18 10:56:35 UTC
Created attachment 138772 [details]
proposed patch for e1000

Comment 3 Jesse Brandeburg 2006-10-18 23:41:01 UTC
Isn't this exactly how polling is supposed to work?  If the scheduler (or irq
balancer) is broken and the system continues to migrate processes to this
processor that is completely consumed with softirq (polling) load, then why is
it e1000 that is broken?

Aside from that concern, I don't have too much problem with the patch besides
that  it might increase overall cpu utilization.

If anything I would consider this a workaround for a broken environment.


Comment 5 John W. Linville 2006-10-23 20:47:26 UTC
Created attachment 139163 [details]
jwltest-e1000-napi-poll-exit.patch

Comment 6 John W. Linville 2006-10-24 11:41:55 UTC
Test kernels w/ above patch are available here:

   http://people.redhat.com/linville/kernels/rhel4/

Please give them a try and post the results here...thanks!

Comment 11 Bryn M. Reeves 2006-10-25 17:50:36 UTC
Hi John,

No problem - I'll talk to the TAM & try to figure out what's up here. IBM
reported a similar problem when I first asked them to test with your kernels, so
I'm guessing it's unrelated to the extra patch now added.

I'm officially on vacation at the minute but I'll be able to find some time for
this in the next day or two.

Thanks,


Comment 12 Jesse Brandeburg 2006-10-26 05:38:33 UTC
I'm very concerned about the peformance implications of this patch.  I'm 
attempting to work on numbers, but IBM going directly to you isn't helping me.

Comment 13 Jesse Brandeburg 2006-10-27 23:21:00 UTC
here is some data: 
I'm sending 64 byte packets to the discard port using pktgen from a remote
machine.  This is a pretty good indicator of how many packets per second can be
received.

newnapi is our latest driver with only the patch in this bug applied, 3 runs,
with avg computed on two different systems with kernel.org kernels.

2.6.17 i686 7520 intel system.
baseline: 654200/652601/654349: avg 653716
newnapi: 648985/646299/650799: avg 648694

2.6.18 pSeries 630
baseline: 237926/235999/238146: avg 237357
newnapi: 229909/226574/229849: avg 228777

I can acquire more data, but from this we see that there is a definite decrease
in performance for at least this test.  It is not very significant however.


Comment 14 Jesse Brandeburg 2006-10-28 01:18:49 UTC
ugh, upgraded 7520 to 2.6.18 and now same two drivers show

baseline: 656616
newnapi:  679900

so maybe the new code is okay.  Can't win consistency on a friday night I guess.

Comment 19 RHEL Program Management 2006-11-08 20:00:41 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 22 Bryn M. Reeves 2006-11-09 18:05:14 UTC
This is looking like an interrupt delivery problem. The dmesg from the two
kernels shows a difference in how they are scanning the PCI bus:

kerne-2.6.9-42.EL:
Unable to get OpenPIC IRQ for cascade
NET: Registered protocol family 16
PCI: Probing PCI hardware
PCI: Address space collision on region 0 of device 0001:00:02.0
[400bfff8000:400bfff7fff]
PCI: Address space collision on region 0 of device 0001:00:02.2
[400bfff0000:400bffeffff]
PCI: Address space collision on region 0 of device 0001:00:02.6
...
IOMMU table initialized, virtual merging enabled
PCI: Probing PCI hardware done
usbcore: registered new driver usbfs

kernel-2.6.9-42.22.EL.jwltest.175:
Unable to get OpenPIC IRQ for cascade
NET: Registered protocol family 16
PCI: Probing PCI hardware
IOMMU table initialized, virtual merging enabled
PCI: Probing PCI hardware done

The resource conflict messages have disappeared in the later kernel, but the
ethtool stats suggest something is wrong with interrupt delivery. I'm waiting to
get some more info to confirm this.

There was a similar problem discussed on lkml recently relating to an IOAPIC
patch in 2.6.18-mm1 - I'll look to see if anything similar went into the later
RHEL kernel.


Comment 24 Bryn M. Reeves 2007-01-18 11:05:31 UTC
Created attachment 145902 [details]
Revised patch to apply to later RHEL kernels

Comment 25 Jesse Brandeburg 2007-01-18 17:45:32 UTC
acked-by: Jesse Brandeburg <jesse.brandeburg>

Comment 26 Janice Girouard - IBM on-site partner 2007-02-08 19:36:09 UTC
Marking severity on RH side to High to match the priority on the IBM side.

Comment 29 RHEL Program Management 2007-04-18 22:50:31 UTC
This request was evaluated by Red Hat Kernel Team for inclusion in a Red
Hat Enterprise Linux maintenance release, and has moved to bugzilla 
status POST.

Comment 30 Jesse Brandeburg 2007-04-18 23:01:35 UTC
please see bug #229603, as it appears this change may be the instigator of a new
bug on T60/X60 laptops.  I don't know how to link these bugs, can someone do it?

At this point there is strong correlation that reverting this patch causes the
problem to go away, but debugging continues.

Comment 33 Bryn M. Reeves 2008-02-06 18:48:39 UTC
This patch has been reverted upstream as it triggered regressions on some widely
deployed e1000 revisions. Given that this code is no longer in mainline and has
not been re-proposed upstream, closing this bug WONTFIX.