Description of problem: The current e1000 driver does not correctly exit polling mode when NAPI is enabled. This is reported to cause load problems on SMP systems with four or more e1000 NICs installed. Because all NIC activity is being bound to a single CPU, that processor becomes excessively loaded. Initially, an interrupt is recieved on one CPU and is handled by the e1000 driver. The driver disables the interrupt and adds the NIC to the CPUs polling list. Future tx/rx on this NIC are now bound to this CPU until we return to interrupt mode. The NAPI budget & weight controls determine how much work can be done for each poll. If the budget is exhausted, the NIC is place back on the tail of the list and a softirq is generated. If no more packets need to be processed, the NIC should exit poll mode and re-enable the NIC interrupt. If the check for packets to be processed always returns true, the NIC becomes bound to this one CPU indefinitely, leading to a load imbalance for this CPU in SMP systems. The check in the current e1000 code uses the following condition: /* If no Tx and not enough Rx work done, exit the polling mode */ if ((!tx_cleaned && (work_done == 0)) || !netif_running(adapter->netdev)) { quit_polling: netif_rx_complete(poll_dev); e1000_irq_enable(adapter); return 0; } In the case that we have just finished sending packets and had no recieve work to do, tx_cleaned will be true and work_done == 0 so we do not exit poll mode. If no further packets are recieved, the driver exits poll mode on the next polling loop (tx_cleaned is now false). The problem occurs when packets are recieved in this period, causing the NIC to remain in polled mode.
Created attachment 138772 [details] proposed patch for e1000
Isn't this exactly how polling is supposed to work? If the scheduler (or irq balancer) is broken and the system continues to migrate processes to this processor that is completely consumed with softirq (polling) load, then why is it e1000 that is broken? Aside from that concern, I don't have too much problem with the patch besides that it might increase overall cpu utilization. If anything I would consider this a workaround for a broken environment.
Created attachment 139163 [details] jwltest-e1000-napi-poll-exit.patch
Test kernels w/ above patch are available here: http://people.redhat.com/linville/kernels/rhel4/ Please give them a try and post the results here...thanks!
Hi John, No problem - I'll talk to the TAM & try to figure out what's up here. IBM reported a similar problem when I first asked them to test with your kernels, so I'm guessing it's unrelated to the extra patch now added. I'm officially on vacation at the minute but I'll be able to find some time for this in the next day or two. Thanks,
I'm very concerned about the peformance implications of this patch. I'm attempting to work on numbers, but IBM going directly to you isn't helping me.
here is some data: I'm sending 64 byte packets to the discard port using pktgen from a remote machine. This is a pretty good indicator of how many packets per second can be received. newnapi is our latest driver with only the patch in this bug applied, 3 runs, with avg computed on two different systems with kernel.org kernels. 2.6.17 i686 7520 intel system. baseline: 654200/652601/654349: avg 653716 newnapi: 648985/646299/650799: avg 648694 2.6.18 pSeries 630 baseline: 237926/235999/238146: avg 237357 newnapi: 229909/226574/229849: avg 228777 I can acquire more data, but from this we see that there is a definite decrease in performance for at least this test. It is not very significant however.
ugh, upgraded 7520 to 2.6.18 and now same two drivers show baseline: 656616 newnapi: 679900 so maybe the new code is okay. Can't win consistency on a friday night I guess.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux major release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Major release. This request is not yet committed for inclusion.
This is looking like an interrupt delivery problem. The dmesg from the two kernels shows a difference in how they are scanning the PCI bus: kerne-2.6.9-42.EL: Unable to get OpenPIC IRQ for cascade NET: Registered protocol family 16 PCI: Probing PCI hardware PCI: Address space collision on region 0 of device 0001:00:02.0 [400bfff8000:400bfff7fff] PCI: Address space collision on region 0 of device 0001:00:02.2 [400bfff0000:400bffeffff] PCI: Address space collision on region 0 of device 0001:00:02.6 ... IOMMU table initialized, virtual merging enabled PCI: Probing PCI hardware done usbcore: registered new driver usbfs kernel-2.6.9-42.22.EL.jwltest.175: Unable to get OpenPIC IRQ for cascade NET: Registered protocol family 16 PCI: Probing PCI hardware IOMMU table initialized, virtual merging enabled PCI: Probing PCI hardware done The resource conflict messages have disappeared in the later kernel, but the ethtool stats suggest something is wrong with interrupt delivery. I'm waiting to get some more info to confirm this. There was a similar problem discussed on lkml recently relating to an IOAPIC patch in 2.6.18-mm1 - I'll look to see if anything similar went into the later RHEL kernel.
Created attachment 145902 [details] Revised patch to apply to later RHEL kernels
acked-by: Jesse Brandeburg <jesse.brandeburg>
Marking severity on RH side to High to match the priority on the IBM side.
This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST.
please see bug #229603, as it appears this change may be the instigator of a new bug on T60/X60 laptops. I don't know how to link these bugs, can someone do it? At this point there is strong correlation that reverting this patch causes the problem to go away, but debugging continues.
This patch has been reverted upstream as it triggered regressions on some widely deployed e1000 revisions. Given that this code is no longer in mainline and has not been re-proposed upstream, closing this bug WONTFIX.