Bug 485544 - Allow irqbalance to balance interrupts in pairs
Allow irqbalance to balance interrupts in pairs
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: irqbalance (Show other bugs)
x86_64 Linux
low Severity medium
: rc
: ---
Assigned To: Neil Horman
Red Hat Kernel QE team
Depends On:
  Show dependency treegraph
Reported: 2009-02-13 20:42 EST by Neal Pitts
Modified: 2010-10-23 03:43 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-03-11 15:07:22 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Neal Pitts 2009-02-13 20:42:10 EST
Description of problem: Allow irqbalance to balance interrupts in pairs

How reproducible: N/A

Steps to Reproduce: N/A

Actual results: N/A

Expected results: N/A

Additional info:

Allow irqbalance to balance interrupts in pairs via by specifying the interrupt name in a config file.  Requesting this change for both RHEL4 and RHEL5.

Currently, irqbalance will balance interrupts individually across processors.  This is viewable by watching /proc/interrupts on a RHEL system.  However, there are some situations where it would be optimal to process interrupts in pairs on the same processor, such as when two network interfaces are bonded together in a round-robin fashion.

One example involves a RHEL server on an Egenera pBlade.  An Egenera pBlade uses a converged fabric to transmit both Ethernet and SCSI data. The communication on a pBlade takes place over two cLAN1k NICs, which operate in active-active mode, essentially doing round-robin communication to the Egenera switched fabric.  iPerf network testing with UDP traffic revealed that a certain percentage of datagrams would be received out-of-order.  This was because the cLAN1k interrupts were being processed on different processors, and depending on how busy each processor was, the UDP datagrams may be sent out-of-order.  When irqbalance was disabled, and both cLAN1k interrupts were manually assigned to CPU0 using SMP affinity, iPerf showed a significant reduction in out-of-order datagrams (from 200+ to only 1).  This improved network performance in the iPerf test.  Similar results were observable with TCP, when running traffic tests and monitoring netstat -s for number of segments retransmitted.  

Unfortunately, the manual assignment of the interrupt processing adversely affected production backups. Symantec Netbackup client ran slow after this change had been made.  After reverting the change, backups over the network returned to their usual performance.  It is likely that CPU0 became very busy during the backup window, and was unable to process all of the cLAN1k interrupts in a timely fashion.  This is why having irqbalance handle both cLAN1k interrupts in pairs would be desirable, versus manual assignment.

This feature could also be potentially useful in other hardware environments, such a Xen farm, where the Dom0 has two bonded NICs balanced in a round-robin fashion.

Below is a snapshot of /proc/interrupts from a RHEL4 Egenera pServer.  Look for the clan1k interrupts to get a picture of how irqbalance balances them today.

[root@node3 ~]# cat /proc/interrupts
          CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
 0:       1156        394         29      28396       1956      19341     304595 1333825978    IO-APIC-edge  timer
 2:          0          0          0          0          0          0          0          0          XT-PIC  cascade
 8:          0          0          0          0          0          0          0          0    IO-APIC-edge  rtc
177:      63437      64367      60996      50898    3836034 3294397316      56173      66242   IO-APIC-level  clan1k
185:      63961      64653      61006      50949   13599520 3557627640      56372      68878   IO-APIC-level  clan1k
NMI:          0          0          0          0          0          0          0          0
LOC: 1334023026 1334023024 1334023017 1334023024 1334021417 1334021176 1334022865 1334022919
ERR:          0
MIS:          0
Comment 1 Neil Horman 2009-02-16 10:47:00 EST
Do you have any metrics showing that this is a clear advantage?  I see how the reasoning above is enticing, but I'm not 100% sure it stands up to scrutiny.  I of course don't have any evidence to the contrary, but I'm not sure that I see a whole lot of performance gain to be had with this feature. 

  First of all, Such a feature is not a universal victory.  Just because two interfaces are bonded doesn't mean its adventageous to process their interrupts on the same cpu.  Performance gains are realized in such a case only if the same NIC driver is used on both interfaces.  If the drivers are different, then the interrupt handlers for each nic will fight the other for cache space, leading to performance degradation, rather than benefit.  

   In the case that  both NICS are the same and have the same driver, the advantage would seem to me to be more one of conservation of cache lines rather than true speedup (i.e. only one cache need contain the irq handler instructions, rather than 2).  While it would be nice to reduce cache reliance, I don't see any real performance gain there.

   Lastly, This feature implies the mapping of two very high volume interrupts to the same cpu.  This destroys any possibility of doing recieve operations in parallel, effectively limiting your receive bandwidth in the bond to that of one interface, which seems to defeat at least half the purpose of bonding. 

If you can provide some data showing a clear performance gain, we can discuss this further, but until then, I'm sorry, this will be devel nak from me.
Comment 2 Neal Pitts 2009-02-16 14:44:02 EST
Thanks for your reply!  I'll look into getting some proper metrics, especially for receiving traffic.  I agree the performance gains may be small from a driver perspective, but I expect they would be better from an application perspective, by using less CPU to reorder data (in the case of UDP).  Also, I expect the performance becomes more deterministic, which is more assuring to users looking at performance data.
Comment 3 Neil Horman 2009-02-16 15:08:12 EST
I don't think you'll find performance will be deterministic at all, actually, given that after you reach a certain load, you'll NIC's will both wind up dropping data, as they start to contend for interrupt servicing, but we can wait for the data to make that determination.

Also, why do you care about message ordering when using UDP?  Isn't that the point at which you should consider tcp (or sctp if you want partial reliability or simple message order guarantees)?  If you're using that much cpu in an application to do re-ordering, you may want to consider alternate approaches.

Also, are you aware of the IRQBALANCE_BANNED_INTERRUPTS option?  It might be a solution for your use case here.  It allows you to tell irqbalance to ignore certain interrupts in the course of its work.  By specifying your NIC interrupts as banned, you can then manually assign irq affinity for both NIC interrupts to the same cpu and not have irqbalance move them again.  Its one of the best practices that we follow in the real time kernel.

Let me know what kind of metrics you get on this.  Thank you!
Comment 4 Neal Pitts 2009-02-16 15:22:55 EST
My application example for UDP is Oracle RAC Cache Fusion traffic.  The idea for this enhancement came while reviewing network performance for a few RAC clusters.

I didn't know about IRQBALANCE_BANNED_INTERRUPTS option.  If it weren't for the issue I ran into with production backups (detailed above), it would probably be the solution.

I'll get back to you with the metrics.
Comment 9 Neal Pitts 2009-03-11 15:07:22 EDT
I haven't been able to make the time to get these metrics.  However, Calvin Smith (RedHat GPS), fresh from his performance tuning class, has suggested I look into cpusets as a way to solve my issue.  I believe cpusets are supported on both RHEL4 and RHEL5.  The concept is I would create a cpuset containing all CPUs minus 1 to run all processes on the system; the remaining CPU would then be 100% free to run my assigned interrupts.  If that configuration works, I could run some tests to see if the solution is viable in production.

Even if the above doesn't work out, this bug can be closed.  Through further review, I realize this solution is really only viable on a very small set of production servers, and probably not worth a lot of time in justifying the feature request.

Note You need to log in before you can comment on or make changes to this bug.