Description of problem: Allow irqbalance to balance interrupts in pairs How reproducible: N/A Steps to Reproduce: N/A Actual results: N/A Expected results: N/A Additional info: FEATURE REQUEST: Allow irqbalance to balance interrupts in pairs via by specifying the interrupt name in a config file. Requesting this change for both RHEL4 and RHEL5. Currently, irqbalance will balance interrupts individually across processors. This is viewable by watching /proc/interrupts on a RHEL system. However, there are some situations where it would be optimal to process interrupts in pairs on the same processor, such as when two network interfaces are bonded together in a round-robin fashion. One example involves a RHEL server on an Egenera pBlade. An Egenera pBlade uses a converged fabric to transmit both Ethernet and SCSI data. The communication on a pBlade takes place over two cLAN1k NICs, which operate in active-active mode, essentially doing round-robin communication to the Egenera switched fabric. iPerf network testing with UDP traffic revealed that a certain percentage of datagrams would be received out-of-order. This was because the cLAN1k interrupts were being processed on different processors, and depending on how busy each processor was, the UDP datagrams may be sent out-of-order. When irqbalance was disabled, and both cLAN1k interrupts were manually assigned to CPU0 using SMP affinity, iPerf showed a significant reduction in out-of-order datagrams (from 200+ to only 1). This improved network performance in the iPerf test. Similar results were observable with TCP, when running traffic tests and monitoring netstat -s for number of segments retransmitted. Unfortunately, the manual assignment of the interrupt processing adversely affected production backups. Symantec Netbackup client ran slow after this change had been made. After reverting the change, backups over the network returned to their usual performance. It is likely that CPU0 became very busy during the backup window, and was unable to process all of the cLAN1k interrupts in a timely fashion. This is why having irqbalance handle both cLAN1k interrupts in pairs would be desirable, versus manual assignment. This feature could also be potentially useful in other hardware environments, such a Xen farm, where the Dom0 has two bonded NICs balanced in a round-robin fashion. Below is a snapshot of /proc/interrupts from a RHEL4 Egenera pServer. Look for the clan1k interrupts to get a picture of how irqbalance balances them today. [root@node3 ~]# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 0: 1156 394 29 28396 1956 19341 304595 1333825978 IO-APIC-edge timer 2: 0 0 0 0 0 0 0 0 XT-PIC cascade 8: 0 0 0 0 0 0 0 0 IO-APIC-edge rtc 177: 63437 64367 60996 50898 3836034 3294397316 56173 66242 IO-APIC-level clan1k 185: 63961 64653 61006 50949 13599520 3557627640 56372 68878 IO-APIC-level clan1k NMI: 0 0 0 0 0 0 0 0 LOC: 1334023026 1334023024 1334023017 1334023024 1334021417 1334021176 1334022865 1334022919 ERR: 0 MIS: 0
Do you have any metrics showing that this is a clear advantage? I see how the reasoning above is enticing, but I'm not 100% sure it stands up to scrutiny. I of course don't have any evidence to the contrary, but I'm not sure that I see a whole lot of performance gain to be had with this feature. First of all, Such a feature is not a universal victory. Just because two interfaces are bonded doesn't mean its adventageous to process their interrupts on the same cpu. Performance gains are realized in such a case only if the same NIC driver is used on both interfaces. If the drivers are different, then the interrupt handlers for each nic will fight the other for cache space, leading to performance degradation, rather than benefit. In the case that both NICS are the same and have the same driver, the advantage would seem to me to be more one of conservation of cache lines rather than true speedup (i.e. only one cache need contain the irq handler instructions, rather than 2). While it would be nice to reduce cache reliance, I don't see any real performance gain there. Lastly, This feature implies the mapping of two very high volume interrupts to the same cpu. This destroys any possibility of doing recieve operations in parallel, effectively limiting your receive bandwidth in the bond to that of one interface, which seems to defeat at least half the purpose of bonding. If you can provide some data showing a clear performance gain, we can discuss this further, but until then, I'm sorry, this will be devel nak from me.
Thanks for your reply! I'll look into getting some proper metrics, especially for receiving traffic. I agree the performance gains may be small from a driver perspective, but I expect they would be better from an application perspective, by using less CPU to reorder data (in the case of UDP). Also, I expect the performance becomes more deterministic, which is more assuring to users looking at performance data.
I don't think you'll find performance will be deterministic at all, actually, given that after you reach a certain load, you'll NIC's will both wind up dropping data, as they start to contend for interrupt servicing, but we can wait for the data to make that determination. Also, why do you care about message ordering when using UDP? Isn't that the point at which you should consider tcp (or sctp if you want partial reliability or simple message order guarantees)? If you're using that much cpu in an application to do re-ordering, you may want to consider alternate approaches. Also, are you aware of the IRQBALANCE_BANNED_INTERRUPTS option? It might be a solution for your use case here. It allows you to tell irqbalance to ignore certain interrupts in the course of its work. By specifying your NIC interrupts as banned, you can then manually assign irq affinity for both NIC interrupts to the same cpu and not have irqbalance move them again. Its one of the best practices that we follow in the real time kernel. Let me know what kind of metrics you get on this. Thank you!
My application example for UDP is Oracle RAC Cache Fusion traffic. The idea for this enhancement came while reviewing network performance for a few RAC clusters. I didn't know about IRQBALANCE_BANNED_INTERRUPTS option. If it weren't for the issue I ran into with production backups (detailed above), it would probably be the solution. I'll get back to you with the metrics.
I haven't been able to make the time to get these metrics. However, Calvin Smith (RedHat GPS), fresh from his performance tuning class, has suggested I look into cpusets as a way to solve my issue. I believe cpusets are supported on both RHEL4 and RHEL5. The concept is I would create a cpuset containing all CPUs minus 1 to run all processes on the system; the remaining CPU would then be 100% free to run my assigned interrupts. If that configuration works, I could run some tests to see if the solution is viable in production. Even if the above doesn't work out, this bug can be closed. Through further review, I realize this solution is really only viable on a very small set of production servers, and probably not worth a lot of time in justifying the feature request.