Bug 2052947

Summary: better handling of NIC irqs
Product: Red Hat Enterprise Linux 8 Reporter: Paolo Abeni <pabeni>
Component: irqbalanceAssignee: ltao
Status: NEW --- QA Contact: Jiri Dluhos <jdluhos>
Severity: high Docs Contact:
Priority: medium    
Version: 8.5CC: danw, jbainbri, jeder, jmario, jshortt, ruyang, rvr
Target Milestone: rcKeywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Paolo Abeni 2022-02-10 10:22:59 UTC
Due to the NAPI infrastructure, the accounting for interrupts generated by network driver is problematic: under low load the interrupt number is proportional to network traffic, but the more the network load increases the more the interrupts generated by the NICS are mitigated in software. Under very high network load is common that no interrupt at all is generated.

As a consequence, actions taken by irqbalance based on NIC interrupt numbers accounting are more often then not incorrect, especially in the most relevant scenarios, when the network load is high.

The proposed solution is limiting irqbalance to spread the NIC IRQs on the available CPUs and then leave the IRQ alone.

Comment 1 Paolo Abeni 2022-02-10 10:30:33 UTC
Added Dan to the CC-list for OCP-team awareness

Comment 5 Jamie Bainbridge 2022-08-29 06:53:51 UTC
Paolo and I were discussing this via email.

aiui the optimal configuration for IRQs is to have NIC channels equal to the number of real cores (not HyperThreads) in the NUMA Node local to the NIC, and don't handle multiple IRQs for the same device on the same CPU.

Crossing a NUMA Node is definitely not good. Performance can be as bad as half wirespeed. This has worked properly since irqbalance-1.0.4-10.el6, so that's good:

 Why is irqbalance not balancing interrupts?
 https://access.redhat.com/solutions/677073

We often get customers to ban thread siblings from irqbalance:

 How to calculate hexadecimal bit mask value for "IRQBALANCE_BANNED_CPUS" parameter
 https://access.redhat.com/solutions/3152271

We often get customers to change the number of IRQ channels to the number of real cores

 How should I configure network interface IRQ channels?
 https://access.redhat.com/solutions/4367191

Some thoughts around irqbalance improvements here:

1) My understanding is that there is no advantage to spreading IRQ channels from the same device across HyperThreads, presumably because disabling IRQs is local to the physical core (two HyperThreads), not to the logical core (one HyperThread). I forget where I read this, but it was at least 7 years ago. It would be good to confirm if this is still the situation on modern CPUs, or if some behave differently. This might be particularly complex with AMD's "configurable NUMA" features within the one socket on some models, and which changes specifics from model to model.

2) If the above holds true, irqbalance could understand siblings from paths like "/sys/devices/system/cpu/cpuX/topology/{core,thread}_siblings_list" and automatically not consider siblings as valid for balancing when the first core is already handling an interrupt, effectively banning HT from irqbalance automatically. For additional complexity depending on CPU brand/family, there are ways to detect these too. Defining the right/wrong way to handle IRQs in code like irqbalance is much better than re-applying the internet knowledge I heard in the middle of last decade that might have changed with new CPU families.

3) irqbalance should not "double up" IRQs for the same device on CPU cores, but it's common for NIC drivers to create "nr_cpus" IRQ channels. On a NUMA system with HT, this results in many more IRQs than really should be made. irqbalance could detect the number of useful cores in a NUMA node and (where possible) issue commands to change the number of channels on a device to the optimal number. This would have to be "max(real CPUs in NUMA Node, NIC max IRQ channels)" because some devices place a hard limit on IRQ channels, like vmxnet3's maximum 8 channels. For networking, the netlink nl_schannels (used by "ethtool --set-channels") interface is the standard for networking

3a) Some storage HBA drivers also allow SMP affinity to be disabled so that their interrupts can be managed, however this appears to be mostly controlled by module options (megraid_sas has smp_affinity_enable, qla2xxx has ql2xuctrlirq). Unsure if there is a generic interface for these like netlink. Maybe we need storage maintainers to develop a generic interface like networking's netlink (or just use netlink).

4) The items 2 and 3 - automatically changing the number of cores/channels to be considered - is moving from an "interrupt balancer" to an "interrupt manager". This is a welcome change, it would proactively solve a number of performance problems that customers contact us about, and is in line with other auto-performance-tuning tools that Red Hat supplies like numad. However, this might be outside of the scope of what upstream irqbalance wants to do. In that case, it would be necessary to fork irqbalance or develop a new solution.

Comment 6 Jamie Bainbridge 2022-10-28 01:54:05 UTC
Marc and I are discussing updating Jon and my old performance tuning whitepaper and moving it into product documentation, which could address this by manual customer action:

 Red Hat Enterprise Linux Network Performance Tuning Guide
 https://access.redhat.com/articles/1391433

 Create a RHEL network performance tuning guide
 https://issues.redhat.com/browse/RHELPLAN-137653

Comment 7 ltao 2023-02-28 08:39:02 UTC
Hi Jamie,

Currently we are discussion about the rebase planning of irqbalance for upcoming rhel9.3 and rhel8.9. I see there is no code updates for this bug instead the performance tuning guide as you mentioned in comment6. I don't know if the issue is solved by the documentation, and should I close the bug?

Thanks,
Tao Liu

Comment 8 Jamie Bainbridge 2023-02-28 21:35:35 UTC
(In reply to ltao from comment #7)
> should I close the bug?

No, this is a bug for Paolo (and other network developers) to consider long-term improvement of irqbalance.

Please leave this bug as it is.

Comment 9 ltao 2023-03-01 01:09:32 UTC
(In reply to Jamie Bainbridge from comment #8)
> (In reply to ltao from comment #7)
> > should I close the bug?
> 
> No, this is a bug for Paolo (and other network developers) to consider
> long-term improvement of irqbalance.
> 
> Please leave this bug as it is.

OK, Thanks!

Comment 10 Jamie Bainbridge 2023-04-20 09:27:36 UTC
The kernel has since grown an attempt at vector spreading on creation in v4.8 (July 2016) starting with:

 genirq: Add a helper to spread an affinity mask for MSI/MSI-X vectors
 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5e385a6ef31f

The state of that can be seen in the git log for kernel/irq/affinity.c

At least in RHEL9 we can do something like this now:

rhel9 ~]# systemctl status irqbalance
○ irqbalance.service - irqbalance daemon
     Loaded: loaded (/usr/lib/systemd/system/irqbalance.service; disabled; vendor preset: enabled)
     Active: inactive (dead)
       Docs: man:irqbalance(1)
             https://github.com/Irqbalance/irqbalance

rhel9 ~]# grep virtio7 /proc/interrupts 
 64:          0          0          0          0   PCI-MSI 4718592-edge      virtio7-config
 65:      26744          0          1          0   PCI-MSI 4718593-edge      virtio7-input.0
 66:      23243          0          0          1   PCI-MSI 4718594-edge      virtio7-output.0
 67:          1     225172          0          0   PCI-MSI 4718595-edge      virtio7-input.1
 68:          0     132020          0          0   PCI-MSI 4718596-edge      virtio7-output.1
 69:          0          0     119062          0   PCI-MSI 4718597-edge      virtio7-input.2
 70:          0          0      82214          1   PCI-MSI 4718598-edge      virtio7-output.2
 71:          1          0          0      77294   PCI-MSI 4718599-edge      virtio7-input.3
 72:          0          1          0      39300   PCI-MSI 4718600-edge      virtio7-output.3

Upstream also had a discussion about doing balancing in the kernel here in Nov 2017:

 Implementing irqbalance into the Linux Kernel #59
 https://github.com/Irqbalance/irqbalance/issues/59

Ironically, PJ and Neil said it's not the kernel's job to enforce policy like IRQ affinity, which makes me wonder how the above genirq/affinity patches got in then.