Bug 2052947
| Summary: | better handling of NIC irqs | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Paolo Abeni <pabeni> |
| Component: | irqbalance | Assignee: | ltao |
| Status: | NEW --- | QA Contact: | Jiri Dluhos <jdluhos> |
| Severity: | high | Docs Contact: | |
| Priority: | medium | ||
| Version: | 8.5 | CC: | danw, jbainbri, jeder, jmario, jshortt, ruyang, rvr |
| Target Milestone: | rc | Keywords: | Triaged |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Paolo Abeni
2022-02-10 10:22:59 UTC
Added Dan to the CC-list for OCP-team awareness Paolo and I were discussing this via email. aiui the optimal configuration for IRQs is to have NIC channels equal to the number of real cores (not HyperThreads) in the NUMA Node local to the NIC, and don't handle multiple IRQs for the same device on the same CPU. Crossing a NUMA Node is definitely not good. Performance can be as bad as half wirespeed. This has worked properly since irqbalance-1.0.4-10.el6, so that's good: Why is irqbalance not balancing interrupts? https://access.redhat.com/solutions/677073 We often get customers to ban thread siblings from irqbalance: How to calculate hexadecimal bit mask value for "IRQBALANCE_BANNED_CPUS" parameter https://access.redhat.com/solutions/3152271 We often get customers to change the number of IRQ channels to the number of real cores How should I configure network interface IRQ channels? https://access.redhat.com/solutions/4367191 Some thoughts around irqbalance improvements here: 1) My understanding is that there is no advantage to spreading IRQ channels from the same device across HyperThreads, presumably because disabling IRQs is local to the physical core (two HyperThreads), not to the logical core (one HyperThread). I forget where I read this, but it was at least 7 years ago. It would be good to confirm if this is still the situation on modern CPUs, or if some behave differently. This might be particularly complex with AMD's "configurable NUMA" features within the one socket on some models, and which changes specifics from model to model. 2) If the above holds true, irqbalance could understand siblings from paths like "/sys/devices/system/cpu/cpuX/topology/{core,thread}_siblings_list" and automatically not consider siblings as valid for balancing when the first core is already handling an interrupt, effectively banning HT from irqbalance automatically. For additional complexity depending on CPU brand/family, there are ways to detect these too. Defining the right/wrong way to handle IRQs in code like irqbalance is much better than re-applying the internet knowledge I heard in the middle of last decade that might have changed with new CPU families. 3) irqbalance should not "double up" IRQs for the same device on CPU cores, but it's common for NIC drivers to create "nr_cpus" IRQ channels. On a NUMA system with HT, this results in many more IRQs than really should be made. irqbalance could detect the number of useful cores in a NUMA node and (where possible) issue commands to change the number of channels on a device to the optimal number. This would have to be "max(real CPUs in NUMA Node, NIC max IRQ channels)" because some devices place a hard limit on IRQ channels, like vmxnet3's maximum 8 channels. For networking, the netlink nl_schannels (used by "ethtool --set-channels") interface is the standard for networking 3a) Some storage HBA drivers also allow SMP affinity to be disabled so that their interrupts can be managed, however this appears to be mostly controlled by module options (megraid_sas has smp_affinity_enable, qla2xxx has ql2xuctrlirq). Unsure if there is a generic interface for these like netlink. Maybe we need storage maintainers to develop a generic interface like networking's netlink (or just use netlink). 4) The items 2 and 3 - automatically changing the number of cores/channels to be considered - is moving from an "interrupt balancer" to an "interrupt manager". This is a welcome change, it would proactively solve a number of performance problems that customers contact us about, and is in line with other auto-performance-tuning tools that Red Hat supplies like numad. However, this might be outside of the scope of what upstream irqbalance wants to do. In that case, it would be necessary to fork irqbalance or develop a new solution. Marc and I are discussing updating Jon and my old performance tuning whitepaper and moving it into product documentation, which could address this by manual customer action: Red Hat Enterprise Linux Network Performance Tuning Guide https://access.redhat.com/articles/1391433 Create a RHEL network performance tuning guide https://issues.redhat.com/browse/RHELPLAN-137653 Hi Jamie, Currently we are discussion about the rebase planning of irqbalance for upcoming rhel9.3 and rhel8.9. I see there is no code updates for this bug instead the performance tuning guide as you mentioned in comment6. I don't know if the issue is solved by the documentation, and should I close the bug? Thanks, Tao Liu (In reply to ltao from comment #7) > should I close the bug? No, this is a bug for Paolo (and other network developers) to consider long-term improvement of irqbalance. Please leave this bug as it is. (In reply to Jamie Bainbridge from comment #8) > (In reply to ltao from comment #7) > > should I close the bug? > > No, this is a bug for Paolo (and other network developers) to consider > long-term improvement of irqbalance. > > Please leave this bug as it is. OK, Thanks! The kernel has since grown an attempt at vector spreading on creation in v4.8 (July 2016) starting with: genirq: Add a helper to spread an affinity mask for MSI/MSI-X vectors https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5e385a6ef31f The state of that can be seen in the git log for kernel/irq/affinity.c At least in RHEL9 we can do something like this now: rhel9 ~]# systemctl status irqbalance ○ irqbalance.service - irqbalance daemon Loaded: loaded (/usr/lib/systemd/system/irqbalance.service; disabled; vendor preset: enabled) Active: inactive (dead) Docs: man:irqbalance(1) https://github.com/Irqbalance/irqbalance rhel9 ~]# grep virtio7 /proc/interrupts 64: 0 0 0 0 PCI-MSI 4718592-edge virtio7-config 65: 26744 0 1 0 PCI-MSI 4718593-edge virtio7-input.0 66: 23243 0 0 1 PCI-MSI 4718594-edge virtio7-output.0 67: 1 225172 0 0 PCI-MSI 4718595-edge virtio7-input.1 68: 0 132020 0 0 PCI-MSI 4718596-edge virtio7-output.1 69: 0 0 119062 0 PCI-MSI 4718597-edge virtio7-input.2 70: 0 0 82214 1 PCI-MSI 4718598-edge virtio7-output.2 71: 1 0 0 77294 PCI-MSI 4718599-edge virtio7-input.3 72: 0 1 0 39300 PCI-MSI 4718600-edge virtio7-output.3 Upstream also had a discussion about doing balancing in the kernel here in Nov 2017: Implementing irqbalance into the Linux Kernel #59 https://github.com/Irqbalance/irqbalance/issues/59 Ironically, PJ and Neil said it's not the kernel's job to enforce policy like IRQ affinity, which makes me wonder how the above genirq/affinity patches got in then. |