RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2184735 - irqbalance: silently failing to enforce IRQBALANCE_BANNED_CPULIST
Summary: irqbalance: silently failing to enforce IRQBALANCE_BANNED_CPULIST
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: irqbalance
Version: 9.2
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: rc
: ---
Assignee: ltao
QA Contact: Jiri Dluhos
URL:
Whiteboard:
Depends On:
Blocks: 2219830
TreeView+ depends on / blocked
 
Reported: 2023-04-05 14:23 UTC by Robin Jarry
Modified: 2023-11-07 11:39 UTC (History)
14 users (show)

Fixed In Version: irqbalance-1.9.2-2.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2219830 (view as bug list)
Environment:
Last Closed: 2023-11-07 08:56:07 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github Irqbalance irqbalance pull 269 0 None Merged Affinity mapping fixes 2023-07-14 08:56:09 UTC
Github Irqbalance irqbalance pull 270 0 None Merged activate_mapping: avoid logging error when there is none 2023-07-14 08:56:09 UTC
Red Hat Issue Tracker RHELPLAN-154092 0 None None None 2023-04-05 14:27:29 UTC
Red Hat Product Errata RHBA-2023:6688 0 None None None 2023-11-07 08:56:19 UTC

Description Robin Jarry 2023-04-05 14:23:43 UTC
Hi folks,

When there is not enough room in the non-banned CPUs APIC, irqbalance seems to silently let irq affinities overspill on banned CPUs.

Here are a few traces to highlight the problem. The tool I am using (irqstat) is available here: https://pypi.org/project/linux-tools/

> # rpm -qa irqbalance
> irqbalance-1.9.0-3.el9.x86_64
>
> ~# grep ^IRQBALANCE_BANNED_CPU /etc/sysconfig/irqbalance
> IRQBALANCE_BANNED_CPULIST=2-19,22-39
>
> ~# irqstat -n | grep -v '\<0\>$'
> CPU  AFFINITY-IRQs  EFFECTIVE-IRQs
> 0              114             114
> 1               43              34
> 20             184             179
> 21              61              51
>
> ~# echo 10 > /sys/class/net/ens2f0np0/device/sriov_numvfs
> ~# echo 10 > /sys/class/net/ens2f1np1/device/sriov_numvfs
> ~# echo 10 > /sys/class/net/eno3/device/sriov_numvfs
> ~# echo 10 > /sys/class/net/eno4/device/sriov_numvfs
>
> ~# irqstat -n | grep -v '\<0\>$'
> CPU  AFFINITY-IRQs  EFFECTIVE-IRQs
> 0              224             201
> 1               52              43
> 2               97              78
> 4               12              11
> 6               12              11
> 8               13              12
> 10               6               5
> 12              12              11
> 14              12              11
> 16              13              12
> 18              13              12
> 20             234             201
> 21              52              42
> 22              81              68
>
> ~# irqstat -c 2
> IRQ        AFFINITY  EFFECTIVE-CPU  DESCRIPTION
> 47                2              2  IR-PCI-MSI 12582920-edge i40e-eno1-TxRx-7
> 61                2              2  IR-PCI-MSI 12582934-edge i40e-eno1-TxRx-21
> 62                2              2  IR-PCI-MSI 12582935-edge i40e-eno1-TxRx-22
> 63                2              2  IR-PCI-MSI 12582936-edge i40e-eno1-TxRx-23
> 64                2              2  IR-PCI-MSI 12582937-edge i40e-eno1-TxRx-24
> 66                2              2  IR-PCI-MSI 12582939-edge i40e-eno1-TxRx-26
> 68                2              2  IR-PCI-MSI 12582941-edge i40e-eno1-TxRx-28
> 70                2              2  IR-PCI-MSI 12582943-edge i40e-eno1-TxRx-30
> 77                2              2  IR-PCI-MSI 12582950-edge i40e-eno1-TxRx-37
> 92                2              2  IR-PCI-MSI 12584960-edge i40e-0000:18:00.1:misc
> 97                2              2  IR-PCI-MSI 12584965-edge i40e-eno2-TxRx-4
> 102               2              2  IR-PCI-MSI 12584970-edge i40e-eno2-TxRx-9
> 128               2              2  IR-PCI-MSI 12584988-edge i40e-eno2-TxRx-27
> 133               2              2  IR-PCI-MSI 12584993-edge i40e-eno2-TxRx-32
> 134               2              2  IR-PCI-MSI 12584994-edge i40e-eno2-TxRx-33
> 139               2              2  IR-PCI-MSI 12584999-edge i40e-eno2-TxRx-38
> 141               2              2  IR-PCI-MSI 12585001-edge i40e-0000:18:00.1:fdir-TxRx-0
> 157               2              2  IR-PCI-MSI 49285123-edge ens3f1-rx-3
> 168               2              2  IR-PCI-MSI 49289220-edge ens3f3-rx-4
> 251               2              2  IR-PCI-MSI 12587020-edge i40e-eno3-TxRx-11
> 253               2              2  IR-PCI-MSI 12587022-edge i40e-eno3-TxRx-13
> 254               2              2  IR-PCI-MSI 12587023-edge i40e-eno3-TxRx-14
> 256               2              2  IR-PCI-MSI 12587025-edge i40e-eno3-TxRx-16
> 257               2              2  IR-PCI-MSI 12587026-edge i40e-eno3-TxRx-17
> 260               2              2  IR-PCI-MSI 12587029-edge i40e-eno3-TxRx-20
> 269               2              2  IR-PCI-MSI 12587038-edge i40e-eno3-TxRx-29
> 316       0,2,20,22              2  IR-PCI-MSI 49827840-edge mlx5_comp0@pci:0000:5f:01.2
> 317               2              2  IR-PCI-MSI 49827841-edge mlx5_comp1@pci:0000:5f:01.2
> 329               2              2  IR-PCI-MSI 49829889-edge mlx5_comp1@pci:0000:5f:01.3
> 352       0,2,20,22              2  IR-PCI-MSI 49833984-edge mlx5_comp0@pci:0000:5f:01.5
> 353               2              2  IR-PCI-MSI 49833985-edge mlx5_comp1@pci:0000:5f:01.5
> 364       0,2,20,22              2  IR-PCI-MSI 49836032-edge mlx5_comp0@pci:0000:5f:01.6
> 365               2              2  IR-PCI-MSI 49836033-edge mlx5_comp1@pci:0000:5f:01.6
> 380               2              2  IR-PCI-MSI 49807363-edge mlx5_comp3@pci:0000:5f:00.0
> 382               2              2  IR-PCI-MSI 49807365-edge mlx5_comp5@pci:0000:5f:00.0
> 384               2              2  IR-PCI-MSI 49807367-edge mlx5_comp7@pci:0000:5f:00.0
> 386               2              2  IR-PCI-MSI 49807369-edge mlx5_comp9@pci:0000:5f:00.0
> 387               2              2  IR-PCI-MSI 49807370-edge mlx5_comp10@pci:0000:5f:00.0
> 406               2              2  IR-PCI-MSI 49807389-edge mlx5_comp29@pci:0000:5f:00.0
> 407               2              2  IR-PCI-MSI 49807390-edge mlx5_comp30@pci:0000:5f:00.0
> 408               2              2  IR-PCI-MSI 49807391-edge mlx5_comp31@pci:0000:5f:00.0
> 413               2              2  IR-PCI-MSI 49807396-edge mlx5_comp36@pci:0000:5f:00.0
> 420               2              2  IR-PCI-MSI 12589058-edge i40e-eno4-TxRx-1
> 421               2              2  IR-PCI-MSI 12589059-edge i40e-eno4-TxRx-2
> 423               2              2  IR-PCI-MSI 12589061-edge i40e-eno4-TxRx-4
> 425               2              2  IR-PCI-MSI 12589063-edge i40e-eno4-TxRx-6
> 427               2              2  IR-PCI-MSI 12589065-edge i40e-eno4-TxRx-8
> 436               2              2  IR-PCI-MSI 12589074-edge i40e-eno4-TxRx-17
> 441               2              2  IR-PCI-MSI 12589079-edge i40e-eno4-TxRx-22
> 447               2              2  IR-PCI-MSI 12589085-edge i40e-eno4-TxRx-28
> 450               2              2  IR-PCI-MSI 12589088-edge i40e-eno4-TxRx-31
> 452               2              2  IR-PCI-MSI 12589090-edge i40e-eno4-TxRx-33
> 454               2              2  IR-PCI-MSI 12589092-edge i40e-eno4-TxRx-35
> 457               2              2  IR-PCI-MSI 12589095-edge i40e-eno4-TxRx-38
> 530               2              2  IR-PCI-MSI 49809418-edge mlx5_comp10@pci:0000:5f:00.1
> 531               2              2  IR-PCI-MSI 49809419-edge mlx5_comp11@pci:0000:5f:00.1
> 532               2              2  IR-PCI-MSI 49809420-edge mlx5_comp12@pci:0000:5f:00.1
> 533               2              2  IR-PCI-MSI 49809421-edge mlx5_comp13@pci:0000:5f:00.1
> 534               2              2  IR-PCI-MSI 49809422-edge mlx5_comp14@pci:0000:5f:00.1
> 535               2              2  IR-PCI-MSI 49809423-edge mlx5_comp15@pci:0000:5f:00.1
> 536               2              2  IR-PCI-MSI 49809424-edge mlx5_comp16@pci:0000:5f:00.1
> 537               2              2  IR-PCI-MSI 49809425-edge mlx5_comp17@pci:0000:5f:00.1
> 538               2              2  IR-PCI-MSI 49809426-edge mlx5_comp18@pci:0000:5f:00.1
> 539               2              2  IR-PCI-MSI 49809427-edge mlx5_comp19@pci:0000:5f:00.1
> 611               2              2  IR-PCI-MSI 49838081-edge mlx5_comp1@pci:0000:5f:01.7
> 622       0,2,20,22              2  IR-PCI-MSI 49840128-edge mlx5_comp0@pci:0000:5f:02.0
> 623               2              2  IR-PCI-MSI 49840129-edge mlx5_comp1@pci:0000:5f:02.0
> 635               2              2  IR-PCI-MSI 49842177-edge mlx5_comp1@pci:0000:5f:02.1
> 647               2              2  IR-PCI-MSI 49844225-edge mlx5_comp1@pci:0000:5f:02.2
> 659               2              2  IR-PCI-MSI 49846273-edge mlx5_comp1@pci:0000:5f:02.3
> 732       0,2,20,22              2  IR-PCI-MSI 12756995-edge iavf-eno3v5-TxRx-2
> 734       0,2,20,22              2  IR-PCI-MSI 12759040-edge iavf-0000:18:0a.6:mbx
> 762       0,2,20,22              2  IR-PCI-MSI 12824579-edge iavf-eno4v6-TxRx-2
> 767       0,2,20,22              2  IR-PCI-MSI 12814339-edge iavf-eno4v1-TxRx-2
> 772       0,2,20,22              2  IR-PCI-MSI 12826627-edge iavf-eno4v7-TxRx-2
> 784       0,2,20,22              2  IR-PCI-MSI 12830720-edge iavf-0000:18:0f.1:mbx
> 787       0,2,20,22              2  IR-PCI-MSI 12830723-edge iavf-eno4v9-TxRx-2
> 792       0,2,20,22              2  IR-PCI-MSI 12816387-edge iavf-eno4v2-TxRx-2

The irq affinities are overspilling on banned cpus. This probably is because the APIC from cpus 0 and 20 are full which is only a hardware limitation.

Reducing the span of banned cpus fixes the issue:

> # grep ^IRQBALANCE_BANNED_CPU /etc/sysconfig/irqbalance
> IRQBALANCE_BANNED_CPULIST=4-19,24-39
>
> ~# irqstat -n | grep -v '\<0\>$'
> CPU  AFFINITY-IRQs  EFFECTIVE-IRQs
> 0              160             164
> 1               31              22
> 2              162             153
> 3               30              21
> 20             164             159
> 21              31              21
> 22             162             157
> 23              30              21

However, it would be nice if irqbalance could at least log a warning or an error that it failed to set the affinity for a specific irq.

> ~# echo 0,20 > /proc/irq/47/smp_affinity_list
> -bash: echo: write error: No space left on device

For the record, here is the platform overview:

> NUMA 0
> ======
> 
> Memory: 187GB
> 2MB hugepages: 0
> 1GB hugepages: 32
> 
> CPUs
> ----
> 
> Model name:                      Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
> Cores IDs:
> 0,20    2,22    4,24    6,26    8,28    10,30   12,32   14,34   16,36   18,38
> 
> NICs
> ----
> 
> SLOT          DRIVER     IFNAME     MAC                LINK/STATE  SPEED   DEVICE
> 0000:18:00.0  i40e       eno1       e4:43:4b:48:6a:20  1/up        1Gb/s   Ethernet Controller X710 for 10GbE SFP+
> 0000:18:00.1  i40e       eno2       e4:43:4b:48:6a:21  1/up        1Gb/s   Ethernet Controller X710 for 10GbE SFP+
> 0000:18:00.2  i40e       eno3       e4:43:4b:48:6a:22  1/up        10Gb/s  Ethernet Controller X710 for 10GbE SFP+
> 0000:18:00.3  i40e       eno4       e4:43:4b:48:6a:23  1/up        10Gb/s  Ethernet Controller X710 for 10GbE SFP+
> 0000:3b:00.0  vfio-pci   -          -                  -/-         -       Ethernet Controller E810-C for QSFP
> 0000:3b:00.1  vfio-pci   -          -                  -/-         -       Ethernet Controller E810-C for QSFP
> 0000:5e:00.0  tg3        ens3f0     00:0a:f7:d9:e4:14  0/down      -       NetXtreme BCM5719 Gigabit Ethernet PCIe
> 0000:5e:00.1  tg3        ens3f1     00:0a:f7:d9:e4:15  0/down      -       NetXtreme BCM5719 Gigabit Ethernet PCIe
> 0000:5e:00.2  tg3        ens3f2     00:0a:f7:d9:e4:16  0/down      -       NetXtreme BCM5719 Gigabit Ethernet PCIe
> 0000:5e:00.3  tg3        ens3f3     00:0a:f7:d9:e4:17  0/down      -       NetXtreme BCM5719 Gigabit Ethernet PCIe
> 0000:5f:00.0  mlx5_core  ens2f0np0  04:3f:72:b8:be:6a  1/up        10Gb/s  MT27800 Family [ConnectX-5]
> 0000:5f:00.1  mlx5_core  ens2f1np1  04:3f:72:b8:be:6b  1/up        10Gb/s  MT27800 Family [ConnectX-5]
> 
> NUMA 1
> ======
> 
> Memory: 188GB
> 2MB hugepages: 0
> 1GB hugepages: 32
> 
> CPUs
> ----
> 
> Model name:                      Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
> Cores IDs:
> 1,21    3,23    5,25    7,27    9,29    11,31   13,33   15,35   17,37   19,39
> 
> NICs
> ----
> 
> SLOT          DRIVER    IFNAME  MAC                LINK/STATE  SPEED   DEVICE
> 0000:af:00.0  i40e      ens4f0  f8:f2:1e:42:5d:70  1/up        10Gb/s  Ethernet Controller X710 for 10GbE SFP+
> 0000:af:00.1  i40e      ens4f1  f8:f2:1e:42:5d:70  1/up        10Gb/s  Ethernet Controller X710 for 10GbE SFP+
> 0000:af:00.2  vfio-pci  -       -                  -/-         -       Ethernet Controller X710 for 10GbE SFP+
> 0000:af:00.3  vfio-pci  -       -                  -/-         -       Ethernet Controller X710 for 10GbE SFP+

Comment 1 ltao 2023-06-19 07:10:43 UTC
(In reply to Robin Jarry from comment #0)
> Hi folks,
> 
> When there is not enough room in the non-banned CPUs APIC, irqbalance seems
> to silently let irq affinities overspill on banned CPUs.
> 
> Here are a few traces to highlight the problem. The tool I am using
> (irqstat) is available here: https://pypi.org/project/linux-tools/
> 
> > # rpm -qa irqbalance
> > irqbalance-1.9.0-3.el9.x86_64
> >
> > ~# grep ^IRQBALANCE_BANNED_CPU /etc/sysconfig/irqbalance
> > IRQBALANCE_BANNED_CPULIST=2-19,22-39
> >
> > ~# irqstat -n | grep -v '\<0\>$'
> > CPU  AFFINITY-IRQs  EFFECTIVE-IRQs
> > 0              114             114
> > 1               43              34
> > 20             184             179
> > 21              61              51
> >
> > ~# echo 10 > /sys/class/net/ens2f0np0/device/sriov_numvfs
> > ~# echo 10 > /sys/class/net/ens2f1np1/device/sriov_numvfs
> > ~# echo 10 > /sys/class/net/eno3/device/sriov_numvfs
> > ~# echo 10 > /sys/class/net/eno4/device/sriov_numvfs
> >
> > ~# irqstat -n | grep -v '\<0\>$'
> > CPU  AFFINITY-IRQs  EFFECTIVE-IRQs
> > 0              224             201
> > 1               52              43
> > 2               97              78
> > 4               12              11
> > 6               12              11
> > 8               13              12
> > 10               6               5
> > 12              12              11
> > 14              12              11
> > 16              13              12
> > 18              13              12
> > 20             234             201
> > 21              52              42
> > 22              81              68
> >
> > ~# irqstat -c 2
> > IRQ        AFFINITY  EFFECTIVE-CPU  DESCRIPTION
> > 47                2              2  IR-PCI-MSI 12582920-edge i40e-eno1-TxRx-7
> > 61                2              2  IR-PCI-MSI 12582934-edge i40e-eno1-TxRx-21
> > 62                2              2  IR-PCI-MSI 12582935-edge i40e-eno1-TxRx-22
> > 63                2              2  IR-PCI-MSI 12582936-edge i40e-eno1-TxRx-23
> > 64                2              2  IR-PCI-MSI 12582937-edge i40e-eno1-TxRx-24
> > 66                2              2  IR-PCI-MSI 12582939-edge i40e-eno1-TxRx-26
> > 68                2              2  IR-PCI-MSI 12582941-edge i40e-eno1-TxRx-28
> > 70                2              2  IR-PCI-MSI 12582943-edge i40e-eno1-TxRx-30
> > 77                2              2  IR-PCI-MSI 12582950-edge i40e-eno1-TxRx-37
> > 92                2              2  IR-PCI-MSI 12584960-edge i40e-0000:18:00.1:misc
> > 97                2              2  IR-PCI-MSI 12584965-edge i40e-eno2-TxRx-4
> > 102               2              2  IR-PCI-MSI 12584970-edge i40e-eno2-TxRx-9
> > 128               2              2  IR-PCI-MSI 12584988-edge i40e-eno2-TxRx-27
> > 133               2              2  IR-PCI-MSI 12584993-edge i40e-eno2-TxRx-32
> > 134               2              2  IR-PCI-MSI 12584994-edge i40e-eno2-TxRx-33
> > 139               2              2  IR-PCI-MSI 12584999-edge i40e-eno2-TxRx-38
> > 141               2              2  IR-PCI-MSI 12585001-edge i40e-0000:18:00.1:fdir-TxRx-0
> > 157               2              2  IR-PCI-MSI 49285123-edge ens3f1-rx-3
> > 168               2              2  IR-PCI-MSI 49289220-edge ens3f3-rx-4
> > 251               2              2  IR-PCI-MSI 12587020-edge i40e-eno3-TxRx-11
> > 253               2              2  IR-PCI-MSI 12587022-edge i40e-eno3-TxRx-13
> > 254               2              2  IR-PCI-MSI 12587023-edge i40e-eno3-TxRx-14
> > 256               2              2  IR-PCI-MSI 12587025-edge i40e-eno3-TxRx-16
> > 257               2              2  IR-PCI-MSI 12587026-edge i40e-eno3-TxRx-17
> > 260               2              2  IR-PCI-MSI 12587029-edge i40e-eno3-TxRx-20
> > 269               2              2  IR-PCI-MSI 12587038-edge i40e-eno3-TxRx-29
> > 316       0,2,20,22              2  IR-PCI-MSI 49827840-edge mlx5_comp0@pci:0000:5f:01.2
> > 317               2              2  IR-PCI-MSI 49827841-edge mlx5_comp1@pci:0000:5f:01.2
> > 329               2              2  IR-PCI-MSI 49829889-edge mlx5_comp1@pci:0000:5f:01.3
> > 352       0,2,20,22              2  IR-PCI-MSI 49833984-edge mlx5_comp0@pci:0000:5f:01.5
> > 353               2              2  IR-PCI-MSI 49833985-edge mlx5_comp1@pci:0000:5f:01.5
> > 364       0,2,20,22              2  IR-PCI-MSI 49836032-edge mlx5_comp0@pci:0000:5f:01.6
> > 365               2              2  IR-PCI-MSI 49836033-edge mlx5_comp1@pci:0000:5f:01.6
> > 380               2              2  IR-PCI-MSI 49807363-edge mlx5_comp3@pci:0000:5f:00.0
> > 382               2              2  IR-PCI-MSI 49807365-edge mlx5_comp5@pci:0000:5f:00.0
> > 384               2              2  IR-PCI-MSI 49807367-edge mlx5_comp7@pci:0000:5f:00.0
> > 386               2              2  IR-PCI-MSI 49807369-edge mlx5_comp9@pci:0000:5f:00.0
> > 387               2              2  IR-PCI-MSI 49807370-edge mlx5_comp10@pci:0000:5f:00.0
> > 406               2              2  IR-PCI-MSI 49807389-edge mlx5_comp29@pci:0000:5f:00.0
> > 407               2              2  IR-PCI-MSI 49807390-edge mlx5_comp30@pci:0000:5f:00.0
> > 408               2              2  IR-PCI-MSI 49807391-edge mlx5_comp31@pci:0000:5f:00.0
> > 413               2              2  IR-PCI-MSI 49807396-edge mlx5_comp36@pci:0000:5f:00.0
> > 420               2              2  IR-PCI-MSI 12589058-edge i40e-eno4-TxRx-1
> > 421               2              2  IR-PCI-MSI 12589059-edge i40e-eno4-TxRx-2
> > 423               2              2  IR-PCI-MSI 12589061-edge i40e-eno4-TxRx-4
> > 425               2              2  IR-PCI-MSI 12589063-edge i40e-eno4-TxRx-6
> > 427               2              2  IR-PCI-MSI 12589065-edge i40e-eno4-TxRx-8
> > 436               2              2  IR-PCI-MSI 12589074-edge i40e-eno4-TxRx-17
> > 441               2              2  IR-PCI-MSI 12589079-edge i40e-eno4-TxRx-22
> > 447               2              2  IR-PCI-MSI 12589085-edge i40e-eno4-TxRx-28
> > 450               2              2  IR-PCI-MSI 12589088-edge i40e-eno4-TxRx-31
> > 452               2              2  IR-PCI-MSI 12589090-edge i40e-eno4-TxRx-33
> > 454               2              2  IR-PCI-MSI 12589092-edge i40e-eno4-TxRx-35
> > 457               2              2  IR-PCI-MSI 12589095-edge i40e-eno4-TxRx-38
> > 530               2              2  IR-PCI-MSI 49809418-edge mlx5_comp10@pci:0000:5f:00.1
> > 531               2              2  IR-PCI-MSI 49809419-edge mlx5_comp11@pci:0000:5f:00.1
> > 532               2              2  IR-PCI-MSI 49809420-edge mlx5_comp12@pci:0000:5f:00.1
> > 533               2              2  IR-PCI-MSI 49809421-edge mlx5_comp13@pci:0000:5f:00.1
> > 534               2              2  IR-PCI-MSI 49809422-edge mlx5_comp14@pci:0000:5f:00.1
> > 535               2              2  IR-PCI-MSI 49809423-edge mlx5_comp15@pci:0000:5f:00.1
> > 536               2              2  IR-PCI-MSI 49809424-edge mlx5_comp16@pci:0000:5f:00.1
> > 537               2              2  IR-PCI-MSI 49809425-edge mlx5_comp17@pci:0000:5f:00.1
> > 538               2              2  IR-PCI-MSI 49809426-edge mlx5_comp18@pci:0000:5f:00.1
> > 539               2              2  IR-PCI-MSI 49809427-edge mlx5_comp19@pci:0000:5f:00.1
> > 611               2              2  IR-PCI-MSI 49838081-edge mlx5_comp1@pci:0000:5f:01.7
> > 622       0,2,20,22              2  IR-PCI-MSI 49840128-edge mlx5_comp0@pci:0000:5f:02.0
> > 623               2              2  IR-PCI-MSI 49840129-edge mlx5_comp1@pci:0000:5f:02.0
> > 635               2              2  IR-PCI-MSI 49842177-edge mlx5_comp1@pci:0000:5f:02.1
> > 647               2              2  IR-PCI-MSI 49844225-edge mlx5_comp1@pci:0000:5f:02.2
> > 659               2              2  IR-PCI-MSI 49846273-edge mlx5_comp1@pci:0000:5f:02.3
> > 732       0,2,20,22              2  IR-PCI-MSI 12756995-edge iavf-eno3v5-TxRx-2
> > 734       0,2,20,22              2  IR-PCI-MSI 12759040-edge iavf-0000:18:0a.6:mbx
> > 762       0,2,20,22              2  IR-PCI-MSI 12824579-edge iavf-eno4v6-TxRx-2
> > 767       0,2,20,22              2  IR-PCI-MSI 12814339-edge iavf-eno4v1-TxRx-2
> > 772       0,2,20,22              2  IR-PCI-MSI 12826627-edge iavf-eno4v7-TxRx-2
> > 784       0,2,20,22              2  IR-PCI-MSI 12830720-edge iavf-0000:18:0f.1:mbx
> > 787       0,2,20,22              2  IR-PCI-MSI 12830723-edge iavf-eno4v9-TxRx-2
> > 792       0,2,20,22              2  IR-PCI-MSI 12816387-edge iavf-eno4v2-TxRx-2
> 
> The irq affinities are overspilling on banned cpus. This probably is because
> the APIC from cpus 0 and 20 are full which is only a hardware limitation.
> 
> Reducing the span of banned cpus fixes the issue:
> 
> > # grep ^IRQBALANCE_BANNED_CPU /etc/sysconfig/irqbalance
> > IRQBALANCE_BANNED_CPULIST=4-19,24-39
> >
> > ~# irqstat -n | grep -v '\<0\>$'
> > CPU  AFFINITY-IRQs  EFFECTIVE-IRQs
> > 0              160             164
> > 1               31              22
> > 2              162             153
> > 3               30              21
> > 20             164             159
> > 21              31              21
> > 22             162             157
> > 23              30              21
> 
> However, it would be nice if irqbalance could at least log a warning or an
> error that it failed to set the affinity for a specific irq.
> 
> > ~# echo 0,20 > /proc/irq/47/smp_affinity_list
> > -bash: echo: write error: No space left on device
> 
> For the record, here is the platform overview:
> 
> > NUMA 0
> > ======
> > 
> > Memory: 187GB
> > 2MB hugepages: 0
> > 1GB hugepages: 32
> > 
> > CPUs
> > ----
> > 
> > Model name:                      Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
> > Cores IDs:
> > 0,20    2,22    4,24    6,26    8,28    10,30   12,32   14,34   16,36   18,38
> > 
> > NICs
> > ----
> > 
> > SLOT          DRIVER     IFNAME     MAC                LINK/STATE  SPEED   DEVICE
> > 0000:18:00.0  i40e       eno1       e4:43:4b:48:6a:20  1/up        1Gb/s   Ethernet Controller X710 for 10GbE SFP+
> > 0000:18:00.1  i40e       eno2       e4:43:4b:48:6a:21  1/up        1Gb/s   Ethernet Controller X710 for 10GbE SFP+
> > 0000:18:00.2  i40e       eno3       e4:43:4b:48:6a:22  1/up        10Gb/s  Ethernet Controller X710 for 10GbE SFP+
> > 0000:18:00.3  i40e       eno4       e4:43:4b:48:6a:23  1/up        10Gb/s  Ethernet Controller X710 for 10GbE SFP+
> > 0000:3b:00.0  vfio-pci   -          -                  -/-         -       Ethernet Controller E810-C for QSFP
> > 0000:3b:00.1  vfio-pci   -          -                  -/-         -       Ethernet Controller E810-C for QSFP
> > 0000:5e:00.0  tg3        ens3f0     00:0a:f7:d9:e4:14  0/down      -       NetXtreme BCM5719 Gigabit Ethernet PCIe
> > 0000:5e:00.1  tg3        ens3f1     00:0a:f7:d9:e4:15  0/down      -       NetXtreme BCM5719 Gigabit Ethernet PCIe
> > 0000:5e:00.2  tg3        ens3f2     00:0a:f7:d9:e4:16  0/down      -       NetXtreme BCM5719 Gigabit Ethernet PCIe
> > 0000:5e:00.3  tg3        ens3f3     00:0a:f7:d9:e4:17  0/down      -       NetXtreme BCM5719 Gigabit Ethernet PCIe
> > 0000:5f:00.0  mlx5_core  ens2f0np0  04:3f:72:b8:be:6a  1/up        10Gb/s  MT27800 Family [ConnectX-5]
> > 0000:5f:00.1  mlx5_core  ens2f1np1  04:3f:72:b8:be:6b  1/up        10Gb/s  MT27800 Family [ConnectX-5]
> > 
> > NUMA 1
> > ======
> > 
> > Memory: 188GB
> > 2MB hugepages: 0
> > 1GB hugepages: 32
> > 
> > CPUs
> > ----
> > 
> > Model name:                      Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
> > Cores IDs:
> > 1,21    3,23    5,25    7,27    9,29    11,31   13,33   15,35   17,37   19,39
> > 
> > NICs
> > ----
> > 
> > SLOT          DRIVER    IFNAME  MAC                LINK/STATE  SPEED   DEVICE
> > 0000:af:00.0  i40e      ens4f0  f8:f2:1e:42:5d:70  1/up        10Gb/s  Ethernet Controller X710 for 10GbE SFP+
> > 0000:af:00.1  i40e      ens4f1  f8:f2:1e:42:5d:70  1/up        10Gb/s  Ethernet Controller X710 for 10GbE SFP+
> > 0000:af:00.2  vfio-pci  -       -                  -/-         -       Ethernet Controller X710 for 10GbE SFP+
> > 0000:af:00.3  vfio-pci  -       -                  -/-         -       Ethernet Controller X710 for 10GbE SFP+

Hi Robin,

Thanks for reporting the issue, I think it is reasonable to have a notification when irqbalance overspill IRQs on banned cpus. Just for curiousity, which beaker machine did you use for the testing? I failed to simulate a system which have enough devices' IRQs to overspill using qemu.

Thanks,
Tao Liu

Comment 2 Robin Jarry 2023-06-19 07:37:08 UTC
Hi there, any machine with SRIOV capable PCI devices should be enough to produce more than 224 IRQs. I don't think you will be able to test this in QEMU.

Comment 3 ltao 2023-07-04 02:34:39 UTC
Patch[1] posted upstream

[1]: https://github.com/Irqbalance/irqbalance/pull/265

Comment 4 Robin Jarry 2023-07-05 14:47:01 UTC
I just realized that this issue actually hides another regression.

This patch https://github.com/Irqbalance/irqbalance/commit/55c5c321c73e4c9b54e041ba8c7d542598685bae (included in irqbalance 1.7.0) causes any failure to enforce smp_affinity to ban the IRQ for the whole life of the process. APIC being out of space is a transient issue but irqbalance will never try again to move the interrupt to another CPU unless it is restarted.

I have submitted another pull request here https://github.com/Irqbalance/irqbalance/pull/266.

Waiting for feedback.

Comment 5 Robin Jarry 2023-07-13 15:33:01 UTC
The issue should be now resolved by all commits here https://github.com/Irqbalance/irqbalance/pull/269

@

Comment 6 Robin Jarry 2023-07-14 08:55:11 UTC
Hi @ltao I have added a small fix to my patch series. Can you take this commit along with the others?

https://github.com/Irqbalance/irqbalance/commit/bc7794dc78474c463a26926749537f23abc4c082

Thanks!

Comment 7 ltao 2023-07-14 10:39:23 UTC
(In reply to Robin Jarry from comment #6)
> Hi @ltao I have added a small fix to my patch series. Can you
> take this commit along with the others?
> 
> https://github.com/Irqbalance/irqbalance/commit/
> bc7794dc78474c463a26926749537f23abc4c082
> 
> Thanks!

Hi Robin,

Thanks a lot for your works! I will integrate all patches and make a release next Monday.

Thanks,
Tao Liu

Comment 9 ltao 2023-07-28 02:08:47 UTC
Hi Robin,

Could you please have a check on the irqbalance-1.9.2-2.el9 release, to see if it works for you? Thanks!

Thanks,
Tao Liu

Comment 10 Robin Jarry 2023-08-01 14:19:41 UTC
Hi Tao,

sorry about the delay. Yes irqbalance-1.9.2-2.el9 contains the required fixes. Thanks!

Comment 11 ltao 2023-08-03 05:50:59 UTC
(In reply to Robin Jarry from comment #10)
> Hi Tao,
> 
> sorry about the delay. Yes irqbalance-1.9.2-2.el9 contains the required
> fixes. Thanks!

Hi Robin,

Thanks for the confirmation!

Thanks,
Tao Liu

Comment 13 ltao 2023-08-18 01:51:16 UTC
There is no release+ flags been set, so cannot be added into errata. I don't know if it is due to missing DTM and ITR, will make a try.

Hi Jiri,

Could you please help set ITR then see if release+ flags can be set?

Thanks,
Tao Liu

Comment 14 ltao 2023-08-18 14:50:19 UTC
OK, Thanks for setting ITR flags, Víctor. However a new error reported by errata:

Errata Can only add VERIFIED bugs when advisory is in REL PREP state

So maybe a verified flag is needed from QE?

Thanks,
Tao Liu

Comment 17 Jiri Dluhos 2023-08-21 15:34:45 UTC
Thanks Robin for detailed testing!
Setting VERIFIED+OtherQA.

Comment 18 Jiri Dluhos 2023-08-21 15:54:02 UTC
Apologies, not OtherQA; it's developer's unit testing, not OtherQA.

Comment 21 errata-xmlrpc 2023-11-07 08:56:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (irqbalance bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:6688


Note You need to log in before you can comment on or make changes to this bug.