Back to bug 2219830

Who When What Removed Added
RHEL Program Management 2023-07-05 14:52:15 UTC Target Release 17.1 ---
Eran Kuris 2023-07-05 15:37:24 UTC Severity high medium
Red Hat One Jira (issues.redhat.com) 2023-07-05 15:37:56 UTC Link ID Red Hat Issue Tracker OSP-26344
Eran Kuris 2023-07-05 15:45:20 UTC Doc Type If docs needed, set a value Known Issue
Eran Kuris 2023-07-05 15:47:26 UTC Version 17.0 (Wallaby) 17.1 (Wallaby)
Eran Kuris 2023-07-05 15:48:17 UTC Status NEW ASSIGNED
RHEL Program Management 2023-07-05 15:48:27 UTC Target Release --- 17.1
Robin Jarry 2023-07-06 10:50:04 UTC Doc Text Cause
=====

Every provisioned VF is automatically bound to its default linux driver (different depending on the PF model). The default driver will create a corresponding network interface with its own RX queues for every VF. Each RX queue corresponds to a hardware IRQ. If the VFs remain "unused" (i.e. not assigned to a VM), they will remain bound to the default linux driver and their corresponding network device will require IRQs for every RX queue.

Provisioning a large number of VFs during deployment can cause a large number of IRQs to be created. Every IRQ needs to be bound to a physical CPU. For NFV deployments, all IRQs should be bound only to housekeeping CPUs to avoid packet loss. On x86_64, a single CPU can only handle 224 IRQs.
When the number of housekeeping CPUs is not big enough to handle all IRQs, irqbalance will fail to bind them and the IRQs will overspill on isolated CPUs.

Consequence
===========

This isolation issue may cause transient packet loss if IRQs are causing non-voluntary context switches on OVS-DPDK PMD threads or in guests running DPDK applications.

Workaround (if any)
===================

One or several of:

1) Reduce the number of provisioned VFs to avoid unused VFs to remain bound to their default linux driver.

2) Increase the number of housekeeping CPUs to handle all IRQs.

3) Force unused VF network interfaces DOWN to avoid IRQs from interrupting isolated CPUs.

4) Disable multicast and broadcast traffic on unused VF network interfaces DOWN to avoid IRQs from interrupting isolated CPUs.

Result
======

Transient packet loss should not occur anymore.
Robin Jarry 2023-07-06 10:50:49 UTC Doc Text Cause
=====

Every provisioned VF is automatically bound to its default linux driver (different depending on the PF model). The default driver will create a corresponding network interface with its own RX queues for every VF. Each RX queue corresponds to a hardware IRQ. If the VFs remain "unused" (i.e. not assigned to a VM), they will remain bound to the default linux driver and their corresponding network device will require IRQs for every RX queue.

Provisioning a large number of VFs during deployment can cause a large number of IRQs to be created. Every IRQ needs to be bound to a physical CPU. For NFV deployments, all IRQs should be bound only to housekeeping CPUs to avoid packet loss. On x86_64, a single CPU can only handle 224 IRQs.
When the number of housekeeping CPUs is not big enough to handle all IRQs, irqbalance will fail to bind them and the IRQs will overspill on isolated CPUs.

Consequence
===========

This isolation issue may cause transient packet loss if IRQs are causing non-voluntary context switches on OVS-DPDK PMD threads or in guests running DPDK applications.

Workaround (if any)
===================

One or several of:

1) Reduce the number of provisioned VFs to avoid unused VFs to remain bound to their default linux driver.

2) Increase the number of housekeeping CPUs to handle all IRQs.

3) Force unused VF network interfaces DOWN to avoid IRQs from interrupting isolated CPUs.

4) Disable multicast and broadcast traffic on unused VF network interfaces DOWN to avoid IRQs from interrupting isolated CPUs.

Result
======

Transient packet loss should not occur anymore.
Cause
=====

Every provisioned VF is automatically bound to its default linux driver (different depending on the PF model). The default driver will create a corresponding network interface with its own RX queues for every VF. Each RX queue corresponds to a hardware IRQ. If the VFs remain "unused" (i.e. not assigned to a VM), they will remain bound to the default linux driver and their corresponding network device will require IRQs for every RX queue.

Provisioning a large number of VFs during deployment can cause a large number of IRQs to be created. Every IRQ needs to be bound to a physical CPU. For NFV deployments, all IRQs should be bound only to housekeeping CPUs to avoid packet loss. On x86_64, a single CPU can only handle 224 IRQs. When the number of housekeeping CPUs is not big enough to handle all IRQs, irqbalance will fail to bind them and the IRQs will overspill on isolated CPUs.

Consequence
===========

This isolation issue may cause transient packet loss if IRQs are causing non-voluntary context switches on OVS-DPDK PMD threads or in guests running DPDK applications.

Workaround (if any)
===================

One or several of:

1) Reduce the number of provisioned VFs to avoid unused VFs to remain bound to their default linux driver.

2) Increase the number of housekeeping CPUs to handle all IRQs.

3) Force unused VF network interfaces DOWN to avoid IRQs from interrupting isolated CPUs.

4) Disable multicast and broadcast traffic on unused VF network interfaces DOWN to avoid IRQs from interrupting isolated CPUs.

Result
======

Transient packet loss should not occur anymore.
Robin Jarry 2023-08-02 09:26:21 UTC CC smooney
Flags needinfo?(smooney) needinfo?(ralonso)
CC ralonso
Robin Jarry 2023-08-02 09:30:15 UTC Assignee rhosp-nfv-int rjarry
Ricardo Alonso 2023-08-02 09:33:55 UTC Flags needinfo?(ralonso) needinfo-
Robin Jarry 2023-08-02 10:04:47 UTC Flags needinfo?(ralonsoh)
CC ralonsoh
Robin Jarry 2023-08-02 10:05:15 UTC CC ralonso
Ian Frangs 2023-08-03 15:46:23 UTC Flags needinfo?(rjarry)
Robin Jarry 2023-08-04 13:46:40 UTC Flags needinfo?(smooney) needinfo?(ralonsoh) needinfo?(rjarry)
Greg Rakauskas 2023-08-09 20:46:06 UTC Flags needinfo?(rjarry)
CC gregraka
Doc Text Cause
=====

Every provisioned VF is automatically bound to its default linux driver (different depending on the PF model). The default driver will create a corresponding network interface with its own RX queues for every VF. Each RX queue corresponds to a hardware IRQ. If the VFs remain "unused" (i.e. not assigned to a VM), they will remain bound to the default linux driver and their corresponding network device will require IRQs for every RX queue.

Provisioning a large number of VFs during deployment can cause a large number of IRQs to be created. Every IRQ needs to be bound to a physical CPU. For NFV deployments, all IRQs should be bound only to housekeeping CPUs to avoid packet loss. On x86_64, a single CPU can only handle 224 IRQs. When the number of housekeeping CPUs is not big enough to handle all IRQs, irqbalance will fail to bind them and the IRQs will overspill on isolated CPUs.

Consequence
===========

This isolation issue may cause transient packet loss if IRQs are causing non-voluntary context switches on OVS-DPDK PMD threads or in guests running DPDK applications.

Workaround (if any)
===================

One or several of:

1) Reduce the number of provisioned VFs to avoid unused VFs to remain bound to their default linux driver.

2) Increase the number of housekeeping CPUs to handle all IRQs.

3) Force unused VF network interfaces DOWN to avoid IRQs from interrupting isolated CPUs.

4) Disable multicast and broadcast traffic on unused VF network interfaces DOWN to avoid IRQs from interrupting isolated CPUs.

Result
======

Transient packet loss should not occur anymore.
In RHOSP 17.1 GA there is a known issue of transient packet loss where hardware interrupt requests (IRQs) are causing non-voluntary context switches on OVS-DPDK PMD threads or in guests running DPDK applications.
+
This issue is the result of provisioning large numbers of VFs during deployment. VFs need IRQs, each of which must be bound to a physical CPU. When there are not enough housekeeping CPUs to handle the capacity of IRQs, `irqbalance` fails to bind all of them and the IRQs overspill on isolated CPUs.
+
Workaround: You can try one or more of these actions:

* Reduce the number of provisioned VFs to avoid unused VFs remaining bound to their default Linux driver.
* Increase the number of housekeeping CPUs to handle all IRQs.
* Force unused VF network interfaces down to avoid IRQs from interrupting isolated CPUs.
* Disable multicast and broadcast traffic on unused, down VF network interfaces to avoid IRQs from interrupting isolated CPUs.
Greg Rakauskas 2023-08-09 20:51:47 UTC Flags needinfo-
Mike Burns 2023-08-11 13:59:33 UTC Target Milestone z1 z2
Jenny-Anne Lynch 2023-08-17 09:42:29 UTC CC jelynch
Doc Text In RHOSP 17.1 GA there is a known issue of transient packet loss where hardware interrupt requests (IRQs) are causing non-voluntary context switches on OVS-DPDK PMD threads or in guests running DPDK applications.
+
This issue is the result of provisioning large numbers of VFs during deployment. VFs need IRQs, each of which must be bound to a physical CPU. When there are not enough housekeeping CPUs to handle the capacity of IRQs, `irqbalance` fails to bind all of them and the IRQs overspill on isolated CPUs.
+
Workaround: You can try one or more of these actions:

* Reduce the number of provisioned VFs to avoid unused VFs remaining bound to their default Linux driver.
* Increase the number of housekeeping CPUs to handle all IRQs.
* Force unused VF network interfaces down to avoid IRQs from interrupting isolated CPUs.
* Disable multicast and broadcast traffic on unused, down VF network interfaces to avoid IRQs from interrupting isolated CPUs.
In RHOSP 17.1, there is a known issue of transient packet loss where hardware interrupt requests (IRQs) are causing non-voluntary context switches on OVS-DPDK PMD threads or in guests running DPDK applications.
+
This issue is the result of provisioning large numbers of VFs during deployment. VFs need IRQs, each of which must be bound to a physical CPU. When there are not enough housekeeping CPUs to handle the capacity of IRQs, `irqbalance` fails to bind all of them and the IRQs overspill on isolated CPUs.
+
Workaround: You can try one or more of these actions:

* Reduce the number of provisioned VFs to avoid unused VFs remaining bound to their default Linux driver.
* Increase the number of housekeeping CPUs to handle all IRQs.
* Force unused VF network interfaces down to avoid IRQs from interrupting isolated CPUs.
* Disable multicast and broadcast traffic on unused, down VF network interfaces to avoid IRQs from interrupting isolated CPUs.

Back to bug 2219830