Bug 2224236

Summary: [RHOSP17.1] Virtual Machine With iavf Driver Flaps (up -> down -> up -> down)
Product: Red Hat OpenStack Reporter: Vadim Khitrin <vkhitrin>
Component: openstack-neutronAssignee: Robin Jarry <rjarry>
Status: ASSIGNED --- QA Contact: Eran Kuris <ekuris>
Severity: medium Docs Contact:
Priority: medium    
Version: 17.1 (Wallaby)CC: chrisw, ekuris, gregraka, gurpsing, hakhande, ivecera, jamsmith, jelynch, jlibosva, jpretori, jschluet, lsvaty, mschmidt, pgrist, rjarry, scohen, yanghliu
Target Milestone: z2Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
In this release of RHOSP, there is a known issue where SR-IOV interfaces that use Intel X710 and E810 series controller virtual functions (VFs) with the iavf driver can experience network connectivity issues that involve link status flapping. The affected guest kernel versions are: + * RHEL 8.7.0 -> 8.7.3 (No fixes planned. End of life.) * RHEL 8.8.0 -> 8.8.2 (Fix planned in version 8.8.3.) * RHEL 9.2.0 -> 9.2.2 (Fix planned in version 9.2.3.) * Upstream Linux 4.9.0 -> 6.4.* (Fix planned in version 6.5.) + Workaround: There is none, other than to use a non-affected guest kernel.
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2228156    
Bug Blocks:    

Description Vadim Khitrin 2023-07-20 08:49:02 UTC
Description of problem:
NOTE: Filing this under the `qemu-kvm-rhev` component since we don't have a clear RCA. For now, this should act as a tracker bug for OSP.

When spawning a virtual machine with SR-IOV VF interface (iavf inside the virtual machine) from Intel X710 (i40e driver on the hypervisor), we observe flapping behavior for the driver (up -> down -> up -> down), and the interface is not able to send traffic.
The driver flaps every 6 seconds.

On the compute node, we can see this message in `dmesg`:
```
[ 2208.980821] i40e 0000:63:00.2: VF 3 in reset. Try again.
[ 2209.047800] i40e 0000:63:00.2: VF 3 in reset. Try again.
```

On the virtual machine, we can see these messages in `dmesg`:
```
[  456.791744] iavf 0000:05:00.0 eth1: NIC Link is Up Speed is 10 Gbps Full Duplex
[  462.411964] iavf 0000:05:00.0 eth1: NIC Link is Up Speed is 10 Gbps Full Duplex
[  468.557711] iavf 0000:05:00.0 eth1: NIC Link is Up Speed is 10 Gbps Full Duplex
[  474.709731] iavf 0000:05:00.0 eth1: NIC Link is Up Speed is 10 Gbps Full Duplex
[  480.325747] iavf 0000:05:00.0 eth1: NIC Link is Up Speed is 10 Gbps Full Duplex
```

Rebooting the virtual machine usually solves the issue.

Version-Release number of selected component (if applicable):
RHOS-17.1-RHEL-9-20230712.n.1
Kernel: 5.14.0-284.23.1.el9_2.x86_64 

How reproducible:
On a high frequency.

Steps to Reproduce:
1. Deploy OSP 17.1 with SR-IOV enabled Intel X710 interfaces
2. Spawn virtual machines
3. Check if there is connectivity on the SR-IOV VF interface from the Intel X710 interface

Actual results:
Sometimes, there is no connectivity for the attached SR-IOV VF interface.

Expected results:
VM boots up with connectivity on the SR-IOV VF interface.

Additional info:
* Did not observe a similar behavior on different drivers. For example, for the Mellanox ConnectX-6 interface, the driver `mlx5_core` is used inside a virtual machine, and it is stable.
* Reproduced this issue on Intel CPU and AMD CPU deployments.
* Reproduced this issue on Intel X710 NIC and OEM (Dell) Intel X710 NIC.
* Attempted to upgrade to a newer firmware, `22.0.9`, and still observing this.