Bug 2361921

Summary: Network Regression for Fedora 41 and 42 Cloud images on libvirt
Product: [Fedora] Fedora Reporter: Scott Williams <vwfoxguru>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: NEW --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 42CC: acaringi, adscvr, airlied, bskeggs, hdegoede, hpa, josef, kernel-maint, linville, masami256, mchehab, ptalbert, steved, suraj.ghimire7
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: ---
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Scott Williams 2025-04-23 19:23:21 UTC
1. Please describe the problem:

Running the current Fedora 42 and 41 cloud images in libvirt, using a bridge VLAN, the network performance is highly degraded (~16-200KB/s).  The same behavior is not seen on current EL9 images, but was reproduced with the current OpenSUSE Tumbleweed, which is also on 6.14.2.

Here are the relevant errors on boot, repeated many times:

```
[Tue Apr 22 17:15:21 2025] net_ratelimit: 19 callbacks suppressed
[Tue Apr 22 17:15:21 2025] enp1s0: bad gso: type: 4, size: 1448
[Tue Apr 22 17:15:21 2025] enp1s0: bad gso: type: 4, size: 1448
[Tue Apr 22 17:15:21 2025] enp1s0: bad gso: type: 4, size: 1448
```

iperf TCP performance is somehow OK, but any http/s client (dnf, wget, curl, etc.) have the same issues:

```
curl https://hil-speed.hetzner.com/1GB.bin >/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0 1024M    0 1615k    0     0  36118      0  8:15:28  0:00:45  8:14:43 33159
```

Compare to freshly deploy AlmaLinux 9.5 on same hypervisor and otherwise identical virt config:

```
curl https://hil-speed.hetzner.com/1GB.bin >/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 37 1024M   37  381M    0     0  13.9M      0  0:01:13  0:00:27  0:00:46 10.6M
```

We have also tried using e1000, e1000e, and rtl8193 instead of virtio and have the same performance regression.

2. What is the Version-Release number of the kernel:

On Fedora 42, this behavior was seen on these kernel versions:
kernel-core-6.14.0-63.fc42.x86_64
kernel-core-6.14.2-300.fc42.x86_64
kernel-core-6.14.3-300.fc42.x86_64

On Fedora 41:
6.11.4-301.fc41.x86_64

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

Yes - the issue does not occur in the Fedora 40 40-1.14 cloud image.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Launch a Fedora 42 (or 41) VM from cloud image with libvirt.
Network as virtio with bridged VLAN.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Haven't tested this yet.

6. Are you running any modules that not shipped with directly Fedora's kernel?:

No.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

```
Apr 22 16:53:48 zoey NetworkManager[978]: <info>  [1745366028.8127] manager: NetworkManager state is now CONNECTED_SITE
Apr 22 16:53:48 zoey NetworkManager[978]: <info>  [1745366028.8128] device (enp1s0): Activation: successful, device activated.
Apr 22 16:53:48 zoey NetworkManager[978]: <info>  [1745366028.8132] manager: NetworkManager state is now CONNECTED_GLOBAL
Apr 22 16:53:48 zoey chronyd[896]: Source 65.74.88.213 online
Apr 22 16:53:48 zoey chronyd[896]: Source 168.235.89.132 online
Apr 22 16:53:48 zoey chronyd[896]: Source 108.61.73.244 online
Apr 22 16:53:48 zoey chronyd[896]: Source 23.168.24.210 online
Apr 22 16:53:49 zoey chronyd[896]: Selected source 23.168.24.210 (2.fedora.pool.ntp.org)
Apr 22 16:53:58 zoey systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
Apr 22 16:53:58 zoey audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/syst>
Apr 22 16:54:00 zoey kernel: enp1s0: bad gso: type: 4, size: 1368
Apr 22 16:54:00 zoey kernel: enp1s0: bad gso: type: 4, size: 1368
Apr 22 16:54:00 zoey kernel: enp1s0: bad gso: type: 4, size: 1368
Apr 22 16:54:00 zoey kernel: enp1s0: bad gso: type: 4, size: 1368
Apr 22 16:54:00 zoey kernel: enp1s0: bad gso: type: 4, size: 1368
Apr 22 16:54:01 zoey kernel: enp1s0: bad gso: type: 4, size: 1368
Apr 22 16:54:01 zoey kernel: enp1s0: bad gso: type: 4, size: 1368
Apr 22 16:54:02 zoey kernel: enp1s0: bad gso: type: 4, size: 1448
Apr 22 16:54:02 zoey kernel: enp1s0: bad gso: type: 4, size: 1448
Apr 22 16:54:02 zoey kernel: enp1s0: bad gso: type: 4, size: 1448
Apr 22 16:54:05 zoey kernel: net_ratelimit: 23 callbacks suppressed
```

Reproducible: Always

Comment 1 Scott Williams 2025-04-23 19:34:53 UTC
Libvirt hosts are SUSE Harvester v1.4.1 (released January 2025) running SLE Micro kernel 5.14.21-150500.55.88-default on kubevirt v1.2.2.

Comment 2 Scott Williams 2025-04-23 19:56:30 UTC
Test it in a newer hypervisor environment: Harvester v1.4.2 (released March 11, 2025) that is based on kuvebirt v1.3.1, so I suspect that somewhere between at least v.1.2.2 and v.1.3.1 of kubevirt, there's something that is not playing well with 6.14.  I also reproduced this on OpenSUSE Tumbleweed (6.14.2) and Ubuntu 25.04 (6.14.0), so it certainly seems like an upstream kernel issue.

Comment 3 Scott Williams 2025-04-23 21:41:14 UTC
Also filed a bug with OpenSUSE Tumbleweed: https://bugzilla.opensuse.org/show_bug.cgi?id=1241662

Comment 4 Scott Williams 2025-04-23 21:52:18 UTC
Beyond the different kubevirt versions, the physical NICs are also different:

Affected hypervisor NICs (the NetXtreme-E's are the relevant ones here):
```
21:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 12)
21:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 12)
63:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
63:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
```

Unaffected hypervisor NICs:
```
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe                                                                                          
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe                                                                                          
17:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)                                                            
17:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)
```

Comment 5 Scott Williams 2025-04-24 00:59:50 UTC
After doing some more testing in different clusters and versions, it's not related to the kubevirt version, but a regression for the Ethernet controller: Broadcom Inc. and subsidiaries BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 12) NIC.

The Harvester/Kubervirt version didn't matter.  On another cluster with mixed NICs, I was able to reproduce it and then migrate to a different host in the same cluster with a BCM57416 10G NIC and then it worked.  Tested with a few different Broadcom NICs, but the BCM57508 was the only problematic one.

Comment 6 Scott Williams 2025-04-24 17:32:55 UTC
Per the cross-reported OpenSUSE ticket, it appears to be related to this upstream:  https://lore.kernel.org/lkml/1d388413ab9cfd765cd2c5e05b5e69cdb2ec5a10.camel@webked.de/