I maintain Fedora's openQA instance, which runs tests in qemu virtual machines on dedicated hosts. It launches qemu directly, libvirt is not involved. Our openQA setup includes several multi-machine tests. These involve running two or more VMs in such a way that they have network access both to each other and to the internet. This is achieved using openvswitch and tap devices. The setup is documented somewhat at https://open.qa/docs/#_tap_based_network , though we do not use the os-autoinst-setup-multi-machine script (it's all set up via ansible) and we don't use GRE tunnels (related groups of jobs are set up to always run on the same host). Most of the setup is done in https://pagure.io/fedora-infra/ansible/blob/main/f/roles/openqa/worker/tasks/tap-setup.yml#_24 , there's also some custom firewall stuff in https://pagure.io/fedora-infra/ansible/blob/main/f/inventory/group_vars/openqa_tap_workers , there might be some other bits and pieces but that's the gist of it. There's an openvswitch bridge, a bunch of tap devices, the individual qemu processes attach to the tap devices. The network related bits of a typical qemu invocation are: -netdev tap,id=qanet0,ifname=tap20,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:01:0d This has all worked flawlessly for years. But recently I updated the staging instance to Fedora 42, and it started having problems. The problems seem to affect tests that use DHCP. In some of these multi-machine test clusters we just have each VM use static network configuration and those test groups work OK. But some other cases use a "support server", which runs a DHCP server (and various other services like NFS and HTTP), and then other VMs in the cluster expect to use it. These are the groups that have trouble. In each of these groups we have more than one VM trying to use DHCP via the support server, and it seems like every single time, they don't all succeed. Usually at least one of the 'client' VMs will be able to bring up its network, but sometimes they all fail to. I don't think I saw one case where *all* the clients succeeded, there's always at least one failed test. Here's an example group: https://openqa.stg.fedoraproject.org/tests/5021586#dependencies . In the blue-highlighted box, support_server@64bit is the support server. The two "install_default_update_netin..." tests are the ones that try to use DHCP from the support server. In this case, both failed. I had this group rigged so I could see the logs from dnsmasq-dhcp (the DHCP server) as well as from the clients. The server logs show it getting DHCPDISCOVER requests from both clients and immediately sending back DHCPOFFER messages (starting at 21:25:31). This seems to go on for quite a while - five exchanges for each client. Then at 21:26:07 it progresses to pairs of DHCPREQUEST and DHCPACK messages (I presume the client sends DHCPREQUEST and the server sends DHCPACK). I see three of these pairs for each client. Then it seems to go back to five DHCPDISCOVER / DHCPOFFER pairs (at 21:26:13), then three more DHCPREQUEST and DHCPACK pairs (at 21:26:45). This all *seems* like a lot more traffic than should be necessary (though I don't have a 'normal' log to compare to at present). The second set of DHCPREQUEST/DHCPACK pairs ends at 21:26:55, at which point the clients seem to give up, there is no more DHCP traffic. The client logs look like this (this is one client, filtered to relevant-looking messages): 01:25:32,570 INFO NetworkManager:<info> [1747963532.5550] dhcp4 (ens4): activation: beginning transaction (timeout in 45 seconds) 01:25:32,570 DEBUG NetworkManager:<debug> [1747963532.5566] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255 01:25:34,690 DEBUG NetworkManager:<debug> [1747963534.6902] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255 01:25:39,180 DEBUG NetworkManager:<debug> [1747963539.1798] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255 01:25:47,461 DEBUG NetworkManager:<debug> [1747963547.4615] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255 01:26:03,495 DEBUG NetworkManager:<debug> [1747963563.4948] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255 01:26:08,825 DEBUG NetworkManager:<debug> [1747963568.8252] dhcp4 (ens4): event: received OFFER of 172.16.2.186 from 172.16.2.110 01:26:08,826 DEBUG NetworkManager:<debug> [1747963568.8254] dhcp4 (ens4): event: send REQUEST of 172.16.2.186 to 255.255.255.255 01:26:11,015 DEBUG NetworkManager:<debug> [1747963571.0153] dhcp4 (ens4): event: send REQUEST of 172.16.2.186 to 255.255.255.255 01:26:15,447 DEBUG NetworkManager:<debug> [1747963575.4470] dhcp4 (ens4): event: send REQUEST of 172.16.2.186 to 255.255.255.255 01:26:18,288 DEBUG NetworkManager:<debug> [1747963578.2887] device[1915d724c7c136f1] (ens4): ip:dhcp4: set state fail (was pending) 01:26:18,289 DEBUG NetworkManager:<debug> [1747963578.2890] device[1915d724c7c136f1] (ens4): ip4: set state fail (was pending, reason: check-ip-state) 01:26:18,290 DEBUG NetworkManager:<debug> [1747963578.2890] device[1915d724c7c136f1] (ens4): ip: set (combined) state fail (was pending, reason: check-ip-state) 01:26:18,290 INFO NetworkManager:<info> [1747963578.2891] device (ens4): state change: ip-config -> failed (reason 'ip-config-unavailable', managed-type: 'full') 01:26:18,300 INFO NetworkManager:<info> [1747963578.3008] dhcp4 (ens4): canceled DHCP transaction 01:26:18,301 INFO NetworkManager:<info> [1747963578.3009] dhcp4 (ens4): activation: beginning transaction (timeout in 45 seconds) 01:26:18,301 INFO NetworkManager:<info> [1747963578.3009] dhcp4 (ens4): state changed no lease 01:26:18,328 INFO NetworkManager:<info> [1747963578.3146] dhcp4 (ens4): activation: beginning transaction (timeout in 45 seconds) 01:26:18,329 DEBUG NetworkManager:<debug> [1747963578.3170] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255 01:26:20,649 DEBUG NetworkManager:<debug> [1747963580.6492] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255 01:26:25,401 DEBUG NetworkManager:<debug> [1747963585.4010] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255 01:26:33,654 DEBUG NetworkManager:<debug> [1747963593.6537] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255 01:26:49,784 DEBUG NetworkManager:<debug> [1747963609.7842] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255 01:26:49,787 DEBUG NetworkManager:<debug> [1747963609.7875] dhcp4 (ens4): event: received OFFER of 172.16.2.186 from 172.16.2.110 01:26:49,788 DEBUG NetworkManager:<debug> [1747963609.7880] dhcp4 (ens4): event: send REQUEST of 172.16.2.186 to 255.255.255.255 01:26:52,467 DEBUG NetworkManager:<debug> [1747963612.4675] dhcp4 (ens4): event: send REQUEST of 172.16.2.186 to 255.255.255.255 01:26:57,340 DEBUG NetworkManager:<debug> [1747963617.3399] dhcp4 (ens4): event: send REQUEST of 172.16.2.186 to 255.255.255.255 01:27:03,040 DEBUG NetworkManager:<debug> [1747963623.0386] device[1915d724c7c136f1] (ens4): ip4: set state none (was fail, reason: ip-state-clear) 01:27:03,041 DEBUG NetworkManager:<debug> [1747963623.0386] device[1915d724c7c136f1] (ens4): ip6: set state none (was fail, reason: ip-state-clear) 01:27:03,041 WARNING NetworkManager:<warn> [1747963623.0386] device (ens4): Activation: failed for connection 'ens4' 01:27:03,041 DEBUG NetworkManager:<debug> [1747963623.0387] device[1915d724c7c136f1] (ens4): add_pending_action (2): 'queued-state-change-disconnected' 01:27:03,041 DEBUG NetworkManager:<debug> [1747963623.0387] device[1915d724c7c136f1] (ens4): queue-state[disconnected, reason:none, id:243]: queue state change 01:27:03,041 DEBUG NetworkManager:<debug> [1747963623.0387] device[1915d724c7c136f1] (ens4): remove_pending_action (1): 'in-state-change' 01:27:03,041 DEBUG NetworkManager:<debug> [1747963623.0391] device[1915d724c7c136f1] (ens4): queue-state[disconnected, reason:none, id:243]: change state 01:27:03,041 INFO NetworkManager:<info> [1747963623.0391] device (ens4): state change: failed -> disconnected (reason 'none', managed-type: 'full') 01:27:03,042 DEBUG NetworkManager:<debug> [1747963623.0391] device[1915d724c7c136f1] (ens4): add_pending_action (2): 'in-state-change' 01:27:03,042 DEBUG NetworkManager:<debug> [1747963623.0392] device[1915d724c7c136f1] (ens4): ip6: addrgenmode6: set none (already set) 01:27:03,042 DEBUG NetworkManager:<debug> [1747963623.0395] platform-linux: sysctl: setting '/proc/sys/net/ipv6/conf/ens4/temp_valid_lft' to '604800' (current value is identical) 01:27:03,042 DEBUG NetworkManager:<debug> [1747963623.0396] platform-linux: sysctl: setting '/proc/sys/net/ipv6/conf/ens4/temp_prefered_lft' to '86400' (current value is identical) 01:27:03,042 DEBUG NetworkManager:<debug> [1747963623.0397] device[1915d724c7c136f1] (ens4): ip6: addrgenmode6: toggle disable_ipv6 sysctl after disabling addr-gen-mode 01:27:03,042 DEBUG NetworkManager:<debug> [1747963623.0398] platform-linux: sysctl: setting '/proc/sys/net/ipv6/conf/ens4/disable_ipv6' to '1' (current value is '0') 01:27:03,043 DEBUG NetworkManager:<debug> [1747963623.0400] platform-linux: sysctl: setting '/proc/sys/net/ipv6/conf/ens4/disable_ipv6' to '0' (current value is '1') 01:27:03,043 DEBUG NetworkManager:<debug> [1747963623.0401] device[1915d724c7c136f1] (ens4): deactivating device (reason 'none') [0] 01:27:03,043 DEBUG NetworkManager:<debug> [1747963623.0411] firewalld: [28673aa1a0996ea5,remove*:"ens4"]: firewall zone remove ens4:default (not running, simulate success) 01:27:03,043 DEBUG NetworkManager:<debug> [1747963623.0412] firewalld: [28673aa1a0996ea5,remove*:"ens4"]: complete: drop request simulating success 01:27:03,043 DEBUG NetworkManager:<debug> [1747963623.0412] device[1915d724c7c136f1] (ens4): ip:dhcp4: set state none (was fail) 01:27:03,043 WARNING org.fedoraproject.Anaconda.Modules.Network:DEBUG:anaconda.modules.network.network:NeworkManager state changed to <enum NM_STATE_DISCONNECTED of type NM.State> 01:27:03,044 WARNING org.fedoraproject.Anaconda.Modules.Network:DEBUG:anaconda.modules.network.network:Connected to network: False 01:27:03,045 INFO NetworkManager:<info> [1747963623.0450] dhcp4 (ens4): canceled DHCP transaction so...it kinda looks like very few of the server responses *reach* the client. On the server we see pairs of DHCPDISCOVER received and DHCPOFFER sent, but the client logs sending five DISCOVERs but only getting an OFFER for the last one. On the server we see three pairs of REQUEST received and ACK sent, but the client logs sending three REQUESTS and getting no ACKs, then gives up. This cycle repeats, then it gives up permanently. I tried downgrading various things to find the cause of this, and it turns out to be qemu. With the upgrade to Fedora 42, the worker hosts got qemu 9.2.3-1.fc42. I did a scratch build of qemu 9.1.3 for F42 from the current F41 spec (with nfs disabled as it doesn't build successfully) - https://koji.fedoraproject.org/koji/taskinfo?taskID=133237554 - and downgraded an affected worker host to that build, and I have not seen this problem on that host since. Every DHCP test group it's run since the downgrade has passed. I don't have any idea what change in qemu 9.2.x might be causing this. I'm going to try bisecting it as far as I can, but building qemu is quite slow and then I have to run a test group to see if it's affected, so it's a bit of a slow bisect loop. I'm currently building 9.2.0 to see if the problem was introduced there, or between 9.2.0 and 9.2.3.
Looking at the server logs from a test after the downgrade to qemu 9.1.3, they're way cleaner. There's a single set of DISCOVER / OFFER / REQUEST / ACK messages for each client. Four messages per client and everything is good.
9.2.0 is bad, so this broke between 9.1.x and 9.2.0.
Just to verify: in the case of Comment 1, the *only* thing changed to make it "way cleaner" was switching from qemu 9.2.3 to 9.1.3 and nothing else was changed? Normally a time-based issue like this would lead me to think of the STP learning time when a port is newly connected and forwarding hasn't yet been enabled (which is usually right around 30 seconds, and that is about how long after the first DHCPDISCOVER until the first offer is received)(although... there is also the amount of time it takes for the OS to boot up to the level of starting dhclient, so maybe that's a false clue), but that would be on the side of the bridge device, not the tap (and certainly not the emulated device in the guest). (sorry I don't have any concrete advice; just poking to see if some other clue emerges...)
Yep. qemu version is the only variable between broken and working cases. I'm bisecting this ATM, the range is down to 'somewhere between 67194c7018b8b06a1c149757f596bb919c683725 and 9.2.0' so far, at least assuming my bisect procedure is valid (I'll verify everything and double-check results when the bisect is complete).
Still bisecting, but I took a quick break to verify a v9.2.0 build with my bisect setup is bad, and it is, so the bisect procedure looks valid. The current range is 61 commits, and it includes this juicy little cluster: 16f6804c46 vhost_net: fix assertion triggered by batch of host notifiers processing 9379ea9db3 virtio-net: Add queues before loading them 7987d2be5a virtio-net: Copy received header to buffer 17437418c4 virtio-net: Initialize hash reporting values 1981fa9d7d virtio-net: Fix hash reporting when the queue changes 162bdb8113 virtio-net: Do not check for the queue before RSS a8575f7fb2 virtio-net: Fix size check in dhclient workaround 5930e5ccf3 net: checksum: Convert data to void * so, I'm gonna bet we wind up in one of those, maybe the "Fix size check in dhclient workaround" one.
OK, bisect comes down to: commit 7987d2be5a8bc3a502f89ba8cf3ac3e09f64d1ce Author: Akihiko Odaki Date: Fri Nov 22 14:03:12 2024 +0900 virtio-net: Copy received header to buffer receive_header() used to cast the const qualifier of the pointer to the received packet away to modify the header. Avoid this by copying the received header to buffer. Signed-off-by: Akihiko Odaki Signed-off-by: Jason Wang CCing them.
The upstream fixed the issue by reverting it. The upstream issue is tracked at: https://gitlab.com/qemu-project/qemu/-/issues/2727
ah, thanks. I should've looked there before bisecting again...sigh.
FEDORA-2025-2b4cc4d8cd (qemu-9.2.4-1.fc42) has been submitted as an update to Fedora 42. https://bodhi.fedoraproject.org/updates/FEDORA-2025-2b4cc4d8cd
FEDORA-2025-2b4cc4d8cd (qemu-9.2.4-1.fc42) has been pushed to the Fedora 42 stable repository. If problem still persists, please make note of it in this bug report.