2369047 – DHCP not working reliably between tap / openvswitch guests with qemu 9.2.0+

Bug 2369047 - DHCP not working reliably between tap / openvswitch guests with qemu 9.2.0+

Summary: DHCP not working reliably between tap / openvswitch guests with qemu 9.2.0+

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	qemu
Sub Component:
Version:	42
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Fedora Virtualization Maintainers
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2025-05-28 18:14 UTC by Adam Williamson
Modified:	2025-06-03 05:17 UTC (History)
CC List:	12 users (show)
Fixed In Version:	qemu-9.2.4-1.fc42
Clone Of:
Environment:
Last Closed:	2025-06-03 05:17:20 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gitlab	qemu-project qemu issues 2727	0	None	closed	`Debian testing` (2024-12-16) - `qemu-system-x86_64 9.2.0` : bug with the `virtio-net` and a DHCP connection with a virt...	2025-05-30 06:29:00 UTC

Description Adam Williamson 2025-05-28 18:14:01 UTC

I maintain Fedora's openQA instance, which runs tests in qemu virtual machines on dedicated hosts. It launches qemu directly, libvirt is not involved.

Our openQA setup includes several multi-machine tests. These involve running two or more VMs in such a way that they have network access both to each other and to the internet. This is achieved using openvswitch and tap devices.

The setup is documented somewhat at https://open.qa/docs/#_tap_based_network , though we do not use the os-autoinst-setup-multi-machine script (it's all set up via ansible) and we don't use GRE tunnels (related groups of jobs are set up to always run on the same host). Most of the setup is done in https://pagure.io/fedora-infra/ansible/blob/main/f/roles/openqa/worker/tasks/tap-setup.yml#_24 , there's also some custom firewall stuff in https://pagure.io/fedora-infra/ansible/blob/main/f/inventory/group_vars/openqa_tap_workers , there might be some other bits and pieces but that's the gist of it. There's an openvswitch bridge, a bunch of tap devices, the individual qemu processes attach to the tap devices. The network related bits of a typical qemu invocation are:

-netdev tap,id=qanet0,ifname=tap20,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:01:0d

This has all worked flawlessly for years. But recently I updated the staging instance to Fedora 42, and it started having problems.

The problems seem to affect tests that use DHCP. In some of these multi-machine test clusters we just have each VM use static network configuration and those test groups work OK. But some other cases use a "support server", which runs a DHCP server (and various other services like NFS and HTTP), and then other VMs in the cluster expect to use it. These are the groups that have trouble.

In each of these groups we have more than one VM trying to use DHCP via the support server, and it seems like every single time, they don't all succeed. Usually at least one of the 'client' VMs will be able to bring up its network, but sometimes they all fail to. I don't think I saw one case where *all* the clients succeeded, there's always at least one failed test.

Here's an example group: https://openqa.stg.fedoraproject.org/tests/5021586#dependencies . In the blue-highlighted box, support_server@64bit is the support server. The two "install_default_update_netin..." tests are the ones that try to use DHCP from the support server. In this case, both failed. I had this group rigged so I could see the logs from dnsmasq-dhcp (the DHCP server) as well as from the clients.

The server logs show it getting DHCPDISCOVER requests from both clients and immediately sending back DHCPOFFER messages (starting at 21:25:31). This seems to go on for quite a while - five exchanges for each client. Then at 21:26:07 it progresses to pairs of DHCPREQUEST and DHCPACK messages (I presume the client sends DHCPREQUEST and the server sends DHCPACK). I see three of these pairs for each client. Then it seems to go back to five DHCPDISCOVER / DHCPOFFER pairs (at 21:26:13), then three more DHCPREQUEST and DHCPACK pairs (at 21:26:45). This all *seems* like a lot more traffic than should be necessary (though I don't have a 'normal' log to compare to at present). The second set of DHCPREQUEST/DHCPACK pairs ends at 21:26:55, at which point the clients seem to give up, there is no more DHCP traffic.

The client logs look like this (this is one client, filtered to relevant-looking messages):

01:25:32,570 INFO NetworkManager:<info>  [1747963532.5550] dhcp4 (ens4): activation: beginning transaction (timeout in 45 seconds)
01:25:32,570 DEBUG NetworkManager:<debug> [1747963532.5566] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255
01:25:34,690 DEBUG NetworkManager:<debug> [1747963534.6902] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255
01:25:39,180 DEBUG NetworkManager:<debug> [1747963539.1798] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255
01:25:47,461 DEBUG NetworkManager:<debug> [1747963547.4615] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255
01:26:03,495 DEBUG NetworkManager:<debug> [1747963563.4948] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255
01:26:08,825 DEBUG NetworkManager:<debug> [1747963568.8252] dhcp4 (ens4): event: received OFFER of 172.16.2.186 from 172.16.2.110
01:26:08,826 DEBUG NetworkManager:<debug> [1747963568.8254] dhcp4 (ens4): event: send REQUEST of 172.16.2.186 to 255.255.255.255
01:26:11,015 DEBUG NetworkManager:<debug> [1747963571.0153] dhcp4 (ens4): event: send REQUEST of 172.16.2.186 to 255.255.255.255
01:26:15,447 DEBUG NetworkManager:<debug> [1747963575.4470] dhcp4 (ens4): event: send REQUEST of 172.16.2.186 to 255.255.255.255
01:26:18,288 DEBUG NetworkManager:<debug> [1747963578.2887] device[1915d724c7c136f1] (ens4): ip:dhcp4: set state fail (was pending)
01:26:18,289 DEBUG NetworkManager:<debug> [1747963578.2890] device[1915d724c7c136f1] (ens4): ip4: set state fail (was pending, reason: check-ip-state)
01:26:18,290 DEBUG NetworkManager:<debug> [1747963578.2890] device[1915d724c7c136f1] (ens4): ip: set (combined) state fail (was pending, reason: check-ip-state)
01:26:18,290 INFO NetworkManager:<info>  [1747963578.2891] device (ens4): state change: ip-config -> failed (reason 'ip-config-unavailable', managed-type: 'full')
01:26:18,300 INFO NetworkManager:<info>  [1747963578.3008] dhcp4 (ens4): canceled DHCP transaction
01:26:18,301 INFO NetworkManager:<info>  [1747963578.3009] dhcp4 (ens4): activation: beginning transaction (timeout in 45 seconds)
01:26:18,301 INFO NetworkManager:<info>  [1747963578.3009] dhcp4 (ens4): state changed no lease
01:26:18,328 INFO NetworkManager:<info>  [1747963578.3146] dhcp4 (ens4): activation: beginning transaction (timeout in 45 seconds)
01:26:18,329 DEBUG NetworkManager:<debug> [1747963578.3170] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255
01:26:20,649 DEBUG NetworkManager:<debug> [1747963580.6492] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255
01:26:25,401 DEBUG NetworkManager:<debug> [1747963585.4010] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255
01:26:33,654 DEBUG NetworkManager:<debug> [1747963593.6537] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255
01:26:49,784 DEBUG NetworkManager:<debug> [1747963609.7842] dhcp4 (ens4): event: send DISCOVER to 255.255.255.255
01:26:49,787 DEBUG NetworkManager:<debug> [1747963609.7875] dhcp4 (ens4): event: received OFFER of 172.16.2.186 from 172.16.2.110
01:26:49,788 DEBUG NetworkManager:<debug> [1747963609.7880] dhcp4 (ens4): event: send REQUEST of 172.16.2.186 to 255.255.255.255
01:26:52,467 DEBUG NetworkManager:<debug> [1747963612.4675] dhcp4 (ens4): event: send REQUEST of 172.16.2.186 to 255.255.255.255
01:26:57,340 DEBUG NetworkManager:<debug> [1747963617.3399] dhcp4 (ens4): event: send REQUEST of 172.16.2.186 to 255.255.255.255
01:27:03,040 DEBUG NetworkManager:<debug> [1747963623.0386] device[1915d724c7c136f1] (ens4): ip4: set state none (was fail, reason: ip-state-clear)
01:27:03,041 DEBUG NetworkManager:<debug> [1747963623.0386] device[1915d724c7c136f1] (ens4): ip6: set state none (was fail, reason: ip-state-clear)
01:27:03,041 WARNING NetworkManager:<warn>  [1747963623.0386] device (ens4): Activation: failed for connection 'ens4'
01:27:03,041 DEBUG NetworkManager:<debug> [1747963623.0387] device[1915d724c7c136f1] (ens4): add_pending_action (2): 'queued-state-change-disconnected'
01:27:03,041 DEBUG NetworkManager:<debug> [1747963623.0387] device[1915d724c7c136f1] (ens4): queue-state[disconnected, reason:none, id:243]: queue state change
01:27:03,041 DEBUG NetworkManager:<debug> [1747963623.0387] device[1915d724c7c136f1] (ens4): remove_pending_action (1): 'in-state-change'
01:27:03,041 DEBUG NetworkManager:<debug> [1747963623.0391] device[1915d724c7c136f1] (ens4): queue-state[disconnected, reason:none, id:243]: change state
01:27:03,041 INFO NetworkManager:<info>  [1747963623.0391] device (ens4): state change: failed -> disconnected (reason 'none', managed-type: 'full')
01:27:03,042 DEBUG NetworkManager:<debug> [1747963623.0391] device[1915d724c7c136f1] (ens4): add_pending_action (2): 'in-state-change'
01:27:03,042 DEBUG NetworkManager:<debug> [1747963623.0392] device[1915d724c7c136f1] (ens4): ip6: addrgenmode6: set none (already set)
01:27:03,042 DEBUG NetworkManager:<debug> [1747963623.0395] platform-linux: sysctl: setting '/proc/sys/net/ipv6/conf/ens4/temp_valid_lft' to '604800' (current value is identical)
01:27:03,042 DEBUG NetworkManager:<debug> [1747963623.0396] platform-linux: sysctl: setting '/proc/sys/net/ipv6/conf/ens4/temp_prefered_lft' to '86400' (current value is identical)
01:27:03,042 DEBUG NetworkManager:<debug> [1747963623.0397] device[1915d724c7c136f1] (ens4): ip6: addrgenmode6: toggle disable_ipv6 sysctl after disabling addr-gen-mode
01:27:03,042 DEBUG NetworkManager:<debug> [1747963623.0398] platform-linux: sysctl: setting '/proc/sys/net/ipv6/conf/ens4/disable_ipv6' to '1' (current value is '0')
01:27:03,043 DEBUG NetworkManager:<debug> [1747963623.0400] platform-linux: sysctl: setting '/proc/sys/net/ipv6/conf/ens4/disable_ipv6' to '0' (current value is '1')
01:27:03,043 DEBUG NetworkManager:<debug> [1747963623.0401] device[1915d724c7c136f1] (ens4): deactivating device (reason 'none') [0]
01:27:03,043 DEBUG NetworkManager:<debug> [1747963623.0411] firewalld: [28673aa1a0996ea5,remove*:"ens4"]: firewall zone remove ens4:default (not running, simulate success)
01:27:03,043 DEBUG NetworkManager:<debug> [1747963623.0412] firewalld: [28673aa1a0996ea5,remove*:"ens4"]: complete: drop request simulating success
01:27:03,043 DEBUG NetworkManager:<debug> [1747963623.0412] device[1915d724c7c136f1] (ens4): ip:dhcp4: set state none (was fail)
01:27:03,043 WARNING org.fedoraproject.Anaconda.Modules.Network:DEBUG:anaconda.modules.network.network:NeworkManager state changed to <enum NM_STATE_DISCONNECTED of type NM.State>
01:27:03,044 WARNING org.fedoraproject.Anaconda.Modules.Network:DEBUG:anaconda.modules.network.network:Connected to network: False
01:27:03,045 INFO NetworkManager:<info>  [1747963623.0450] dhcp4 (ens4): canceled DHCP transaction

so...it kinda looks like very few of the server responses *reach* the client. On the server we see pairs of DHCPDISCOVER received and DHCPOFFER sent, but the client logs sending five DISCOVERs but only getting an OFFER for the last one. On the server we see three pairs of REQUEST received and ACK sent, but the client logs sending three REQUESTS and getting no ACKs, then gives up. This cycle repeats, then it gives up permanently.

I tried downgrading various things to find the cause of this, and it turns out to be qemu. With the upgrade to Fedora 42, the worker hosts got qemu 9.2.3-1.fc42. I did a scratch build of qemu 9.1.3 for F42 from the current F41 spec (with nfs disabled as it doesn't build successfully) - https://koji.fedoraproject.org/koji/taskinfo?taskID=133237554 - and downgraded an affected worker host to that build, and I have not seen this problem on that host since. Every DHCP test group it's run since the downgrade has passed.

I don't have any idea what change in qemu 9.2.x might be causing this. I'm going to try bisecting it as far as I can, but building qemu is quite slow and then I have to run a test group to see if it's affected, so it's a bit of a slow bisect loop. I'm currently building 9.2.0 to see if the problem was introduced there, or between 9.2.0 and 9.2.3.

Comment 1 Adam Williamson 2025-05-28 18:23:31 UTC

Looking at the server logs from a test after the downgrade to qemu 9.1.3, they're way cleaner. There's a single set of DISCOVER / OFFER / REQUEST / ACK messages for each client. Four messages per client and everything is good.

Comment 2 Adam Williamson 2025-05-28 18:51:45 UTC

9.2.0 is bad, so this broke between 9.1.x and 9.2.0.

Comment 3 Laine Stump 2025-05-28 22:23:20 UTC

Just to verify: in the case of Comment 1, the *only* thing changed to make it "way cleaner" was switching from qemu 9.2.3 to 9.1.3 and nothing else was changed? Normally a time-based issue like this would lead me to think of the STP learning time when a port is newly connected and forwarding hasn't yet been enabled (which is usually right around 30 seconds, and that is about how long after the first DHCPDISCOVER until the first offer is received)(although... there is also the amount of time it takes for the OS to boot up to the level of starting dhclient, so maybe that's a false clue), but that would be on the side of the bridge device, not the tap (and certainly not the emulated device in the guest).

(sorry I don't have any concrete advice; just poking to see if some other clue emerges...)

Comment 4 Adam Williamson 2025-05-28 23:48:31 UTC

Yep. qemu version is the only variable between broken and working cases.

I'm bisecting this ATM, the range is down to 'somewhere between 67194c7018b8b06a1c149757f596bb919c683725 and 9.2.0' so far, at least assuming my bisect procedure is valid (I'll verify everything and double-check results when the bisect is complete).

Comment 5 Adam Williamson 2025-05-29 15:23:05 UTC

Still bisecting, but I took a quick break to verify a v9.2.0 build with my bisect setup is bad, and it is, so the bisect procedure looks valid. The current range is 61 commits, and it includes this juicy little cluster:

16f6804c46 vhost_net: fix assertion triggered by batch of host notifiers processing
9379ea9db3 virtio-net: Add queues before loading them
7987d2be5a virtio-net: Copy received header to buffer
17437418c4 virtio-net: Initialize hash reporting values
1981fa9d7d virtio-net: Fix hash reporting when the queue changes
162bdb8113 virtio-net: Do not check for the queue before RSS
a8575f7fb2 virtio-net: Fix size check in dhclient workaround
5930e5ccf3 net: checksum: Convert data to void *

so, I'm gonna bet we wind up in one of those, maybe the "Fix size check in dhclient workaround" one.

Comment 6 Adam Williamson 2025-05-29 22:17:52 UTC

OK, bisect comes down to:

commit 7987d2be5a8bc3a502f89ba8cf3ac3e09f64d1ce
Author: Akihiko Odaki
Date:   Fri Nov 22 14:03:12 2024 +0900

    virtio-net: Copy received header to buffer
    
    receive_header() used to cast the const qualifier of the pointer to the
    received packet away to modify the header. Avoid this by copying the
    received header to buffer.
    
    Signed-off-by: Akihiko Odaki
    Signed-off-by: Jason Wang

CCing them.

Comment 7 Akihiko Odaki 2025-05-30 03:39:19 UTC

The upstream fixed the issue by reverting it. The upstream issue is tracked at: https://gitlab.com/qemu-project/qemu/-/issues/2727

Comment 8 Adam Williamson 2025-05-30 06:24:10 UTC

ah, thanks. I should've looked there before bisecting again...sigh.

Comment 9 Fedora Update System 2025-06-02 14:06:37 UTC

FEDORA-2025-2b4cc4d8cd (qemu-9.2.4-1.fc42) has been submitted as an update to Fedora 42.
https://bodhi.fedoraproject.org/updates/FEDORA-2025-2b4cc4d8cd

Comment 10 Fedora Update System 2025-06-03 05:17:20 UTC

FEDORA-2025-2b4cc4d8cd (qemu-9.2.4-1.fc42) has been pushed to the Fedora 42 stable repository.
If problem still persists, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.