Description of problem: So this has been seen a few times on RHEL-8.1 (not sure about other versions) and with: libvirt-5.6.0-10.module+el8.1.1+5309+6d656f05.x86_64 Here is what we rarely observe: 0) the libvirtd service gets restarted unconditionally (there might or might not be vms/networks already configured) 1) We create a number of VMs, each of which has three nics on three separate networks. All these VMs are usually RHEL-8.1 or RHEL-8.2. All VMs use DHCP on each of their three nics: [root@undercloud-0 ~]# more /etc/sysconfig/network-scripts/ifcfg-eth1 DEVICE=eth1 BOOTPROTO="dhcp" BOOTPROTOv6="dhcp" ONBOOT="yes" TYPE="Ethernet" USERCTL="yes" PEERDNS="yes" IPV6INIT="yes" PERSISTENT_DHCLIENT="1" DEFROUTE=yes IPV6_DEFROUTE=yes 2) After the VM creation we run ansible in a loop to to find out what IP the VM gets on the network called 'management' (since we know the mac-addresses of each VM) so we run this: virsh net-dhcp-leases management | awk '($4 == "ipv4") && ($3 == "52:54:00:6c:51:0e") {print $5}' and we wait until this loop eventually finds all mac addresses of all vms (the mac address moves to the next one as soon as it is seen once) Normally, I'd say 98% of the times, this all works just fine. But sometimes it just does not work and we observe the following: 3) The loop will be blocked looking for a mac address that will never show up in virsh net-dhcp-leases management, but the corresponding VM is actually up and running and it has received an IP address just fine: May 26 04:04:23 localhost.localdomain NetworkManager[757]: <info> [1590480263.4513] dhcp4 (eth1): activation: beginning transaction (timeout in 45 seconds) May 26 04:04:26 localhost.localdomain NetworkManager[757]: <info> [1590480266.6213] dhcp4 (eth1): option dhcp_lease_time => '3600' May 26 04:04:26 localhost.localdomain NetworkManager[757]: <info> [1590480266.6214] dhcp4 (eth1): option domain_name_servers => '172.16.0.1' May 26 04:04:26 localhost.localdomain NetworkManager[757]: <info> [1590480266.6214] dhcp4 (eth1): option expiry => '1590483866' May 26 04:04:26 localhost.localdomain NetworkManager[757]: <info> [1590480266.6214] dhcp4 (eth1): option ip_address => '172.16.0.33' May 26 04:04:26 localhost.localdomain NetworkManager[757]: <info> [1590480266.6214] dhcp4 (eth1): option next_server => '172.16.0.1' May 26 04:04:26 localhost.localdomain NetworkManager[757]: <info> [1590480266.6410] device (eth1): Activation: successful, device activated. Yet virsh net-dhcp-leases management will never show '52:54:00:6c:51:0e' in its output. And this will stay so until I actually force a DHCP renewal, by restarting NetworkManager on said VM (I guess waiting > 1hr will also just fix this, since we're not specifying a dhcp lease time in the network and dnsmasq's default is 1 hour). After I do that the output is as expected and the ansible playbook will happily continue: [root@hab-01 ~]# virsh net-dhcp-leases management Expiry Time MAC address Protocol IP address Hostname Client ID or DUID -------------------------------------------------------------------------------------------------------- 2020-05-26 05:04:44 52:54:00:48:8f:94 ipv4 172.16.0.25/24 - 01:52:54:00:48:8f:94 2020-05-26 05:23:30 52:54:00:6c:51:0e ipv4 172.16.0.33/24 - 01:52:54:00:6c:51:0e So to me it seems we're triggering some sort of race and dnsmasq (and consequently virsh net-dhcp-leases) does not possess the old dhcp requests of some VMs. In this situation when we look at /var/lib/libvirt/dnsmasq/management.status we do *not* see the macaddress/vm that is missing. Additional info: While chasing this I noticed that in the logs we usually do see the dhcp sequence correctly (below snippet is from an older failure): May 12 20:12:10 hab-02.ha.lab.eng.bos.redhat.com dnsmasq-dhcp[21732]: DHCPDISCOVER(management) 52:54:00:c9:ed:dd May 12 20:12:10 hab-02.ha.lab.eng.bos.redhat.com dnsmasq-dhcp[21732]: DHCPOFFER(management) 172.16.0.27 52:54:00:c9:ed:dd May 12 20:12:11 hab-02.ha.lab.eng.bos.redhat.com dnsmasq-dhcp[21732]: DHCPREQUEST(management) 172.16.0.27 52:54:00:c9:ed:dd May 12 20:12:11 hab-02.ha.lab.eng.bos.redhat.com dnsmasq-dhcp[21732]: DHCPACK(management) 172.16.0.27 52:54:00:c9:ed:dd May 12 20:12:11 hab-02.ha.lab.eng.bos.redhat.com dnsmasq[21732]: script process exited with status 1 Now I added some debug to libvirt/src/network/leasehelper.c and that error above seems to be in: /* Try to claim the pidfile, exiting if we can't */ if ((pid_file_fd = virPidFileAcquirePath(pid_file, false, getpid())) < 0) { fprintf(stderr, "virPidFileAcquirePath(...) failed\n"); // we added this bit goto cleanup; } I am not sure how/if this error is relevant to our issue, but I thought it is worth noting. Side note: it seems to me that no error is ever logged in the leasehelper binary, maybe it is missing some initialization of the libvirt logging subsystem or is this by design? I guess once https://bugzilla.redhat.com/show_bug.cgi?id=913446 lands, I could shorten the lease time enough and just wait enough until the VM renews the lease, but until I have that on the hypervisor I am a bit out of ideas (beyond 'wait more than the 3600s default lease time')
Michal, any idea about this behavior? Thanks.
(In reply to Michele Baldessari from comment #0) > Now I added some debug to libvirt/src/network/leasehelper.c and that error > above seems to be in: > /* Try to claim the pidfile, exiting if we can't */ > if ((pid_file_fd = virPidFileAcquirePath(pid_file, false, getpid())) < 0) { > fprintf(stderr, "virPidFileAcquirePath(...) failed\n"); // we added this > bit > goto cleanup; > } > > I am not sure how/if this error is relevant to our issue, but I thought it > is worth noting. > Ah, you may be onto something. I will need to check how dnsmasq executes the script [1], but if it spawns it in parallel then the first leases helper starts updating the JSON file (where net-dhcp-leases gets it from) and if the second comes it tries to acquire the pidfile which fails (because of "false" passed it does lock or fail without any wait), and thus doesn't update the JSON but quits. Hence, the JSON is missing data from this second write attempt and hence net-dhcp-leases doesn't print anything. What speaks in favor is that IIUC you spawn a lot of machines at once, so they do DHCP roughly at the same time allowing the race condition to happen. BTW: for testing purposes (if you want) just do: diff --git i/src/network/leaseshelper.c w/src/network/leaseshelper.c index a1780ca4e4..86c847db6d 100644 --- i/src/network/leaseshelper.c +++ w/src/network/leaseshelper.c @@ -160,7 +160,7 @@ main(int argc, char **argv) pid_file = g_strdup(RUNSTATEDIR "/leaseshelper.pid"); /* Try to claim the pidfile, exiting if we can't */ - if ((pid_file_fd = virPidFileAcquirePath(pid_file, false, getpid())) < 0) + if ((pid_file_fd = virPidFileAcquirePath(pid_file, true, getpid())) < 0) goto cleanup; /* Since interfaces can be hot plugged, we need to make sure that the and if my hunch is correct the problem goes away.
In fact, I am able to reproduce (if I put sleep() at the right place - see 1/3 from the referenced patches). Hint for QE - it's important to have two separate networks where DHCP occurs roughly at the same time, because one instance of dnsmasq runs the leases helper sequentially, even if DHCP requests appear concurrently. Patches posted upstream: https://www.redhat.com/archives/libvir-list/2020-June/msg00533.html
I think I can more or less confirm that your patches fix the issue for me. I have run with your patch for almost two weeks and had no issues, then by mistake a server had rolled the libvirt rpms back to the unpatched version and I hit the issue again. So from my POV, the patches do seem to fix this. Thanks, Michele
Merged upstream as: 9ed345ac1a leaseshelper: Report more errors 876211ef4a leaseshelper: Wait to acquire PID file v6.4.0-245-g9ed345ac1a
I can reproduce this bug on libvirt-libs-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64, it failed 17 times in 20 test round. Steps to reproduce: 1. Prepare a vm rhel with 2 interfaces connected to 2 separate nat networks(default and def), start the vm and check the net-dhcp-leases for each network. #!/bin/bash for i in `seq 1 20`; do rm -fr /var/lib/libvirt/dnsmasq/virbr0.status rm -fr /var/lib/libvirt/dnsmasq/virbr1.status virsh net-destroy default virsh net-destroy def virsh net-start default virsh net-start def virsh start rhel; sleep 40; echo "---------loop $i--------" >> /tmp/result.txt virsh net-dhcp-leases default >> /tmp/result.txt virsh net-dhcp-leases def >> /tmp/result.txt virsh destroy rhel done check the result.txt, and found it failed 17/20. In 17 rounds, 1 of the network can not get the leases info.
Hi Michal, Thank you very much for your information in comment 4 about how to reproduce it. It saves me a lot of time and effort. And I found this bug can only be reproduced when the 2 interfaces are with the same model type, both are "virtio" or "e1000e". If I set the model type as 1 virtio and 1 e1000e, there will be no such issue. I'm just curious about it, do you think it makes sense?
Update libvirt to libvirt-6.5.0-1.module+el8.3.0+7323+d54bb644.x86_64 and retest by the steps in comment 9, the issue if fixed, set this bug to be verified. 1. test with 2 virtio interfaces for 20 rounds - pass 2. test with 2 e1000e interfaces for 20 rounds - pass
(In reply to yalzhang from comment #10) > I'm just curious about it, do you think it makes sense? That's weird. The only reason I can think of is that virtio and e1000e drivers don't align in sending DHCP packets at once (e.g. couple of extra CPU cycles spend in emulating e1000e?) and this the transaction doesn't happen exactly at the same time and thus the bug doesn't reproduce.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5137