Description of problem: While running VM migrate, few were failed. Running 60Hrs test, 1200 iterations, 2 failures Version-Release number of selected component (if applicable): redhat-release-8.2-1.0.el8.x86_64 rhv-release-4.4.0-31-001.noarch vdsm-4.40.13-1.el8ev.x86_64 How reproducible: Running JMETER script, 1 thread Steps to Reproduce: 1. Select VM 2. Select Host (Randomly) 3. Start Migrate Actual results: Have VM migrate failures Expected results: No Errors Additional info: vdsm.log.61: Log TimeStamp - 2020-05-15 01:46:05,809+0000 (vm_214) vdsm.log.12: Log TimeStamp - 2020-05-17 02:34:15,550+0000 (vm_219) MigrationErrors_4.4 - Engine ERRORS https://drive.google.com/open?id=1Zj0UXSgpug61Dm3Rta_A88lIKJbN01sd
Possibly worth a look, but a clean reproducer may be hard if it only happens .16% of the time
I can see the following errors in the log: libvirt.libvirtError: operation failed: binding 'vnet54' is already being removed libvirt.libvirtError: operation failed: binding 'vnet43' already exists It looks like a race in networking code. The migrations don't seem to be initiated too fast after each other so it may be a real bug. David, could you please provide the corresponding logs from the destinations so that we can see more info about the errors? (It's also often helpful to have DEBUG log level enabled.)
Add vdsm.log.84.xz file. (Dest host for vm_219) https://drive.google.com/open?id=1Zj0UXSgpug61Dm3Rta_A88lIKJbN01sd
Thank you, David. I can't see a migration error on the destination. There is VM.migrationCreate call and then these messages follow: 2020-05-17 02:34:09,968+0000 INFO (dhclient-monitor) [root] Received DHCP response: {'action': 'configure', 'iface': 'vnet54'} (dhclient_monitor:129) 2020-05-17 02:34:09,970+0000 INFO (dhclient-monitor) [root] Interface vnet54 is not a libvirt interface (dhclient_monitor:134) This matches the message on the source: 2020-05-17 02:34:11,530+0000 ERROR (migsrc/623f6cc1) [virt.vm] (vmId='623f6cc1-d49f-43a4-8170-b5fec01f2897') operation failed: binding 'vnet54' is already being removed (migration:278) And the VM is destroyed from the source: 2020-05-17 02:34:11,546+0000 INFO (jsonrpc/7) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:SOURCE-IP-ADDRESS,48984, vmId=623f6cc1-d49f-43a4-8170-b5fec01f2897 (api:48) And the destination reports the VM is already gone: 2020-05-17 02:34:13,422+0000 WARN (vm/623f6cc1) [virt.vm] (vmId='623f6cc1-d49f-43a4-8170-b5fec01f2897') Couldn't destroy incoming VM: Domain not found: no domain with matching uuid '623f6cc1-d49f-43a4-8170-b5fec01f2897' (vm:3673) The "is already being removed" message is apparently related to device removal in _dhcp_response_handler, which happens after the vnet54 INFO messages above. Since I can't see any direct migration failure in the destination log, it looks like the migration was stopped by libvirt or QEMU rather than Vdsm. I don't know whether the vnet54 problems started in libvirt, caused the migration failure due to a bug there and made the dhclient-monitor confusion; or whether dhclient-monitor took an action at a wrong time (incoming migration initiation) and thus caused the device error in libvirt and migration failure. Aleš, could you please take a look?
(In reply to Milan Zamazal from comment #4) > Thank you, David. I can't see a migration error on the destination. There is > VM.migrationCreate call and then these messages follow: > > 2020-05-17 02:34:09,968+0000 INFO (dhclient-monitor) [root] Received DHCP > response: {'action': 'configure', 'iface': 'vnet54'} (dhclient_monitor:129) > 2020-05-17 02:34:09,970+0000 INFO (dhclient-monitor) [root] Interface > vnet54 is not a libvirt interface (dhclient_monitor:134) This message might be confusing but it just indicates that the said interface was not configured for DHCP monitoring that's all. You can see the same message appearing multiple times during successful migrations e.g. 2020-05-17 02:06:15,461+0000 INFO (dhclient-monitor) [root] Received DHCP response: {'action': 'configure', 'iface': 'vnet128'} (dhclient_monitor:129) 2020-05-17 02:06:15,462+0000 INFO (dhclient-monitor) [root] Interface vnet128 is not a libvirt interface (dhclient_monitor:134) > > This matches the message on the source: > > 2020-05-17 02:34:11,530+0000 ERROR (migsrc/623f6cc1) [virt.vm] > (vmId='623f6cc1-d49f-43a4-8170-b5fec01f2897') operation failed: binding > 'vnet54' is already being removed (migration:278) > > And the VM is destroyed from the source: > > 2020-05-17 02:34:11,546+0000 INFO (jsonrpc/7) [api.virt] START > destroy(gracefulAttempts=1) from=::ffff:SOURCE-IP-ADDRESS,48984, > vmId=623f6cc1-d49f-43a4-8170-b5fec01f2897 (api:48) > > And the destination reports the VM is already gone: > > 2020-05-17 02:34:13,422+0000 WARN (vm/623f6cc1) [virt.vm] > (vmId='623f6cc1-d49f-43a4-8170-b5fec01f2897') Couldn't destroy incoming VM: > Domain not found: no domain with matching uuid > '623f6cc1-d49f-43a4-8170-b5fec01f2897' (vm:3673) > > The "is already being removed" message is apparently related to device > removal in _dhcp_response_handler, which happens after the vnet54 INFO > messages above. The handler just removes file that we are using for tracking which in this case does not exist. AFAIK network code should not touch the libvirt tap device that is connected to VM. > > Since I can't see any direct migration failure in the destination log, it > looks like the migration was stopped by libvirt or QEMU rather than Vdsm. I > don't know whether the vnet54 problems started in libvirt, caused the > migration failure due to a bug there and made the dhclient-monitor > confusion; or whether dhclient-monitor took an action at a wrong time > (incoming migration initiation) and thus caused the device error in libvirt > and migration failure. Aleš, could you please take a look? IMO it seems like libivirt error because I can't think of any way how would we remove the tap device. Anyway David can you please include supervdsm.log around the time of the error. Just to be sure.
Source Host - supervdsm.log.9.xz (F01-H15) Dest Host - supervdsm.log.8.xz (F02-H19)
Nothing relevant in the supervdsm.log that indicates even more some failure in libvirt probably. Milane WDYT?
Thank you Aleš for looking. As you describe it, we should report it to libvirt. Since we don't have a reproducer or another proof to file a bug, a question on the libvirt users mailing list may be a better option. Would you like to ask there as you are familiar with the networking stuff and what is this all about? Another option might be to try to reproduce the bug with libvirt debug logging enabled (which we may be asked for anyway). Then we would have info for us and the libvirt developers about what's going on exactly. But I'm not sure it would be feasible and easy enough.
If it can be reproduced it is IMO good idea to collect the logs. It might gave us additional hint what to look for.
David, is there a chance to reproduce the bug with libvirt debug logs enabled?
https://bugzilla.redhat.com/show_bug.cgi?id=1837395#c11
Hm, Aleš, what do you think? Can you ask on the libvirt users list even without additional information? If not, I'm afraid we can't do anything about the bug due to insufficient information.
Ales asked on libvirt-users list, which brought me here, so I'm clearing the needinfo you set for him. >IMO it seems like libivirt error because I can't think of any way how would we remove the tap device. Actually libvirt doesn't remove the tap device either, that is done automatically by the kernel when the QEMU process exits. But I believe you are correct that this is a libvirt problem - these libvirt errors: libvirt.libvirtError: operation failed: binding 'vnet54' is already being removed libvirt.libvirtError: operation failed: binding 'vnet43' already exists are produced by libvirt's nwfilter driver. It sounds like another race similar to what's described in this email I wrote last week: https://www.redhat.com/archives/libvir-list/2020-June/msg00681.html In the case described by the person starting that thread, the problem was that libvirt was removing a port from an OpenvSwitch switch when a guest was shutting down, but another guest had already been started and was given a tap device with the same name as the just-deleted tap device. (the tap device name is determined by the kernel, which just finds the lowest numbered unused name based on a pattern provided by libvirt. The problem is that the kernel always gives the lowest available number. Your problem is an extension of that - the nwfilter rules are also instantiated based on the name of the tap device, and are torn down after the tap device has been removed. This is even more evidence that we need to follow Dan Berrangé's suggestion (somewhere in that thread) of not relying on the kernel to assign tap device names, but to instead provide the names ourselves and base it on a monotonically increasing counter (with a reasonably large wrap-around to avoid overflow of the netdev name maximum length of 15). This will eliminate the possibility of immediate re-use (except for some extremely dysfunctional circumstance), and thus also eliminate the race condition. I'm switching this BZ to rhe RHEL-AV libvirt component and re-assigning to me (since I had coincidentally already put this exact task on my personal todo list foer the week). If the product needs adjusting to fit what you're using in RHV, feel free to reset it to whatever is appropriate.
These patches upstream should resolve the issue. They're waiting for libvirt release freeze to be over, and then will be pushed: https://www.redhat.com/archives/libvir-list/2020-August/msg00962.html
These two patches were pushed upstream to fix the problem. They are included in libvirt-6.8.0 commit d7f38beb2ee072f1f19bb91fbafc9182ce9b069e Author: Laine Stump <laine> Date: Sun Aug 23 14:57:19 2020 -0400 util: replace macvtap name reservation bitmap with a simple counter Author: Laine Stump <laine> Date: Sun Aug 23 21:20:13 2020 -0400 util: assign tap device names using a monotonically increasing integer
Accidentally omitted the commit id of the 2nd patch: commit 95089f481e003d971fe0a082018216c58c1b80e5 Author: Laine Stump <laine> Date: Sun Aug 23 21:20:13 2020 -0400 util: assign tap device names using a monotonically increasing integer
As pointed out by jtomko in Bug 1874304, this patch (fixing an improper initialization I had made in 95089f481e003 is also needed. commit 2b6cd855042984b87beb7e3c30b67b0f586d89bb Author: Ján Tomko <jtomko> CommitDate: 2020-09-14 13:02:56 +0200 util: virNetDevTapCreate: initialize fd to -1
Hi laine, test on libvirt-6.6.0-5.module+el8.3.0+8092+f9e72d7e.x86_64, it introduced another issue, the vnet* name is kept somewhere and not flushed. Only a libvirtd restart will clear the record. If this is expected? # virsh start rhel; ifconfig | grep vnet ; virsh destroy rhel; ifconfig | grep vnet ; virsh start rhel; ifconfig | grep vnet ; virsh destroy rhel; ifconfig | grep vnet ; virsh start rhel; ifconfig | grep vnet ; Domain rhel started vnet0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 Domain rhel destroyed Domain rhel started ***vnet1***: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 Domain rhel destroyed Domain rhel started ***vnet2***: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 Before this bug fix, it should be 'vnet0'
I don't understand your question. What do you mean by "not flushed"? Where do you continue to see vnet0? What should happen is that each time a new guest is started (this includes the case of re-starting a guest that had previously been started and then shutdown), a new and different tap device name is used, and that name will never be re-used as long as libvirtd is running. If libvirtd is restarted, it will note the tap device names in use by all guests when it reconnects to the guests at startup, and will continue allocating new tap devices at the index of "highest numbered device + 1". So if you restart libvirtd when many guests have started and stopped, and there are some devices in use, with the higher numbered device currently in use being, e.g. "vnet2753", then the next new tao device will be named "vnet2754". If you happen to shut down *all* running guests, and then restart libvirtd, since it won't find any tap devices in use, the counter will again start at 0. Are you seeing behavior different from this? (Looking back after writing this - are you concerned by the fact that when you shutdown your guest that had been using vnet0, and then start it again, that it now is using a differently-named tap device? If so, then that is the desired behavior.)
Hi laine, Thank you for the explanation, it's very clear. I should read the comment before asking questions. I can reproduce the error on libvirt-6.6.0-4.module+el8.3.0+7883+3d717aa8.x86_64 with the steps as below: 1. Prepare 2 guests with 2 interfaces: one nat interface with nwfilter; another interface connected to ovs-bridge; 2. On terminal one, start first guest rh; wait for a while; then execute destroy "# virsh destroy rh" at the same time start 2nd guest in another terminal; it will fail as: termianl 1 # virsh destroy rh Domain rh destroyed terminal 2 # time virsh start rhel error: Failed to start domain rhel error: operation failed: binding 'vnet1' is already being removed real 0m10.518s user 0m0.014s sys 0m0.011s # grep -i error /var/log/libvirt/libvirtd.log | grep -v info 2020-09-17 05:36:43.824+0000: 106837: error : virNWFilterBindingObjListAddLocked:179 : operation failed: binding 'vnet1' is already being removed 2020-09-17 05:36:53.947+0000: 106837: error : virNetDevVPortProfileOpCommon:1002 : internal error: port-profile setlink timed out
Are you saying that the first guest is using vnet1, and the 2nd guest attempts to also use vnet1? And this is with a build that supposedly has the fix? That should not be possible. Can you verify you're running the newer build?
(In reply to Laine Stump from comment #25) > Are you saying that the first guest is using vnet1, and the 2nd guest > attempts to also use vnet1? And this is with a build that supposedly has the > fix? That should not be possible. Can you verify you're running the newer > build? comment 24 is to reproduce the bug on libvirt-6.6.0-4.module+el8.3.0+7883+3d717aa8.x86_64 without the fix. The result is as expected. I have tested on libvirt-6.6.0-5.el8 with the same steps, the issue is fixed and no error messages any more. Set the bug to be verified.
Add other test scenarios: 1. Create tap device and macvtap device by ip command, then start vm with nat or macvtap interface, or hotplug interfaces to guest - PASS 2. Restart libvirtd and check the the tap/macvtap device name will begin from vnet0 and macvtap0 - PASS details: # systemctl restart libvirtd # virsh dumpxml rhel | grep /interface -B9 <interface type='network'> <mac address='52:54:00:f3:41:15'/> <source network='default'/> <model type='virtio'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </interface> <interface type='network'> <mac address='52:54:00:3c:4b:df'/> <source network='direct'/> <model type='virtio'/> <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' function='0x0'/> </interface> <interface type='direct'> <mac address='52:54:00:f2:d7:18'/> <source dev='eno1' mode='vepa'/> <model type='virtio'/> <address type='pci' domain='0x0000' bus='0x0e' slot='0x00' function='0x0'/> </interface> # ip tuntap add dev vnet0 mod tap # ip tuntap add dev vnet2 mod tap # ip link add link eno1 macvtap0 type macvtap # ip link add link eno1 macvtap2 type macvtap # virsh start rhel Domain rhel started # virsh domiflist rhel Interface Type Source Model MAC ------------------------------------------------------------- vnet1 network default virtio 52:54:00:f3:41:15 macvtap1 direct direct virtio 52:54:00:3c:4b:df macvtap3 direct eno1 virtio 52:54:00:f2:d7:18 # virsh attach-interface rhel network default Interface attached successfully # virsh attach-interface rhel direct eno1 Interface attached successfully # virsh domiflist rhel Interface Type Source Model MAC -------------------------------------------------------------- vnet1 network default virtio 52:54:00:f3:41:15 macvtap1 direct direct virtio 52:54:00:3c:4b:df macvtap3 direct eno1 virtio 52:54:00:f2:d7:18 vnet3 network default rtl8139 52:54:00:3a:6f:95 macvtap4 direct eno1 rtl8139 52:54:00:45:23:ec # virsh destroy rhel virsh sDomain rhel destroyed # virsh start rhel Domain rhel started # virsh domiflist rhel Interface Type Source Model MAC ------------------------------------------------------------- vnet4 network default virtio 52:54:00:f3:41:15 macvtap5 direct direct virtio 52:54:00:3c:4b:df macvtap6 direct eno1 virtio 52:54:00:f2:d7:18 # virsh destroy rhel; # systemctl restart libvirtd delete all the pre-created vnet* and macvtap* # virsh start rhel #virsh domiflist rhel Interface Type Source Model MAC ------------------------------------------------------------- vnet0 network default virtio 52:54:00:f3:41:15 macvtap0 direct direct virtio 52:54:00:3c:4b:df macvtap1 direct eno1 virtio 52:54:00:f2:d7:18 the result is as expected.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5137