Bug 1874308

Summary:	nwfilter fails trying to bind/unbind rules for tap device that was/is in use by a different domain [rhel-7.9.z]
Product:	Red Hat Enterprise Linux 7	Reporter:	Laine Stump <laine>
Component:	libvirt	Assignee:	Laine Stump <laine>
Status:	CLOSED WONTFIX	QA Contact:	yalzhang <yalzhang>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.0	CC:	ahadas, amusil, bugs, chhu, dvaanunu, dyuan, jdenemar, jreznik, jsuchane, laine, lmen, mkalinin, mlehrer, mtessun, mzamazal, ribarry, virt-maint, xuzhang, yalzhang
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1837395	Environment:
Last Closed:	2021-06-07 15:37:09 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Laine Stump 2020-09-01 02:17:04 UTC

+++ This bug was initially created as a clone of Bug #1837395 +++

Description of problem:
While running VM migrate, few were failed.
Running 60Hrs test, 1200 iterations, 2 failures

Version-Release number of selected component (if applicable):

redhat-release-8.2-1.0.el8.x86_64
rhv-release-4.4.0-31-001.noarch
vdsm-4.40.13-1.el8ev.x86_64



How reproducible:

Running JMETER script, 1  thread

Steps to Reproduce:
1. Select VM
2. Select Host (Randomly)
3. Start Migrate

Actual results:

Have VM migrate failures

Expected results:

No Errors


Additional info:

vdsm.log.61: Log TimeStamp - 2020-05-15 01:46:05,809+0000 (vm_214)
vdsm.log.12: Log TimeStamp - 2020-05-17 02:34:15,550+0000 (vm_219)
MigrationErrors_4.4 - Engine ERRORS


https://drive.google.com/open?id=1Zj0UXSgpug61Dm3Rta_A88lIKJbN01sd

--- Additional comment from Ryan Barry on 2020-05-20 01:13:53 UTC ---

Possibly worth a look, but a clean reproducer may be hard if it only happens .16% of the time

--- Additional comment from Milan Zamazal on 2020-05-20 08:13:24 UTC ---

I can see the following errors in the log:

  libvirt.libvirtError: operation failed: binding 'vnet54' is already being removed

  libvirt.libvirtError: operation failed: binding 'vnet43' already exists

It looks like a race in networking code. The migrations don't seem to be initiated too fast after each other so it may be a real bug.

David, could you please provide the corresponding logs from the destinations so that we can see more info about the errors? (It's also often helpful to have DEBUG log level enabled.)

--- Additional comment from David Vaanunu on 2020-05-20 15:10:11 UTC ---



Add vdsm.log.84.xz file. (Dest host for vm_219)

https://drive.google.com/open?id=1Zj0UXSgpug61Dm3Rta_A88lIKJbN01sd

--- Additional comment from Milan Zamazal on 2020-05-20 16:58:45 UTC ---

Thank you, David. I can't see a migration error on the destination. There is VM.migrationCreate call and then these messages follow:

  2020-05-17 02:34:09,968+0000 INFO  (dhclient-monitor) [root] Received DHCP response: {'action': 'configure', 'iface': 'vnet54'} (dhclient_monitor:129)
  2020-05-17 02:34:09,970+0000 INFO  (dhclient-monitor) [root] Interface vnet54 is not a libvirt interface (dhclient_monitor:134)

This matches the message on the source:

  2020-05-17 02:34:11,530+0000 ERROR (migsrc/623f6cc1) [virt.vm] (vmId='623f6cc1-d49f-43a4-8170-b5fec01f2897') operation failed: binding 'vnet54' is already being removed (migration:278)

And the VM is destroyed from the source:

  2020-05-17 02:34:11,546+0000 INFO  (jsonrpc/7) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:SOURCE-IP-ADDRESS,48984, vmId=623f6cc1-d49f-43a4-8170-b5fec01f2897 (api:48)

And the destination reports the VM is already gone:

  2020-05-17 02:34:13,422+0000 WARN  (vm/623f6cc1) [virt.vm] (vmId='623f6cc1-d49f-43a4-8170-b5fec01f2897') Couldn't destroy incoming VM: Domain not found: no domain with matching uuid '623f6cc1-d49f-43a4-8170-b5fec01f2897' (vm:3673)

The "is already being removed" message is apparently related to device removal in _dhcp_response_handler, which happens after the vnet54 INFO messages above.

Since I can't see any direct migration failure in the destination log, it looks like the migration was stopped by libvirt or QEMU rather than Vdsm. I don't know whether the vnet54 problems started in libvirt, caused the migration failure due to a bug there and made the dhclient-monitor confusion; or whether dhclient-monitor took an action at a wrong time (incoming migration initiation) and thus caused the device error in libvirt and migration failure. Aleš, could you please take a look?

--- Additional comment from Ales Musil on 2020-05-25 06:16:00 UTC ---

(In reply to Milan Zamazal from comment #4)
> Thank you, David. I can't see a migration error on the destination. There is
> VM.migrationCreate call and then these messages follow:
> 
>   2020-05-17 02:34:09,968+0000 INFO  (dhclient-monitor) [root] Received DHCP
> response: {'action': 'configure', 'iface': 'vnet54'} (dhclient_monitor:129)
>   2020-05-17 02:34:09,970+0000 INFO  (dhclient-monitor) [root] Interface
> vnet54 is not a libvirt interface (dhclient_monitor:134)

This message might be confusing but it just indicates that the said interface was not configured for DHCP monitoring that's all. 
You can see the same message appearing multiple times during successful migrations e.g.

2020-05-17 02:06:15,461+0000 INFO  (dhclient-monitor) [root] Received DHCP response: {'action': 'configure', 'iface': 'vnet128'} (dhclient_monitor:129)
2020-05-17 02:06:15,462+0000 INFO  (dhclient-monitor) [root] Interface vnet128 is not a libvirt interface (dhclient_monitor:134)

> 
> This matches the message on the source:
> 
>   2020-05-17 02:34:11,530+0000 ERROR (migsrc/623f6cc1) [virt.vm]
> (vmId='623f6cc1-d49f-43a4-8170-b5fec01f2897') operation failed: binding
> 'vnet54' is already being removed (migration:278)
> 
> And the VM is destroyed from the source:
> 
>   2020-05-17 02:34:11,546+0000 INFO  (jsonrpc/7) [api.virt] START
> destroy(gracefulAttempts=1) from=::ffff:SOURCE-IP-ADDRESS,48984,
> vmId=623f6cc1-d49f-43a4-8170-b5fec01f2897 (api:48)
> 
> And the destination reports the VM is already gone:
> 
>   2020-05-17 02:34:13,422+0000 WARN  (vm/623f6cc1) [virt.vm]
> (vmId='623f6cc1-d49f-43a4-8170-b5fec01f2897') Couldn't destroy incoming VM:
> Domain not found: no domain with matching uuid
> '623f6cc1-d49f-43a4-8170-b5fec01f2897' (vm:3673)
> 
> The "is already being removed" message is apparently related to device
> removal in _dhcp_response_handler, which happens after the vnet54 INFO
> messages above.

The handler just removes file that we are using for tracking which in this case does not exist. 
AFAIK network code should not touch the libvirt tap device that is connected to VM.


> 
> Since I can't see any direct migration failure in the destination log, it
> looks like the migration was stopped by libvirt or QEMU rather than Vdsm. I
> don't know whether the vnet54 problems started in libvirt, caused the
> migration failure due to a bug there and made the dhclient-monitor
> confusion; or whether dhclient-monitor took an action at a wrong time
> (incoming migration initiation) and thus caused the device error in libvirt
> and migration failure. Aleš, could you please take a look?

IMO it seems like libivirt error because I can't think of any way how would we remove the tap device. 
Anyway David can you please include supervdsm.log around the time of the error. Just to be sure.

--- Additional comment from David Vaanunu on 2020-05-25 09:10:41 UTC ---

Source Host - supervdsm.log.9.xz  (F01-H15)
Dest Host - supervdsm.log.8.xz (F02-H19)

--- Additional comment from Ales Musil on 2020-05-25 09:52:26 UTC ---

Nothing relevant in the supervdsm.log that indicates even more some failure in libvirt probably. Milane WDYT?

--- Additional comment from Milan Zamazal on 2020-05-25 10:39:34 UTC ---

Thank you Aleš for looking. As you describe it, we should report it to libvirt. Since we don't have a reproducer or another proof to file a bug, a question on the libvirt users mailing list may be a better option. Would you like to ask there as you are familiar with the networking stuff and what is this all about?

Another option might be to try to reproduce the bug with libvirt debug logging enabled (which we may be asked for anyway). Then we would have info for us and the libvirt developers about what's going on exactly. But I'm not sure it would be feasible and easy enough.

--- Additional comment from Ales Musil on 2020-05-27 08:18:48 UTC ---

If it can be reproduced it is IMO good idea to collect the logs. It might gave us additional hint what to look for.

--- Additional comment from Milan Zamazal on 2020-05-27 08:32:20 UTC ---

David, is there a chance to reproduce the bug with libvirt debug logs enabled?

--- Additional comment from  on 2020-06-15 07:59:57 UTC ---

We cannot reproduce with libvirt logs on debug it produces too many files in scale enviroment as the error doesnt happen at a specific time but rather during a scale run we dont have a way of managing this amount of logs in our enviroment.

--- Additional comment from David Vaanunu on 2020-06-16 07:50:26 UTC ---

https://bugzilla.redhat.com/show_bug.cgi?id=1837395#c11

--- Additional comment from Milan Zamazal on 2020-06-16 15:27:14 UTC ---

Hm, Aleš, what do you think? Can you ask on the libvirt users list even without additional information? If not, I'm afraid we can't do anything about the bug due to insufficient information.

--- Additional comment from Laine Stump on 2020-06-23 19:38:47 UTC ---

Ales asked on libvirt-users list, which brought me here, so I'm clearing the needinfo you set for him.

>IMO it seems like libivirt error because I can't think of any way how would we remove the tap device. 

Actually libvirt doesn't remove the tap device either, that is done automatically by the kernel when the QEMU process exits.

But I believe you are correct that this is a libvirt problem - these libvirt errors:

  libvirt.libvirtError: operation failed: binding 'vnet54' is already being removed
  libvirt.libvirtError: operation failed: binding 'vnet43' already exists

are produced by libvirt's nwfilter driver. It sounds like another race similar to what's described in this email I wrote last week:

  
  https://www.redhat.com/archives/libvir-list/2020-June/msg00681.html

In the case described by the person starting that thread, the problem was that libvirt was removing a port from an OpenvSwitch switch when a guest was shutting down, but another guest had already been started and was given a tap device with the same name as the just-deleted tap device. (the tap device name is determined by the kernel, which just finds the lowest numbered unused name based on a pattern provided by libvirt. The problem is that the kernel always gives the lowest available number.

Your problem is an extension of that - the nwfilter rules are also instantiated based on the name of the tap device, and are torn down after the tap device has been removed.

This is even more evidence that we need to follow Dan Berrangé's suggestion (somewhere in that thread) of not relying on the kernel to assign tap device names, but to instead provide the names ourselves and base it on a monotonically increasing counter (with a reasonably large wrap-around to avoid overflow of the netdev name maximum length of 15). This will eliminate the possibility of immediate re-use (except for some extremely dysfunctional circumstance), and thus also eliminate the race condition.

I'm switching this BZ to rhe RHEL-AV libvirt component and re-assigning to me (since I had coincidentally already put this exact task on my personal todo list foer the week). If the product needs adjusting to fit what you're using in RHV, feel free to reset it to whatever is appropriate.

--- Additional comment from Laine Stump on 2020-08-31 14:13:02 UTC ---

These patches upstream should resolve the issue. They're waiting for libvirt release freeze to be over, and then will be pushed:

https://www.redhat.com/archives/libvir-list/2020-August/msg00962.html

Comment 2 Laine Stump 2020-09-13 17:41:06 UTC

These two patches were pushed upstream to fix the problem. They are included in libvirt-6.7.0

commit d7f38beb2ee072f1f19bb91fbafc9182ce9b069e
Author: Laine Stump <laine>
Date:   Sun Aug 23 14:57:19 2020 -0400

    util: replace macvtap name reservation bitmap with a simple counter

Author: Laine Stump <laine>
Date:   Sun Aug 23 21:20:13 2020 -0400

    util: assign tap device names using a monotonically increasing integer

Comment 3 Laine Stump 2020-09-13 17:42:38 UTC

Correction - those patches are not included in libvirt-6.7.0. They will be in libvirt-6.8.0.

Comment 4 Laine Stump 2020-09-13 17:50:14 UTC

Backported to RHEL7.9 and posted to rhvirt-patches to be included in a z-stream (?) build:

http://post-office.corp.redhat.com/archives/rhvirt-patches/2020-September/msg00197.html

Comment 5 Laine Stump 2020-09-14 20:05:02 UTC

1) I accidentally omitted the commit id for the 2nd patch. Here is a more complete description:


commit 95089f481e003d971fe0a082018216c58c1b80e5
Author: Laine Stump <laine>
Date:   Sun Aug 23 21:20:13 2020 -0400

    util: assign tap device names using a monotonically increasing integer


2) As pointed out by jtomko in Bug 1874304 (same bz but for RHEL8), this patch (fixing an improper initialization I had made in 95089f481e003 is also needed. 

commit 2b6cd855042984b87beb7e3c30b67b0f586d89bb
Author:     Ján Tomko <jtomko>
CommitDate: 2020-09-14 13:02:56 +0200

    util: virNetDevTapCreate: initialize fd to -1

Comment 9 Jiri Denemark 2020-10-09 19:52:09 UTC

OK, moving back to ASSIGNED.

Comment 17 Jaroslav Suchanek 2021-06-07 15:37:09 UTC

This issues has not been identified as critical for rhel-7.9.z. Upgrade to rhel-8 is recommended. Closing as won't fix.