Bug 1740498 - [downstream clone - 4.3.6] Remove nwfilter configuration from the vdsmd service start
Summary: [downstream clone - 4.3.6] Remove nwfilter configuration from the vdsmd servi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.3.1
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ovirt-4.3.6
: 4.3.6
Assignee: Dominik Holler
QA Contact: David Vaanunu
URL:
Whiteboard:
Depends On: 1722854
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-13 06:16 UTC by RHV bug bot
Modified: 2019-10-10 15:37 UTC (History)
17 users (show)

Fixed In Version: vdsm-4.30.29
Doc Type: Bug Fix
Doc Text:
Previously, while VDSM service was starting, the nwfilter configuration removed and recreated the 'vdsm-no-mac-spoofing' network filter definition to ensure the filter was up to date. If many vNICs were using the filter, that operation took longer than expected and caused the VDSM startup to timeout. The current release fixes this issue by updating the filter, which takes less than a second independent of how many vNICs are using the filter.
Clone Of: 1722854
Environment:
Last Closed: 2019-10-10 15:36:50 UTC
oVirt Team: Network
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:3009 0 None None None 2019-10-10 15:37:06 UTC
oVirt gerrit 102573 0 'None' MERGED nwfilter: Update vdsm-no-mac-spoofing 2021-02-04 08:45:11 UTC

Description RHV bug bot 2019-08-13 06:16:37 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1722854 +++
======================================================================

Description of problem:
Vdsmd service may fail to start due to timeout. This timeout is caused by NWFilter configuration. The cause is mainly the suboptimal implementation in libvirt.

Version-Release number of selected component (if applicable):
vdsm-4.30.13-4.el7ev.x86_64

How reproducible:
Not completely clear. It is most visible in very loaded environments.

Steps to Reproduce:
1.# time vdsm-tool nwfilter

Actual results:
time vdsm-tool nwfilter
libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use

real    2m19.788s
user    0m0.291s
sys     0m0.092s


Expected results:
The nwfilter configuration is not triggered.

Additional info:

I am not sure why we call the nwfilter configuration when starting vdsm. The 
logic is the following:

    - Look up the filter by nwfilterLookupByName. We only look for vdsm-no-mac-spoofing
    - If we do not find it then we do nwfilterDefineXML
    - If we find it then we do undefine and then nwfilterDefineXML

The code is [1]. I think that if a VM has created on the hypervisor then it has to be a VM created by vdsm and the filters must to be set correctly. So there is no point to undefine and defice. Also, there is not reason to define it if it is not there as the filter may be disable for a purpose. So I believe that this part can be completely removed and speed up the vdsmd restart. BTW you cannot undefine any filter as the VMs are runinng anyway. we will get 

    libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use


[1]
        try:
            old_filter = conn.nwfilterLookupByName(self.filterName)
        except libvirt.libvirtError:  # No filter, we can define it.
            pass
        else:
            try:
                old_filter.undefine()
            except libvirt.libvirtError:
                # If the filter is in use it may fail to be removed. In that
                # case we warn of it and consider the task done.
                logging.warning('Failed to remove the old filter %s.',
                                old_filter.name(), exc_info=True)
                return

        nwFilter = conn.nwfilterDefineXML(self.buildFilterXml())
        logging.debug("Filter %s was defined", nwFilter.name())

(Originally by Roman Hodain)

Comment 2 RHV bug bot 2019-08-13 06:16:41 UTC
Dan and Petr, do you remember if there was a reason to keep the "vdsm-tool nwfilter" in vdsmd_init_common.sh during fixing bug 1150718 ?
During the start of VDSM, maybe VDSM can just create the network filters if they do not exist (or fail if they not exist)?

(Originally by Dominik Holler)

Comment 3 RHV bug bot 2019-08-13 06:16:43 UTC
One more thing to add. Keep in mind that the vdsm-no-mac-spoofing can be disabled in the portal per network in the vNIC  Profiles. In that case, the vdsm restart could cause the vdsm-no-mac-spoofing to enable even though it is disabled in the UI.

(Originally by Roman Hodain)

Comment 4 RHV bug bot 2019-08-13 06:16:45 UTC
I do not understand what takes so long. It may be the multiple libvirt calls we do.

(In reply to Roman Hodain from comment #3)
> One more thing to add. Keep in mind that the vdsm-no-mac-spoofing can be
> disabled in the portal per network in the vNIC  Profiles. In that case, the
> vdsm restart could cause the vdsm-no-mac-spoofing to enable even though it
> is disabled in the UI.

This is not the case. Creating the filter is one thing, using it is another. In oVirt APIs you can only opt out from using the filter for a vnic profile, but it is always defined.


(In reply to Dominik Holler from comment #2)
> Dan and Petr, do you remember if there was a reason to keep the "vdsm-tool
> nwfilter" in vdsmd_init_common.sh during fixing bug 1150718 ?
> During the start of VDSM, maybe VDSM can just create the network filters if
> they do not exist (or fail if they not exist)?

Historically, we wanted to make sure that the host has the most up-to-date definition of the filter, and we wanted to forget about it during run time.
If testing for the filter and creating it is SO slow, we CERTAINLY would not like to do it on every VM start up - it would slow them all down. If we take this path we must @momoize this.

Another option that I considered many many years ago, was to ship the filter xml in vdsm RPM, placing it into libvirt directory structure. This is quite rude, as the directory structure is not an API, and it requires restarting libvirtd after installing vdsm (but I think we already do that). I wanted to be polite, but please reconsider my ancient decision.

(Originally by danken)

Comment 5 RHV bug bot 2019-08-13 06:16:47 UTC
(In reply to Dan Kenigsberg from comment #4)
> I do not understand what takes so long. It may be the multiple libvirt calls
> we do.

According to bug 1720482 comment 5 only a view calls hurt already.

> 
> (In reply to Dominik Holler from comment #2)
> > Dan and Petr, do you remember if there was a reason to keep the "vdsm-tool
> > nwfilter" in vdsmd_init_common.sh during fixing bug 1150718 ?
> > During the start of VDSM, maybe VDSM can just create the network filters if
> > they do not exist (or fail if they not exist)?
> 
> Historically, we wanted to make sure that the host has the most up-to-date
> definition of the filter, and we wanted to forget about it during run time.

Maybe re-creating the filter during `vdsm-tool configure` would be enough?

> If testing for the filter and creating it is SO slow, we CERTAINLY would not
> like to do it on every VM start up - it would slow them all down. If we take
> this path we must @momoize this.
> 

Already the one time re-creation during VDSM startup hurts, if VMs are running.

> Another option that I considered many many years ago, was to ship the filter
> xml in vdsm RPM, placing it into libvirt directory structure. This is quite
> rude, as the directory structure is not an API, and it requires restarting

Maybe we could trigger a script during vdsm installation, even I think
to trigger the filter creation during `vdsm-tool configure` would be more correct?

> libvirtd after installing vdsm (but I think we already do that). I wanted to
> be polite, but please reconsider my ancient decision.

(Originally by Dominik Holler)

Comment 6 RHV bug bot 2019-08-13 06:16:49 UTC
> Maybe re-creating the filter during `vdsm-tool configure` would be enough?

I think we had an issue of libvirt not always running when `vdsm-tool configure` is run. But I'm not sure.

> Already the one time re-creation during VDSM startup hurts, if VMs are running.

sure, that's why we must @memoize, so we run it once in vdsm lifetime.

(Originally by danken)

Comment 7 RHV bug bot 2019-08-13 06:16:51 UTC
(In reply to Dan Kenigsberg from comment #6)
> > Maybe re-creating the filter during `vdsm-tool configure` would be enough?
> 
> I think we had an issue of libvirt not always running when `vdsm-tool
> configure` is run. But I'm not sure.
> 

You are right.

(Originally by Dominik Holler)

Comment 8 RHV bug bot 2019-08-13 06:16:53 UTC
Daniel, can you help us to verify that the proposed code change fixes the problem?
A host applied the fix (just a single file is modified) has to restart vdsm succussfully, even if many VMs using the vdsm-no-mac-spoofing are running.
Interesting would be the output of
time vdsm-tool nwfilter
during the VMs are running, to ensure that we are not too close the critical 90 seconds.

(Originally by Dominik Holler)

Comment 12 RHV bug bot 2019-08-13 06:17:01 UTC
Restoring need info from comment 8

(Originally by Michael Burman)

Comment 15 RHV bug bot 2019-08-13 06:17:06 UTC
To confirm the reproduction steps:

#reproduce current state
Stop all vms on host
restart vdsmd 
start vms
run vdsm-tool nwfilter with time command and verify the amount of time taken.


#Verify change
Stop all vms on host
Stop vdsmd
swap nwfilter.py file from patch
Start vdsmd 
Start vms
vdsm-tool nwfilter with time command


Is the above correct?
what about stop and start order of the vms ?

(Originally by Mordechai Lehrer)

Comment 16 RHV bug bot 2019-08-13 06:17:08 UTC
(In reply to mlehrer from comment #15)
> 
> Is the above correct?


The steps are:
1. Create the scenario, especially many vNICs on a host
2. On the host with the many vNICs:
  a) 'time vdsm-tool nwfilter' should be slow and show error, e.g.:
[root@ovirt-master-host2 ~]# ip l | grep vnet | wc -l
44
[root@ovirt-master-host2 ~]# time vdsm-tool nwfilter
libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use

real	0m7.723s
user	0m0.188s
sys	0m0.051s

  b) swap nwfilter.py file from patch, e.g.:
[root@ovirt-master-host2 ~]# cp /usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py /usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py.orig
[root@ovirt-master-host2 ~]# cp /usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py.fixed /usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py
cp: overwrite ‘/usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py’? y

  c) run 'time vdsm-tool nwfilter' again, should be faster than 1 second and no error message anymore, e.g.
[root@ovirt-master-host2 ~]# time vdsm-tool nwfilter

real	0m0.299s
user	0m0.216s
sys	0m0.054s

(Originally by Dominik Holler)

Comment 18 RHV bug bot 2019-08-13 06:17:12 UTC
*FYI*

I had an issue with an RHV host, after few live migration vdsmd (v4.20.32-1) service stucked. It was impossible to restart the host.

Changing nwfilter.py to the latest version available on github https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/nwfilter.py resolved our issue and we was able to start vdsmd.

(Originally by Florian Jacquin)

Comment 19 RHV bug bot 2019-08-13 06:17:14 UTC
Florian, thank you very much for sharing this!
Do you remember an approximate number of vNICs or VMs instantiated on this host?

(In reply to Florian JACQUIN from comment #18)
> *FYI*
> 
> I had an issue with an RHV host, after few live migration vdsmd (v4.20.32-1)
> service stucked. It was impossible to restart the host.
> 
> Changing nwfilter.py to the latest version available on github
> https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/nwfilter.py resolved
> our issue and we was able to start vdsmd.

(Originally by Dominik Holler)

Comment 20 RHV bug bot 2019-08-13 06:17:16 UTC
5 Bond with a total of 12 physical NIC
(In reply to Dominik Holler from comment #19)
> Florian, thank you very much for sharing this!
> Do you remember an approximate number of vNICs or VMs instantiated on this
> host?
> 
> (In reply to Florian JACQUIN from comment #18)
> > *FYI*
> > 
> > I had an issue with an RHV host, after few live migration vdsmd (v4.20.32-1)
> > service stucked. It was impossible to restart the host.
> > 
> > Changing nwfilter.py to the latest version available on github
> > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/nwfilter.py resolved
> > our issue and we was able to start vdsmd.

5 Bond with a total of 12 physical NIC

(Originally by Florian Jacquin)

Comment 21 RHV bug bot 2019-08-13 06:17:18 UTC
(In reply to Florian JACQUIN from comment #20)
> 5 Bond with a total of 12 physical NIC
> (In reply to Dominik Holler from comment #19)
> > Florian, thank you very much for sharing this!
> > Do you remember an approximate number of vNICs or VMs instantiated on this
> > host?
> > 
> > (In reply to Florian JACQUIN from comment #18)
> > > *FYI*
> > > 
> > > I had an issue with an RHV host, after few live migration vdsmd (v4.20.32-1)
> > > service stucked. It was impossible to restart the host.
> > > 
> > > Changing nwfilter.py to the latest version available on github
> > > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/nwfilter.py resolved
> > > our issue and we was able to start vdsmd.
> 
> 5 Bond with a total of 12 physical NIC

there was 191 VM on that host with minimum of 2 vNIC by VM

(Originally by Florian Jacquin)

Comment 22 RHV bug bot 2019-08-13 06:17:20 UTC
Tested on: 
----------
rhv-release-4.3.5-1-001
vdsm-4.30.19-1


1)   f01-h09-000-1029u.rdu2.scalelab.redhat.com
         151 VMs (VM with 1 NIC)

***orig nwfilter
[root@f01-h09-000-r620 ~]# ip l | grep vnet | wc -l
151

[root@f01-h09-000-r620 ~]# time vdsm-tool nwfilter
libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use

real	0m39.362s
user	0m0.407s
sys	0m0.084s

***fixed nwfilter
[root@f01-h09-000-r620 tool]# ip l | grep vnet | wc -l
151

[root@f01-h09-000-r620 tool]# time vdsm-tool nwfilter

real	0m0.511s
user	0m0.402s
sys	0m0.094s




2)   f01-h08-000-1029u.rdu2.scalelab.redhat.com
     257 VMs (VM_Ovirt* with 2 NICs per VM --> Nested Host)

***orig nwfilter
[root@f01-h08-000-1029u ~]# ip l | grep vnet | wc -l
514

[root@f01-h08-000-1029u ~]# time vdsm-tool nwfilter
libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use

real	2m3.472s
user	0m0.251s
sys	0m0.070s

***fixed nwfilter
[root@f01-h08-000-1029u tool]# ip l | grep vnet | wc -l
514

[root@f01-h08-000-1029u tool]# time vdsm-tool nwfilter

real	0m0.352s
user	0m0.263s
sys	0m0.075s

(Originally by David Vaanunu)

Comment 23 RHV bug bot 2019-08-13 06:17:21 UTC
David, thanks for checking the implementation!
I like to include the in RHV 4.3.6 .
Mordechai, can you please qa_ack this bug for 4.3.6?

(Originally by Dominik Holler)

Comment 26 David Vaanunu 2019-08-29 10:26:32 UTC
Tested on version:

rhv-release-4.3.6-3-001.noarch
vdsm-4.30.26-1.el7ev.x86_64



[root@f02-h18-000-r620 tool]# time vdsm-tool nwfilter
libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use

real	0m35.909s
user	0m0.399s
sys	0m0.092s
[root@f02-h18-000-r620 tool]# ip l | grep vnet | wc -l
147



[root@f02-h25-000-r620 tool]# time vdsm-tool nwfilter
libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use

real	0m37.681s
user	0m0.404s
sys	0m0.111s
[root@f02-h25-000-r620 tool]# ip l | grep vnet | wc -l
153



Verifed on version:

rhv-release-4.3.6-5-001.noarch
vdsm-4.30.29-1.el7ev.x86_64



[root@f02-h18-000-r620 tool]# time vdsm-tool nwfilter

real	0m0.511s
user	0m0.383s
sys	0m0.105s
[root@f02-h18-000-r620 tool]# ip l | grep vnet | wc -l
147



[root@f02-h25-000-r620 ~]# time vdsm-tool nwfilter

real	0m0.525s
user	0m0.411s
sys	0m0.089s
[root@f02-h25-000-r620 ~]# ip l | grep vnet | wc -l
153

Comment 29 errata-xmlrpc 2019-10-10 15:36:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3009


Note You need to log in before you can comment on or make changes to this bug.