Bug 1722854 - Remove nwfilter configuration from the vdsmd service start
Summary: Remove nwfilter configuration from the vdsmd service start
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.3.1
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ovirt-4.4.0
: ---
Assignee: Dominik Holler
QA Contact: David Vaanunu
URL:
Whiteboard:
Depends On:
Blocks: 1740498
TreeView+ depends on / blocked
 
Reported: 2019-06-21 13:59 UTC by Roman Hodain
Modified: 2023-09-07 20:09 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, while VDSM was starting, the definition of the network filter vdsm-no-mac-spoofing was removed and recreated to ensure the filter was up to date. This occasionally resulted in a timeout during the start of VDSM. The current release fixes this issue. Instead of removing and recreating of the filter, the vdsm-no-mac-spoofing filter is updated during the start of the VDSM. This update takes less than a second, regardless of the number of vNICs using this filter.
Clone Of:
: 1740498 (view as bug list)
Environment:
Last Closed: 2020-08-04 13:27:06 UTC
oVirt Team: Scale
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4260231 0 Troubleshoot None Vdsmd service fails to start with a timout error message 2019-07-02 08:09:45 UTC
Red Hat Product Errata RHEA-2020:3246 0 None None None 2020-08-04 13:27:38 UTC
oVirt gerrit 101577 0 'None' MERGED nwfilter: Update vdsm-no-mac-spoofing 2021-02-04 08:43:14 UTC
oVirt gerrit 102573 0 'None' MERGED nwfilter: Update vdsm-no-mac-spoofing 2021-02-04 08:43:15 UTC

Description Roman Hodain 2019-06-21 13:59:48 UTC
Description of problem:
Vdsmd service may fail to start due to timeout. This timeout is caused by NWFilter configuration. The cause is mainly the suboptimal implementation in libvirt.

Version-Release number of selected component (if applicable):
vdsm-4.30.13-4.el7ev.x86_64

How reproducible:
Not completely clear. It is most visible in very loaded environments.

Steps to Reproduce:
1.# time vdsm-tool nwfilter

Actual results:
time vdsm-tool nwfilter
libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use

real    2m19.788s
user    0m0.291s
sys     0m0.092s


Expected results:
The nwfilter configuration is not triggered.

Additional info:

I am not sure why we call the nwfilter configuration when starting vdsm. The 
logic is the following:

    - Look up the filter by nwfilterLookupByName. We only look for vdsm-no-mac-spoofing
    - If we do not find it then we do nwfilterDefineXML
    - If we find it then we do undefine and then nwfilterDefineXML

The code is [1]. I think that if a VM has created on the hypervisor then it has to be a VM created by vdsm and the filters must to be set correctly. So there is no point to undefine and defice. Also, there is not reason to define it if it is not there as the filter may be disable for a purpose. So I believe that this part can be completely removed and speed up the vdsmd restart. BTW you cannot undefine any filter as the VMs are runinng anyway. we will get 

    libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use


[1]
        try:
            old_filter = conn.nwfilterLookupByName(self.filterName)
        except libvirt.libvirtError:  # No filter, we can define it.
            pass
        else:
            try:
                old_filter.undefine()
            except libvirt.libvirtError:
                # If the filter is in use it may fail to be removed. In that
                # case we warn of it and consider the task done.
                logging.warning('Failed to remove the old filter %s.',
                                old_filter.name(), exc_info=True)
                return

        nwFilter = conn.nwfilterDefineXML(self.buildFilterXml())
        logging.debug("Filter %s was defined", nwFilter.name())

Comment 2 Dominik Holler 2019-07-02 08:41:25 UTC
Dan and Petr, do you remember if there was a reason to keep the "vdsm-tool nwfilter" in vdsmd_init_common.sh during fixing bug 1150718 ?
During the start of VDSM, maybe VDSM can just create the network filters if they do not exist (or fail if they not exist)?

Comment 3 Roman Hodain 2019-07-02 08:58:13 UTC
One more thing to add. Keep in mind that the vdsm-no-mac-spoofing can be disabled in the portal per network in the vNIC  Profiles. In that case, the vdsm restart could cause the vdsm-no-mac-spoofing to enable even though it is disabled in the UI.

Comment 4 Dan Kenigsberg 2019-07-02 09:26:00 UTC
I do not understand what takes so long. It may be the multiple libvirt calls we do.

(In reply to Roman Hodain from comment #3)
> One more thing to add. Keep in mind that the vdsm-no-mac-spoofing can be
> disabled in the portal per network in the vNIC  Profiles. In that case, the
> vdsm restart could cause the vdsm-no-mac-spoofing to enable even though it
> is disabled in the UI.

This is not the case. Creating the filter is one thing, using it is another. In oVirt APIs you can only opt out from using the filter for a vnic profile, but it is always defined.


(In reply to Dominik Holler from comment #2)
> Dan and Petr, do you remember if there was a reason to keep the "vdsm-tool
> nwfilter" in vdsmd_init_common.sh during fixing bug 1150718 ?
> During the start of VDSM, maybe VDSM can just create the network filters if
> they do not exist (or fail if they not exist)?

Historically, we wanted to make sure that the host has the most up-to-date definition of the filter, and we wanted to forget about it during run time.
If testing for the filter and creating it is SO slow, we CERTAINLY would not like to do it on every VM start up - it would slow them all down. If we take this path we must @momoize this.

Another option that I considered many many years ago, was to ship the filter xml in vdsm RPM, placing it into libvirt directory structure. This is quite rude, as the directory structure is not an API, and it requires restarting libvirtd after installing vdsm (but I think we already do that). I wanted to be polite, but please reconsider my ancient decision.

Comment 5 Dominik Holler 2019-07-02 09:45:51 UTC
(In reply to Dan Kenigsberg from comment #4)
> I do not understand what takes so long. It may be the multiple libvirt calls
> we do.

According to bug 1720482 comment 5 only a view calls hurt already.

> 
> (In reply to Dominik Holler from comment #2)
> > Dan and Petr, do you remember if there was a reason to keep the "vdsm-tool
> > nwfilter" in vdsmd_init_common.sh during fixing bug 1150718 ?
> > During the start of VDSM, maybe VDSM can just create the network filters if
> > they do not exist (or fail if they not exist)?
> 
> Historically, we wanted to make sure that the host has the most up-to-date
> definition of the filter, and we wanted to forget about it during run time.

Maybe re-creating the filter during `vdsm-tool configure` would be enough?

> If testing for the filter and creating it is SO slow, we CERTAINLY would not
> like to do it on every VM start up - it would slow them all down. If we take
> this path we must @momoize this.
> 

Already the one time re-creation during VDSM startup hurts, if VMs are running.

> Another option that I considered many many years ago, was to ship the filter
> xml in vdsm RPM, placing it into libvirt directory structure. This is quite
> rude, as the directory structure is not an API, and it requires restarting

Maybe we could trigger a script during vdsm installation, even I think
to trigger the filter creation during `vdsm-tool configure` would be more correct?

> libvirtd after installing vdsm (but I think we already do that). I wanted to
> be polite, but please reconsider my ancient decision.

Comment 6 Dan Kenigsberg 2019-07-02 10:05:17 UTC
> Maybe re-creating the filter during `vdsm-tool configure` would be enough?

I think we had an issue of libvirt not always running when `vdsm-tool configure` is run. But I'm not sure.

> Already the one time re-creation during VDSM startup hurts, if VMs are running.

sure, that's why we must @memoize, so we run it once in vdsm lifetime.

Comment 7 Dominik Holler 2019-07-08 10:21:46 UTC
(In reply to Dan Kenigsberg from comment #6)
> > Maybe re-creating the filter during `vdsm-tool configure` would be enough?
> 
> I think we had an issue of libvirt not always running when `vdsm-tool
> configure` is run. But I'm not sure.
> 

You are right.

Comment 8 Dominik Holler 2019-07-08 11:24:57 UTC
Daniel, can you help us to verify that the proposed code change fixes the problem?
A host applied the fix (just a single file is modified) has to restart vdsm succussfully, even if many VMs using the vdsm-no-mac-spoofing are running.
Interesting would be the output of
time vdsm-tool nwfilter
during the VMs are running, to ensure that we are not too close the critical 90 seconds.

Comment 12 Michael Burman 2019-07-22 12:09:14 UTC
Restoring need info from comment 8

Comment 15 mlehrer 2019-07-24 11:13:47 UTC
To confirm the reproduction steps:

#reproduce current state
Stop all vms on host
restart vdsmd 
start vms
run vdsm-tool nwfilter with time command and verify the amount of time taken.


#Verify change
Stop all vms on host
Stop vdsmd
swap nwfilter.py file from patch
Start vdsmd 
Start vms
vdsm-tool nwfilter with time command


Is the above correct?
what about stop and start order of the vms ?

Comment 16 Dominik Holler 2019-07-29 16:04:30 UTC
(In reply to mlehrer from comment #15)
> 
> Is the above correct?


The steps are:
1. Create the scenario, especially many vNICs on a host
2. On the host with the many vNICs:
  a) 'time vdsm-tool nwfilter' should be slow and show error, e.g.:
[root@ovirt-master-host2 ~]# ip l | grep vnet | wc -l
44
[root@ovirt-master-host2 ~]# time vdsm-tool nwfilter
libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use

real	0m7.723s
user	0m0.188s
sys	0m0.051s

  b) swap nwfilter.py file from patch, e.g.:
[root@ovirt-master-host2 ~]# cp /usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py /usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py.orig
[root@ovirt-master-host2 ~]# cp /usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py.fixed /usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py
cp: overwrite ‘/usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py’? y

  c) run 'time vdsm-tool nwfilter' again, should be faster than 1 second and no error message anymore, e.g.
[root@ovirt-master-host2 ~]# time vdsm-tool nwfilter

real	0m0.299s
user	0m0.216s
sys	0m0.054s

Comment 18 Florian JACQUIN 2019-08-01 13:39:28 UTC
*FYI*

I had an issue with an RHV host, after few live migration vdsmd (v4.20.32-1) service stucked. It was impossible to restart the host.

Changing nwfilter.py to the latest version available on github https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/nwfilter.py resolved our issue and we was able to start vdsmd.

Comment 19 Dominik Holler 2019-08-01 13:52:22 UTC
Florian, thank you very much for sharing this!
Do you remember an approximate number of vNICs or VMs instantiated on this host?

(In reply to Florian JACQUIN from comment #18)
> *FYI*
> 
> I had an issue with an RHV host, after few live migration vdsmd (v4.20.32-1)
> service stucked. It was impossible to restart the host.
> 
> Changing nwfilter.py to the latest version available on github
> https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/nwfilter.py resolved
> our issue and we was able to start vdsmd.

Comment 20 Florian JACQUIN 2019-08-01 14:01:12 UTC
5 Bond with a total of 12 physical NIC
(In reply to Dominik Holler from comment #19)
> Florian, thank you very much for sharing this!
> Do you remember an approximate number of vNICs or VMs instantiated on this
> host?
> 
> (In reply to Florian JACQUIN from comment #18)
> > *FYI*
> > 
> > I had an issue with an RHV host, after few live migration vdsmd (v4.20.32-1)
> > service stucked. It was impossible to restart the host.
> > 
> > Changing nwfilter.py to the latest version available on github
> > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/nwfilter.py resolved
> > our issue and we was able to start vdsmd.

5 Bond with a total of 12 physical NIC

Comment 21 Florian JACQUIN 2019-08-01 14:06:38 UTC
(In reply to Florian JACQUIN from comment #20)
> 5 Bond with a total of 12 physical NIC
> (In reply to Dominik Holler from comment #19)
> > Florian, thank you very much for sharing this!
> > Do you remember an approximate number of vNICs or VMs instantiated on this
> > host?
> > 
> > (In reply to Florian JACQUIN from comment #18)
> > > *FYI*
> > > 
> > > I had an issue with an RHV host, after few live migration vdsmd (v4.20.32-1)
> > > service stucked. It was impossible to restart the host.
> > > 
> > > Changing nwfilter.py to the latest version available on github
> > > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/nwfilter.py resolved
> > > our issue and we was able to start vdsmd.
> 
> 5 Bond with a total of 12 physical NIC

there was 191 VM on that host with minimum of 2 vNIC by VM

Comment 22 David Vaanunu 2019-08-11 11:29:00 UTC
Tested on: 
----------
rhv-release-4.3.5-1-001
vdsm-4.30.19-1


1)   f01-h09-000-1029u.rdu2.scalelab.redhat.com
         151 VMs (VM with 1 NIC)

***orig nwfilter
[root@f01-h09-000-r620 ~]# ip l | grep vnet | wc -l
151

[root@f01-h09-000-r620 ~]# time vdsm-tool nwfilter
libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use

real	0m39.362s
user	0m0.407s
sys	0m0.084s

***fixed nwfilter
[root@f01-h09-000-r620 tool]# ip l | grep vnet | wc -l
151

[root@f01-h09-000-r620 tool]# time vdsm-tool nwfilter

real	0m0.511s
user	0m0.402s
sys	0m0.094s




2)   f01-h08-000-1029u.rdu2.scalelab.redhat.com
     257 VMs (VM_Ovirt* with 2 NICs per VM --> Nested Host)

***orig nwfilter
[root@f01-h08-000-1029u ~]# ip l | grep vnet | wc -l
514

[root@f01-h08-000-1029u ~]# time vdsm-tool nwfilter
libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use

real	2m3.472s
user	0m0.251s
sys	0m0.070s

***fixed nwfilter
[root@f01-h08-000-1029u tool]# ip l | grep vnet | wc -l
514

[root@f01-h08-000-1029u tool]# time vdsm-tool nwfilter

real	0m0.352s
user	0m0.263s
sys	0m0.075s

Comment 23 Dominik Holler 2019-08-12 05:46:36 UTC
David, thanks for checking the implementation!
I like to include the in RHV 4.3.6 .
Mordechai, can you please qa_ack this bug for 4.3.6?

Comment 25 Dominik Holler 2019-08-13 06:26:22 UTC
This bug has been cloned to the z-stream bug#1740498
clone job https://rhv-devops-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/bugzilla_clone_bugs/261/ failed

Comment 26 Daniel Gur 2019-08-28 13:14:07 UTC
sync2jira

Comment 27 Daniel Gur 2019-08-28 13:18:23 UTC
sync2jira

Comment 28 RHV bug bot 2019-10-22 17:26:08 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 29 RHV bug bot 2019-10-22 17:39:42 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 30 RHV bug bot 2019-10-22 17:46:53 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 31 RHV bug bot 2019-10-22 18:02:42 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 32 RHV bug bot 2019-11-19 11:53:41 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 33 RHV bug bot 2019-11-19 12:03:40 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 35 RHV bug bot 2019-12-13 13:17:40 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 36 RHV bug bot 2019-12-20 17:46:48 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 37 RHV bug bot 2020-01-08 14:50:42 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 38 RHV bug bot 2020-01-08 15:20:50 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 39 RHV bug bot 2020-01-24 19:52:24 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 40 mlehrer 2020-02-20 11:19:02 UTC
Hi - I see we had tested this 4.3.6 and its now cloned to 4.4

The verifications steps for 4.4 are
150 vms with 2 nics issue 'time vdsm-tool nwfilter'  returns in less than a second.

Can you confirm the 
1) fix is present in rhv-4.4.0-20 downstream
2) Verification steps written above are correct

Thanks.

Comment 41 Dominik Holler 2020-02-20 11:25:12 UTC
(In reply to mlehrer from comment #40)
> Hi - I see we had tested this 4.3.6 and its now cloned to 4.4
> 
> The verifications steps for 4.4 are
> 150 vms with 2 nics issue 'time vdsm-tool nwfilter'  returns in less than a
> second.
> 

Yes.

> Can you confirm the 
> 1) fix is present in rhv-4.4.0-20 downstream

I can confirm that this is on 4.4 branch upstream for 6 month.

> 2) Verification steps written above are correct
> 

Yes, they are.
Please let me know the time it took, if it took more than 0.5 seconds.
I think even 1 second should be no problem, but I expect a much shorter time.

> Thanks.

Comment 42 David Vaanunu 2020-03-04 08:13:09 UTC
Tested on:

rhv-release-4.4.0-13-999.noarch
vdsm-4.40.0-164


Have no VMs with 2 NICs right now, Is it OK with 1 NIC?


Host running 160VMs with 1 NIC

[root@f01-h27-000-r620 ~]# time vdsm-tool nwfilter

real	0m0.825s
user	0m0.724s
sys	0m0.079s
[root@f01-h27-000-r620 ~]# ip l | grep vnet | wc -l
160

Comment 43 mlehrer 2020-03-08 15:23:36 UTC
(In reply to David Vaanunu from comment #42)
> Tested on:
> 
> rhv-release-4.4.0-13-999.noarch
> vdsm-4.40.0-164
> 
> 
> Have no VMs with 2 NICs right now, Is it OK with 1 NIC?
> 
> 
> Host running 160VMs with 1 NIC
> 
> [root@f01-h27-000-r620 ~]# time vdsm-tool nwfilter
> 
> real	0m0.825s
> user	0m0.724s
> sys	0m0.079s
> [root@f01-h27-000-r620 ~]# ip l | grep vnet | wc -l
> 160


Even though the 4.4 scenario uses 1 nic per vm instead of two we can see the impact of the improvement by comparing it to previous testing in 4.3 where 1 nic was also used https://bugzilla.redhat.com/show_bug.cgi?id=1722854#c22 and the number of vms were almost the same 150 vs 160

In 4.4 the resulting time is 0.8s with the fix.
In 4.3 the resulting time is 0.5s with the fix.
In 4.3 the resulting time is 40s without the fix.



Any objections with the difference of 300 ms difference in fixed times or if you feel it is important to test again with 2 nics per VM.
The comparison of 1 nic per vm show the improvement in 4.4 as its under 1s.
Let us know if this has to be tested with 2 nics per vm or you agree and we can move this to verified.

Comment 44 Dominik Holler 2020-03-09 07:29:53 UTC
(In reply to mlehrer from comment #43)

> 
> 
> Even though the 4.4 scenario uses 1 nic per vm instead of two we can see the
> impact of the improvement by comparing it to previous testing in 4.3 where 1
> nic was also used https://bugzilla.redhat.com/show_bug.cgi?id=1722854#c22
> and the number of vms were almost the same 150 vs 160
> 
> In 4.4 the resulting time is 0.8s with the fix.
> In 4.3 the resulting time is 0.5s with the fix.
> In 4.3 the resulting time is 40s without the fix.
> 
> 
> 
> Any objections with the difference of 300 ms difference in fixed times or if
> you feel it is important to test again with 2 nics per VM.
> The comparison of 1 nic per vm show the improvement in 4.4 as its under 1s.
> Let us know if this has to be tested with 2 nics per vm or you agree and we
> can move this to verified.

From my point of view, this is fine for RHV, because it does not hurt if reboot takes a second more or less 
and the initial bug was about minutes, not seconds.
But it might be interesting for libvirt QE to get a hint about this.

Comment 51 errata-xmlrpc 2020-08-04 13:27:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3246


Note You need to log in before you can comment on or make changes to this bug.