Description of problem: Vdsmd service may fail to start due to timeout. This timeout is caused by NWFilter configuration. The cause is mainly the suboptimal implementation in libvirt. Version-Release number of selected component (if applicable): vdsm-4.30.13-4.el7ev.x86_64 How reproducible: Not completely clear. It is most visible in very loaded environments. Steps to Reproduce: 1.# time vdsm-tool nwfilter Actual results: time vdsm-tool nwfilter libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use real 2m19.788s user 0m0.291s sys 0m0.092s Expected results: The nwfilter configuration is not triggered. Additional info: I am not sure why we call the nwfilter configuration when starting vdsm. The logic is the following: - Look up the filter by nwfilterLookupByName. We only look for vdsm-no-mac-spoofing - If we do not find it then we do nwfilterDefineXML - If we find it then we do undefine and then nwfilterDefineXML The code is [1]. I think that if a VM has created on the hypervisor then it has to be a VM created by vdsm and the filters must to be set correctly. So there is no point to undefine and defice. Also, there is not reason to define it if it is not there as the filter may be disable for a purpose. So I believe that this part can be completely removed and speed up the vdsmd restart. BTW you cannot undefine any filter as the VMs are runinng anyway. we will get libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use [1] try: old_filter = conn.nwfilterLookupByName(self.filterName) except libvirt.libvirtError: # No filter, we can define it. pass else: try: old_filter.undefine() except libvirt.libvirtError: # If the filter is in use it may fail to be removed. In that # case we warn of it and consider the task done. logging.warning('Failed to remove the old filter %s.', old_filter.name(), exc_info=True) return nwFilter = conn.nwfilterDefineXML(self.buildFilterXml()) logging.debug("Filter %s was defined", nwFilter.name())
Dan and Petr, do you remember if there was a reason to keep the "vdsm-tool nwfilter" in vdsmd_init_common.sh during fixing bug 1150718 ? During the start of VDSM, maybe VDSM can just create the network filters if they do not exist (or fail if they not exist)?
One more thing to add. Keep in mind that the vdsm-no-mac-spoofing can be disabled in the portal per network in the vNIC Profiles. In that case, the vdsm restart could cause the vdsm-no-mac-spoofing to enable even though it is disabled in the UI.
I do not understand what takes so long. It may be the multiple libvirt calls we do. (In reply to Roman Hodain from comment #3) > One more thing to add. Keep in mind that the vdsm-no-mac-spoofing can be > disabled in the portal per network in the vNIC Profiles. In that case, the > vdsm restart could cause the vdsm-no-mac-spoofing to enable even though it > is disabled in the UI. This is not the case. Creating the filter is one thing, using it is another. In oVirt APIs you can only opt out from using the filter for a vnic profile, but it is always defined. (In reply to Dominik Holler from comment #2) > Dan and Petr, do you remember if there was a reason to keep the "vdsm-tool > nwfilter" in vdsmd_init_common.sh during fixing bug 1150718 ? > During the start of VDSM, maybe VDSM can just create the network filters if > they do not exist (or fail if they not exist)? Historically, we wanted to make sure that the host has the most up-to-date definition of the filter, and we wanted to forget about it during run time. If testing for the filter and creating it is SO slow, we CERTAINLY would not like to do it on every VM start up - it would slow them all down. If we take this path we must @momoize this. Another option that I considered many many years ago, was to ship the filter xml in vdsm RPM, placing it into libvirt directory structure. This is quite rude, as the directory structure is not an API, and it requires restarting libvirtd after installing vdsm (but I think we already do that). I wanted to be polite, but please reconsider my ancient decision.
(In reply to Dan Kenigsberg from comment #4) > I do not understand what takes so long. It may be the multiple libvirt calls > we do. According to bug 1720482 comment 5 only a view calls hurt already. > > (In reply to Dominik Holler from comment #2) > > Dan and Petr, do you remember if there was a reason to keep the "vdsm-tool > > nwfilter" in vdsmd_init_common.sh during fixing bug 1150718 ? > > During the start of VDSM, maybe VDSM can just create the network filters if > > they do not exist (or fail if they not exist)? > > Historically, we wanted to make sure that the host has the most up-to-date > definition of the filter, and we wanted to forget about it during run time. Maybe re-creating the filter during `vdsm-tool configure` would be enough? > If testing for the filter and creating it is SO slow, we CERTAINLY would not > like to do it on every VM start up - it would slow them all down. If we take > this path we must @momoize this. > Already the one time re-creation during VDSM startup hurts, if VMs are running. > Another option that I considered many many years ago, was to ship the filter > xml in vdsm RPM, placing it into libvirt directory structure. This is quite > rude, as the directory structure is not an API, and it requires restarting Maybe we could trigger a script during vdsm installation, even I think to trigger the filter creation during `vdsm-tool configure` would be more correct? > libvirtd after installing vdsm (but I think we already do that). I wanted to > be polite, but please reconsider my ancient decision.
> Maybe re-creating the filter during `vdsm-tool configure` would be enough? I think we had an issue of libvirt not always running when `vdsm-tool configure` is run. But I'm not sure. > Already the one time re-creation during VDSM startup hurts, if VMs are running. sure, that's why we must @memoize, so we run it once in vdsm lifetime.
(In reply to Dan Kenigsberg from comment #6) > > Maybe re-creating the filter during `vdsm-tool configure` would be enough? > > I think we had an issue of libvirt not always running when `vdsm-tool > configure` is run. But I'm not sure. > You are right.
Daniel, can you help us to verify that the proposed code change fixes the problem? A host applied the fix (just a single file is modified) has to restart vdsm succussfully, even if many VMs using the vdsm-no-mac-spoofing are running. Interesting would be the output of time vdsm-tool nwfilter during the VMs are running, to ensure that we are not too close the critical 90 seconds.
Restoring need info from comment 8
To confirm the reproduction steps: #reproduce current state Stop all vms on host restart vdsmd start vms run vdsm-tool nwfilter with time command and verify the amount of time taken. #Verify change Stop all vms on host Stop vdsmd swap nwfilter.py file from patch Start vdsmd Start vms vdsm-tool nwfilter with time command Is the above correct? what about stop and start order of the vms ?
(In reply to mlehrer from comment #15) > > Is the above correct? The steps are: 1. Create the scenario, especially many vNICs on a host 2. On the host with the many vNICs: a) 'time vdsm-tool nwfilter' should be slow and show error, e.g.: [root@ovirt-master-host2 ~]# ip l | grep vnet | wc -l 44 [root@ovirt-master-host2 ~]# time vdsm-tool nwfilter libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use real 0m7.723s user 0m0.188s sys 0m0.051s b) swap nwfilter.py file from patch, e.g.: [root@ovirt-master-host2 ~]# cp /usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py /usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py.orig [root@ovirt-master-host2 ~]# cp /usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py.fixed /usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py cp: overwrite ‘/usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py’? y c) run 'time vdsm-tool nwfilter' again, should be faster than 1 second and no error message anymore, e.g. [root@ovirt-master-host2 ~]# time vdsm-tool nwfilter real 0m0.299s user 0m0.216s sys 0m0.054s
*FYI* I had an issue with an RHV host, after few live migration vdsmd (v4.20.32-1) service stucked. It was impossible to restart the host. Changing nwfilter.py to the latest version available on github https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/nwfilter.py resolved our issue and we was able to start vdsmd.
Florian, thank you very much for sharing this! Do you remember an approximate number of vNICs or VMs instantiated on this host? (In reply to Florian JACQUIN from comment #18) > *FYI* > > I had an issue with an RHV host, after few live migration vdsmd (v4.20.32-1) > service stucked. It was impossible to restart the host. > > Changing nwfilter.py to the latest version available on github > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/nwfilter.py resolved > our issue and we was able to start vdsmd.
5 Bond with a total of 12 physical NIC (In reply to Dominik Holler from comment #19) > Florian, thank you very much for sharing this! > Do you remember an approximate number of vNICs or VMs instantiated on this > host? > > (In reply to Florian JACQUIN from comment #18) > > *FYI* > > > > I had an issue with an RHV host, after few live migration vdsmd (v4.20.32-1) > > service stucked. It was impossible to restart the host. > > > > Changing nwfilter.py to the latest version available on github > > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/nwfilter.py resolved > > our issue and we was able to start vdsmd. 5 Bond with a total of 12 physical NIC
(In reply to Florian JACQUIN from comment #20) > 5 Bond with a total of 12 physical NIC > (In reply to Dominik Holler from comment #19) > > Florian, thank you very much for sharing this! > > Do you remember an approximate number of vNICs or VMs instantiated on this > > host? > > > > (In reply to Florian JACQUIN from comment #18) > > > *FYI* > > > > > > I had an issue with an RHV host, after few live migration vdsmd (v4.20.32-1) > > > service stucked. It was impossible to restart the host. > > > > > > Changing nwfilter.py to the latest version available on github > > > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/nwfilter.py resolved > > > our issue and we was able to start vdsmd. > > 5 Bond with a total of 12 physical NIC there was 191 VM on that host with minimum of 2 vNIC by VM
Tested on: ---------- rhv-release-4.3.5-1-001 vdsm-4.30.19-1 1) f01-h09-000-1029u.rdu2.scalelab.redhat.com 151 VMs (VM with 1 NIC) ***orig nwfilter [root@f01-h09-000-r620 ~]# ip l | grep vnet | wc -l 151 [root@f01-h09-000-r620 ~]# time vdsm-tool nwfilter libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use real 0m39.362s user 0m0.407s sys 0m0.084s ***fixed nwfilter [root@f01-h09-000-r620 tool]# ip l | grep vnet | wc -l 151 [root@f01-h09-000-r620 tool]# time vdsm-tool nwfilter real 0m0.511s user 0m0.402s sys 0m0.094s 2) f01-h08-000-1029u.rdu2.scalelab.redhat.com 257 VMs (VM_Ovirt* with 2 NICs per VM --> Nested Host) ***orig nwfilter [root@f01-h08-000-1029u ~]# ip l | grep vnet | wc -l 514 [root@f01-h08-000-1029u ~]# time vdsm-tool nwfilter libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use real 2m3.472s user 0m0.251s sys 0m0.070s ***fixed nwfilter [root@f01-h08-000-1029u tool]# ip l | grep vnet | wc -l 514 [root@f01-h08-000-1029u tool]# time vdsm-tool nwfilter real 0m0.352s user 0m0.263s sys 0m0.075s
David, thanks for checking the implementation! I like to include the in RHV 4.3.6 . Mordechai, can you please qa_ack this bug for 4.3.6?
This bug has been cloned to the z-stream bug#1740498 clone job https://rhv-devops-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/bugzilla_clone_bugs/261/ failed
sync2jira
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops
Hi - I see we had tested this 4.3.6 and its now cloned to 4.4 The verifications steps for 4.4 are 150 vms with 2 nics issue 'time vdsm-tool nwfilter' returns in less than a second. Can you confirm the 1) fix is present in rhv-4.4.0-20 downstream 2) Verification steps written above are correct Thanks.
(In reply to mlehrer from comment #40) > Hi - I see we had tested this 4.3.6 and its now cloned to 4.4 > > The verifications steps for 4.4 are > 150 vms with 2 nics issue 'time vdsm-tool nwfilter' returns in less than a > second. > Yes. > Can you confirm the > 1) fix is present in rhv-4.4.0-20 downstream I can confirm that this is on 4.4 branch upstream for 6 month. > 2) Verification steps written above are correct > Yes, they are. Please let me know the time it took, if it took more than 0.5 seconds. I think even 1 second should be no problem, but I expect a much shorter time. > Thanks.
Tested on: rhv-release-4.4.0-13-999.noarch vdsm-4.40.0-164 Have no VMs with 2 NICs right now, Is it OK with 1 NIC? Host running 160VMs with 1 NIC [root@f01-h27-000-r620 ~]# time vdsm-tool nwfilter real 0m0.825s user 0m0.724s sys 0m0.079s [root@f01-h27-000-r620 ~]# ip l | grep vnet | wc -l 160
(In reply to David Vaanunu from comment #42) > Tested on: > > rhv-release-4.4.0-13-999.noarch > vdsm-4.40.0-164 > > > Have no VMs with 2 NICs right now, Is it OK with 1 NIC? > > > Host running 160VMs with 1 NIC > > [root@f01-h27-000-r620 ~]# time vdsm-tool nwfilter > > real 0m0.825s > user 0m0.724s > sys 0m0.079s > [root@f01-h27-000-r620 ~]# ip l | grep vnet | wc -l > 160 Even though the 4.4 scenario uses 1 nic per vm instead of two we can see the impact of the improvement by comparing it to previous testing in 4.3 where 1 nic was also used https://bugzilla.redhat.com/show_bug.cgi?id=1722854#c22 and the number of vms were almost the same 150 vs 160 In 4.4 the resulting time is 0.8s with the fix. In 4.3 the resulting time is 0.5s with the fix. In 4.3 the resulting time is 40s without the fix. Any objections with the difference of 300 ms difference in fixed times or if you feel it is important to test again with 2 nics per VM. The comparison of 1 nic per vm show the improvement in 4.4 as its under 1s. Let us know if this has to be tested with 2 nics per vm or you agree and we can move this to verified.
(In reply to mlehrer from comment #43) > > > Even though the 4.4 scenario uses 1 nic per vm instead of two we can see the > impact of the improvement by comparing it to previous testing in 4.3 where 1 > nic was also used https://bugzilla.redhat.com/show_bug.cgi?id=1722854#c22 > and the number of vms were almost the same 150 vs 160 > > In 4.4 the resulting time is 0.8s with the fix. > In 4.3 the resulting time is 0.5s with the fix. > In 4.3 the resulting time is 40s without the fix. > > > > Any objections with the difference of 300 ms difference in fixed times or if > you feel it is important to test again with 2 nics per VM. > The comparison of 1 nic per vm show the improvement in 4.4 as its under 1s. > Let us know if this has to be tested with 2 nics per vm or you agree and we > can move this to verified. From my point of view, this is fine for RHV, because it does not hurt if reboot takes a second more or less and the initial bug was about minutes, not seconds. But it might be interesting for libvirt QE to get a hint about this.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:3246