+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1722854 +++ ====================================================================== Description of problem: Vdsmd service may fail to start due to timeout. This timeout is caused by NWFilter configuration. The cause is mainly the suboptimal implementation in libvirt. Version-Release number of selected component (if applicable): vdsm-4.30.13-4.el7ev.x86_64 How reproducible: Not completely clear. It is most visible in very loaded environments. Steps to Reproduce: 1.# time vdsm-tool nwfilter Actual results: time vdsm-tool nwfilter libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use real 2m19.788s user 0m0.291s sys 0m0.092s Expected results: The nwfilter configuration is not triggered. Additional info: I am not sure why we call the nwfilter configuration when starting vdsm. The logic is the following: - Look up the filter by nwfilterLookupByName. We only look for vdsm-no-mac-spoofing - If we do not find it then we do nwfilterDefineXML - If we find it then we do undefine and then nwfilterDefineXML The code is [1]. I think that if a VM has created on the hypervisor then it has to be a VM created by vdsm and the filters must to be set correctly. So there is no point to undefine and defice. Also, there is not reason to define it if it is not there as the filter may be disable for a purpose. So I believe that this part can be completely removed and speed up the vdsmd restart. BTW you cannot undefine any filter as the VMs are runinng anyway. we will get libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use [1] try: old_filter = conn.nwfilterLookupByName(self.filterName) except libvirt.libvirtError: # No filter, we can define it. pass else: try: old_filter.undefine() except libvirt.libvirtError: # If the filter is in use it may fail to be removed. In that # case we warn of it and consider the task done. logging.warning('Failed to remove the old filter %s.', old_filter.name(), exc_info=True) return nwFilter = conn.nwfilterDefineXML(self.buildFilterXml()) logging.debug("Filter %s was defined", nwFilter.name()) (Originally by Roman Hodain)
Dan and Petr, do you remember if there was a reason to keep the "vdsm-tool nwfilter" in vdsmd_init_common.sh during fixing bug 1150718 ? During the start of VDSM, maybe VDSM can just create the network filters if they do not exist (or fail if they not exist)? (Originally by Dominik Holler)
One more thing to add. Keep in mind that the vdsm-no-mac-spoofing can be disabled in the portal per network in the vNIC Profiles. In that case, the vdsm restart could cause the vdsm-no-mac-spoofing to enable even though it is disabled in the UI. (Originally by Roman Hodain)
I do not understand what takes so long. It may be the multiple libvirt calls we do. (In reply to Roman Hodain from comment #3) > One more thing to add. Keep in mind that the vdsm-no-mac-spoofing can be > disabled in the portal per network in the vNIC Profiles. In that case, the > vdsm restart could cause the vdsm-no-mac-spoofing to enable even though it > is disabled in the UI. This is not the case. Creating the filter is one thing, using it is another. In oVirt APIs you can only opt out from using the filter for a vnic profile, but it is always defined. (In reply to Dominik Holler from comment #2) > Dan and Petr, do you remember if there was a reason to keep the "vdsm-tool > nwfilter" in vdsmd_init_common.sh during fixing bug 1150718 ? > During the start of VDSM, maybe VDSM can just create the network filters if > they do not exist (or fail if they not exist)? Historically, we wanted to make sure that the host has the most up-to-date definition of the filter, and we wanted to forget about it during run time. If testing for the filter and creating it is SO slow, we CERTAINLY would not like to do it on every VM start up - it would slow them all down. If we take this path we must @momoize this. Another option that I considered many many years ago, was to ship the filter xml in vdsm RPM, placing it into libvirt directory structure. This is quite rude, as the directory structure is not an API, and it requires restarting libvirtd after installing vdsm (but I think we already do that). I wanted to be polite, but please reconsider my ancient decision. (Originally by danken)
(In reply to Dan Kenigsberg from comment #4) > I do not understand what takes so long. It may be the multiple libvirt calls > we do. According to bug 1720482 comment 5 only a view calls hurt already. > > (In reply to Dominik Holler from comment #2) > > Dan and Petr, do you remember if there was a reason to keep the "vdsm-tool > > nwfilter" in vdsmd_init_common.sh during fixing bug 1150718 ? > > During the start of VDSM, maybe VDSM can just create the network filters if > > they do not exist (or fail if they not exist)? > > Historically, we wanted to make sure that the host has the most up-to-date > definition of the filter, and we wanted to forget about it during run time. Maybe re-creating the filter during `vdsm-tool configure` would be enough? > If testing for the filter and creating it is SO slow, we CERTAINLY would not > like to do it on every VM start up - it would slow them all down. If we take > this path we must @momoize this. > Already the one time re-creation during VDSM startup hurts, if VMs are running. > Another option that I considered many many years ago, was to ship the filter > xml in vdsm RPM, placing it into libvirt directory structure. This is quite > rude, as the directory structure is not an API, and it requires restarting Maybe we could trigger a script during vdsm installation, even I think to trigger the filter creation during `vdsm-tool configure` would be more correct? > libvirtd after installing vdsm (but I think we already do that). I wanted to > be polite, but please reconsider my ancient decision. (Originally by Dominik Holler)
> Maybe re-creating the filter during `vdsm-tool configure` would be enough? I think we had an issue of libvirt not always running when `vdsm-tool configure` is run. But I'm not sure. > Already the one time re-creation during VDSM startup hurts, if VMs are running. sure, that's why we must @memoize, so we run it once in vdsm lifetime. (Originally by danken)
(In reply to Dan Kenigsberg from comment #6) > > Maybe re-creating the filter during `vdsm-tool configure` would be enough? > > I think we had an issue of libvirt not always running when `vdsm-tool > configure` is run. But I'm not sure. > You are right. (Originally by Dominik Holler)
Daniel, can you help us to verify that the proposed code change fixes the problem? A host applied the fix (just a single file is modified) has to restart vdsm succussfully, even if many VMs using the vdsm-no-mac-spoofing are running. Interesting would be the output of time vdsm-tool nwfilter during the VMs are running, to ensure that we are not too close the critical 90 seconds. (Originally by Dominik Holler)
Restoring need info from comment 8 (Originally by Michael Burman)
To confirm the reproduction steps: #reproduce current state Stop all vms on host restart vdsmd start vms run vdsm-tool nwfilter with time command and verify the amount of time taken. #Verify change Stop all vms on host Stop vdsmd swap nwfilter.py file from patch Start vdsmd Start vms vdsm-tool nwfilter with time command Is the above correct? what about stop and start order of the vms ? (Originally by Mordechai Lehrer)
(In reply to mlehrer from comment #15) > > Is the above correct? The steps are: 1. Create the scenario, especially many vNICs on a host 2. On the host with the many vNICs: a) 'time vdsm-tool nwfilter' should be slow and show error, e.g.: [root@ovirt-master-host2 ~]# ip l | grep vnet | wc -l 44 [root@ovirt-master-host2 ~]# time vdsm-tool nwfilter libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use real 0m7.723s user 0m0.188s sys 0m0.051s b) swap nwfilter.py file from patch, e.g.: [root@ovirt-master-host2 ~]# cp /usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py /usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py.orig [root@ovirt-master-host2 ~]# cp /usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py.fixed /usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py cp: overwrite ‘/usr/lib/python2.7/site-packages/vdsm/tool/nwfilter.py’? y c) run 'time vdsm-tool nwfilter' again, should be faster than 1 second and no error message anymore, e.g. [root@ovirt-master-host2 ~]# time vdsm-tool nwfilter real 0m0.299s user 0m0.216s sys 0m0.054s (Originally by Dominik Holler)
*FYI* I had an issue with an RHV host, after few live migration vdsmd (v4.20.32-1) service stucked. It was impossible to restart the host. Changing nwfilter.py to the latest version available on github https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/nwfilter.py resolved our issue and we was able to start vdsmd. (Originally by Florian Jacquin)
Florian, thank you very much for sharing this! Do you remember an approximate number of vNICs or VMs instantiated on this host? (In reply to Florian JACQUIN from comment #18) > *FYI* > > I had an issue with an RHV host, after few live migration vdsmd (v4.20.32-1) > service stucked. It was impossible to restart the host. > > Changing nwfilter.py to the latest version available on github > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/nwfilter.py resolved > our issue and we was able to start vdsmd. (Originally by Dominik Holler)
5 Bond with a total of 12 physical NIC (In reply to Dominik Holler from comment #19) > Florian, thank you very much for sharing this! > Do you remember an approximate number of vNICs or VMs instantiated on this > host? > > (In reply to Florian JACQUIN from comment #18) > > *FYI* > > > > I had an issue with an RHV host, after few live migration vdsmd (v4.20.32-1) > > service stucked. It was impossible to restart the host. > > > > Changing nwfilter.py to the latest version available on github > > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/nwfilter.py resolved > > our issue and we was able to start vdsmd. 5 Bond with a total of 12 physical NIC (Originally by Florian Jacquin)
(In reply to Florian JACQUIN from comment #20) > 5 Bond with a total of 12 physical NIC > (In reply to Dominik Holler from comment #19) > > Florian, thank you very much for sharing this! > > Do you remember an approximate number of vNICs or VMs instantiated on this > > host? > > > > (In reply to Florian JACQUIN from comment #18) > > > *FYI* > > > > > > I had an issue with an RHV host, after few live migration vdsmd (v4.20.32-1) > > > service stucked. It was impossible to restart the host. > > > > > > Changing nwfilter.py to the latest version available on github > > > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/nwfilter.py resolved > > > our issue and we was able to start vdsmd. > > 5 Bond with a total of 12 physical NIC there was 191 VM on that host with minimum of 2 vNIC by VM (Originally by Florian Jacquin)
Tested on: ---------- rhv-release-4.3.5-1-001 vdsm-4.30.19-1 1) f01-h09-000-1029u.rdu2.scalelab.redhat.com 151 VMs (VM with 1 NIC) ***orig nwfilter [root@f01-h09-000-r620 ~]# ip l | grep vnet | wc -l 151 [root@f01-h09-000-r620 ~]# time vdsm-tool nwfilter libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use real 0m39.362s user 0m0.407s sys 0m0.084s ***fixed nwfilter [root@f01-h09-000-r620 tool]# ip l | grep vnet | wc -l 151 [root@f01-h09-000-r620 tool]# time vdsm-tool nwfilter real 0m0.511s user 0m0.402s sys 0m0.094s 2) f01-h08-000-1029u.rdu2.scalelab.redhat.com 257 VMs (VM_Ovirt* with 2 NICs per VM --> Nested Host) ***orig nwfilter [root@f01-h08-000-1029u ~]# ip l | grep vnet | wc -l 514 [root@f01-h08-000-1029u ~]# time vdsm-tool nwfilter libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use real 2m3.472s user 0m0.251s sys 0m0.070s ***fixed nwfilter [root@f01-h08-000-1029u tool]# ip l | grep vnet | wc -l 514 [root@f01-h08-000-1029u tool]# time vdsm-tool nwfilter real 0m0.352s user 0m0.263s sys 0m0.075s (Originally by David Vaanunu)
David, thanks for checking the implementation! I like to include the in RHV 4.3.6 . Mordechai, can you please qa_ack this bug for 4.3.6? (Originally by Dominik Holler)
Tested on version: rhv-release-4.3.6-3-001.noarch vdsm-4.30.26-1.el7ev.x86_64 [root@f02-h18-000-r620 tool]# time vdsm-tool nwfilter libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use real 0m35.909s user 0m0.399s sys 0m0.092s [root@f02-h18-000-r620 tool]# ip l | grep vnet | wc -l 147 [root@f02-h25-000-r620 tool]# time vdsm-tool nwfilter libvirt: Network Filter Driver error : Requested operation is not valid: nwfilter is in use real 0m37.681s user 0m0.404s sys 0m0.111s [root@f02-h25-000-r620 tool]# ip l | grep vnet | wc -l 153 Verifed on version: rhv-release-4.3.6-5-001.noarch vdsm-4.30.29-1.el7ev.x86_64 [root@f02-h18-000-r620 tool]# time vdsm-tool nwfilter real 0m0.511s user 0m0.383s sys 0m0.105s [root@f02-h18-000-r620 tool]# ip l | grep vnet | wc -l 147 [root@f02-h25-000-r620 ~]# time vdsm-tool nwfilter real 0m0.525s user 0m0.411s sys 0m0.089s [root@f02-h25-000-r620 ~]# ip l | grep vnet | wc -l 153
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3009