Description of problem: Shutting down multiple VMs with port mirroring set on the same network results in some of them failing to destroy, therefore showing as not responsive in the UI. VDSM needs to be restarted to clean up. MainProcess|jsonrpc/0::ERROR::2018-09-21 15:21:49,140::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) Error in unsetPortMirroring Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 98, in wrapper res = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/tc/__init__.py", line 108, in unsetPortMirroring _qdisc_del(network, kind='ingress') File "/usr/lib/python2.7/site-packages/vdsm/network/tc/__init__.py", line 143, in _qdisc_del qdisc.delete(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/tc/qdisc.py", line 57, in delete _wrapper.process_request(command) File "/usr/lib/python2.7/site-packages/vdsm/network/tc/_wrapper.py", line 41, in process_request raise TrafficControlException(retcode, err, command) TrafficControlException: (22, 'RTNETLINK answers: Invalid argument', ['/sbin/tc', 'qdisc', 'del', 'dev', 'ovirtmgmt', 'ingress']) Version-Release number of selected component (if applicable): vdsm-4.20.35-1.el7.x86_64 iputils-20160308-10.el7.x86_64 kernel-3.10.0-862.11.6.el7.x86_64 How reproducible: 50% Steps to Reproduce: 1. Start 3+ VMs, all with Port Mirroring set in the vNIC profile 2. Power them down at the same time Actual results: One or two VMs wail fail to destroy. domain quit but vdsm still reports it to engine. Expected results: destroy to succeed, VM in down state. Additional info: 1) Once the VMs shutdown the vnetX devices are gone. If we have 3 vms, all 3 vnet0, vnet1 and vnet2 netdevs go away at once. 2) In _delTarget() below, the 'acts' list comprehension returns empty because the devices of the VMs do not exist anymore: 3) This makes one of destroy evaluate the if to false and delete the entire qdisc instead of replacing the filters!! 4) The subsequent VMs also find no actions/filters and try to delete the qdisc too, this throws exception,leaving the vm existing on vdsm. def _delTarget(network, parent, target): ... [a] devices = set(link.name for link in ipwrapper.getLinks()) acts = [act for act in filt.actions [b] if act.target in devices and act.target != target] if acts: filt = Filter(prio=filt.prio, handle=filt.handle, actions=acts) _filter_replace(network, parent, filt) else: [c] tc_filter.delete(network, filt.prio, parent=parent) return acts [a] this does not return the vnetX devices because all N VMs were shutdown simultaneously, the vnetX for the other VMs are also gone. [b] all the actions related to the now gone vnetX devices do not make it into acts list [c] the if evaluates to false and deletes the entire qdiscs. The next VMs will also call destroy and will try to do the same and fail, because the qdisc does not exist anymore, so it cannot be deleted, therefore: TrafficControlException: (22, 'RTNETLINK answers: Invalid argument', ['/sbin/tc', 'qdisc', 'del', 'dev', 'ovirtmgmt', 'ingress'])
Can't reproduce this report. Tested several attempts with the steps in the description. Tried with both power off and shutdown. VM successfully powered off.
I was not able to reproduce it on my cluster with master either. I will be on away next 5 days. I will try it with 4.2.6 when I'm back. Could you grant me access to a machine that reproduces it by any chance?
(In reply to Michael Burman from comment #4) > Can't reproduce this report. > Tested several attempts with the steps in the description. > Tried with both power off and shutdown. VM successfully powered off. I've just reproduced it again. It's not every time that I can reproduce it. For me, it seems a graceful shutdown via guest agent and 10+ VMs is 100% hit. 2018-10-04 13:08:42,775+1000 ERROR (libvirt/events) [vds] Error running VM callback (clientIF:690) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 653, in dispatchLibvirtEvents v.onLibvirtLifecycleEvent(event, detail, None) File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5580, in onLibvirtLifecycleEvent self._onQemuDeath(exit_code, reason) File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1067, in _onQemuDeath result = self.releaseVm() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5276, in releaseVm nic.name) File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__ return callMethod() File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda> **kwargs) File "<string>", line 2, in unsetPortMirroring File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod raise convert_to_error(kind, result) TrafficControlException: (22, 'RTNETLINK answers: Invalid argument', ['/sbin/tc', 'qdisc', 'del', 'dev', 'ovirtmgmt', 'ingress']) (In reply to Petr Horáček from comment #5) > I was not able to reproduce it on my cluster with master either. I will be > on away next 5 days. I will try it with 4.2.6 when I'm back. Could you grant > me access to a machine that reproduces it by any chance? I don't even have that system anymore as I constantly break my test env ;) But the log from above is on a freshly installed one, same thing happens. Also, looking at the code in master I see the same race condition can happen: all vnet devices already gone when VM destroy is called, but there are more VM destroys to be executed. The filter was removed by it so the next vm destroy fails to remove it again.
(In reply to Germano Veit Michel from comment #6) > (In reply to Michael Burman from comment #4) > > Can't reproduce this report. > > Tested several attempts with the steps in the description. > > Tried with both power off and shutdown. VM successfully powered off. > > I've just reproduced it again. It's not every time that I can reproduce it. > For me, it seems a graceful shutdown via guest agent and 10+ VMs is 100% hit. > > 2018-10-04 13:08:42,775+1000 ERROR (libvirt/events) [vds] Error running VM > callback (clientIF:690) > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 653, in > dispatchLibvirtEvents > v.onLibvirtLifecycleEvent(event, detail, None) > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5580, in > onLibvirtLifecycleEvent > self._onQemuDeath(exit_code, reason) > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1067, in > _onQemuDeath > result = self.releaseVm() > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5276, in > releaseVm > nic.name) > File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, > in __call__ > return callMethod() > File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, > in <lambda> > **kwargs) > File "<string>", line 2, in unsetPortMirroring > File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in > _callmethod > raise convert_to_error(kind, result) > TrafficControlException: (22, 'RTNETLINK answers: Invalid argument', > ['/sbin/tc', 'qdisc', 'del', 'dev', 'ovirtmgmt', 'ingress']) > > > (In reply to Petr Horáček from comment #5) > > I was not able to reproduce it on my cluster with master either. I will be > > on away next 5 days. I will try it with 4.2.6 when I'm back. Could you grant > > me access to a machine that reproduces it by any chance? > > I don't even have that system anymore as I constantly break my test env ;) > But the log from above is on a freshly installed one, same thing happens. > > Also, looking at the code in master I see the same race condition can > happen: all vnet devices already gone when VM destroy is called, but there > are more VM destroys to be executed. The filter was removed by it so the > next vm destroy fails to remove it again. Ok, so lets maybe try to understand what is the difference between us :) What are the OS versions on the host and the guests? is it 7.6? i heard that there is some regression with tc on 7.6 Also, i see that the network you are using to reproduce it is ovirtmgmt, i tried with another network which is not the management. Should i try with ovirtmgmt? Thanks,
Correction - managed to reproduce on rhel 7.5 images which running on rhel 7.6 hosts.
Michael, I can hit this on 7.5 hosts, and customer ticket attached to this case on 7.4. I don't think this is a tc/iproute issue, looks like a race condition on vdsm to me as explained on comment #0. I use rhel 7.5 guests to reproduce, shutting down gracefully via guest agent. Also the network name should not make any difference, customer hit on custom network and I did on ovirtmgmt.
We expect an "RTNETLINK answers: No such file or directory" error (2) and in these cases we get "RTNETLINK answers: Invalid argument" error (22). I think this is a tc/netlink race, as VDSM is serializing the mirroring commands. vdsm vm threads are sending mirror commands to supervdsm process through the proxy connection (which is one per process, so vdsm has only one such connection). We need to try and recreate it using plain tc commands.
We seem to have handled this with the host QoS, expecting several types of errors when an entity (qdisc in this case) is missing. I have recreated this as follows: ``` tc qdisc add dev eth3 ingress tc qdisc replace dev eth3 root prio tc qdisc del dev eth3 root tc qdisc del dev eth3 ingress tc qdisc del dev eth3 root tc qdisc del dev eth3 ingress ```
We've missed the 4.2.7 deadline for non-blockers. This would need to wait for 4.2.8.
Verified on - vdsm-4.30.0-631.gitac20250.el7.x86_64 with 4.3.0-0.0.master.20181010155922.gite89babf.el7
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1077