Bug 1631624

Summary:	Exception on unsetPortMirroring makes vmDestroy fail.
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Germano Veit Michel <gveitmic>
Component:	vdsm	Assignee:	Edward Haas <edwardh>
Status:	CLOSED ERRATA	QA Contact:	Michael Burman <mburman>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.2.6	CC:	danken, dholler, gveitmic, lsurette, phoracek, srevivo, ycui
Target Milestone:	ovirt-4.3.0	Keywords:	ZStream
Target Release:	4.3.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	v4.30.3	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1637549 (view as bug list)		Environment:
Last Closed:	2019-05-08 12:36:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Network	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1637549

Description Germano Veit Michel 2018-09-21 06:12:45 UTC

Description of problem:

Shutting down multiple VMs with port mirroring set on the same network results in some of them failing to destroy, therefore showing as not responsive in the UI. VDSM needs to be restarted to clean up.

MainProcess|jsonrpc/0::ERROR::2018-09-21 15:21:49,140::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) Error in unsetPortMirroring
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 98, in wrapper
    res = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/network/tc/__init__.py", line 108, in unsetPortMirroring
    _qdisc_del(network, kind='ingress')
  File "/usr/lib/python2.7/site-packages/vdsm/network/tc/__init__.py", line 143, in _qdisc_del
    qdisc.delete(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/network/tc/qdisc.py", line 57, in delete
    _wrapper.process_request(command)
  File "/usr/lib/python2.7/site-packages/vdsm/network/tc/_wrapper.py", line 41, in process_request
    raise TrafficControlException(retcode, err, command)
TrafficControlException: (22, 'RTNETLINK answers: Invalid argument', ['/sbin/tc', 'qdisc', 'del', 'dev', 'ovirtmgmt', 'ingress'])

Version-Release number of selected component (if applicable):
vdsm-4.20.35-1.el7.x86_64
iputils-20160308-10.el7.x86_64
kernel-3.10.0-862.11.6.el7.x86_64

How reproducible:
50%

Steps to Reproduce:
1. Start 3+ VMs, all with Port Mirroring set in the vNIC profile
2. Power them down at the same time

Actual results:
One or two VMs wail fail to destroy. domain quit but vdsm still reports it to engine.

Expected results:
destroy to succeed, VM in down state.

Additional info:

1) Once the VMs shutdown the vnetX devices are gone. If we have 3 vms, all 3 vnet0, vnet1 and vnet2 netdevs go away at once.
2) In _delTarget() below, the 'acts' list comprehension returns empty because the devices of the VMs do not exist anymore:
3) This makes one of destroy evaluate the if to false and delete the entire qdisc instead of replacing the filters!!
4) The subsequent VMs also find no actions/filters and try to delete the qdisc too, this throws exception,leaving the vm existing on vdsm.

def _delTarget(network, parent, target):
    ...
[a] devices = set(link.name for link in ipwrapper.getLinks())
    acts = [act for act in filt.actions
[b]          if act.target in devices and act.target != target]
    if acts:
        filt = Filter(prio=filt.prio, handle=filt.handle, actions=acts)
        _filter_replace(network, parent, filt)
    else:
[c]      tc_filter.delete(network, filt.prio, parent=parent)
    return acts

[a] this does not return the vnetX devices because all N VMs were shutdown simultaneously, the vnetX for the other VMs are also gone.
[b] all the actions related to the now gone vnetX devices do not make it into acts list
[c] the if evaluates to false and deletes the entire qdiscs. The next VMs will also call destroy and will try to do the same and fail, because the qdisc does not exist anymore, so it cannot be deleted, therefore:

TrafficControlException: (22, 'RTNETLINK answers: Invalid argument', ['/sbin/tc', 'qdisc', 'del', 'dev', 'ovirtmgmt', 'ingress'])

Comment 4 Michael Burman 2018-10-03 09:31:14 UTC

Can't reproduce this report.
Tested several attempts with the steps in the description.
Tried with both power off and shutdown. VM successfully powered off.

Comment 5 Petr Horáček 2018-10-03 12:36:33 UTC

I was not able to reproduce it on my cluster with master either. I will be on away next 5 days. I will try it with 4.2.6 when I'm back. Could you grant me access to a machine that reproduces it by any chance?

Comment 6 Germano Veit Michel 2018-10-04 03:14:54 UTC

(In reply to Michael Burman from comment #4)
> Can't reproduce this report.
> Tested several attempts with the steps in the description.
> Tried with both power off and shutdown. VM successfully powered off.

I've just reproduced it again. It's not every time that I can reproduce it.
For me, it seems a graceful shutdown via guest agent and 10+ VMs is 100% hit.

2018-10-04 13:08:42,775+1000 ERROR (libvirt/events) [vds] Error running VM callback (clientIF:690)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 653, in dispatchLibvirtEvents
    v.onLibvirtLifecycleEvent(event, detail, None)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5580, in onLibvirtLifecycleEvent
    self._onQemuDeath(exit_code, reason)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1067, in _onQemuDeath
    result = self.releaseVm()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5276, in releaseVm
    nic.name)
  File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__
    return callMethod()
  File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda>
    **kwargs)
  File "<string>", line 2, in unsetPortMirroring
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
    raise convert_to_error(kind, result)
TrafficControlException: (22, 'RTNETLINK answers: Invalid argument', ['/sbin/tc', 'qdisc', 'del', 'dev', 'ovirtmgmt', 'ingress'])


(In reply to Petr Horáček from comment #5)
> I was not able to reproduce it on my cluster with master either. I will be
> on away next 5 days. I will try it with 4.2.6 when I'm back. Could you grant
> me access to a machine that reproduces it by any chance?

I don't even have that system anymore as I constantly break my test env ;)
But the log from above is on a freshly installed one, same thing happens.

Also, looking at the code in master I see the same race condition can happen: all vnet devices already gone when VM destroy is called, but there are more VM destroys to be executed. The filter was removed by it so the next vm destroy fails to remove it again.

Comment 7 Michael Burman 2018-10-04 07:51:43 UTC

(In reply to Germano Veit Michel from comment #6)
> (In reply to Michael Burman from comment #4)
> > Can't reproduce this report.
> > Tested several attempts with the steps in the description.
> > Tried with both power off and shutdown. VM successfully powered off.
> 
> I've just reproduced it again. It's not every time that I can reproduce it.
> For me, it seems a graceful shutdown via guest agent and 10+ VMs is 100% hit.
> 
> 2018-10-04 13:08:42,775+1000 ERROR (libvirt/events) [vds] Error running VM
> callback (clientIF:690)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 653, in
> dispatchLibvirtEvents
>     v.onLibvirtLifecycleEvent(event, detail, None)
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5580, in
> onLibvirtLifecycleEvent
>     self._onQemuDeath(exit_code, reason)
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1067, in
> _onQemuDeath
>     result = self.releaseVm()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5276, in
> releaseVm
>     nic.name)
>   File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55,
> in __call__
>     return callMethod()
>   File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53,
> in <lambda>
>     **kwargs)
>   File "<string>", line 2, in unsetPortMirroring
>   File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in
> _callmethod
>     raise convert_to_error(kind, result)
> TrafficControlException: (22, 'RTNETLINK answers: Invalid argument',
> ['/sbin/tc', 'qdisc', 'del', 'dev', 'ovirtmgmt', 'ingress'])
> 
> 
> (In reply to Petr Horáček from comment #5)
> > I was not able to reproduce it on my cluster with master either. I will be
> > on away next 5 days. I will try it with 4.2.6 when I'm back. Could you grant
> > me access to a machine that reproduces it by any chance?
> 
> I don't even have that system anymore as I constantly break my test env ;)
> But the log from above is on a freshly installed one, same thing happens.
> 
> Also, looking at the code in master I see the same race condition can
> happen: all vnet devices already gone when VM destroy is called, but there
> are more VM destroys to be executed. The filter was removed by it so the
> next vm destroy fails to remove it again.

Ok, so lets maybe try to understand what is the difference between us :)

What are the OS versions on the host and the guests? is it 7.6? i heard that there is some regression with tc on 7.6

Also, i see that the network you are using to reproduce it is ovirtmgmt, i tried with another network which is not the management. Should i try with ovirtmgmt?

Thanks,

Comment 9 Michael Burman 2018-10-04 09:46:39 UTC

Correction - managed to reproduce on rhel 7.5 images which running on rhel 7.6 hosts.

Comment 10 Germano Veit Michel 2018-10-04 23:35:36 UTC

Michael,

I can hit this on 7.5 hosts, and customer ticket attached to this case on 7.4. I don't think this is a tc/iproute issue, looks like a race condition on vdsm to me as explained on comment #0.

I use rhel 7.5 guests to reproduce, shutting down gracefully via guest agent.

Also the network name should not make any difference, customer hit on custom network and I did on ovirtmgmt.

Comment 11 Edward Haas 2018-10-07 07:23:19 UTC

We expect an "RTNETLINK answers: No such file or directory" error (2) and in these cases we get "RTNETLINK answers: Invalid argument" error (22).

I think this is a tc/netlink race, as VDSM is serializing the mirroring commands.
vdsm vm threads are sending mirror commands to supervdsm process through the proxy connection (which is one per process, so vdsm has only one such connection).

We need to try and recreate it using plain tc commands.

Comment 12 Edward Haas 2018-10-08 07:07:35 UTC

We seem to have handled this with the host QoS, expecting several types of errors when an entity (qdisc in this case) is missing.

I have recreated this as follows:
```
tc qdisc add dev eth3 ingress
tc qdisc replace dev eth3 root prio
tc qdisc del dev eth3 root
tc qdisc del dev eth3 ingress

tc qdisc del dev eth3 root
tc qdisc del dev eth3 ingress
```

Comment 13 Dan Kenigsberg 2018-10-09 10:28:54 UTC

We've missed the 4.2.7 deadline for non-blockers. This would need to wait for 4.2.8.

Comment 15 Michael Burman 2018-10-11 07:54:47 UTC

Verified on - vdsm-4.30.0-631.gitac20250.el7.x86_64 with 4.3.0-0.0.master.20181010155922.gite89babf.el7

Comment 17 errata-xmlrpc 2019-05-08 12:36:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1077