Bug 1631624
Summary: | Exception on unsetPortMirroring makes vmDestroy fail. | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> | |
Component: | vdsm | Assignee: | Edward Haas <edwardh> | |
Status: | CLOSED ERRATA | QA Contact: | Michael Burman <mburman> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.2.6 | CC: | danken, dholler, gveitmic, lsurette, phoracek, srevivo, ycui | |
Target Milestone: | ovirt-4.3.0 | Keywords: | ZStream | |
Target Release: | 4.3.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | v4.30.3 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1637549 (view as bug list) | Environment: | ||
Last Closed: | 2019-05-08 12:36:02 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1637549 |
Description
Germano Veit Michel
2018-09-21 06:12:45 UTC
Can't reproduce this report. Tested several attempts with the steps in the description. Tried with both power off and shutdown. VM successfully powered off. I was not able to reproduce it on my cluster with master either. I will be on away next 5 days. I will try it with 4.2.6 when I'm back. Could you grant me access to a machine that reproduces it by any chance? (In reply to Michael Burman from comment #4) > Can't reproduce this report. > Tested several attempts with the steps in the description. > Tried with both power off and shutdown. VM successfully powered off. I've just reproduced it again. It's not every time that I can reproduce it. For me, it seems a graceful shutdown via guest agent and 10+ VMs is 100% hit. 2018-10-04 13:08:42,775+1000 ERROR (libvirt/events) [vds] Error running VM callback (clientIF:690) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 653, in dispatchLibvirtEvents v.onLibvirtLifecycleEvent(event, detail, None) File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5580, in onLibvirtLifecycleEvent self._onQemuDeath(exit_code, reason) File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1067, in _onQemuDeath result = self.releaseVm() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5276, in releaseVm nic.name) File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__ return callMethod() File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda> **kwargs) File "<string>", line 2, in unsetPortMirroring File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod raise convert_to_error(kind, result) TrafficControlException: (22, 'RTNETLINK answers: Invalid argument', ['/sbin/tc', 'qdisc', 'del', 'dev', 'ovirtmgmt', 'ingress']) (In reply to Petr Horáček from comment #5) > I was not able to reproduce it on my cluster with master either. I will be > on away next 5 days. I will try it with 4.2.6 when I'm back. Could you grant > me access to a machine that reproduces it by any chance? I don't even have that system anymore as I constantly break my test env ;) But the log from above is on a freshly installed one, same thing happens. Also, looking at the code in master I see the same race condition can happen: all vnet devices already gone when VM destroy is called, but there are more VM destroys to be executed. The filter was removed by it so the next vm destroy fails to remove it again. (In reply to Germano Veit Michel from comment #6) > (In reply to Michael Burman from comment #4) > > Can't reproduce this report. > > Tested several attempts with the steps in the description. > > Tried with both power off and shutdown. VM successfully powered off. > > I've just reproduced it again. It's not every time that I can reproduce it. > For me, it seems a graceful shutdown via guest agent and 10+ VMs is 100% hit. > > 2018-10-04 13:08:42,775+1000 ERROR (libvirt/events) [vds] Error running VM > callback (clientIF:690) > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 653, in > dispatchLibvirtEvents > v.onLibvirtLifecycleEvent(event, detail, None) > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5580, in > onLibvirtLifecycleEvent > self._onQemuDeath(exit_code, reason) > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1067, in > _onQemuDeath > result = self.releaseVm() > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5276, in > releaseVm > nic.name) > File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, > in __call__ > return callMethod() > File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, > in <lambda> > **kwargs) > File "<string>", line 2, in unsetPortMirroring > File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in > _callmethod > raise convert_to_error(kind, result) > TrafficControlException: (22, 'RTNETLINK answers: Invalid argument', > ['/sbin/tc', 'qdisc', 'del', 'dev', 'ovirtmgmt', 'ingress']) > > > (In reply to Petr Horáček from comment #5) > > I was not able to reproduce it on my cluster with master either. I will be > > on away next 5 days. I will try it with 4.2.6 when I'm back. Could you grant > > me access to a machine that reproduces it by any chance? > > I don't even have that system anymore as I constantly break my test env ;) > But the log from above is on a freshly installed one, same thing happens. > > Also, looking at the code in master I see the same race condition can > happen: all vnet devices already gone when VM destroy is called, but there > are more VM destroys to be executed. The filter was removed by it so the > next vm destroy fails to remove it again. Ok, so lets maybe try to understand what is the difference between us :) What are the OS versions on the host and the guests? is it 7.6? i heard that there is some regression with tc on 7.6 Also, i see that the network you are using to reproduce it is ovirtmgmt, i tried with another network which is not the management. Should i try with ovirtmgmt? Thanks, Correction - managed to reproduce on rhel 7.5 images which running on rhel 7.6 hosts. Michael, I can hit this on 7.5 hosts, and customer ticket attached to this case on 7.4. I don't think this is a tc/iproute issue, looks like a race condition on vdsm to me as explained on comment #0. I use rhel 7.5 guests to reproduce, shutting down gracefully via guest agent. Also the network name should not make any difference, customer hit on custom network and I did on ovirtmgmt. We expect an "RTNETLINK answers: No such file or directory" error (2) and in these cases we get "RTNETLINK answers: Invalid argument" error (22). I think this is a tc/netlink race, as VDSM is serializing the mirroring commands. vdsm vm threads are sending mirror commands to supervdsm process through the proxy connection (which is one per process, so vdsm has only one such connection). We need to try and recreate it using plain tc commands. We seem to have handled this with the host QoS, expecting several types of errors when an entity (qdisc in this case) is missing. I have recreated this as follows: ``` tc qdisc add dev eth3 ingress tc qdisc replace dev eth3 root prio tc qdisc del dev eth3 root tc qdisc del dev eth3 ingress tc qdisc del dev eth3 root tc qdisc del dev eth3 ingress ``` We've missed the 4.2.7 deadline for non-blockers. This would need to wait for 4.2.8. Verified on - vdsm-4.30.0-631.gitac20250.el7.x86_64 with 4.3.0-0.0.master.20181010155922.gite89babf.el7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1077 |