Bug 1061569 - Concurrent host network changes are allowed but prone to fail
Summary: Concurrent host network changes are allowed but prone to fail
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Network
Version: ---
Hardware: x86_64
OS: Linux
high
medium with 1 vote
Target Milestone: ovirt-4.4.0
: ---
Assignee: eraviv
QA Contact: Michael Burman
URL:
Whiteboard:
Depends On: 1477599
Blocks: 1300220 1850104
TreeView+ depends on / blocked
 
Reported: 2014-02-05 06:02 UTC by Meni Yakove
Modified: 2020-11-14 04:33 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, if you requested multiple concurrent network changes on a host, some requests were not handled due to a 'reject on busy' service policy. The current release fixes this issue with a new service policy: If resources are not available on the server to handle a request, the host queues the request for a configurable period. If server resources become available within this period, the server handles the request. Otherwise, it rejects the request. There is no guarantee for the order in which queued requests are handled.
Clone Of:
: 1300220 1850104 (view as bug list)
Environment:
Last Closed: 2020-05-20 20:01:32 UTC
oVirt Team: Network
Embargoed:
danken: ovirt-4.4?
ylavi: planning_ack?
rule-engine: devel_ack+
mburman: testing_ack+


Attachments (Terms of Use)
engine vdsm and supervdsm logs (1.06 MB, application/zip)
2014-02-05 06:02 UTC, Meni Yakove
no flags Details
vdsm log (1.21 MB, application/x-gzip)
2015-10-15 12:26 UTC, Michael Burman
no flags Details
new engine log_ (86.17 KB, application/x-gzip)
2016-04-06 08:12 UTC, Michael Burman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 46431 0 master MERGED engine: add locking on (Host)SetupNetworksCommand 2020-11-03 07:52:48 UTC
oVirt gerrit 46669 0 ovirt-engine-3.6 MERGED engine: add locking on (Host)SetupNetworksCommand 2020-11-03 07:52:48 UTC
oVirt gerrit 98587 0 None MERGED core: setup networks with timed wait 2020-11-03 07:52:48 UTC
oVirt gerrit 98617 0 None MERGED core: introduce wait with timeout to lock manager 2020-11-03 07:52:48 UTC
oVirt gerrit 99318 0 None MERGED core: add wait with timeout option to lock properties 2020-11-03 07:52:48 UTC

Description Meni Yakove 2014-02-05 06:02:05 UTC
Created attachment 859467 [details]
engine vdsm and supervdsm logs

Description of problem:
On 3.4 if network is attached to the host when updating network on the DC setupNetwork command sent sync the network on the host.
If update network send more then once to quickly the second internal setupNetworks fails with error:
Operation Failed: [Resource unavailable]

Version-Release number of selected component (if applicable):
ovirt-engine-3.4.0-0.5.beta1.el6.noarch

How reproducible:
100%

Steps to Reproduce:
1. Create 2 networks and attached them to the host using setupNetworks
2. Update the first network and immediately update the second one
3.

Actual results:
The second network never got synced on the host

Expected results:


Additional info:
Both networks should be synced on the host

Comment 1 Moti Asayag 2014-02-05 08:08:09 UTC
You might end with the same result if you execute 2 consecutive 'setup networks' commands from 2 different clients.

But now with the multi-host network configuration and network labels feature we might hit it more often.

The result of such failure will be a network which is not synced on the host. This is not a blocking issue:
1. The failure is reported via an event log specifying the change wasn't configured on the specific host.
2. The user can sync the network via the 'setup networks' as used to do before if failed.

As for the specific issue:
VDSM has a lock on the 'setupNetworks' which rejects any consecutive call with 'resource unavailable' error.

There are couple of ways to handle it, and it deserves its own thread on engine-devel and vdsm-devel.

Comment 2 Itamar Heim 2014-02-09 08:52:56 UTC
Setting target release to current version for consideration and review. please
do not push non-RFE bugs to an undefined target release to make sure bugs are
reviewed for relevancy, fix, closure, etc.

Comment 3 Sandro Bonazzola 2014-03-04 09:22:42 UTC
This is an automated message.
Re-targeting all non-blocker bugs still open on 3.4.0 to 3.4.1.

Comment 4 Sandro Bonazzola 2015-09-04 09:02:34 UTC
This is an automated message.
This Bugzilla report has been opened on a version which is not maintained anymore.
Please check if this bug is still relevant in oVirt 3.5.4.
If it's not relevant anymore, please close it (you may use EOL or CURRENT RELEASE resolution)
If it's an RFE please update the version to 4.0 if still relevant.

Comment 5 Michael Burman 2015-10-15 12:21:03 UTC
I managed to reproduce this bug on 3.6.0.1-0.1.el6 with vdsm-4.17.9-1.el7ev.noarch

I have 2 vlan non-VM networks attached to 1 NIC and i managed to update them quickly from non-VM> to VM networks, and result was that i ended up with 2 unsynced networks.

Property - Bridged
Host - false
DC- true

Attaching vdsm.log

Comment 6 Michael Burman 2015-10-15 12:26:24 UTC
Created attachment 1083241 [details]
vdsm log

Comment 7 Red Hat Bugzilla Rules Engine 2015-10-19 10:49:41 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 8 Yaniv Lavi 2015-10-29 12:04:46 UTC
In oVirt testing is done on single stream by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.

Comment 9 Dan Kenigsberg 2015-11-10 16:45:03 UTC
(In reply to Michael Burman from comment #6)
> Created attachment 1083241 [details]
> vdsm log

#1
This log does not show the tale-telling 'concurrent network verb already executing' error. Hence, I assume that the Engine-side protection actually worked.

#2
I do not understand how the error

Thread-10905::DEBUG::2015-10-15 13:41:32,803::__init__::503::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setupNetworks' in bridge with {u'bondings': {}, u'networks': {u'net-2': {u'nic': u'ens1f0', u'bridged': u'false', u'mtu': u'1500'}}, u'options': {u'connectivityCheck': u'true', u'connectivityTimeout': 120}}
...
ConfigNetworkError: (21, "interface u'ens1f0' cannot be defined with this network since it is already defined with network net-1")

is related to your reproduction.

#3
The patch solves a race between two single-host setupNetwork commands. 
You are right that it does not solve the reported problem of a DC-level change. We change the DC-level property, then go to change it on the host(s) to which the network is attached. Any of these host-level commands may fail and leave the host in an unsync'ed state.

Comment 10 Dan Kenigsberg 2015-11-10 16:56:03 UTC
I believe that the problem stems from our multi-host (background) operations. When changing the DC-level property, a background process starts updating all relevant hosts.

Another change of a DC-level property would spawn another background multi-host process that is likely to collide with the first one.

Solving this is not easy. We may want to fail all subsequent multi-host processes while the first one has not finished, but this may harm usability if the first one handles many slow hosts. We can block subsequent processes only if they handle the same hosts as a concurrent process.

Another idea is to provide a mechanism to report the current state of multi-host processes, so the user can tell whether it is safe to start a new process.

Comment 11 Michael Burman 2016-04-06 08:11:16 UTC
Another scenario for this bug --> 

1) Attach 3 vlan networks to NIC via label
2) Remove the 3 networks from DC in one action
Result:
- 1 network was removed from the host(with the label)
- 2 networks remain as 'unmanaged' networks on the host('remove' setup networks command was sent, but first operation was still busy busy on vdsm) 

jsonrpc.Executor/6::DEBUG::2016-04-06 10:50:44,341::__init__::511::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setupNetworks' in bridge with {'bondings': {}, 'networks': {'f3': {'remove': 'true'}}, 'optio
ns': {'connectivityCheck': 'true', 'connectivityTimeout': 120}}
mailbox.SPMMonitor::DEBUG::2016-04-06 10:50:44,365::storage_mailbox::733::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0191812 s, 53.4 MB
/s\n'; <rc> = 0
jsonrpc.Executor/3::DEBUG::2016-04-06 10:50:44,374::__init__::511::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.ping' in bridge with {}
jsonrpc.Executor/3::DEBUG::2016-04-06 10:50:44,375::__init__::539::jsonrpc.JsonRpcServer::(_serveRequest) Return 'Host.ping' in bridge with True
jsonrpc.Executor/5::DEBUG::2016-04-06 10:50:44,689::__init__::511::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setupNetworks' in bridge with {'bondings': {}, 'networks': {'f2': {'remove': 'true'}}, 'optio
ns': {'connectivityCheck': 'true', 'connectivityTimeout': 120}}
jsonrpc.Executor/5::WARNING::2016-04-06 10:50:44,690::API::1459::vds::(setupNetworks) concurrent network verb already executing
jsonrpc.Executor/7::DEBUG::2016-04-06 10:50:44,693::__init__::511::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.ping' in bridge with {}
jsonrpc.Executor/7::DEBUG::2016-04-06 10:50:44,693::__init__::539::jsonrpc.JsonRpcServer::(_serveRequest) Return 'Host.ping' in bridge with True
jsonrpc.Executor/1::DEBUG::2016-04-06 10:50:45,027::__init__::511::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setupNetworks' in bridge with {'bondings': {}, 'networks': {'f4': {'remove': 'true'}}, 'optio
ns': {'connectivityCheck': 'true', 'connectivityTimeout': 120}}

Attaching engine log

Tested on 4.0.0-0.0.master.20160404161620.git4ffd5a4.el7.centos
and vdsm-4.17.999-879.git565cb2e.el7.centos.noarch

Comment 12 Michael Burman 2016-04-06 08:12:42 UTC
Created attachment 1144119 [details]
new engine log_

Comment 13 Sandro Bonazzola 2016-05-02 09:56:10 UTC
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 14 Yaniv Lavi 2016-05-23 13:17:34 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 15 Yaniv Lavi 2016-05-23 13:24:06 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 16 lgh 2018-05-15 22:50:59 UTC
Hi.

I think I'm hitting this bug when deleting multiple networks in a batch. I delete for example 10 networks, but only 4 get deleted and the other 6 remain as "unmanaged"

Since this hasn't been solved for a long time, do you know of any workarounds that I can apply? Is there any way of deleting "all" the unmanaged networks in a host or something similar?

Thanks!

Comment 17 Michael Burman 2019-06-17 08:53:10 UTC
Eitan please add all relevant patches that you already merged for this bug report. Thanks)

Comment 18 eraviv 2019-06-17 08:58:08 UTC
(In reply to Michael Burman from comment #17)
> Eitan please add all relevant patches that you already merged for this bug
> report. Thanks)

done

Comment 19 Michael Burman 2019-06-23 09:12:42 UTC
The expected behavior after the fix is to avoid a collision messages in engine when performing multiple host network requests and avoid failures if such collision happen.
We shouldn't see "Can't perform setup networks because another setup networks is running" error in engine.

Comment 20 Michael Burman 2020-03-30 13:56:22 UTC
We haven't saw this collisions for a long time now. Looks good. We will report new bug if see this or similar again

Verified on - 4.4.0-0.27.master.el8ev with 
vdsm-4.40.7-1.el8ev.x86_64
nmstate-0.2.6-4.el8.noarch

Comment 21 Sandro Bonazzola 2020-05-20 20:01:32 UTC
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.