1300220 – [downstream clone] Concurrent multi-host network changes are allowed but prone to fail

Bug 1300220 - [downstream clone] Concurrent multi-host network changes are allowed but prone to fail

Summary: [downstream clone] Concurrent multi-host network changes are allowed but pron...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.5.5
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.3.5
Target Release:	---
Assignee:	eraviv
QA Contact:	Michael Burman
Docs Contact:
URL:
Whiteboard:
Depends On:	1061569 1850104
Blocks:	1369064 1520566
TreeView+	depends on / blocked

Reported:	2016-01-20 10:02 UTC by Pavel Zhukov
Modified:	2023-09-07 18:44 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Feature: The change provide a user with the information about when the background multi-host operations is finished. Reason: Some network changes (e.g. attach/detach labeled network, update network definitions) cause a background multi-host operation to be spawned. Whereas the user isn't aware of that background operation, he might initiate another network change that would spawn another background operation that would collide with the first one if that hasn't finished yet. Result: Make a user aware of the background processes that he starts, so he would avoid starting multiple processes that are prone to fail.
Clone Of:	1061569
Clones:	1369064 (view as bug list)
Environment:
Last Closed:	2019-06-17 08:47:44 UTC
oVirt Team:	Network
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1358501	high	CLOSED	[RFE] multihost network change - notify when done	2021-02-22 00:41:40 UTC
oVirt gerrit	46431	'None'	MERGED	engine: add locking on (Host)SetupNetworksCommand	2020-04-11 21:38:40 UTC
oVirt gerrit	46669	'None'	MERGED	engine: add locking on (Host)SetupNetworksCommand	2020-04-11 21:38:39 UTC
oVirt gerrit	61350	'None'	MERGED	engine: add finish report on multi-host network commands	2020-04-11 21:38:39 UTC
oVirt gerrit	62081	'None'	MERGED	engine: add finish report on multi-host network commands	2020-04-11 21:38:39 UTC
oVirt gerrit	62891	'None'	MERGED	core: add setCustomValues to AuditLogableBase	2020-04-11 21:38:39 UTC
oVirt gerrit	62892	'None'	MERGED	engine: add finish report on multi-host network commands	2020-04-11 21:38:39 UTC

Internal Links: 1358501

Description Pavel Zhukov 2016-01-20 10:02:04 UTC

+++ This bug was initially created as a clone of Bug #1061569 +++

Description of problem:
On 3.4 if network is attached to the host when updating network on the DC setupNetwork command sent sync the network on the host.
If update network send more then once to quickly the second internal setupNetworks fails with error:
Operation Failed: [Resource unavailable]

Version-Release number of selected component (if applicable):
ovirt-engine-3.4.0-0.5.beta1.el6.noarch

How reproducible:
100%

Steps to Reproduce:
1. Create 2 networks and attached them to the host using setupNetworks
2. Update the first network and immediately update the second one
3.

Actual results:
The second network never got synced on the host

Expected results:


Additional info:
Both networks should be synced on the host

--- Additional comment from Moti Asayag on 2014-02-05 03:08:09 EST ---

You might end with the same result if you execute 2 consecutive 'setup networks' commands from 2 different clients.

But now with the multi-host network configuration and network labels feature we might hit it more often.

The result of such failure will be a network which is not synced on the host. This is not a blocking issue:
1. The failure is reported via an event log specifying the change wasn't configured on the specific host.
2. The user can sync the network via the 'setup networks' as used to do before if failed.

As for the specific issue:
VDSM has a lock on the 'setupNetworks' which rejects any consecutive call with 'resource unavailable' error.

There are couple of ways to handle it, and it deserves its own thread on engine-devel and vdsm-devel.

--- Additional comment from Itamar Heim on 2014-02-09 03:52:56 EST ---

Setting target release to current version for consideration and review. please
do not push non-RFE bugs to an undefined target release to make sure bugs are
reviewed for relevancy, fix, closure, etc.

--- Additional comment from Sandro Bonazzola on 2014-03-04 04:22:42 EST ---

This is an automated message.
Re-targeting all non-blocker bugs still open on 3.4.0 to 3.4.1.

--- Additional comment from Sandro Bonazzola on 2015-09-04 05:02:34 EDT ---

This is an automated message.
This Bugzilla report has been opened on a version which is not maintained anymore.
Please check if this bug is still relevant in oVirt 3.5.4.
If it's not relevant anymore, please close it (you may use EOL or CURRENT RELEASE resolution)
If it's an RFE please update the version to 4.0 if still relevant.

--- Additional comment from Michael Burman on 2015-10-15 08:21:03 EDT ---

I managed to reproduce this bug on 3.6.0.1-0.1.el6 with vdsm-4.17.9-1.el7ev.noarch

I have 2 vlan non-VM networks attached to 1 NIC and i managed to update them quickly from non-VM> to VM networks, and result was that i ended up with 2 unsynced networks.

Property - Bridged
Host - false
DC- true

Attaching vdsm.log

--- Additional comment from Michael Burman on 2015-10-15 08:26 EDT ---



--- Additional comment from Red Hat Bugzilla Rules Engine on 2015-10-19 06:49:41 EDT ---

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

--- Additional comment from Yaniv Dary on 2015-10-29 08:04:46 EDT ---

In oVirt testing is done on single stream by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.

--- Additional comment from Dan Kenigsberg on 2015-11-10 11:45:03 EST ---

(In reply to Michael Burman from comment #6)
> Created attachment 1083241 [details]
> vdsm log

#1
This log does not show the tale-telling 'concurrent network verb already executing' error. Hence, I assume that the Engine-side protection actually worked.

#2
I do not understand how the error

Thread-10905::DEBUG::2015-10-15 13:41:32,803::__init__::503::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setupNetworks' in bridge with {u'bondings': {}, u'networks': {u'net-2': {u'nic': u'ens1f0', u'bridged': u'false', u'mtu': u'1500'}}, u'options': {u'connectivityCheck': u'true', u'connectivityTimeout': 120}}
...
ConfigNetworkError: (21, "interface u'ens1f0' cannot be defined with this network since it is already defined with network net-1")

is related to your reproduction.

#3
The patch solves a race between two single-host setupNetwork commands. 
You are right that it does not solve the reported problem of a DC-level change. We change the DC-level property, then go to change it on the host(s) to which the network is attached. Any of these host-level commands may fail and leave the host in an unsync'ed state.

--- Additional comment from Dan Kenigsberg on 2015-11-10 11:56:03 EST ---

I believe that the problem stems from our multi-host (background) operations. When changing the DC-level property, a background process starts updating all relevant hosts.

Another change of a DC-level property would spawn another background multi-host process that is likely to collide with the first one.

Solving this is not easy. We may want to fail all subsequent multi-host processes while the first one has not finished, but this may harm usability if the first one handles many slow hosts. We can block subsequent processes only if they handle the same hosts as a concurrent process.

Another idea is to provide a mechanism to report the current state of multi-host processes, so the user can tell whether it is safe to start a new process.

Comment 1 Pavel Zhukov 2016-01-20 10:05:16 UTC

Steps to reproduce:

1) Attach label "test" to NIC
2) Using API create 10 networks and attach the label to all of them
3) Attach networks to the Cluster

Result:
Engine tries to call setupNetwork multiple times for every labeled networks but vdsm rejects with 
Thread-363622::WARNING::2016-01-20 04:48:10,134::API::1393::vds::(setupNetworks) concurrent network verb already executing

Comment 9 Yevgeny Zaspitsky 2016-07-17 11:00:54 UTC

When registering a new host, only ovirtmgmt is automatically created. Then, a label can be assigned to a nic, and by that - multiple networks need to be attached to the host. Is that what the customer doing? (If so, it is unrelated to this bug, which speaks about *multi-host* operations.)

Or is the customer defining new networks such as "vlan_1307" etc and assigning a label to them?

Comment 15 Dan Kenigsberg 2016-07-19 06:47:13 UTC

Bimal, as far as I understand, the only current workaround is to wait. After assigning a label to a network, the network is applied to all host. The user must wait until this process ends before he assign a label to another network. Note that there is no indication that the process has ended.

Comment 19 Yevgeny Zaspitsky 2016-07-26 08:44:29 UTC

The patch [1] intends to provide an audit log message upon finish applying a network change to the hosts. So a user would be informed of when another network change should be submitted.

[1] https://gerrit.ovirt.org/#/c/61350/

Comment 22 Yaniv Lavi 2016-08-15 14:13:53 UTC

Agreed to include the a 3.6 change to have a audit log for network actions compilation.

Comment 26 Michael Burman 2016-09-04 12:33:35 UTC

Tested on - 4.0.4-0.1.el7ev

When multi-host operation failed - 

Failed to apply changes of Network complex-net-2 on Data Center: DC1 to the hosts. Failed hosts are: puma22.scl.lab.tlv.redhat.com, orchid-vds2.qa.lab.tlv.redhat.com.
(2/2): Failed to apply changes for network(s) complex-net-2 on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz)
(1/2): Failed to apply changes for network(s) complex-net-2 on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz)
(1/2): Applying network's changes on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz)
(2/2): Applying network's changes on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz)
Update network complex-net-2 on Data Center: DC1 was started.


When multi-host operation succeeded - 

Update network new_network1 on Data Center: DC1 has finished successfully.
(1/2): Successfully applied changes for network(s) new_network1 on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz)
(2/2): Successfully applied changes for network(s) new_network1 on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz)
(2/2): Applying network's changes on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz)
(1/2): Applying network's changes on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz)
Update network new_network1 on Data Center: DC1 was started.

When multi-host partial failed(only one host from 2)

Failed to apply changes of Network n3 on Data Center: DC1 to the hosts. Failed hosts are: orchid-vds2.qa.lab.tlv.redhat.com
(2/2): Successfully applied changes for network(s) n3 on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz)
(1/2): Failed to apply changes for network(s) n3 on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz)
(1/2): Applying network's changes on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz)
(2/2): Applying network's changes on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz)
Update network n3 on Data Center: DC1 was started.


- So what we have now is indication about the operation that has been started.
Indication about it ended with success or failure in the DC and on which host it failed or not.
But no reason or indication about the reason we failed. 
Isn't the error messages should include the reason we failed? or this is the fix for this?

Comment 27 Dan Kenigsberg 2016-09-05 10:36:51 UTC

We are about to revert the patches. Customer are requested to watch the event log for a (1/N) alert (where N is the number of host in the cluster), and not to do any other multi-host operation until the (N/N) alert is show in the log.

Comment 28 Yaniv Lavi 2016-09-07 09:23:58 UTC

Can you communicate comment #27 to the customer and see if this is good enough to resolve the use case for knowing when network changes are complete via audit log?

Comment 32 Dan Kenigsberg 2016-09-11 10:52:11 UTC

(In reply to Dan Kenigsberg from comment #27)

I have to correct my previous comment, as alert (N/N) may arrive before alerts from other hosts. The correct statement is:

Customer are requested to watch the event log for a (n/N) alert (where N is the number of host in the cluster and n is in the 1..N range), and not to do any other multi-host operation until all (n/N) alerts has showed up in the log.

Comment 48 Raz Tamir 2018-08-15 18:52:42 UTC

Michael has tested it (comment #26) already.
Michael, could you please provide more info on Marina's questions in comment #47?

Comment 51 Marina Kalinin 2018-08-17 14:42:13 UTC

Cool. Sounds great.
Thanks for testing it, Michael.
Seems like we can close it CURRENTRELEASE.
I will let you or Dan doing that.

Comment 55 Dan Kenigsberg 2018-12-11 11:49:17 UTC

We have a reasonable workaround as of 4.2.7: we have an indication that a multi-host action is going on, and we have a SyncAll button to regenrate a failed multi-host connection.

Comment 56 Dominik Holler 2019-02-06 12:24:53 UTC

*** Bug 1672762 has been marked as a duplicate of this bug. ***

Comment 62 eraviv 2019-04-15 09:27:43 UTC

in reply to comment 60:

According to attached screen-shot (attachment 1555116 [details]) none of the attached networks are out-of-sync so no reason for the 'sync all networks' button of the host  to be enabled.

Comment 63 Michael Burman 2019-06-17 08:47:44 UTC

In 4.3 we have the current issues addressed:

The current version(rhvm-4.3.5-0.1.el7.noarch) have:

1. engine: add finish report on multi-host network commands 
2. add setCustomValues to AuditLogableBase
3. engine: add locking on (Host)SetupNetworksCommand Add locking on (Host)SetupNetworksCommand in order to prevent multiple commands being executed on a single host.
4. we have an indication that a multi-host action is going on, and we have a SyncAll button to regenerate a failed multi-host connection.

The origin report and issue will be addressed and fixed in BZ 1061569 only for 4.4

Comment 64 eraviv 2019-06-23 06:51:48 UTC

(In reply to Michael Burman from comment #63)
> In 4.3 we have the current issues addressed:
> 
> The current version(rhvm-4.3.5-0.1.el7.noarch) have:
> 
> 1. engine: add finish report on multi-host network commands 
> 2. add setCustomValues to AuditLogableBase
> 3. engine: add locking on (Host)SetupNetworksCommand Add locking on
> (Host)SetupNetworksCommand in order to prevent multiple commands being
> executed on a single host.
> 4. we have an indication that a multi-host action is going on, and we have a
> SyncAll button to regenerate a failed multi-host connection.
> 
> The origin report and issue will be addressed and fixed in BZ 1061569 only
> for 4.4

Burman,
Can you please specify on BZ 1061569 what is the origim problem with a reproduce scenario and what is the expected behaviour for 4.4
thanks

Comment 65 Michael Burman 2019-06-23 08:54:56 UTC

(In reply to eraviv from comment #64)
> (In reply to Michael Burman from comment #63)
> > In 4.3 we have the current issues addressed:
> > 
> > The current version(rhvm-4.3.5-0.1.el7.noarch) have:
> > 
> > 1. engine: add finish report on multi-host network commands 
> > 2. add setCustomValues to AuditLogableBase
> > 3. engine: add locking on (Host)SetupNetworksCommand Add locking on
> > (Host)SetupNetworksCommand in order to prevent multiple commands being
> > executed on a single host.
> > 4. we have an indication that a multi-host action is going on, and we have a
> > SyncAll button to regenerate a failed multi-host connection.
> > 
> > The origin report and issue will be addressed and fixed in BZ 1061569 only
> > for 4.4
> 
> Burman,
> Can you please specify on BZ 1061569 what is the origim problem with a
> reproduce scenario and what is the expected behaviour for 4.4
> thanks

Hi Eitan, 
The origin problem is described in the description(one example) and there are many scenarios to see this issue, not one. We saw it on DEV and QE side in several scenarios, some of them not always reproduced. 
The expected behavior for 4.4 will be to not allow and fail. I believe you added the wait time out for the lock manager. 
We will need to ensure that such collisions on setup networks requests won't fail with your fix on 4.4

Note You need to log in before you can comment on or make changes to this bug.