Bug 1300220 - [downstream clone] Concurrent multi-host network changes are allowed but prone to fail
[downstream clone] Concurrent multi-host network changes are allowed but pron...
Status: NEW
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.5.5
x86_64 Linux
high Severity high
: ovirt-4.3.0
: ---
Assigned To: nobody nobody
Michael Burman
: Reopened, ZStream
Depends On: 1061569
Blocks: 1369064
  Show dependency treegraph
 
Reported: 2016-01-20 05:02 EST by Pavel Zhukov
Modified: 2017-08-15 11:16 EDT (History)
17 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: The change provide a user with the information about when the background multi-host operations is finished. Reason: Some network changes (e.g. attach/detach labeled network, update network definitions) cause a background multi-host operation to be spawned. Whereas the user isn't aware of that background operation, he might initiate another network change that would spawn another background operation that would collide with the first one if that hasn't finished yet. Result: Make a user aware of the background processes that he starts, so he would avoid starting multiple processes that are prone to fail.
Story Points: ---
Clone Of: 1061569
: 1369064 (view as bug list)
Environment:
Last Closed: 2016-11-09 04:40:00 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Network
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 46431 None None None 2016-01-20 05:02 EST
oVirt gerrit 46669 None None None 2016-01-20 05:02 EST
oVirt gerrit 61350 ovirt-engine-3.6 MERGED engine: add finish report on multi-host network commands 2016-08-29 05:33 EDT
oVirt gerrit 62081 master MERGED engine: add finish report on multi-host network commands 2016-08-28 04:24 EDT
oVirt gerrit 62891 ovirt-engine-4.0 MERGED core: add setCustomValues to AuditLogableBase 2016-08-28 10:57 EDT
oVirt gerrit 62892 ovirt-engine-4.0 MERGED engine: add finish report on multi-host network commands 2016-08-28 10:57 EDT

  None (edit)
Description Pavel Zhukov 2016-01-20 05:02:04 EST
+++ This bug was initially created as a clone of Bug #1061569 +++

Description of problem:
On 3.4 if network is attached to the host when updating network on the DC setupNetwork command sent sync the network on the host.
If update network send more then once to quickly the second internal setupNetworks fails with error:
Operation Failed: [Resource unavailable]

Version-Release number of selected component (if applicable):
ovirt-engine-3.4.0-0.5.beta1.el6.noarch

How reproducible:
100%

Steps to Reproduce:
1. Create 2 networks and attached them to the host using setupNetworks
2. Update the first network and immediately update the second one
3.

Actual results:
The second network never got synced on the host

Expected results:


Additional info:
Both networks should be synced on the host

--- Additional comment from Moti Asayag on 2014-02-05 03:08:09 EST ---

You might end with the same result if you execute 2 consecutive 'setup networks' commands from 2 different clients.

But now with the multi-host network configuration and network labels feature we might hit it more often.

The result of such failure will be a network which is not synced on the host. This is not a blocking issue:
1. The failure is reported via an event log specifying the change wasn't configured on the specific host.
2. The user can sync the network via the 'setup networks' as used to do before if failed.

As for the specific issue:
VDSM has a lock on the 'setupNetworks' which rejects any consecutive call with 'resource unavailable' error.

There are couple of ways to handle it, and it deserves its own thread on engine-devel and vdsm-devel.

--- Additional comment from Itamar Heim on 2014-02-09 03:52:56 EST ---

Setting target release to current version for consideration and review. please
do not push non-RFE bugs to an undefined target release to make sure bugs are
reviewed for relevancy, fix, closure, etc.

--- Additional comment from Sandro Bonazzola on 2014-03-04 04:22:42 EST ---

This is an automated message.
Re-targeting all non-blocker bugs still open on 3.4.0 to 3.4.1.

--- Additional comment from Sandro Bonazzola on 2015-09-04 05:02:34 EDT ---

This is an automated message.
This Bugzilla report has been opened on a version which is not maintained anymore.
Please check if this bug is still relevant in oVirt 3.5.4.
If it's not relevant anymore, please close it (you may use EOL or CURRENT RELEASE resolution)
If it's an RFE please update the version to 4.0 if still relevant.

--- Additional comment from Michael Burman on 2015-10-15 08:21:03 EDT ---

I managed to reproduce this bug on 3.6.0.1-0.1.el6 with vdsm-4.17.9-1.el7ev.noarch

I have 2 vlan non-VM networks attached to 1 NIC and i managed to update them quickly from non-VM> to VM networks, and result was that i ended up with 2 unsynced networks.

Property - Bridged
Host - false
DC- true

Attaching vdsm.log

--- Additional comment from Michael Burman on 2015-10-15 08:26 EDT ---



--- Additional comment from Red Hat Bugzilla Rules Engine on 2015-10-19 06:49:41 EDT ---

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

--- Additional comment from Yaniv Dary on 2015-10-29 08:04:46 EDT ---

In oVirt testing is done on single stream by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.

--- Additional comment from Dan Kenigsberg on 2015-11-10 11:45:03 EST ---

(In reply to Michael Burman from comment #6)
> Created attachment 1083241 [details]
> vdsm log

#1
This log does not show the tale-telling 'concurrent network verb already executing' error. Hence, I assume that the Engine-side protection actually worked.

#2
I do not understand how the error

Thread-10905::DEBUG::2015-10-15 13:41:32,803::__init__::503::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setupNetworks' in bridge with {u'bondings': {}, u'networks': {u'net-2': {u'nic': u'ens1f0', u'bridged': u'false', u'mtu': u'1500'}}, u'options': {u'connectivityCheck': u'true', u'connectivityTimeout': 120}}
...
ConfigNetworkError: (21, "interface u'ens1f0' cannot be defined with this network since it is already defined with network net-1")

is related to your reproduction.

#3
The patch solves a race between two single-host setupNetwork commands. 
You are right that it does not solve the reported problem of a DC-level change. We change the DC-level property, then go to change it on the host(s) to which the network is attached. Any of these host-level commands may fail and leave the host in an unsync'ed state.

--- Additional comment from Dan Kenigsberg on 2015-11-10 11:56:03 EST ---

I believe that the problem stems from our multi-host (background) operations. When changing the DC-level property, a background process starts updating all relevant hosts.

Another change of a DC-level property would spawn another background multi-host process that is likely to collide with the first one.

Solving this is not easy. We may want to fail all subsequent multi-host processes while the first one has not finished, but this may harm usability if the first one handles many slow hosts. We can block subsequent processes only if they handle the same hosts as a concurrent process.

Another idea is to provide a mechanism to report the current state of multi-host processes, so the user can tell whether it is safe to start a new process.
Comment 1 Pavel Zhukov 2016-01-20 05:05:16 EST
Steps to reproduce:

1) Attach label "test" to NIC
2) Using API create 10 networks and attach the label to all of them
3) Attach networks to the Cluster

Result:
Engine tries to call setupNetwork multiple times for every labeled networks but vdsm rejects with 
Thread-363622::WARNING::2016-01-20 04:48:10,134::API::1393::vds::(setupNetworks) concurrent network verb already executing
Comment 9 Yevgeny Zaspitsky 2016-07-17 07:00:54 EDT
When registering a new host, only ovirtmgmt is automatically created. Then, a label can be assigned to a nic, and by that - multiple networks need to be attached to the host. Is that what the customer doing? (If so, it is unrelated to this bug, which speaks about *multi-host* operations.)

Or is the customer defining new networks such as "vlan_1307" etc and assigning a label to them?
Comment 15 Dan Kenigsberg 2016-07-19 02:47:13 EDT
Bimal, as far as I understand, the only current workaround is to wait. After assigning a label to a network, the network is applied to all host. The user must wait until this process ends before he assign a label to another network. Note that there is no indication that the process has ended.
Comment 19 Yevgeny Zaspitsky 2016-07-26 04:44:29 EDT
The patch [1] intends to provide an audit log message upon finish applying a network change to the hosts. So a user would be informed of when another network change should be submitted.

[1] https://gerrit.ovirt.org/#/c/61350/
Comment 22 Yaniv Lavi (Dary) 2016-08-15 10:13:53 EDT
Agreed to include the a 3.6 change to have a audit log for network actions compilation.
Comment 26 Michael Burman 2016-09-04 08:33:35 EDT
Tested on - 4.0.4-0.1.el7ev

When multi-host operation failed - 

Failed to apply changes of Network complex-net-2 on Data Center: DC1 to the hosts. Failed hosts are: puma22.scl.lab.tlv.redhat.com, orchid-vds2.qa.lab.tlv.redhat.com.
(2/2): Failed to apply changes for network(s) complex-net-2 on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz)
(1/2): Failed to apply changes for network(s) complex-net-2 on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz)
(1/2): Applying network's changes on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz)
(2/2): Applying network's changes on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz)
Update network complex-net-2 on Data Center: DC1 was started.


When multi-host operation succeeded - 

Update network new_network1 on Data Center: DC1 has finished successfully.
(1/2): Successfully applied changes for network(s) new_network1 on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz)
(2/2): Successfully applied changes for network(s) new_network1 on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz)
(2/2): Applying network's changes on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz)
(1/2): Applying network's changes on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz)
Update network new_network1 on Data Center: DC1 was started.

When multi-host partial failed(only one host from 2)

Failed to apply changes of Network n3 on Data Center: DC1 to the hosts. Failed hosts are: orchid-vds2.qa.lab.tlv.redhat.com
(2/2): Successfully applied changes for network(s) n3 on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz)
(1/2): Failed to apply changes for network(s) n3 on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz)
(1/2): Applying network's changes on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz)
(2/2): Applying network's changes on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz)
Update network n3 on Data Center: DC1 was started.


- So what we have now is indication about the operation that has been started.
Indication about it ended with success or failure in the DC and on which host it failed or not.
But no reason or indication about the reason we failed. 
Isn't the error messages should include the reason we failed? or this is the fix for this?
Comment 27 Dan Kenigsberg 2016-09-05 06:36:51 EDT
We are about to revert the patches. Customer are requested to watch the event log for a (1/N) alert (where N is the number of host in the cluster), and not to do any other multi-host operation until the (N/N) alert is show in the log.
Comment 28 Yaniv Lavi (Dary) 2016-09-07 05:23:58 EDT
Can you communicate comment #27 to the customer and see if this is good enough to resolve the use case for knowing when network changes are complete via audit log?
Comment 32 Dan Kenigsberg 2016-09-11 06:52:11 EDT
(In reply to Dan Kenigsberg from comment #27)

I have to correct my previous comment, as alert (N/N) may arrive before alerts from other hosts. The correct statement is:

Customer are requested to watch the event log for a (n/N) alert (where N is the number of host in the cluster and n is in the 1..N range), and not to do any other multi-host operation until all (n/N) alerts has showed up in the log.

Note You need to log in before you can comment on or make changes to this bug.