+++ This bug was initially created as a clone of Bug #1061569 +++ Description of problem: On 3.4 if network is attached to the host when updating network on the DC setupNetwork command sent sync the network on the host. If update network send more then once to quickly the second internal setupNetworks fails with error: Operation Failed: [Resource unavailable] Version-Release number of selected component (if applicable): ovirt-engine-3.4.0-0.5.beta1.el6.noarch How reproducible: 100% Steps to Reproduce: 1. Create 2 networks and attached them to the host using setupNetworks 2. Update the first network and immediately update the second one 3. Actual results: The second network never got synced on the host Expected results: Additional info: Both networks should be synced on the host --- Additional comment from Moti Asayag on 2014-02-05 03:08:09 EST --- You might end with the same result if you execute 2 consecutive 'setup networks' commands from 2 different clients. But now with the multi-host network configuration and network labels feature we might hit it more often. The result of such failure will be a network which is not synced on the host. This is not a blocking issue: 1. The failure is reported via an event log specifying the change wasn't configured on the specific host. 2. The user can sync the network via the 'setup networks' as used to do before if failed. As for the specific issue: VDSM has a lock on the 'setupNetworks' which rejects any consecutive call with 'resource unavailable' error. There are couple of ways to handle it, and it deserves its own thread on engine-devel and vdsm-devel. --- Additional comment from Itamar Heim on 2014-02-09 03:52:56 EST --- Setting target release to current version for consideration and review. please do not push non-RFE bugs to an undefined target release to make sure bugs are reviewed for relevancy, fix, closure, etc. --- Additional comment from Sandro Bonazzola on 2014-03-04 04:22:42 EST --- This is an automated message. Re-targeting all non-blocker bugs still open on 3.4.0 to 3.4.1. --- Additional comment from Sandro Bonazzola on 2015-09-04 05:02:34 EDT --- This is an automated message. This Bugzilla report has been opened on a version which is not maintained anymore. Please check if this bug is still relevant in oVirt 3.5.4. If it's not relevant anymore, please close it (you may use EOL or CURRENT RELEASE resolution) If it's an RFE please update the version to 4.0 if still relevant. --- Additional comment from Michael Burman on 2015-10-15 08:21:03 EDT --- I managed to reproduce this bug on 3.6.0.1-0.1.el6 with vdsm-4.17.9-1.el7ev.noarch I have 2 vlan non-VM networks attached to 1 NIC and i managed to update them quickly from non-VM> to VM networks, and result was that i ended up with 2 unsynced networks. Property - Bridged Host - false DC- true Attaching vdsm.log --- Additional comment from Michael Burman on 2015-10-15 08:26 EDT --- --- Additional comment from Red Hat Bugzilla Rules Engine on 2015-10-19 06:49:41 EDT --- Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. --- Additional comment from Yaniv Dary on 2015-10-29 08:04:46 EDT --- In oVirt testing is done on single stream by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone. --- Additional comment from Dan Kenigsberg on 2015-11-10 11:45:03 EST --- (In reply to Michael Burman from comment #6) > Created attachment 1083241 [details] > vdsm log #1 This log does not show the tale-telling 'concurrent network verb already executing' error. Hence, I assume that the Engine-side protection actually worked. #2 I do not understand how the error Thread-10905::DEBUG::2015-10-15 13:41:32,803::__init__::503::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setupNetworks' in bridge with {u'bondings': {}, u'networks': {u'net-2': {u'nic': u'ens1f0', u'bridged': u'false', u'mtu': u'1500'}}, u'options': {u'connectivityCheck': u'true', u'connectivityTimeout': 120}} ... ConfigNetworkError: (21, "interface u'ens1f0' cannot be defined with this network since it is already defined with network net-1") is related to your reproduction. #3 The patch solves a race between two single-host setupNetwork commands. You are right that it does not solve the reported problem of a DC-level change. We change the DC-level property, then go to change it on the host(s) to which the network is attached. Any of these host-level commands may fail and leave the host in an unsync'ed state. --- Additional comment from Dan Kenigsberg on 2015-11-10 11:56:03 EST --- I believe that the problem stems from our multi-host (background) operations. When changing the DC-level property, a background process starts updating all relevant hosts. Another change of a DC-level property would spawn another background multi-host process that is likely to collide with the first one. Solving this is not easy. We may want to fail all subsequent multi-host processes while the first one has not finished, but this may harm usability if the first one handles many slow hosts. We can block subsequent processes only if they handle the same hosts as a concurrent process. Another idea is to provide a mechanism to report the current state of multi-host processes, so the user can tell whether it is safe to start a new process.
Steps to reproduce: 1) Attach label "test" to NIC 2) Using API create 10 networks and attach the label to all of them 3) Attach networks to the Cluster Result: Engine tries to call setupNetwork multiple times for every labeled networks but vdsm rejects with Thread-363622::WARNING::2016-01-20 04:48:10,134::API::1393::vds::(setupNetworks) concurrent network verb already executing
When registering a new host, only ovirtmgmt is automatically created. Then, a label can be assigned to a nic, and by that - multiple networks need to be attached to the host. Is that what the customer doing? (If so, it is unrelated to this bug, which speaks about *multi-host* operations.) Or is the customer defining new networks such as "vlan_1307" etc and assigning a label to them?
Bimal, as far as I understand, the only current workaround is to wait. After assigning a label to a network, the network is applied to all host. The user must wait until this process ends before he assign a label to another network. Note that there is no indication that the process has ended.
The patch [1] intends to provide an audit log message upon finish applying a network change to the hosts. So a user would be informed of when another network change should be submitted. [1] https://gerrit.ovirt.org/#/c/61350/
Agreed to include the a 3.6 change to have a audit log for network actions compilation.
Tested on - 4.0.4-0.1.el7ev When multi-host operation failed - Failed to apply changes of Network complex-net-2 on Data Center: DC1 to the hosts. Failed hosts are: puma22.scl.lab.tlv.redhat.com, orchid-vds2.qa.lab.tlv.redhat.com. (2/2): Failed to apply changes for network(s) complex-net-2 on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz) (1/2): Failed to apply changes for network(s) complex-net-2 on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz) (1/2): Applying network's changes on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz) (2/2): Applying network's changes on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz) Update network complex-net-2 on Data Center: DC1 was started. When multi-host operation succeeded - Update network new_network1 on Data Center: DC1 has finished successfully. (1/2): Successfully applied changes for network(s) new_network1 on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz) (2/2): Successfully applied changes for network(s) new_network1 on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz) (2/2): Applying network's changes on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz) (1/2): Applying network's changes on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz) Update network new_network1 on Data Center: DC1 was started. When multi-host partial failed(only one host from 2) Failed to apply changes of Network n3 on Data Center: DC1 to the hosts. Failed hosts are: orchid-vds2.qa.lab.tlv.redhat.com (2/2): Successfully applied changes for network(s) n3 on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz) (1/2): Failed to apply changes for network(s) n3 on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz) (1/2): Applying network's changes on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz) (2/2): Applying network's changes on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz) Update network n3 on Data Center: DC1 was started. - So what we have now is indication about the operation that has been started. Indication about it ended with success or failure in the DC and on which host it failed or not. But no reason or indication about the reason we failed. Isn't the error messages should include the reason we failed? or this is the fix for this?
We are about to revert the patches. Customer are requested to watch the event log for a (1/N) alert (where N is the number of host in the cluster), and not to do any other multi-host operation until the (N/N) alert is show in the log.
Can you communicate comment #27 to the customer and see if this is good enough to resolve the use case for knowing when network changes are complete via audit log?
(In reply to Dan Kenigsberg from comment #27) I have to correct my previous comment, as alert (N/N) may arrive before alerts from other hosts. The correct statement is: Customer are requested to watch the event log for a (n/N) alert (where N is the number of host in the cluster and n is in the 1..N range), and not to do any other multi-host operation until all (n/N) alerts has showed up in the log.
Michael has tested it (comment #26) already. Michael, could you please provide more info on Marina's questions in comment #47?
Cool. Sounds great. Thanks for testing it, Michael. Seems like we can close it CURRENTRELEASE. I will let you or Dan doing that.
We have a reasonable workaround as of 4.2.7: we have an indication that a multi-host action is going on, and we have a SyncAll button to regenrate a failed multi-host connection.
*** Bug 1672762 has been marked as a duplicate of this bug. ***
in reply to comment 60: According to attached screen-shot (attachment 1555116 [details]) none of the attached networks are out-of-sync so no reason for the 'sync all networks' button of the host to be enabled.
In 4.3 we have the current issues addressed: The current version(rhvm-4.3.5-0.1.el7.noarch) have: 1. engine: add finish report on multi-host network commands 2. add setCustomValues to AuditLogableBase 3. engine: add locking on (Host)SetupNetworksCommand Add locking on (Host)SetupNetworksCommand in order to prevent multiple commands being executed on a single host. 4. we have an indication that a multi-host action is going on, and we have a SyncAll button to regenerate a failed multi-host connection. The origin report and issue will be addressed and fixed in BZ 1061569 only for 4.4
(In reply to Michael Burman from comment #63) > In 4.3 we have the current issues addressed: > > The current version(rhvm-4.3.5-0.1.el7.noarch) have: > > 1. engine: add finish report on multi-host network commands > 2. add setCustomValues to AuditLogableBase > 3. engine: add locking on (Host)SetupNetworksCommand Add locking on > (Host)SetupNetworksCommand in order to prevent multiple commands being > executed on a single host. > 4. we have an indication that a multi-host action is going on, and we have a > SyncAll button to regenerate a failed multi-host connection. > > The origin report and issue will be addressed and fixed in BZ 1061569 only > for 4.4 Burman, Can you please specify on BZ 1061569 what is the origim problem with a reproduce scenario and what is the expected behaviour for 4.4 thanks
(In reply to eraviv from comment #64) > (In reply to Michael Burman from comment #63) > > In 4.3 we have the current issues addressed: > > > > The current version(rhvm-4.3.5-0.1.el7.noarch) have: > > > > 1. engine: add finish report on multi-host network commands > > 2. add setCustomValues to AuditLogableBase > > 3. engine: add locking on (Host)SetupNetworksCommand Add locking on > > (Host)SetupNetworksCommand in order to prevent multiple commands being > > executed on a single host. > > 4. we have an indication that a multi-host action is going on, and we have a > > SyncAll button to regenerate a failed multi-host connection. > > > > The origin report and issue will be addressed and fixed in BZ 1061569 only > > for 4.4 > > Burman, > Can you please specify on BZ 1061569 what is the origim problem with a > reproduce scenario and what is the expected behaviour for 4.4 > thanks Hi Eitan, The origin problem is described in the description(one example) and there are many scenarios to see this issue, not one. We saw it on DEV and QE side in several scenarios, some of them not always reproduced. The expected behavior for 4.4 will be to not allow and fail. I believe you added the wait time out for the lock manager. We will need to ensure that such collisions on setup networks requests won't fail with your fix on 4.4