Bug 1300220
| Summary: | [downstream clone] Concurrent multi-host network changes are allowed but prone to fail | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Pavel Zhukov <pzhukov> | |
| Component: | ovirt-engine | Assignee: | eraviv | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Michael Burman <mburman> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 3.5.5 | CC: | bcholler, bugs, danken, dholler, eraviv, fgarciad, jcall, lsurette, masayag, mburman, mkalinin, myakove, obockows, Rhev-m-bugs, sasundar, srevivo | |
| Target Milestone: | ovirt-4.3.5 | Keywords: | Reopened, ZStream | |
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Enhancement | ||
| Doc Text: | Feature: 
The change provide a user with the information about when the background multi-host operations is finished.
Reason: 
Some network changes (e.g. attach/detach labeled network, update network definitions) cause a background multi-host operation to be spawned. Whereas the user isn't aware of that background operation, he might initiate another network change that would spawn another background operation that would collide with the first one if that hasn't finished yet.
Result: 
Make a user aware of the background processes that he starts, so he would avoid starting multiple processes that are prone to fail. | Story Points: | --- | |
| Clone Of: | 1061569 | |||
| : | 1369064 (view as bug list) | Environment: | ||
| Last Closed: | 2019-06-17 08:47:44 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1061569, 1850104 | |||
| Bug Blocks: | 1369064, 1520566 | |||
| 
        
          Description
        
        
          Pavel Zhukov
        
        
        
        
        
          2016-01-20 10:02:04 UTC
        
       Steps to reproduce: 1) Attach label "test" to NIC 2) Using API create 10 networks and attach the label to all of them 3) Attach networks to the Cluster Result: Engine tries to call setupNetwork multiple times for every labeled networks but vdsm rejects with Thread-363622::WARNING::2016-01-20 04:48:10,134::API::1393::vds::(setupNetworks) concurrent network verb already executing When registering a new host, only ovirtmgmt is automatically created. Then, a label can be assigned to a nic, and by that - multiple networks need to be attached to the host. Is that what the customer doing? (If so, it is unrelated to this bug, which speaks about *multi-host* operations.) Or is the customer defining new networks such as "vlan_1307" etc and assigning a label to them? Bimal, as far as I understand, the only current workaround is to wait. After assigning a label to a network, the network is applied to all host. The user must wait until this process ends before he assign a label to another network. Note that there is no indication that the process has ended. The patch [1] intends to provide an audit log message upon finish applying a network change to the hosts. So a user would be informed of when another network change should be submitted. [1] https://gerrit.ovirt.org/#/c/61350/ Agreed to include the a 3.6 change to have a audit log for network actions compilation. Tested on - 4.0.4-0.1.el7ev When multi-host operation failed - Failed to apply changes of Network complex-net-2 on Data Center: DC1 to the hosts. Failed hosts are: puma22.scl.lab.tlv.redhat.com, orchid-vds2.qa.lab.tlv.redhat.com. (2/2): Failed to apply changes for network(s) complex-net-2 on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz) (1/2): Failed to apply changes for network(s) complex-net-2 on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz) (1/2): Applying network's changes on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz) (2/2): Applying network's changes on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz) Update network complex-net-2 on Data Center: DC1 was started. When multi-host operation succeeded - Update network new_network1 on Data Center: DC1 has finished successfully. (1/2): Successfully applied changes for network(s) new_network1 on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz) (2/2): Successfully applied changes for network(s) new_network1 on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz) (2/2): Applying network's changes on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz) (1/2): Applying network's changes on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz) Update network new_network1 on Data Center: DC1 was started. When multi-host partial failed(only one host from 2) Failed to apply changes of Network n3 on Data Center: DC1 to the hosts. Failed hosts are: orchid-vds2.qa.lab.tlv.redhat.com (2/2): Successfully applied changes for network(s) n3 on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz) (1/2): Failed to apply changes for network(s) n3 on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz) (1/2): Applying network's changes on host orchid-vds2.qa.lab.tlv.redhat.com. (User: admin@internal-authz) (2/2): Applying network's changes on host puma22.scl.lab.tlv.redhat.com. (User: admin@internal-authz) Update network n3 on Data Center: DC1 was started. - So what we have now is indication about the operation that has been started. Indication about it ended with success or failure in the DC and on which host it failed or not. But no reason or indication about the reason we failed. Isn't the error messages should include the reason we failed? or this is the fix for this? We are about to revert the patches. Customer are requested to watch the event log for a (1/N) alert (where N is the number of host in the cluster), and not to do any other multi-host operation until the (N/N) alert is show in the log. Can you communicate comment #27 to the customer and see if this is good enough to resolve the use case for knowing when network changes are complete via audit log? (In reply to Dan Kenigsberg from comment #27) I have to correct my previous comment, as alert (N/N) may arrive before alerts from other hosts. The correct statement is: Customer are requested to watch the event log for a (n/N) alert (where N is the number of host in the cluster and n is in the 1..N range), and not to do any other multi-host operation until all (n/N) alerts has showed up in the log. Michael has tested it (comment #26) already. Michael, could you please provide more info on Marina's questions in comment #47? Cool. Sounds great. Thanks for testing it, Michael. Seems like we can close it CURRENTRELEASE. I will let you or Dan doing that. We have a reasonable workaround as of 4.2.7: we have an indication that a multi-host action is going on, and we have a SyncAll button to regenrate a failed multi-host connection. *** Bug 1672762 has been marked as a duplicate of this bug. *** in reply to comment 60: According to attached screen-shot (attachment 1555116 [details]) none of the attached networks are out-of-sync so no reason for the 'sync all networks' button of the host to be enabled. In 4.3 we have the current issues addressed: The current version(rhvm-4.3.5-0.1.el7.noarch) have: 1. engine: add finish report on multi-host network commands 2. add setCustomValues to AuditLogableBase 3. engine: add locking on (Host)SetupNetworksCommand Add locking on (Host)SetupNetworksCommand in order to prevent multiple commands being executed on a single host. 4. we have an indication that a multi-host action is going on, and we have a SyncAll button to regenerate a failed multi-host connection. The origin report and issue will be addressed and fixed in BZ 1061569 only for 4.4 (In reply to Michael Burman from comment #63) > In 4.3 we have the current issues addressed: > > The current version(rhvm-4.3.5-0.1.el7.noarch) have: > > 1. engine: add finish report on multi-host network commands > 2. add setCustomValues to AuditLogableBase > 3. engine: add locking on (Host)SetupNetworksCommand Add locking on > (Host)SetupNetworksCommand in order to prevent multiple commands being > executed on a single host. > 4. we have an indication that a multi-host action is going on, and we have a > SyncAll button to regenerate a failed multi-host connection. > > The origin report and issue will be addressed and fixed in BZ 1061569 only > for 4.4 Burman, Can you please specify on BZ 1061569 what is the origim problem with a reproduce scenario and what is the expected behaviour for 4.4 thanks (In reply to eraviv from comment #64) > (In reply to Michael Burman from comment #63) > > In 4.3 we have the current issues addressed: > > > > The current version(rhvm-4.3.5-0.1.el7.noarch) have: > > > > 1. engine: add finish report on multi-host network commands > > 2. add setCustomValues to AuditLogableBase > > 3. engine: add locking on (Host)SetupNetworksCommand Add locking on > > (Host)SetupNetworksCommand in order to prevent multiple commands being > > executed on a single host. > > 4. we have an indication that a multi-host action is going on, and we have a > > SyncAll button to regenerate a failed multi-host connection. > > > > The origin report and issue will be addressed and fixed in BZ 1061569 only > > for 4.4 > > Burman, > Can you please specify on BZ 1061569 what is the origim problem with a > reproduce scenario and what is the expected behaviour for 4.4 > thanks Hi Eitan, The origin problem is described in the description(one example) and there are many scenarios to see this issue, not one. We saw it on DEV and QE side in several scenarios, some of them not always reproduced. The expected behavior for 4.4 will be to not allow and fail. I believe you added the wait time out for the lock manager. We will need to ensure that such collisions on setup networks requests won't fail with your fix on 4.4 |