Created attachment 892293 [details] logs from engine and vdsm Description of problem: Even though networks replacement operation had failed because of a failure in vdsm to re-connect to the storage server using a different network (reported here https://bugzilla.redhat.com/show_bug.cgi?id=1094025), engine doesn't revert the operation. The networks in the bond aren't being updated back to the ones before the change which failed. Version-Release number of selected component (if applicable): AV7 vdsm-4.14.7-0.1.beta3.el6ev.x86_64 rhevm-3.4.0-0.15.beta3.el6ev.noarch How reproducible: Always Steps to Reproduce: On a shared DC with active iSCSI storage domain(s): 1. Create 3 new networks and attach them to the cluster with required check-box checked 2. Attach the networks to the cluster's hosts NICs 3. Create a new iSCSI multipath bond (under DC tab -> pick the relevant DC -> iSCSI multipath sub-tab -> new) and add 2 of the new networks along which the targets to it 4. Maintenance the iSCSI domain and activate it so the connection to the storage will be done from the new networks 5. After the iSCSI domain is active, edit the multipath bond, uncheck the checked networks an pick the third network. Click 'Ok' 6. VDSM will fail to perform the operation with IscsiNodeError. Actual results: Engine fails to catch the error from vdsm (as reported here https://bugzilla.redhat.com/show_bug.cgi?id=1094023). After the failure, the networks under the bond are the new ones even though the operation failed. Engine doesn't perform roll-back to the failed bond iSCSI multipath update. Expected results: If the iSCSI multipath bond update had failed, engine should revert the changes and the update shouldn't be partial. Additional info: logs from engine and vdsm Error on vdsm: Thread-1846::ERROR::2014-05-04 15:40:44,971::hsm::2379::Storage.HSM::(connectStorageServer) Could not connect to storageServer Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 2376, in connectStorageServer conObj.connect() File "/usr/share/vdsm/storage/storageServer.py", line 359, in connect iscsi.addIscsiNode(self._iface, self._target, self._cred) File "/usr/share/vdsm/storage/iscsi.py", line 166, in addIscsiNode iscsiadm.node_login(iface.name, portalStr, targetName) File "/usr/share/vdsm/storage/iscsiadm.py", line 295, in node_login raise IscsiNodeError(rc, out, err) IscsiNodeError: (8, ['Logging in to [iface: eth0.1, target: iqn.2008-05.com.xtremio:001e675b8ee1, portal: 10.35.160.3,3260] (multiple)'], ['iscsiadm: Could not login to [iface: eth0.1, target: iqn.2008-05.com.xtremio:001e675b8ee1, portal: 10.35.160.3,3260].', 'iscsiadm: initiator reported error (8 - connection timed out)', 'iscsiadm: Could not log into all portals']) Error in engine: 2014-05-04 15:39:56,678 ERROR [org.ovirt.engine.core.bll.storage.EditIscsiBondCommand] (ajp-/127.0.0.1:8702-10) [37fea003] Command org.ovirt.engine.core.bll.storage.EditIscsiBon dCommand throw exception: java.lang.RuntimeException: java.util.concurrent.ExecutionException: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.en gine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022)
After discussing the bug with Elad and Allon, we have decided that we should not rollback the update of the iSCSI bond since there might be a scenario when some of the hosts will be able to connect to the storage and some don't. Instead we will let the user decide whether he would like to keep the situation as it is, or he can also decide to rollback him self. I've added a warning audit log to indicate that the operation succeeded but the engine encountered some issues with the storage connection.
I'm unable to get to a situation in which vdsm fails to connect to the storage server via its network that participates in the iscsi bond like explain in the description (as happened in bz #1102687). Therefore, closing as UPSTREAM