1094033 – [engine-backend] [iscsi multipath] After networks replacement in an iSCSI multipath bond had failed, the bond's networks aren't being updated back

Bug 1094033 - [engine-backend] [iscsi multipath] After networks replacement in an iSCSI multipath bond had failed, the bond's networks aren't being updated back

Summary: [engine-backend] [iscsi multipath] After networks replacement in an iSCSI mul...

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.4.0
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Maor
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	1127007 rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-05-04 15:20 UTC by Elad
Modified:	2016-02-10 20:26 UTC (History)
CC List:	8 users (show)
Fixed In Version:	ovirt-3.5.0_rc1.1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1127007 (view as bug list)
Environment:
Last Closed:	2014-09-16 06:45:54 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	amureini: Triaged+

Attachments	(Terms of Use)
logs from engine and vdsm (307.05 KB, application/x-gzip) 2014-05-04 15:20 UTC, Elad	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	31092	0	master	MERGED	core[iSCSI multipath]: Add audit log when encounter connection issues	Never
oVirt gerrit	31109	0	ovirt-engine-3.5	MERGED	core[iSCSI multipath]: Add audit log when encounter connection issues	Never

Description Elad 2014-05-04 15:20:48 UTC

Created attachment 892293 [details]
logs from engine and vdsm

Description of problem:
Even though networks replacement operation had failed because of a failure in vdsm to re-connect to the storage server using a different network (reported here https://bugzilla.redhat.com/show_bug.cgi?id=1094025), engine doesn't revert the operation. The networks in the bond aren't being updated back to the ones before the change which failed.

Version-Release number of selected component (if applicable):
AV7
vdsm-4.14.7-0.1.beta3.el6ev.x86_64
rhevm-3.4.0-0.15.beta3.el6ev.noarch


How reproducible:
Always

Steps to Reproduce:
On a shared DC with active iSCSI storage domain(s):
1. Create 3 new networks and attach them to the cluster with required check-box checked
2. Attach the networks to the cluster's hosts NICs 
3. Create a new iSCSI multipath bond (under DC tab -> pick the relevant DC -> iSCSI multipath sub-tab -> new) and add 2 of the new networks along which the targets to it
4. Maintenance the iSCSI domain and activate it so the connection to the storage will be done from the new networks
5. After the iSCSI domain is active, edit the multipath bond, uncheck the checked networks an pick the third network. Click 'Ok'
6. VDSM will fail to perform the operation with IscsiNodeError.

Actual results:
Engine fails to catch the error from vdsm (as reported here https://bugzilla.redhat.com/show_bug.cgi?id=1094023). After the failure, the networks under the bond are the new ones even though the operation failed. Engine doesn't perform roll-back to the failed bond iSCSI multipath update.

Expected results:
If the iSCSI multipath bond update had failed, engine should revert the changes and the update shouldn't be partial.

Additional info: logs from engine and vdsm

Error on vdsm:

Thread-1846::ERROR::2014-05-04 15:40:44,971::hsm::2379::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2376, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 359, in connect
    iscsi.addIscsiNode(self._iface, self._target, self._cred)
  File "/usr/share/vdsm/storage/iscsi.py", line 166, in addIscsiNode
    iscsiadm.node_login(iface.name, portalStr, targetName)
  File "/usr/share/vdsm/storage/iscsiadm.py", line 295, in node_login
    raise IscsiNodeError(rc, out, err)
IscsiNodeError: (8, ['Logging in to [iface: eth0.1, target: iqn.2008-05.com.xtremio:001e675b8ee1, portal: 10.35.160.3,3260] (multiple)'], ['iscsiadm: Could not login to [iface:
eth0.1, target: iqn.2008-05.com.xtremio:001e675b8ee1, portal: 10.35.160.3,3260].', 'iscsiadm: initiator reported error (8 - connection timed out)', 'iscsiadm: Could not log into
 all portals'])



Error in engine:

2014-05-04 15:39:56,678 ERROR [org.ovirt.engine.core.bll.storage.EditIscsiBondCommand] (ajp-/127.0.0.1:8702-10) [37fea003] Command org.ovirt.engine.core.bll.storage.EditIscsiBon
dCommand throw exception: java.lang.RuntimeException: java.util.concurrent.ExecutionException: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.en
gine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022)

Comment 3 Maor 2014-08-05 16:12:27 UTC

After discussing the bug with Elad and Allon, we have decided that we should not rollback the update of the iSCSI bond since there might be a scenario when some of the hosts will be able to connect to the storage and some don't.

Instead we will let the user decide whether he would like to keep the situation as it is, or he can also decide to rollback him self.

I've added a warning audit log to indicate that the operation succeeded but the engine encountered some issues with the storage connection.

Comment 5 Elad 2014-09-16 06:45:54 UTC

I'm unable to get to a situation in which vdsm fails to connect to the storage server via its network that participates in the iscsi bond like explain in the description (as happened in bz #1102687). Therefore, closing as UPSTREAM

Note You need to log in before you can comment on or make changes to this bug.