1787192 – Host fails to activate in RHV and goes to non-operational status when some of the iSCSI targets are down

Bug 1787192 - Host fails to activate in RHV and goes to non-operational status when some of the iSCSI targets are down

Summary: Host fails to activate in RHV and goes to non-operational status when some of...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.3.6
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.5.0
Target Release:	4.5.0
Assignee:	Vojtech Juranek
QA Contact:	Shir Fishbain
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1777888 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-01 07:15 UTC by nijin ashok
Modified:	2024-03-25 15:35 UTC (History)
CC List:	17 users (show)
Fixed In Version:	vdsm-4.50.0.3
Doc Type:	Bug Fix
Doc Text:	Cause: login into iSCSI multiple targets/portals was done in serial manner Consequence: when two or more portals/tatgets were unavailable, engine command for connecting iSCSI domains timed out on the host, which resulted into setting host as non-operational Fix: run iSCSI login in concurrent threads, number of concurrent threads is configurable (via iscsi.parallel_logins), by default set to 10 threads Result: login to unavailable iSCSI portals/targets times out, but doesn't affect host itself (under assumption number of iSCSI login threads is higher then number of unavailable targets/portals)
Clone Of:
Environment:
Last Closed:	2022-05-26 17:22:44 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	4619211	None	None	None	2020-01-01 07:24:53 UTC
Red Hat Product Errata	RHSA-2022:4764	None	None	None	2022-05-26 17:23:06 UTC
oVirt gerrit	109933	master	ABANDONED	iscsi: Add config for irs.iscsi_node_login_retries	2021-02-17 12:06:39 UTC
oVirt gerrit	117582	master	MERGED	storage: move storage server related helpers from hsm	2021-12-06 16:19:23 UTC
oVirt gerrit	117583	master	MERGED	storage: move connectStorageOverIser() function into IscsiConnection	2021-12-15 16:52:05 UTC
oVirt gerrit	117584	master	MERGED	storage: add property for credentials	2021-12-15 16:57:05 UTC
oVirt gerrit	117585	master	MERGED	storage: move connect to iSCSI target into dedicated function	2021-12-15 16:57:35 UTC
oVirt gerrit	117586	master	ABANDONED	storage: add method for bulk iSCSI login	2021-12-20 14:18:02 UTC
oVirt gerrit	117634	master	ABANDONED	storage: create private helpers for (dis)connect iSCSI server	2021-12-20 14:17:56 UTC
oVirt gerrit	117661	master	MERGED	storage: refactor preparation of connection objects	2021-12-15 16:57:37 UTC
oVirt gerrit	117662	master	MERGED	storage: create storage server connection parent	2021-12-17 15:54:01 UTC
oVirt gerrit	117663	master	ABANDONED	storage: make storageServer.prepare_connections() private	2021-11-24 10:03:36 UTC
oVirt gerrit	117772	master	MERGED	storage: use specific class for connection	2021-12-17 15:54:08 UTC
oVirt gerrit	117773	master	MERGED	storage: run login to iSCSI nodes in parallel	2021-12-20 14:17:34 UTC
oVirt gerrit	117859	master	MERGED	storage: unify output of Connection (dis)connect methods	2021-12-17 15:54:04 UTC
oVirt gerrit	117860	master	MERGED	storage: provide (dis)connect functions for storageServer module	2021-12-17 15:54:06 UTC
oVirt gerrit	117886	master	ABANDONED	storage: don't test iSCSI connection over iSER	2021-12-06 12:28:49 UTC
oVirt gerrit	118018	master	MERGED	storage: move udev settle into dedicated method	2021-12-17 15:54:10 UTC
oVirt gerrit	118052	master	MERGED	storage: refactor error translation	2021-12-20 14:17:39 UTC
oVirt gerrit	118053	master	MERGED	storage: move logger to module level	2021-12-20 14:17:41 UTC
oVirt gerrit	118074	master	MERGED	iscsiadm: Remove unneeded locking	2021-12-20 14:17:36 UTC

Description nijin ashok 2020-01-01 07:15:51 UTC

Description of problem:

There are 3 targets for the iSCSI storage domain in three different networks which is available through the interface as below.

====
iscsiadm -m session -P3|egrep "Current Portal:|Iface Name:"
	Current Portal: 192.168.1.2:3260,1
		Iface Name: eth2
	Current Portal: 192.168.3.4:3260,1
		Iface Name: eth4
	Current Portal: 192.168.2.4:3260,1
		Iface Name: eth3
===

Three iSCSI multipathing is created for all the 3 targets under Data Center. However, if any of the two targets go down and if you are activating a host from maintenance mode, it will fail to activate and will go into "non-operational" status since "ConnectStorageServerVDSCommand" will timeout in the engine side.

===
 2020-01-01 00:52:26,067-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-68) [62de0282] Command 'ConnectStorageServerVDSCommand(HostName = 192.168.0.8, StorageServerConnectionManagementVDSParameters:{hostId='59fbfff0-826b-4cb9-a0ca-6a4bff87e41f', storagePoolId='7b3b5614-d62c-11e9-9f8c-566f0ebd0003', storageType='ISCSI', connectionList='[StorageServerConnections:{id='d480c8a1-f08b-49a5-ace2-19d89399f9c2', connection='192.168.1.2', iqn='iqn.2003-01.org.linux-iscsi.storage.x8664:sn.52fd3f4b2f0b', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='eth2', netIfaceName='eth2'}, StorageServerConnections:{id='d0956332-bd3a-41d4-a381-5958d8557df2', connection='192.168.2.4', iqn='iqn.2003-01.org.linux-iscsi.storage.x8664:sn.52fd3f4b2f0b', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='eth3', netIfaceName='eth3'}, StorageServerConnections:{id='374247ba-68c7-4beb-a64b-d89084ab7670', connection='192.168.3.4', iqn='iqn.2003-01.org.linux-iscsi.storage.x8664:sn.52fd3f4b2f0b', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='eth4', netIfaceName='eth4'}]', sendNetworkEventOnFailure='true'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

2020-01-01 00:52:26,078-05 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-21641) [62de0282] EVENT_ID: VDS_HOST_NOT_RESPONDING(9,027), Host 192.168.0.8 is not responding. Host cannot be fenced automatically because power management for the host is disabled.

2020-01-01 00:52:26,318-05 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-68) [11871461] Host '192.168.0.8'(59fbfff0-826b-4cb9-a0ca-6a4bff87e41f) is already in NonOperational status for reason 'STORAGE_DOMAIN_UNREACHABLE'. SetNonOperationalVds command is skipped.
===

The issue is because the connectStorageServer will take more than 3 minutes on the vdsm side as each iSCSI login to the unreachable target will take around 2 minutes to timeout.

===
2020-01-01 11:19:25,855+0530 INFO  (jsonrpc/5) [vdsm.api] START connectStorageServer(domType=3, spUUID=u'7b3b5614-d62c-11e9-9f8c-566f0ebd0003', conList=[{u'netIfaceName': u'eth2', u'id': u'd480c8a1-f08b-49a5-ace2-19d89399f9c2', u'connection': u'192.168.1.2', u'iqn': u'iqn.2003-01.org.linux-iscsi.storage.x8664:sn.52fd3f4b2f0b', u'user': u'', u'tpgt': u'1', u'ipv6_enabled': u'false', u'ifaceName': u'eth2', u'password': '********', u'port': u'3260'}, {u'netIfaceName': u'eth3', u'id': u'd0956332-bd3a-41d4-a381-5958d8557df2', u'connection': u'192.168.2.4', u'iqn': u'iqn.2003-01.org.linux-iscsi.storage.x8664:sn.52fd3f4b2f0b', u'user': u'', u'tpgt': u'1', u'ipv6_enabled': u'false', u'ifaceName': u'eth3', u'password': '********', u'port': u'3260'}, {u'netIfaceName': u'eth4', u'id': u'374247ba-68c7-4beb-a64b-d89084ab7670', u'connection': u'192.168.3.4', u'iqn': u'iqn.2003-01.org.linux-iscsi.storage.x8664:sn.52fd3f4b2f0b', u'user': u'', u'tpgt': u'1', u'ipv6_enabled': u'false', u'ifaceName': u'eth4', u'password': '********', u'port': u'3260'}], options=None) from=::ffff:192.168.0.7,51106, flow_id=62de0282, task_id=9d1e713f-cffd-4147-b048-5f9164f24990 (api:48)

2020-01-01 11:19:26,390+0530 DEBUG (jsonrpc/5) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2003-01.org.linux-iscsi.storage.x8664:sn.52fd3f4b2f0b -I eth3 -p 192.168.2.4:3260,1 -l (cwd None) (commands:198)

<----- Took 2 minutes to timeout --->

2020-01-01 11:21:26,610+0530 DEBUG (jsonrpc/5) [storage.Misc.excCmd] FAILED: <err> = 'iscsiadm: Could not login to [iface: eth3, target: iqn.2003-01.org.linux-iscsi.storage.x8664:sn.52fd3f4b2f0b, portal: 192.168.2.4,3260].\niscsiadm: initiator reported error (8 - connection timed out)\niscsiadm: Could not log into all portals\n'; <rc> = 8 (commands:219)

2020-01-01 11:21:27,084+0530 DEBUG (jsonrpc/5) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2003-01.org.linux-iscsi.storage.x8664:sn.52fd3f4b2f0b -I eth4 -p 192.168.3.4:3260,1 -l (cwd None) (commands:198)

<----- Took 2 minutes to timeout --->


2020-01-01 11:23:27,296+0530 DEBUG (jsonrpc/5) [storage.Misc.excCmd] FAILED: <err> = 'iscsiadm: Could not login to [iface: eth4, target: iqn.2003-01.org.linux-iscsi.storage.x8664:sn.52fd3f4b2f0b, portal: 192.168.3.4,3260].\niscsiadm: initiator reported error (8 - connection timed out)\niscsiadm: Could not log into all portals\n'; <rc> = 8 (commands:219)

2020-01-01 11:23:27,892+0530 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 242.04 seconds (__init__:312)
====

So it took around 4 minutes to complete the connectStorageServer and the engine will timeout before this and marks host as "non operational".


Version-Release number of selected component (if applicable):

vdsm-4.30.24-2.el7ev.x86_64
rhvm-4.3.6.7-0.1.el7.noarch

How reproducible:

100%

Steps to Reproduce:

1. Create an iSCSI storage with 3 target IP.

2. Add the storage domain in RHV by login to all the 3 targets.

3. Add 3 iSCSI bond to each target.

4. Shutdown the interfaces of two targets from the storage.

5. Put one host into maintenance mode and try to activate it. It will go into non-operational status.

Actual results:

Host fails to activate in RHV and goes to non-operational status when some of the iSCSI targets are down.

Expected results:

Since the storage is still available through one path, it should not mark the host as "non-operational". I think there should be a different timeout for connectStorageServer or it should somehow intelligently detect the iSCSI targets are accessible before trying to login to the target so that it should not wait for 2 minutes to timeout while trying to login to each unavailable target.

Additional info:

Comment 2 nijin ashok 2020-01-01 07:24:53 UTC

*** Bug 1777888 has been marked as a duplicate of this bug. ***

Comment 3 Amit Bawer 2020-01-14 11:34:21 UTC

Snippet from Vdsm hsm.connectStorageServer():

        res = []
        connections = []
        for conDef in conList:
            conInfo = _connectionDict2ConnectionInfo(domType, conDef)
            conObj = storageServer.ConnectionFactory.createConnection(conInfo)
            try:
                self._connectStorageOverIser(conDef, conObj, domType)
                conObj.connect()
            except Exception as err:
                self.log.error(
                    "Could not connect to storageServer", exc_info=True)
                status, _ = self._translateConnectionError(err)
            else:
                status = 0
                connections.append(conObj)

            res.append({'id': conDef["id"], 'status': status})


Engine uses 'vdsTimeout' config (default is 180 seconds) to determine that vds command has timed out.
iscsiadm default command execution timeout is 120 seconds (set by node.session.timeo.replacement_timeout in iscsi conf).

If we would parallelize the connections loop, we could* finish all connect() attempts within 2 minutes and Engine is not expected to deem this as a timed-out vds command.

* Disclaimers:

The IscsiConnection.connect() runs the following commands which have to be executed in sequence:

1. iscsiadm new node
2. iscsiadm login node
3. iscsiadm update node
4. udev settle (for 5 seconds by default)

If any command in (1-3) hits timeout we break from connect() with IscsiNodeError exception,
however bad scenario could be that we spend 1:59 minutes on each step successfully, totalling 6 minutes for the entire connect() flow.

So I think that we should set the default iscsiadm timeout to 40 seconds tops, assuming we go for parallezing the connect() flows per each target connection.

Comment 4 Amit Bawer 2020-01-14 18:01:33 UTC

(In reply to Amit Bawer from comment #3)
> Snippet from Vdsm hsm.connectStorageServer():
> 
>         res = []
>         connections = []
>         for conDef in conList:
>             conInfo = _connectionDict2ConnectionInfo(domType, conDef)
>             conObj =
> storageServer.ConnectionFactory.createConnection(conInfo)
>             try:
>                 self._connectStorageOverIser(conDef, conObj, domType)
>                 conObj.connect()
>             except Exception as err:
>                 self.log.error(
>                     "Could not connect to storageServer", exc_info=True)
>                 status, _ = self._translateConnectionError(err)
>             else:
>                 status = 0
>                 connections.append(conObj)
> 
>             res.append({'id': conDef["id"], 'status': status})
> 
> 
> Engine uses 'vdsTimeout' config (default is 180 seconds) to determine that
> vds command has timed out.
> iscsiadm default command execution timeout is 120 seconds (set by
> node.session.timeo.replacement_timeout in iscsi conf).

In my recreation, and also in the original vdsm.log, the hanging command is the iscsiadm login (-l) for 120 seconds.
The iscsiadm login timeout is manipulated by the following /etc/iscsi/iscsid.conf parameters:

 node.session.initial_login_retry_max = 8
 node.conn[0].timeo.login_timeout = 15 

setting the iscsiadm -l command timeout to 8*15=120 seconds by default.

Setting "node.session.initial_login_retry_max = 1" and restarting iscsid on the host allowed connecting to SD when 3 of the 4 target interfaces were down while re-activating host from maintenance, so current workaround for this issue is a host iscsid setup.

> 
> If we would parallelize the connections loop, we could* finish all connect()
> attempts within 2 minutes and Engine is not expected to deem this as a
> timed-out vds command.
> 
> * Disclaimers:
> 
> The IscsiConnection.connect() runs the following commands which have to be
> executed in sequence:
> 
> 1. iscsiadm new node
> 2. iscsiadm login node
> 3. iscsiadm update node
> 4. udev settle (for 5 seconds by default)
> 
> If any command in (1-3) hits timeout we break from connect() with
> IscsiNodeError exception,
> however bad scenario could be that we spend 1:59 minutes on each step
> successfully, totalling 6 minutes for the entire connect() flow.
> 
> So I think that we should set the default iscsiadm timeout to 40 seconds
> tops, assuming we go for parallezing the connect() flows per each target
> connection.

Comment 5 Michal Skrivanek 2020-03-17 15:56:20 UTC

does this track making calls in parallel or iscsid.conf timeout config to 40s or both?

Comment 6 Amit Bawer 2020-03-17 16:09:58 UTC

It provides information on what is the issue and what resolution might be:

a. Parallelize the iscsi connect calls for multiple target interfaces.
b. Reduce the wait time for login attempt on an off-line target interface
c. Combination of any of the above.

Comment 8 Amit Bawer 2020-06-17 13:24:16 UTC

Seems that concurrent iscsiadm logins cannot be achieved as iscsiadm itself is racy and we have transaction lock for vdsm iscsiadm command sessions added
back on https://gerrit.ovirt.org/#/c/2750/ for bz802347.

So using a thread per each connection attempt will be effectively sequential as it would have to wait for another failing interface login spending 120 seconds
of retries before acquiring the transaction lock.

Hence IMO it should be a config change for iscsiadm to shorten the retry period in case of login failures.

Comment 9 Nir Soffer 2020-06-29 16:07:56 UTC

(In reply to Amit Bawer from comment #8)
> Seems that concurrent iscsiadm logins cannot be achieved as iscsiadm itself
> is racy and we have transaction lock for vdsm iscsiadm command sessions added
> back on https://gerrit.ovirt.org/#/c/2750/ for bz802347.

Bug 802347 does not contain any useful information about the problem, and there
is no analysis of the issue. There is also no input from iscsiadm developers
about the "race" in iscsiadm.

The patch also does not contain any justification about the global lock added.

I think we need to check if the lock is really required and remove it if
possible.

We also need to discuss with iscsiadm authors why we have so many login 
attempts and if using lower number of attempts is a good idea.

Comment 10 Amit Bawer 2020-06-30 13:11:19 UTC

(In reply to Nir Soffer from comment #9)
> (In reply to Amit Bawer from comment #8)
> > Seems that concurrent iscsiadm logins cannot be achieved as iscsiadm itself
> > is racy and we have transaction lock for vdsm iscsiadm command sessions added
> > back on https://gerrit.ovirt.org/#/c/2750/ for bz802347.
> 
> Bug 802347 does not contain any useful information about the problem, and
> there
> is no analysis of the issue. There is also no input from iscsiadm developers
> about the "race" in iscsiadm.
> 
> The patch also does not contain any justification about the global lock
> added.
> 
> I think we need to check if the lock is really required and remove it if
> possible.

Would try a lockless version.

> 
> We also need to discuss with iscsiadm authors why we have so many login 
> attempts and if using lower number of attempts is a good idea.

Raised questions at the open-iscsi mailing list: https://groups.google.com/g/open-iscsi/c/OHOdIm1W274/m/KW-2saC7AgAJ

Comment 11 Amit Bawer 2020-07-06 09:49:08 UTC

According to reply from open-iscsi maintainer about iscsiadm being racy[1] and iscsi login errors i got when
trying a parallel lockless version, it seems more promising at this point to use only a configuration change
as applied in the attached patch.

[1] https://groups.google.com/forum/embed/#!msg/open-iscsi/OHOdIm1W274/9l5NcPQHBAAJ

Comment 12 Amit Bawer 2020-07-29 13:58:14 UTC

Pending input from open-iscsi [2] about testing different login schemes:

We have a limitation on our RHEV system where all logins to listed iscsi targets should finish within 180 seconds in total.
In our current implementation we serialize the iscsiadm node logins one after the other,
each is for specific target and portal. In such scheme, each login would wait 120 seconds in case a portal is down
(default 15 seconds login timeout * 8 login retries), so if we have 2 or more connections down, we spend at least 240 seconds
which exceeds our 180 seconds time limit and the entire operation is considered to be failed (RHEV-wise).

Testing [1] different login schemes is summarized in the following table (logins to 2 targets with 2 portals each).
It seems that either login-all nodes after creating them, as suggested in previous answer here, compares in total time spent
with doing specific node logins concurrently (i.e. running iscsiadm -m node -T target -p portal -I interface -l in parallel per
each target-portal), for both cases of all portals being online and when one portal is down:

Login scheme               On-line  Portals       Active Sessions      Total Login Time (sec)
---------------------------------------------------------------------------------------------
 All at once                    2/2                      4                     2.1
 All at once                    1/2                      2                     120.2
 Serial target-portal           2/2                      4                     8.5
 Serial target-portal           1/2                      2                     243.5
 Concurrent target-portal       2/2                      4                     2.1
 Concurrent target-portal       1/2                      2                     120.1


Using concurrent target-portal logins seems to be preferable in our perspective as it allows to connect only to the
specified target and portals without the risk of intermixing with other potential iscsi targets.

The node creation part is kept serial in all tests here and we have seen it may result in the iscsi DB issues if run in parallel.
But using only node logins in parallel doesn't seems to have issues for at least 1000 tries of out tests.

The question to be asked here is it advisable by open-iscsi?
I know I have been answered already that iscsiadm is racy, but does it applies to node logins as well?

The other option is to use one login-all call without parallelism, but that would have other implications on our system to consider.

[1] https://gerrit.ovirt.org/#/c/110432
[2] https://groups.google.com/g/open-iscsi/c/OHOdIm1W274/m/XpFw5ikzBwAJ

Comment 13 Amit Bawer 2020-07-29 14:09:23 UTC

Chris, could you provide your advice about concurrent iscsiadm logins discussed in comment #12?

Comment 14 Amit Bawer 2020-08-06 08:39:23 UTC

Another point i'd like to ask about is iSER fallback that we have:

Currently we check during connection flow if 'iser' is set on iscsi_default_ifaces in 
our configuration. If yes, it is first checked if its working on server side by attempting

iscsiadm -m node -T target -I iser -p portal -l
iscsiadm -m node -T target -I iser -p portal -u

If the login/logout worked it is kept as 'iser' instead of 'default' interface setup, otherwise
it fallbacks to 'default'. This is used later for the actual node login.

The thing is that this check can also waste valuable time when the portal is down,
is there a way to fallback in the iscsiadm command itself, or prefer a specific interface type
when trying all/parallel logins for same target+portal but with different interfaces types?

Comment 18 Chris Leech 2021-10-05 17:31:07 UTC

(In reply to Amit Bawer from comment #12)
> Pending input from open-iscsi [2] about testing different login schemes:
> 
> We have a limitation on our RHEV system where all logins to listed iscsi
> targets should finish within 180 seconds in total.
> In our current implementation we serialize the iscsiadm node logins one
> after the other,
> each is for specific target and portal. In such scheme, each login would
> wait 120 seconds in case a portal is down
> (default 15 seconds login timeout * 8 login retries), so if we have 2 or
> more connections down, we spend at least 240 seconds
> which exceeds our 180 seconds time limit and the entire operation is
> considered to be failed (RHEV-wise).
> 
> Testing [1] different login schemes is summarized in the following table
> (logins to 2 targets with 2 portals each).
> It seems that either login-all nodes after creating them, as suggested in
> previous answer here, compares in total time spent
> with doing specific node logins concurrently (i.e. running iscsiadm -m node
> -T target -p portal -I interface -l in parallel per
> each target-portal), for both cases of all portals being online and when one
> portal is down:
> 
> Login scheme               On-line  Portals       Active Sessions      Total
> Login Time (sec)
> -----------------------------------------------------------------------------
> ----------------
>  All at once                    2/2                      4                  
> 2.1
>  All at once                    1/2                      2                  
> 120.2
>  Serial target-portal           2/2                      4                  
> 8.5
>  Serial target-portal           1/2                      2                  
> 243.5
>  Concurrent target-portal       2/2                      4                  
> 2.1
>  Concurrent target-portal       1/2                      2                  
> 120.1
> 
> 
> Using concurrent target-portal logins seems to be preferable in our
> perspective as it allows to connect only to the
> specified target and portals without the risk of intermixing with other
> potential iscsi targets.
> 
> The node creation part is kept serial in all tests here and we have seen it
> may result in the iscsi DB issues if run in parallel.
> But using only node logins in parallel doesn't seems to have issues for at
> least 1000 tries of out tests.
> 
> The question to be asked here is it advisable by open-iscsi?
> I know I have been answered already that iscsiadm is racy, but does it
> applies to node logins as well?
> 
> The other option is to use one login-all call without parallelism, but that
> would have other implications on our system to consider.
> 
> [1] https://gerrit.ovirt.org/#/c/110432
> [2] https://groups.google.com/g/open-iscsi/c/OHOdIm1W274/m/XpFw5ikzBwAJ

(apologies for the stale needinfos, but I see this is still active again so here goes)

Concurrent iscsiadm command should be fine, iscsid processes them in the order received and handles async tasks for long running commands.  The iscsiadm to iscsid control messaging is over connected UNIX socketpairs, so responses are targeted to the correct iscsiadm process.  With a single iscsid directing the kernel code over netlink there should be no issues.

The big iscsistart issues were a combination of netlink responses from the kernel being broadcast confusing multiple iscsistart "back half" (essentially iscsid) instances, and UNIX abstract namespace collisions causing multiple iscsistart "front half" (essentially iscsiadm) to end up talking to a single back half which could be terminated before handling all the requests.  These should be addressed now, but using iscsid and iscsiadm is preferred.

Comment 20 Vojtech Juranek 2021-12-20 21:46:23 UTC

Verification flow:
==================

* create 3 NICs on the iSCSI target server (if it's a VM, by adding NICs into HW)
* create 3 portals on the iSCSI target server (e.g. targetcli /iscsi/iqn.2016-01.com.ovirt:444/tpg1/portals create $PORTAL_IP)
* in oVirt discover all three portal and log into all of them
* create iSCSI SD from LUN(s) provided by these portals
* put one of the hosts into maintenance
* make sure host is already disconnected from all iSCSI targets (e.g. by running iscsiadm -m session -P3|egrep "Current Portal:|Iface Name:" - should not show anything)
* on the iSCSI target put 2 NICs down (e.g. nmcli connection down $CONN_NAME)
* put the hosts back online


Without fix:
============

engine command times out:

2021-12-20 06:42:41,010-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-86) [4bf1869b] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM stream8-host-test1 command ConnectStorageServerVDS failed: Message timeout which can be caused by communication issues
...
2021-12-20 06:42:41,035-05 ERROR [org.ovirt.engine.core.bll.storage.pool.ConnectHostToStoragePoolServersCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-86) [4bf1869b] Command 'org.ovirt.engine.core.bll.storage.pool.ConnectHostToStoragePoolServersCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)

and host is set as non-operational:

2021-12-20 06:42:41,114-05 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-86) [274c39b] EVENT_ID: VDS_SET_NONOPERATIONAL_DOMAIN(522), Host stream8-host-test1 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center Default. Setting Host state to Non-Operational. 


With fix (vdsm log with debug enabled):
=======================================

1.API call
-------------

2021-12-20 07:39:13,888-0500 INFO  (jsonrpc/7) [vdsm.api] START connectStorageServer(domType=3, spUUID='19016d46-17b4-11ec-9517-5254003aa1f8', conList=[{'password': '********', 'port': '3260', 'iqn': 'iqn.2016-01.com.ovirt:444', 'connection': '192.168.1
22.254', 'ipv6_enabled': 'false', 'id': '994a711a-60f3-411a-aca2-0b60f01e8b8c', 'user': '', 'tpgt': '1'}, {'password': '********', 'port': '3260', 'iqn': 'iqn.2016-01.com.ovirt:444', 'connection': '192.168.122.115', 'ipv6_enabled': 'false', 'id': '60b3a
e07-db32-470d-9ed8-d1f99efcc99c', 'user': '', 'tpgt': '1'}, {'password': '********', 'port': '3260', 'iqn': 'iqn.2016-01.com.ovirt:444', 'connection': '192.168.122.210', 'ipv6_enabled': 'false', 'id': '14cd4d9d-d7ee-4a3a-9928-d9c37ec34a6f', 'user': '', 
'tpgt': '1'}, {'password': '********', 'port': '3260', 'iqn': 'iqn.2016-01.com.ovirt:444', 'connection': '192.168.122.235', 'ipv6_enabled': 'false', 'id': 'f10ac130-6988-4e34-ae46-fef4fd9e7de2', 'user': '', 'tpgt': '1'}, {'password': '********', 'port':
 '3260', 'iqn': 'iqn.2016-01.com.ovirt:444', 'connection': '192.168.122.72', 'ipv6_enabled': 'false', 'id': '971386cf-1b46-4dea-92f4-9439237c9767', 'user': '', 'tpgt': '1'}]) from=::ffff:192.168.122.192,55666, flow_id=27e1c881, task_id=6b67e007-7e70-4dc
8-995f-7d4146d814ee (api:48)


2. Serial part - setting up nodes
--------------------------------------------

2021-12-20 07:39:13,888-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.254:3260,1 --op=new (cwd None) (commands:154)
2021-12-20 07:39:13,929-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:39:13,929-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.254:3260,1 -n node.startup -v manual --op=update (cwd None) (
commands:154)
2021-12-20 07:39:13,968-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:39:13,968-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m iface -I default (cwd None) (commands:154)
2021-12-20 07:39:14,008-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:39:14,009-0500 DEBUG (jsonrpc/7) [storage.iscsi] iface.net_ifacename not provided, skipping rp filter setup (iscsi:586)
2021-12-20 07:39:14,009-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.115:3260,1 --op=new (cwd None) (commands:154)
2021-12-20 07:39:14,043-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:39:14,044-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.115:3260,1 -n node.startup -v manual --op=update (cwd None) (
commands:154)
2021-12-20 07:39:14,073-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:39:14,074-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m iface -I default (cwd None) (commands:154)
2021-12-20 07:39:14,101-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:39:14,102-0500 DEBUG (jsonrpc/7) [storage.iscsi] iface.net_ifacename not provided, skipping rp filter setup (iscsi:586)
2021-12-20 07:39:14,102-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.210:3260,1 --op=new (cwd None) (commands:154)
2021-12-20 07:39:14,129-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:39:14,129-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.210:3260,1 -n node.startup -v manual --op=update (cwd None) (
commands:154)
2021-12-20 07:39:14,156-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:39:14,156-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m iface -I default (cwd None) (commands:154)
2021-12-20 07:39:14,183-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:39:14,183-0500 DEBUG (jsonrpc/7) [storage.iscsi] iface.net_ifacename not provided, skipping rp filter setup (iscsi:586)
2021-12-20 07:39:14,184-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.235:3260,1 --op=new (cwd None) (commands:154)
2021-12-20 07:39:14,211-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:39:14,211-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.235:3260,1 -n node.startup -v manual --op=update (cwd None) (
commands:154)
2021-12-20 07:39:14,238-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:39:14,238-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m iface -I default (cwd None) (commands:154)
2021-12-20 07:39:14,266-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:39:14,266-0500 DEBUG (jsonrpc/7) [storage.iscsi] iface.net_ifacename not provided, skipping rp filter setup (iscsi:586)
2021-12-20 07:39:14,266-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.72:3260,1 --op=new (cwd None) (commands:154)
2021-12-20 07:39:14,294-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:39:14,294-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.72:3260,1 -n node.startup -v manual --op=update (cwd None) (c
ommands:154)
2021-12-20 07:39:14,321-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:39:14,322-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m iface -I default (cwd None) (commands:154)
2021-12-20 07:39:14,350-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)


3. Concurrent part - login start
------------------------------------------

2021-12-20 07:39:14,351-0500 DEBUG (iscsi-login/0) [root] START thread <Thread(iscsi-login/0, started daemon 139636961634048)> (func=<function tmap.<locals>.worker at 0x7effd843ca60>, args=(), kwargs={}) (concurrent:258)
2021-12-20 07:39:14,351-0500 DEBUG (iscsi-login/0) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.254:3260,1 -l (cwd None) (commands:154)
2021-12-20 07:39:14,354-0500 DEBUG (iscsi-login/1) [root] START thread <Thread(iscsi-login/1, started daemon 139636928063232)> (func=<function tmap.<locals>.worker at 0x7effd843ca60>, args=(), kwargs={}) (concurrent:258)
2021-12-20 07:39:14,356-0500 DEBUG (iscsi-login/1) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.115:3260,1 -l (cwd None) (commands:154)
2021-12-20 07:39:14,362-0500 DEBUG (iscsi-login/2) [root] START thread <Thread(iscsi-login/2, started daemon 139636902885120)> (func=<function tmap.<locals>.worker at 0x7effd843ca60>, args=(), kwargs={}) (concurrent:258)
2021-12-20 07:39:14,364-0500 DEBUG (iscsi-login/2) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.210:3260,1 -l (cwd None) (commands:154)
2021-12-20 07:39:14,370-0500 DEBUG (iscsi-login/3) [root] START thread <Thread(iscsi-login/3, started daemon 139637372745472)> (func=<function tmap.<locals>.worker at 0x7effd843ca60>, args=(), kwargs={}) (concurrent:258)
2021-12-20 07:39:14,372-0500 DEBUG (iscsi-login/3) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.235:3260,1 -l (cwd None) (commands:154)
2021-12-20 07:39:14,378-0500 DEBUG (iscsi-login/4) [root] START thread <Thread(iscsi-login/4, started daemon 139636944848640)> (func=<function tmap.<locals>.worker at 0x7effd843ca60>, args=(), kwargs={}) (concurrent:258)
2021-12-20 07:39:14,379-0500 DEBUG (iscsi-login/4) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.72:3260,1 -l (cwd None) (commands:154)


4. Concurrent part - login to available targets succeeded
------------------------------------------------------------------------------

2021-12-20 07:39:14,516-0500 DEBUG (iscsi-login/1) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:39:14,516-0500 DEBUG (iscsi-login/1) [root] FINISH thread <Thread(iscsi-login/1, stopped daemon 139636928063232)> (concurrent:261)
2021-12-20 07:39:14,627-0500 DEBUG (iscsi-login/0) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:39:14,627-0500 DEBUG (iscsi-login/0) [root] FINISH thread <Thread(iscsi-login/0, stopped daemon 139636961634048)> (concurrent:261)
2021-12-20 07:39:14,845-0500 DEBUG (iscsi-login/2) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:39:14,845-0500 DEBUG (iscsi-login/2) [root] FINISH thread <Thread(iscsi-login/2, stopped daemon 139636902885120)> (concurrent:261)


5. Concurrent part - login to unavailable targets times out
--------------------------------------------------------------------------------

2021-12-20 07:41:14,830-0500 DEBUG (iscsi-login/3) [common.commands] FAILED: <err> = b'iscsiadm: Could not login to [iface: default, target: iqn.2016-01.com.ovirt:444, portal: 192.168.122.235,3260].\niscsiadm: initiator reported error (8 - connection timed out)\niscsiadm: Could not log into all portals\n'; <rc> = 8 (commands:99)
2021-12-20 07:41:14,830-0500 DEBUG (iscsi-login/3) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m iface (cwd None) (commands:154)
2021-12-20 07:41:14,840-0500 DEBUG (iscsi-login/4) [common.commands] FAILED: <err> = b'iscsiadm: Could not login to [iface: default, target: iqn.2016-01.com.ovirt:444, portal: 192.168.122.72,3260].\niscsiadm: initiator reported error (8 - connection timed out)\niscsiadm: Could not log into all portals\n'; <rc> = 8 (commands:99)
2021-12-20 07:41:14,843-0500 DEBUG (iscsi-login/4) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m iface (cwd None) (commands:154)
2021-12-20 07:41:14,893-0500 DEBUG (iscsi-login/4) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:41:14,893-0500 DEBUG (iscsi-login/4) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.72:3260,1 -u (cwd None) (commands:154)
2021-12-20 07:41:14,903-0500 DEBUG (iscsi-login/3) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:41:14,930-0500 DEBUG (iscsi-login/4) [common.commands] FAILED: <err> = b'iscsiadm: No matching sessions found\n'; <rc> = 21 (commands:99)
2021-12-20 07:41:14,930-0500 DEBUG (iscsi-login/4) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m iface (cwd None) (commands:154)
2021-12-20 07:41:14,958-0500 DEBUG (iscsi-login/4) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:41:14,959-0500 DEBUG (iscsi-login/4) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.72:3260,1 --op=delete (cwd None) (commands:154)
2021-12-20 07:41:14,988-0500 DEBUG (iscsi-login/4) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:41:14,988-0500 DEBUG (iscsi-login/4) [storage.iscsi] iface.net_ifacename not provided, skipping rp filter setup (iscsi:586)
2021-12-20 07:41:14,988-0500 DEBUG (iscsi-login/4) [root] FINISH thread <Thread(iscsi-login/4, stopped daemon 139636944848640)> (concurrent:261)
2021-12-20 07:41:14,989-0500 DEBUG (iscsi-login/3) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.235:3260,1 -u (cwd None) (commands:154)
2021-12-20 07:41:14,992-0500 ERROR (jsonrpc/7) [storage.storageServer] Could not login to storageServer (storageServer:637)
NoneType: None
2021-12-20 07:41:15,018-0500 DEBUG (iscsi-login/3) [common.commands] FAILED: <err> = b'iscsiadm: No matching sessions found\n'; <rc> = 21 (commands:99)
2021-12-20 07:41:15,019-0500 DEBUG (iscsi-login/3) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m iface (cwd None) (commands:154)
2021-12-20 07:41:15,046-0500 DEBUG (iscsi-login/3) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:41:15,046-0500 DEBUG (iscsi-login/3) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2016-01.com.ovirt:444 -I default -p 192.168.122.235:3260,1 --op=delete (cwd None) (commands:154)
2021-12-20 07:41:15,075-0500 DEBUG (iscsi-login/3) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:41:15,075-0500 DEBUG (iscsi-login/3) [storage.iscsi] iface.net_ifacename not provided, skipping rp filter setup (iscsi:586)
2021-12-20 07:41:15,076-0500 DEBUG (iscsi-login/3) [root] FINISH thread <Thread(iscsi-login/3, stopped daemon 139637372745472)> (concurrent:261)
2021-12-20 07:41:15,076-0500 ERROR (jsonrpc/7) [storage.storageServer] Could not login to storageServer (storageServer:637)


6. Waiting for udev events
------------------------------------

2021-12-20 07:41:15,076-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /sbin/udevadm settle --timeout=5 (cwd None) (commands:154)
2021-12-20 07:41:15,090-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)


7. Refreshing domain after login
--------------------------------------------

2021-12-20 07:41:15,091-0500 INFO  (jsonrpc/7) [storage.storagedomaincache] Refreshing storage domain cache (resize=True) (sdc:80)
2021-12-20 07:41:15,091-0500 DEBUG (jsonrpc/7) [storage.samplingmethod] Trying to enter sampling method (vdsm.storage.iscsi.<function rescan at 0x7f0028047268>) (misc:343)
2021-12-20 07:41:15,091-0500 DEBUG (jsonrpc/7) [storage.samplingmethod] Got in to sampling method (misc:346)
2021-12-20 07:41:15,091-0500 INFO  (jsonrpc/7) [storage.iscsi] Scanning iSCSI devices (iscsi:449)
2021-12-20 07:41:15,091-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/iscsiadm -m session -R (cwd None) (commands:154)
2021-12-20 07:41:15,133-0500 INFO  (jsonrpc/7) [storage.iscsi] Scanning iSCSI devices: 0.04 seconds (utils:390)
2021-12-20 07:41:15,133-0500 DEBUG (jsonrpc/7) [storage.samplingmethod] Returning last result (misc:353)
2021-12-20 07:41:15,133-0500 DEBUG (jsonrpc/7) [storage.samplingmethod] Trying to enter sampling method (vdsm.storage.hba.<function rescan at 0x7f00287b2d08>) (misc:343)
2021-12-20 07:41:15,133-0500 DEBUG (jsonrpc/7) [storage.samplingmethod] Got in to sampling method (misc:346)
2021-12-20 07:41:15,133-0500 INFO  (jsonrpc/7) [storage.hba] Scanning FC devices (hba:59)
2021-12-20 07:41:15,190-0500 INFO  (jsonrpc/7) [storage.hba] Scanning FC devices: 0.06 seconds (utils:390)
2021-12-20 07:41:15,190-0500 DEBUG (jsonrpc/7) [storage.samplingmethod] Returning last result (misc:353)
2021-12-20 07:41:15,190-0500 INFO  (jsonrpc/7) [storage.multipath] Waiting until multipathd is ready (multipath:112)
2021-12-20 07:41:15,190-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /sbin/udevadm settle --timeout=10 (cwd None) (commands:154)
2021-12-20 07:41:15,205-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:99)
2021-12-20 07:41:17,266-0500 INFO  (jsonrpc/7) [storage.multipath] Waited 2.08 seconds for multipathd (tries=2, ready=2) (multipath:139)
2021-12-20 07:41:17,266-0500 INFO  (jsonrpc/7) [storage.multipath] Resizing multipath devices (multipath:220)
2021-12-20 07:41:17,283-0500 INFO  (jsonrpc/7) [storage.multipath] Resizing multipath devices: 0.02 seconds (utils:390)
2021-12-20 07:41:17,283-0500 INFO  (jsonrpc/7) [storage.storagedomaincache] Refreshing storage domain cache: 2.19 seconds (utils:390)
2021-12-20 07:41:17,283-0500 DEBUG (jsonrpc/7) [storage.samplingmethod] Returning last result (misc:353)
2021-12-20 07:41:17,288-0500 DEBUG (jsonrpc/7) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/36001405088203d71b564fe0aaac74f6c$|^/dev/mapper/3600140514f301cb91f04badae7506e46$|^/dev/mapper/360014055ba493238c484d9284005d1bf$|^/dev/mapper/360014059fff83490d9b4ddfbc7e43c20$|^/dev/mapper/36001405a77a1ee25cbf4439b7ddd2062$|^/dev/mapper/36001405cd5268d7e3b64b40a16200729$|^/dev/mapper/36001405e7bafcb012c74e59a51e98ccb$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  prioritise_write_locks=1  wait_for_locks=1  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:154)
2021-12-20 07:41:17,358-0500 DEBUG (jsonrpc/7) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:186)
2021-12-20 07:41:17,358-0500 DEBUG (jsonrpc/7) [storage.hsm] Found SD uuids: ('66bbf96e-a291-4975-8773-4e6db522dd28', '7f61fd81-9d64-44ce-b9db-825befa0d0d5', 'ed3fb387-44be-4464-9625-f36b3a2cff19', 'f345fb3b-c0e9-461a-9f90-dcde6a1d9eb7') (hsm:2147)
2021-12-20 07:41:17,358-0500 INFO  (jsonrpc/7) [storage.storagedomaincache] Removing domain 66bbf96e-a291-4975-8773-4e6db522dd28 from storage domain cache (sdc:211)
2021-12-20 07:41:17,358-0500 INFO  (jsonrpc/7) [storage.storagedomaincache] Removing domain 7f61fd81-9d64-44ce-b9db-825befa0d0d5 from storage domain cache (sdc:211)
2021-12-20 07:41:17,358-0500 INFO  (jsonrpc/7) [storage.storagedomaincache] Removing domain ed3fb387-44be-4464-9625-f36b3a2cff19 from storage domain cache (sdc:211)
2021-12-20 07:41:17,358-0500 INFO  (jsonrpc/7) [storage.storagedomaincache] Removing domain f345fb3b-c0e9-461a-9f90-dcde6a1d9eb7 from storage domain cache (sdc:211)
2021-12-20 07:41:17,358-0500 DEBUG (jsonrpc/7) [storage.hsm] Found SD uuids: ('66bbf96e-a291-4975-8773-4e6db522dd28', '7f61fd81-9d64-44ce-b9db-825befa0d0d5', 'ed3fb387-44be-4464-9625-f36b3a2cff19', 'f345fb3b-c0e9-461a-9f90-dcde6a1d9eb7') (hsm:2147)
2021-12-20 07:41:17,359-0500 INFO  (jsonrpc/7) [storage.storagedomaincache] Removing domain 66bbf96e-a291-4975-8773-4e6db522dd28 from storage domain cache (sdc:211)
2021-12-20 07:41:17,359-0500 INFO  (jsonrpc/7) [storage.storagedomaincache] Removing domain 7f61fd81-9d64-44ce-b9db-825befa0d0d5 from storage domain cache (sdc:211)
2021-12-20 07:41:17,359-0500 INFO  (jsonrpc/7) [storage.storagedomaincache] Removing domain ed3fb387-44be-4464-9625-f36b3a2cff19 from storage domain cache (sdc:211)
2021-12-20 07:41:17,359-0500 INFO  (jsonrpc/7) [storage.storagedomaincache] Removing domain f345fb3b-c0e9-461a-9f90-dcde6a1d9eb7 from storage domain cache (sdc:211)
2021-12-20 07:41:17,359-0500 DEBUG (jsonrpc/7) [storage.hsm] Found SD uuids: ('66bbf96e-a291-4975-8773-4e6db522dd28', '7f61fd81-9d64-44ce-b9db-825befa0d0d5', 'ed3fb387-44be-4464-9625-f36b3a2cff19', 'f345fb3b-c0e9-461a-9f90-dcde6a1d9eb7') (hsm:2147)
2021-12-20 07:41:17,359-0500 INFO  (jsonrpc/7) [storage.storagedomaincache] Removing domain 66bbf96e-a291-4975-8773-4e6db522dd28 from storage domain cache (sdc:211)
2021-12-20 07:41:17,359-0500 INFO  (jsonrpc/7) [storage.storagedomaincache] Removing domain 7f61fd81-9d64-44ce-b9db-825befa0d0d5 from storage domain cache (sdc:211)
2021-12-20 07:41:17,359-0500 INFO  (jsonrpc/7) [storage.storagedomaincache] Removing domain ed3fb387-44be-4464-9625-f36b3a2cff19 from storage domain cache (sdc:211)
2021-12-20 07:41:17,359-0500 INFO  (jsonrpc/7) [storage.storagedomaincache] Removing domain f345fb3b-c0e9-461a-9f90-dcde6a1d9eb7 from storage domain cache (sdc:211)
2021-12-20 07:41:17,359-0500 DEBUG (jsonrpc/7) [storage.hsm] knownSDs: {eb405d82-0cd7-45c0-bba0-3795a34a6e5b: vdsm.storage.nfsSD.findDomain, 66bbf96e-a291-4975-8773-4e6db522dd28: vdsm.storage.blockSD.findDomain, 7f61fd81-9d64-44ce-b9db-825befa0d0d5: vdsm.storage.blockSD.findDomain, ed3fb387-44be-4464-9625-f36b3a2cff19: vdsm.storage.blockSD.findDomain, f345fb3b-c0e9-461a-9f90-dcde6a1d9eb7: vdsm.storage.blockSD.findDomain} (hsm:2200)
2021-12-20 07:41:17,359-0500 INFO  (jsonrpc/7) [storage.storagedomaincache] Invalidating storage domain cache (sdc:74)


8. API return
-----------------

2021-12-20 07:41:17,359-0500 INFO  (jsonrpc/7) [vdsm.api] FINISH connectStorageServer return={'statuslist': [{'id': '60b3ae07-db32-470d-9ed8-d1f99efcc99c', 'status': 0}, {'id': '994a711a-60f3-411a-aca2-0b60f01e8b8c', 'status': 0}, {'id': '14cd4d9d-d7ee-4a3a-9928-d9c37ec34a6f', 'status': 0}, {'id': '971386cf-1b46-4dea-92f4-9439237c9767', 'status': 465}, {'id': 'f10ac130-6988-4e34-ae46-fef4fd9e7de2', 'status': 465}]} from=::ffff:192.168.122.192,55666, flow_id=27e1c881, task_id=6b67e007-7e70-4dc8-995f-7d4146d814ee (api:54)


Connections to unavailable portals time out, but host itself is NOT set as non-operational, works normally and SD is available on the host. 
This applies for 2 unavailable portals. By default vdsm use maximum 10 login thread, so if there are more than 10 unavailable portals, original issue still pop up and admin has to increase number of parallel iscsi logins in vdsm.

Comment 21 Vojtech Juranek 2021-12-22 10:10:09 UTC

Note: to change number of parallel login threads, one needs to create custom vdsm config file in /etc/vdsm/vdsm.conf.d/ (e.g. /etc/vdsm/vdsm.conf.d/99-admin.conf) with

[iscsi]
parallel_logins=16 

In this case, maximum 16 login threads will be used.
This config has to be pushed on all hosts.

Comment 31 Shir Fishbain 2022-03-14 09:15:05 UTC

Thank you Nir for your description and response about the WARN attached.
Moving to Verified.

rhv-4.5.0-596.7434eeb1dde2.226-DS-nightly
vdsm-4.50.0.8-1.el8ev.x86_64

Comment 39 errata-xmlrpc 2022-05-26 17:22:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: RHV RHEL Host (ovirt-host) [ovirt-4.5.0] security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:4764

Note You need to log in before you can comment on or make changes to this bug.