Bug 1788631 - unnecessary ISCSI login attempt to the same target fail with "failed to setup iscsi subsystem" and cause host activation to fail leaving the host in Non Operational state
Summary: unnecessary ISCSI login attempt to the same target fail with "failed to setup...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.4.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.4.1
: ---
Assignee: Ahmad Khiet
QA Contact: Shir Fishbain
URL:
Whiteboard:
Depends On: 1812264
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-07 16:28 UTC by Avihai
Modified: 2020-08-05 09:11 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-29 12:09:59 UTC
oVirt Team: Storage
Embargoed:
pagranat: needinfo+
pm-rhel: ovirt-4.4+
michal.skrivanek: blocker?
pelauter: planning_ack+


Attachments (Terms of Use)
2 different envs with different storages logs (2.14 MB, application/gzip)
2020-01-07 16:28 UTC, Avihai
no flags Details
journalctl log (1.15 MB, application/gzip)
2020-01-07 16:58 UTC, Avihai
no flags Details
engine and vdsm logs (5.25 MB, application/gzip)
2020-02-16 13:09 UTC, Polina
no flags Details
Logs copied from the RHEL bug (1.45 MB, application/zip)
2020-06-23 13:38 UTC, Nir Soffer
no flags Details

Description Avihai 2020-01-07 16:28:15 UTC
Created attachment 1650473 [details]
2 different envs with different storages logs

Description of problem:
Seen in Automation testcase dc_upgrade_4_2_to_4_3 in rhv-4.4.0-13 on several different ENV's with different storages.

The host(which resides on a 4.2 compatibility level DC/cluster) is trying to login to ISCSI target via REST API it fails with the following errors:


Engine:
2020-01-07 17:54:03,399+02 ERROR [org.ovirt.engine.core.bll.storage.connection.ISCSIStorageHelper] (default task-144) [hosts_syncAction_70177856-58c7-4124] The connection with details '00000000-0000-0000-0000-0
00000000000' failed because of error code '465' and error message is: failed to setup iscsi subsystem
2020-01-07 17:54:03,411+02 ERROR [org.ovirt.engine.core.bll.storage.connection.ConnectStorageToVdsCommand] (default task-144) [hosts_syncAction_70177856-58c7-4124] Transaction rolled-back for command 'org.ovirt
.engine.core.bll.storage.connection.ConnectStorageToVdsCommand'.
2020-01-07 17:54:03,418+02 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-144) [] Operation Failed: [failed to setup iscsi subsystem]


VDSM(-7H before):
2020-01-07 10:54:00,439-0500 INFO  (jsonrpc/3) [vdsm.api] START connectStorageServer(domType=3, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'password': '********', 'port': '3260', 'iqn': 'iqn.2000-
05.com.3pardata:20210002ac021f6b', 'connection': '3par-iscsi-1.scl.lab.tlv.redhat.com', 'id': '00000000-0000-0000-0000-000000000000', 'user': '', 'tpgt': '1'}], options=None) from=::ffff:10.35.162.6,42990, flow
_id=hosts_syncAction_70177856-58c7-4124, task_id=aa06fce6-8e82-4ab8-9dd9-4059487fc416 (api:48)
2020-01-07 10:54:02,755-0500 ERROR (jsonrpc/3) [storage.HSM] Could not connect to storageServer (hsm:2418)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/iscsiadm.py", line 326, in node_login
    "-p", portal, "-l"])
  File "/usr/lib/python3.6/site-packages/vdsm/storage/iscsiadm.py", line 129, in run_cmd
    out = commands.run(cmd, sudo=True)
  File "/usr/lib/python3.6/site-packages/vdsm/common/commands.py", line 101, in run
    raise cmdutils.Error(args, p.returncode, out, err)
vdsm.common.cmdutils.Error: Command ['/sbin/iscsiadm', '-m', 'node', '-T', 'iqn.2000-05.com.3pardata:20210002ac021f6b', '-I', 'default', '-p', '3par-iscsi-1.scl.lab.tlv.redhat.com:3260,1', '-l'] failed with rc=15 out=b'Logging in to [iface: default, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260]\nLogging in to [iface: default, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260]\nLogin to [iface: default, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260] successful.\n' err=b'iscsiadm: Could not login to [iface: default, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260].\niscsiadm: initiator reported error (15 - session exists)\niscsiadm: Could not log into all portals\n'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2415, in connectStorageServer
    conObj.connect()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py", line 487, in connect
    iscsi.addIscsiNode(self._iface, self._target, self._cred)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/iscsi.py", line 219, in addIscsiNode
    iscsiadm.node_login(iface.name, target.address, target.iqn)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/iscsiadm.py", line 334, in node_login
    raise IscsiNodeError(e.rc, e.out, e.err)
vdsm.storage.iscsiadm.IscsiNodeError: (15, b'Logging in to [iface: default, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260]\nLogging in to [iface: default, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260]\nLogin to [iface: default, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260] successful.\n', b'iscsiadm: Could not login to [iface: default, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260].\niscsiadm: initiator reported error (15 - session exists)\niscsiadm: Could not log into all portals\n')
2020-01-07 10:54:02,756-0500 INFO  (jsonrpc/3) [storage.StorageDomainCache] Refreshing storage domain cache (resize=True) (sdc:80)

2020-01-07 10:54:03,337-0500 ERROR (jsonrpc/3) [storage.Multipath] Could not resize device 360002ac0000000000000003d00021f6b (multipath:111)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/multipath.py", line 109, in resize_devices
    _resize_if_needed(guid)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/multipath.py", line 117, in _resize_if_needed
    for slave in devicemapper.getSlaves(name)]
  File "/usr/lib/python3.6/site-packages/vdsm/storage/multipath.py", line 117, in <listcomp>
    for slave in devicemapper.getSlaves(name)]
  File "/usr/lib/python3.6/site-packages/vdsm/storage/multipath.py", line 187, in getDeviceSize
    bs, phyBs = getDeviceBlockSizes(devName)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/multipath.py", line 179, in getDeviceBlockSizes
    QUEUE, "logical_block_size"))
  File "/usr/lib/python3.6/site-packages/vdsm/storage/multipath.py", line 198, in read_int
    with open(path, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/sys/block/sdb/queue/logical_block_size'


Version-Release number of selected component (if applicable):
engine 4.4.0-0.13.master.el7
vdsm-4.40.0-164.git38a19bb.el8ev.x86_64
libvirt-client-5.6.0-6.module+el8.1.0+4244+9aa4e6bb.x86_64
qemu-kvm-4.1.0-14.module+el8.1.0+4548+ed1300f4.x86_64
host OS is Red Hat Enterprise Linux release 8.1 (Ootpa)
Kernel 4.18.0-147.3.1.el8_1.x86_64

How reproducible:
The issue is 100% reproducible via automation but manually it's harder to reproduce.(manage once)

Steps to Reproduce:
1. Create 4.2 DC (with compatibility level 4.2)
2. Create 4.2 cluster 
3. Add host to this cluster
4. Discovery via RESTAPI to the storage domain works.
Discovery REST request:
2020-01-07 17:53:59,702 - MainThread - hosts - DEBUG - Action request content is --  url:/ovirt-engine/api/hosts/a5e32c47-498b-4d24-92a1-9a8bf332a03e/iscsidiscover body:<action>
    <async>false</async>
    <grace_period>
        <expiry>10</expiry>
    </grace_period>
    <iscsi>
        <address>3par-iscsi-1.scl.lab.tlv.redhat.com</address>
    </iscsi>
</action>

Replay:
2020-01-07 17:54:00,228 - MainThread - hosts - DEBUG - Response body for action request is: 
<action>
    <async>false</async>
    <discovered_targets>
        <iscsi_details>
            <address>10.35.146.1</address>
            <port>3260</port>
            <portal>10.35.146.1:3260,21</portal>
            <target>iqn.2000-05.com.3pardata:20210002ac021f6b</target>
        </iscsi_details>
        <iscsi_details>
            <address>10.35.146.2</address>
            <port>3260</port>
            <portal>10.35.146.2:3260,121</portal>
            <target>iqn.2000-05.com.3pardata:21210002ac021f6b</target>
        </iscsi_details>
    </discovered_targets>
    <grace_period>
        <expiry>10</expiry>
    </grace_period>
    <iscsi>
        <address>3par-iscsi-1.scl.lab.tlv.redhat.com</address>
    </iscsi>
    <iscsi_targets>
        <iscsi_target>iqn.2000-05.com.3pardata:20210002ac021f6b</iscsi_target>
        <iscsi_target>iqn.2000-05.com.3pardata:21210002ac021f6b</iscsi_target>
    </iscsi_targets>
    <status>complete</status>
</action>


5. Login via RESTAPI one of the discovered targets:

2020-01-07 17:54:00,324 - MainThread - hosts - DEBUG - Action request content is --  url:/ovirt-engine/api/hosts/a5e32c47-498b-4d24-92a1-9a8bf332a03e/iscsilogin 

body:<action>
    <async>false</async>
    <grace_period>
        <expiry>10</expiry>
    </grace_period>
    <iscsi>
        <address>3par-iscsi-1.scl.lab.tlv.redhat.com</address>
        <target>iqn.2000-05.com.3pardata:20210002ac021f6b</target>
    </iscsi>
</action>


Actual results:
ISCSI login fails with the following REST response:
2020-01-07 17:54:03,431 - MainThread - api_utils - ERROR - Failed to syncAction element NOT as expected:
        Status: 400
        Reason: Bad Request
        Detail: [failed to setup iscsi subsystem]

2020-01-07 17:54:03,435 - MainThread - hosts - ERROR - Response code is not valid, expected is: [200, 201], actual is: 400 


Expected results:


Additional info:
With webadmin login to the same target works.

The issue occurs in different environments and on different storages(that we usually works with) which are NetApp and HP3PAR.

Comment 1 Avihai 2020-01-07 16:58:35 UTC
Created attachment 1650478 [details]
journalctl log

Comment 2 Avihai 2020-01-07 16:58:51 UTC
This is what I see in journalctl log of the host(time is -8H behind the engine log) in the 2nd ENV reproduction:
Also attached it, I hope it helps.

Jan 07 10:54:01 storage-ge7-vdsm2.scl.lab.tlv.redhat.com iscsid[1261]: iscsid: Connection2:0 to [target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260] through [ifa
ce: default] is operational now

Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com kernel: scsi 3:0:0:1: alua: Detached
Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com kernel: scsi 3:0:0:2: alua: Detached
Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com kernel: scsi 3:0:0:3: alua: Detached
Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com kernel: scsi 3:0:0:4: alua: Detached
Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com kernel: scsi 3:0:0:5: alua: Detached
Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com iscsid[1261]: iscsid: Connection2:0 to [target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260] through [ifa
ce: default] is shutdown.
Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com sudo[7662]: pam_unix(sudo:session): session closed for user root
Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com systemd[1]: Started Session c44978 of user root.
Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com sudo[7683]: pam_unix(sudo:session): session opened for user root by (uid=0)
Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com sudo[7683]: pam_unix(sudo:session): session closed for user root
Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com multipathd[818]: 8:16: cannot find block device
Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com multipathd[818]: checker failed path 8:16 in map 360002ac0000000000000003d00021f6b
Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com kernel: device-mapper: multipath: Failing path 8:16.
Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com multipathd[818]: 360002ac0000000000000003d00021f6b: Entering recovery mode: max_retries=4
Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com multipathd[818]: 360002ac0000000000000003d00021f6b: remaining active paths: 0
Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com systemd[1]: Started Session c44979 of user root.
Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com sudo[7687]: pam_unix(sudo:session): session opened for user root by (uid=0)
Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com sudo[7687]: pam_unix(sudo:session): session closed for user root
Jan 07 10:54:03 storage-ge7-vdsm2.scl.lab.tlv.redhat.com multipathd[818]: 8:16: cannot find block device
Jan 07 10:54:03 storage-ge7-vdsm2.scl.lab.tlv.redhat.com multipathd[818]: 360002ac0000000000000003d00021f6b Last path deleted, disabling queueing
Jan 07 10:54:03 storage-ge7-vdsm2.scl.lab.tlv.redhat.com multipathd[818]: 360002ac0000000000000003d00021f6b: map in use
Jan 07 10:54:03 storage-ge7-vdsm2.scl.lab.tlv.redhat.com multipathd[818]: 360002ac0000000000000003d00021f6b: can't flush
Jan 07 10:54:03 storage-ge7-vdsm2.scl.lab.tlv.redhat.com multipathd[818]: 360002ac0000000000000003d00021f6b: queueing disabled
Jan 07 10:54:03 storage-ge7-vdsm2.scl.lab.tlv.redhat.com multipathd[818]: 360002ac0000000000000003d00021f6b: load table [0 314572800 multipath 0 1 alua 0 0]
Jan 07 10:54:03 storage-ge7-vdsm2.scl.lab.tlv.redhat.com multipathd[818]: sdb [8:16]: path removed from map 360002ac0000000000000003d00021f6b

Comment 3 Ahmad Khiet 2020-02-12 14:51:03 UTC
sent a question to your email.
but the info is also relevant here. 
does the iSCSI target neet a CHAP user/pass? or empty credentials?

Comment 4 Avihai 2020-02-12 15:33:55 UTC
(In reply to Ahmad Khiet from comment #3)
> sent a question to your email.
> but the info is also relevant here. 
> does the iSCSI target neet a CHAP user/pass? or empty credentials?

Answered :) 
No CHAP user/pass is required.

Comment 5 Polina 2020-02-16 13:09:25 UTC
Created attachment 1663324 [details]
engine and vdsm logs

I see the same issue in automation Runs in several environments with 4.4. 
It is not triggered by some special scenario. The tests send several times deactivate/activate actions that succeed. Then we see in engine.log 'failed because of error code '465' and error message is: failed to setup iscsi subsystem.' 

here are some strings from engine log related to this issue 

2020-02-12 05:30:16,350+02 INFO  [org.ovirt.engine.core.bll.storage.connection.ISCSIStorageHelper] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-77) [50c57553] The lun with id 'ytuOXG-2Ucm-lw5I-GRkh-MfRJ-lIuI-DaRCoz' was reported as problematic
2020-02-12 05:30:16,355+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirecto.r] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-77) [50c57553] EVENT_ID: STORAGE_DOMAIN_ERROR(996), The error message for connection beetle-iscsi-lif2.lab.eng.tlv2.redhat.com iqn.1992-08.com.netapp:vserver-rhv-qe (LUN 3600a09803830447a4f244c4657595037) returned by VDSM was: Failed to setup iSCSI subsystem
2020-02-12 05:30:16,355+02 ERROR [org.ovirt.engine.core.bll.storage.connection.ISCSIStorageHelper] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-77) [50c57553] The connection with details 'beetle-iscsi-lif2.lab.eng.tlv2.redhat.com iqn.1992-08.com.netapp:vserver-rhv-qe (LUN 3600a09803830447a4f244c4657595037)' failed because of error code '465' and error message is: failed to setup iscsi subsystem
2020-02-12 05:30:16,374+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RegisterLibvirtSecretsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-77) [50c57553] START, RegisterLibvirtSecretsVDSCommand(HostName = host_mixed_1, RegisterLibvirtSecretsVDSParameters:{hostId='be689ba9-3b37-49e5-8d46-3b021e74a1ff', libvirtSecrets='[org.ovirt.engine.core.common.businessentities.storage.LibvirtSecret@227f4b02]', clearUnusedSecrets='false'}), log id: 72943d21

2020-02-12 05:35:23,208+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedThreadFactory-engine-Thread-3855) [aa52e26] Domain '77bc098f-cd6c-4d3c-a531-d98cf50a1b88:iscsi_1' was reported with error code '2001'

2020-02-12 05:35:23,209+02 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (EE-ManagedThreadFactory-engine-Thread-3855) [aa52e26] Storage Domain 'iscsi_1' of pool 'golden_env_mixed' is in problem in host 'host_mixed_2'
2020-02-12 05:35:23,212+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-3855) [aa52e26] EVENT_ID: VDS_STORAGE_VDS_STATS_FAILED(189), Host host_mixed_2 reports about one of the Active Storage Domains as Problematic.

Comment 7 Ahmad Khiet 2020-02-17 13:31:24 UTC
can you provide system logs?

Comment 11 Shir Fishbain 2020-02-27 13:03:58 UTC
Hi,

Please find below the summary of my meeting with Benny and Ahmad. 
This bug was escalated to the RHEL DEV team (Chris Leech):

1) In both versions (4.3 and 4.4) the configuration/creation of the login process of iscsi is the same. 
The suggestion from RHV was to delete the nodes from the host, add them again and then try to login to iscsi:

The following describes the error with the iscsi login on an RHV4.4/RHEL8.2 host:
# ISCSI Logout:
[anaconda root@storage-ge11-vdsm1 ~]# iscsiadm -m session -u                    
iscsiadm: No matching sessions found

# Removing the nodes:
[anaconda root@storage-ge11-vdsm1 iqn.2000-05.com.3pardata:20210002ac021f6b]# rm -rf /var/lib/iscsi/nodes/*

# Adding the nodes
[anaconda root@storage-ge11-vdsm1 iqn.2000-05.com.3pardata:20210002ac021f6b]# /sbin/iscsiadm -m node -T iqn.2000-05.com.3pardata:20210002ac021f6b -I default -p 10.35.146.1:3260,1:3260,1 --op=new
New iSCSI node [tcp:[hw=,ip=,net_if=,iscsi_if=default] 10.35.146.1,3260,1 iqn.2000-05.com.3pardata:20210002ac021f6b] added
[anaconda root@storage-ge11-vdsm1 iqn.2000-05.com.3pardata:20210002ac021f6b]# /sbin/iscsiadm -m node -T iqn.2000-05.com.3pardata:20210002ac021f6b -I default -p 3par-iscsi-1.scl.lab.tlv.redhat.com:3260,1 --op=new
New iSCSI node [tcp:[hw=,ip=,net_if=,iscsi_if=default] 3par-iscsi-1.scl.lab.tlv.redhat.com,3260,1
iqn.2000-05.com.3pardata:20210002ac021f6b] added

# ISCSI login to one target - the login fails: 
[anaconda root@storage-ge11-vdsm1 iqn.2000-05.com.3pardata:20210002ac021f6b]# /sbin/iscsiadm -m node -T iqn.2000-05.com.3pardata:20210002ac021f6b -I default -p 10.35.146.1:3260,1 -l
Logging in to [iface: default, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260]
Logging in to [iface: default, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 10.35.146.1,3260]
Login to [iface: default, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260] successful.
iscsiadm: Could not login to [iface: default, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 10.35.146.1,3260].
iscsiadm: initiator reported error (15 - session exists)
iscsiadm: Could not log into all portals

The following describes the iscsi login on a RHV4.3/RHEL7.7 host:
# ISCSI Loggout:
[root@storage-ge2-vdsm1 ~]# iscsiadm -m session -u
Logging out of session [sid: 58, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260]
Logout of [sid: 58, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260] successful.

#Removing the nodes
[root@storage-ge2-vdsm1 ~]# rm -rf /var/lib/iscsi/nodes/*

#Adding the nodes
[root@storage-ge2-vdsm1 ~]# /sbin/iscsiadm -m node -T iqn.2000-05.com.3pardata:20210002ac021f6b -I default -p 3par-iscsi-1.scl.lab.tlv.redhat.com:3260,1 --op=new
New iSCSI node [tcp:[hw=,ip=,net_if=,iscsi_if=default] 3par-iscsi-1.scl.lab.tlv.redhat.com,3260,1 iqn.2000-05.com.3pardata:20210002ac021f6b] added
[root@storage-ge2-vdsm1 ~]# /sbin/iscsiadm -m node -T iqn.2000-05.com.3pardata:20210002ac021f6b -I default -p 10.35.146.1:3260,1 --op=new
New iSCSI node [tcp:[hw=,ip=,net_if=,iscsi_if=default] 10.35.146.1,3260,1 iqn.2000-05.com.3pardata:20210002ac021f6b] added

# ISCSI login to one target - the login succeeds:
[root@storage-ge2-vdsm1 ~]# /sbin/iscsiadm -m node -T iqn.2000-05.com.3pardata:20210002ac021f6b -I default -p 10.35.146.1:3260,1 -l
Logging in to [iface: default, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260] (multiple)
Logging in to [iface: default, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 10.35.146.1,3260] (multiple)
Login to [iface: default, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260] successful.
Login to [iface: default, target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 10.35.146.1,3260] successful.

We drilled down to the ISCSI DB, after removing the sessions, we saw in RHV4.3/RHEL7.7 that there are multiple nodes of connection addresses (node.conn[0].address = 3par-iscsi-1.scl.lab.tlv.redhat.com, node.conn[0].address = 10.35.146.2, node.conn[0].address = 10.35.146.1). 
The host can deal with the multiple nodes and make the login successfully. However, on RHV4.4/RHEL8.2 the host cannot deal with the multiple nodes.
How is the ISCSI login supposed to work? Are multiple nodes supposed to be supported in RHV4.4/RHEL8.2?

2) Also, the failed ISCSI login causes the host to be in a non-operational state forever due to host loss connectivity/multipath/ISCSI connections:

# multipath is dead on this host alone (all other hosts in the same cluster and storage is up and connected to the targets):

* Host was in maintenance-> manually ran ISCSI logout all targets + ISCSI login 
* Activate host -> host fails to activate with the following errors in VDSM and indeed all multipath connections are gone:

3) We would like to know if there will be a fix from RHEL, and when it will be fixed?
4) Will this issue be fixed on RHV or RHEL side?

Comment 12 Martin Tessun 2020-03-10 13:14:41 UTC
Looking at the initial iscsid error:

Jan 07 10:54:02 storage-ge7-vdsm2.scl.lab.tlv.redhat.com iscsid[1261]: iscsid: Connection2:0 to [target: iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260] through [iface: default] is shutdown.

does mean that the interface which is used for the iscsi connection has been shut down. This does not really look like an iSCSI issue, but rather as if the interface has been unavailable for some time - thus iscsid is terminating the connection.

Cheers,
Martin

Comment 13 Nir Soffer 2020-06-08 16:29:31 UTC
(In reply to Shir Fishbain from comment #11)
> # Adding the nodes
> [anaconda root@storage-ge11-vdsm1
> iqn.2000-05.com.3pardata:20210002ac021f6b]# /sbin/iscsiadm -m node -T
> iqn.2000-05.com.3pardata:20210002ac021f6b -I default -p
> 10.35.146.1:3260,1:3260,1 --op=new

We added 10.35.146.1:3260...

> iqn.2000-05.com.3pardata:20210002ac021f6b]# /sbin/iscsiadm -m node -T
> iqn.2000-05.com.3pardata:20210002ac021f6b -I default -p
> 3par-iscsi-1.scl.lab.tlv.redhat.com:3260,1 --op=new

What is the ip address of 3par-iscsi-1.scl.lab.tlv.redhat.com?

Why do you mix ip addresses and DNS names? this is not how engine adds
targets.
 
> # ISCSI login to one target - the login fails: 
> [anaconda root@storage-ge11-vdsm1
> iqn.2000-05.com.3pardata:20210002ac021f6b]# /sbin/iscsiadm -m node -T
> iqn.2000-05.com.3pardata:20210002ac021f6b -I default -p 10.35.146.1:3260,1 -l
> Logging in to [iface: default, target:
> iqn.2000-05.com.3pardata:20210002ac021f6b, portal:
> 3par-iscsi-1.scl.lab.tlv.redhat.com,3260]
> Logging in to [iface: default, target:
> iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 10.35.146.1,3260]
> Login to [iface: default, target: iqn.2000-05.com.3pardata:20210002ac021f6b,
> portal: 3par-iscsi-1.scl.lab.tlv.redhat.com,3260] successful.
> iscsiadm: Could not login to [iface: default, target:
> iqn.2000-05.com.3pardata:20210002ac021f6b, portal: 10.35.146.1,3260].
> iscsiadm: initiator reported error (15 - session exists)
> iscsiadm: Could not log into all portals

Is 3par-iscsi-1.scl.lab.tlv.redhat.com ip address 10.35.146.1?

The flow describes here is not what RHV is doing when you discover and add
targets, so the fact that it fails should not affect RHV.

Comment 14 Nir Soffer 2020-06-08 16:33:57 UTC
This looks like duplicate of bug 1812264.

Comment 15 Michal Skrivanek 2020-06-18 14:33:14 UTC
either NOTABUG or possibly an engine fix, pending QE update (from https://bugzilla.redhat.com/show_bug.cgi?id=1812264#c25)

Comment 18 Nir Soffer 2020-06-23 13:38:01 UTC
Created attachment 1698460 [details]
Logs copied from the RHEL bug

Copied from https://bugzilla.redhat.com/attachment.cgi?id=1698215

Comment 21 Shir Fishbain 2020-06-24 06:45:43 UTC
Nir, please look at the answers above. It's a very important issue still in a NEW status and targeted for GA(4.4.1).
Let's try and fix it for 4.4 GA or retarget it (Not a good thing).

Comment 23 Nir Soffer 2020-06-24 19:47:55 UTC
Summary:

1. There no issue if you login to the targets discovered using the
   IP address reported by vdsm.

Discovery should be done using either DNS name:
3par-iscsi-1.example.com

Or the IP address:

[root@green-vdsd ~]# host 3par-iscsi-1.example.com
3par-iscsi-1.example.com has address 10.35.146.1

Discovery returns targets with IP Address. In this environment:
- 10.35.146.1
- 10.35.146.2

Engine should keep these addresses in the database.

2. Login to target using the addresses discovered in step 1.

Engine should reject login attempts to address that were
not discovered.

3. Do not mix IP address and DNS when logging in to targets

This can try to create 2 sessions to the same target, once
using the DNS name, and once using the IP address, and this
configuration is not supported on RHEL 8.

This used to work on RHEL 7, hiding the issue in RHV.

So for the tests in automation, fix the code according to these
guidelines.

We can file new bug to improve validation on engine side to 
avoid the bad logins, and document how the API should be used.

Comment 24 RHEL Program Management 2020-06-29 10:07:40 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 25 Shir Fishbain 2020-06-29 12:35:04 UTC
I'm fine with the solution that offered to add a doc test (administration doc) that the values expected would come only from the engine. So it should be clear that the customers can use (REST/UI/CLI) before or after doing discovery/login via RHV with the same IP/FQDN. 
Need to describe:
* Using SDK / REST API (values received from the engine)
* Using UI (values received from the engine)
* Using host CLI just running iscsiadm commands (values weren't received from the engine)

The only values received from the engine would be used for the discovery and login. If we begin the discovery with an FQDN, then the other operations will be using FQDN only. Same for IP addresses. We are using only the values that the engine returns.

Comment 26 Shir Fishbain 2020-07-05 11:30:15 UTC
The solution that offered for this bug was added to the administration doc that the values expected would come only from the engine for ISCSI discovery and login.
I opened a documentation bug:https://bugzilla.redhat.com/show_bug.cgi?id=1853898

Comment 27 Roni 2020-07-26 13:27:05 UTC
(In reply to Nir Soffer from comment #23)
> Summary:
> 
> 1. There no issue if you login to the targets discovered using the
>    IP address reported by vdsm.
> 
> Discovery should be done using either DNS name:
> 3par-iscsi-1.example.com
> 
> Or the IP address:
> 
> [root@green-vdsd ~]# host 3par-iscsi-1.example.com
> 3par-iscsi-1.example.com has address 10.35.146.1
> 
> Discovery returns targets with IP Address. In this environment:
> - 10.35.146.1
> - 10.35.146.2
> 
> Engine should keep these addresses in the database.
> 
> 2. Login to target using the addresses discovered in step 1.
> 
> Engine should reject login attempts to address that were
> not discovered.
> 
> 3. Do not mix IP address and DNS when logging in to targets
> 
> This can try to create 2 sessions to the same target, once
> using the DNS name, and once using the IP address, and this
> configuration is not supported on RHEL 8.
> 
> This used to work on RHEL 7, hiding the issue in RHV.
> 
> So for the tests in automation, fix the code according to these
> guidelines.
> 
> We can file new bug to improve validation on engine side to 
> avoid the bad logins, and document how the API should be used.

Hi Nir

1. Do you see any issue when FQDN is resolved to IPv4 & IPv6 as well 
e.g.:
[root@localhost]# host beetle-iscsi-lif2.lab.eng.tlv2.redhat.com
beetle-iscsi-lif2.lab.eng.tlv2.redhat.com has address 10.46.16.10
beetle-iscsi-lif2.lab.eng.tlv2.redhat.com has IPv6 address 2620:52:0:2e10:a0:98ff:febf:f1ea

2. How the setup should handle mix of hosts as follows: 
e.g.: IPv4 only, IPv6 only or hosts with both IPv4 & IPv6 addresses

Thx
Roni


Note You need to log in before you can comment on or make changes to this bug.