1977792 – iscsi volume connection gets stuck if multipath is enabled and "iscsi -m session" fails

Bug 1977792 - iscsi volume connection gets stuck if multipath is enabled and "iscsi -m session" fails

Summary: iscsi volume connection gets stuck if multipath is enabled and "iscsi -m sess...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-os-brick
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	z7
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Pablo Caruana
QA Contact:	Tzach Shefi
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1942487 1977796 (view as bug list)
Depends On:	1923975
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-30 13:42 UTC by Pablo Caruana
Modified:	2024-12-20 20:22 UTC (History)
CC List:	14 users (show)
Fixed In Version:	python-os-brick-2.10.5-1.20210706143310.634fb4a.el8ost
Doc Type:	Bug Fix
Doc Text:	Before this update, there were unhandled exceptions during connection to iSCSI portals. For example, failures in `iscsiadm -m session`. This occurred because the `_connect_vol` threads can abort unexpectedly in some failure patterns, and this abort causes a hang in subsequent steps while waiting for results from `_connct_vol` threads. + With this update, any exceptions during connection to iSCSI portals are handled in the `_connect_vol` method correctly and avoids any unexpected abort without updating thread results.
Clone Of:	1923975
Environment:
Last Closed:	2021-12-09 20:20:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1915678	None	None	None	2021-06-30 14:07:25 UTC
Red Hat Issue Tracker	OSP-5668	None	None	None	2021-11-18 11:34:53 UTC
Red Hat Product Errata	RHBA-2021:3762	None	None	None	2021-12-09 20:20:38 UTC

Description Pablo Caruana 2021-06-30 13:42:25 UTC

+++ This bug was initially created as a clone of Bug #1923975 +++ targeted for 16.2
This is the 16.1 branch.
Description of problem:

When os-brick attaches an iscsi volume, it executes "iscsiadm -m session" first to obtain
all existing sessions, then execute "iscsiadm -m node -T <target> -p <portal> --login"
if it has not yet logged into that portal.

If multipath is enabled, it execute these process in threads and run iscsi commands concurrently
for multiple iscsi devices under an multipath device it is attaching.
However current implementation doesn't care about the failure in "iscsiadm -m session",
and if the command fails the volume attachment never completes.

Comment 1 Takashi Kajinami 2021-06-30 14:01:38 UTC

*** Bug 1977796 has been marked as a duplicate of this bug. ***

Comment 4 Pablo Caruana 2021-07-01 09:52:14 UTC

*** Bug 1942487 has been marked as a duplicate of this bug. ***

Comment 11 Tzach Shefi 2021-08-04 12:07:21 UTC

Verified on:
python3-os-brick-2.10.5-1.20210706143310.634fb4a.el8ost.noarch


On a multipath deployment using netapp iSCSI as Cinder backend, 
I ran the below scripts in parallel terminals:

[root@controller-2 ~]# cat iscsiwatch.sh                                                                                                                                                                                                     
# cat check_session                                                                                                                                                                                                                          
while true; do                                                                                                                                                                                                                               
date ; iscsiadm -m session                                                                                                                                                                                                                   
echo "==================================================================================="                                                                                                                                                   
done 

[root@controller-2 ~]# cat lin.sh 
# cat login loop                                                                                                                                                                                                                             
while true; do                                                                                                                                                                                                                               
date ; iscsiadm -m node -T iqn.1992-08.com.netapp:sn.83806661cc2f11eba182d039ea28c8f6:vs.19 -p 10.XXXXXXXX:3260 --login     (xxx ->internal IP)                                                                                                                
echo "==================================================================================="                                                                                                                                                   
done

[root@controller-2 ~]# cat lout.sh                                                                                                                                                                                                
# cat logout loop
while true; do
date ; iscsiadm -m node -T iqn.1992-08.com.netapp:sn.83806661cc2f11eba182d039ea28c8f6:vs.19 -p 10.XXXXXXXXX:3260 --logout
echo "==================================================================================="
done


With all three loops running simultaneously on the controller hosting c-vol, 
I successfully created 4 volumes from a rhel image:
(overcloud) [stack@undercloud-0 ~]$ cinder list
+--------------------------------------+-----------+----------+------+-------------+----------+-------------+
| ID                                   | Status    | Name     | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+----------+------+-------------+----------+-------------+
| 69f5db2e-32ac-47ca-a721-290afe36c3a8 | available | rhelvol3 | 10   | tripleo     | true     |             |
| 74125dde-db5e-40ef-bb5f-edc359c01e51 | available | rhelvol1 | 10   | tripleo     | true     |             |
| 9816afd4-3182-4994-8395-c0b9dde2e5d7 | available | rhelvol2 | 10   | tripleo     | true     |             |
| ea8902f4-34cd-4ea5-8d96-5754ad0d6ccd | available | rhelvol4 | 10   | tripleo     | true     |             |
+--------------------------------------+-----------+----------+------+-------------+----------+-------------+


Now lets test volume attachment, I'll run the same 3 loops on compute node hosting my instance

Before the loops are running:
(overcloud) [stack@undercloud-0 ~]$ nova volume-attach inst1 74125dde-db5e-40ef-bb5f-edc359c01e51
+-----------------------+--------------------------------------+
| Property              | Value                                |                                                                                                                                                                             
+-----------------------+--------------------------------------+                                                                                                                                                                             
| delete_on_termination | False                                |                                                                                                                                                                             
| device                | /dev/vdb                             |                                                                                                                                                                             
| id                    | 74125dde-db5e-40ef-bb5f-edc359c01e51 |                                                                                                                                                                             
| serverId              | 5a6b5a47-1d33-4324-a486-96a9d02f7f42 |
| tag                   | -                                    |
| volumeId              | 74125dde-db5e-40ef-bb5f-edc359c01e51 |
+-----------------------+--------------------------------------+

We see the positve flow, 4 sessions/paths:
[root@compute-0 ~]# iscsiadm -m session
tcp: [6] 10.xxxxxxxx:3260,1039 iqn.1992-08.com.netapp:sn.83806661cc2f11eba182d039ea28c8f6:vs.19 (non-flash)
tcp: [7] 10.xxxxxxxx:3260,1045 iqn.1992-08.com.netapp:sn.83806661cc2f11eba182d039ea28c8f6:vs.19 (non-flash)
tcp: [8] 10.xxxxxxxx:3260,1046 iqn.1992-08.com.netapp:sn.83806661cc2f11eba182d039ea28c8f6:vs.19 (non-flash)
tcp: [9] 10.xxxxxxxx:3260,1047 iqn.1992-08.com.netapp:sn.83806661cc2f11eba182d039ea28c8f6:vs.19 (non-flash)


Now lets detach and reattach with loops running:
Attachment works fine, 
we notice only 3 sessions, 4th session is missing due to constant logout/login attempts.  

| 74125dde-db5e-40ef-bb5f-edc359c01e51 | in-use    | rhelvol1 | 10   | tripleo     | true     | 5a6b5a47-1d33-4324-a486-96a9d02f7f42 |

tcp: [11] 10.xxxxxxxx:3260,1045 iqn.1992-08.com.netapp:sn.83806661cc2f11eba182d039ea28c8f6:vs.19 (non-flash)
tcp: [13] 10.xxxxxxxx:3260,1046 iqn.1992-08.com.netapp:sn.83806661cc2f11eba182d039ea28c8f6:vs.19 (non-flash)
tcp: [14] 10.xxxxxxxx:3260,1047 iqn.1992-08.com.netapp:sn.83806661cc2f11eba182d039ea28c8f6:vs.19 (non-flash)

Good to verify.

Comment 22 errata-xmlrpc 2021-12-09 20:20:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3762

Note You need to log in before you can comment on or make changes to this bug.