1923975 – iscsi volume connection gets stuck if multipath is enabled and "iscsi -m session" fails

Bug 1923975 - iscsi volume connection gets stuck if multipath is enabled and "iscsi -m session" fails

Summary: iscsi volume connection gets stuck if multipath is enabled and "iscsi -m sess...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-os-brick
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	Alpha
Target Release:	16.2 (Train on RHEL 8.4)
Assignee:	Pablo Caruana
QA Contact:	Tzach Shefi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1977792 1977796
TreeView+	depends on / blocked

Reported:	2021-02-02 11:03 UTC by Takashi Kajinami
Modified:	2024-10-01 17:28 UTC (History)
CC List:	15 users (show)
Fixed In Version:	python-os-brick-2.10.7-2.20210528134947.el8ost.2
Doc Type:	Bug Fix
Doc Text:	Before this update, some exceptions were not caught during connections to iSCSI portals, such as failures in `iscsiadm -m session`. This caused `_connect_vol` threads to abort unexpectedly in some failure patterns, which caused subsequent steps to hang while waiting for results from `_connect_vol` threads. This update ensures that any exceptions during connections to iSCSI portals are handled correctly in the `_connect_vol` method, to avoid unhandled exceptions during connecting to iSCSI portals, and unexpected aborts that have no updated thread results.
Clone Of:
Clones:	1977792 1977796 (view as bug list)
Environment:
Last Closed:	2021-09-15 07:11:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	1915678	0	None	None	None	2021-02-15 01:28:04 UTC
OpenStack gerrit	796671	0	None	MERGED	Avoid unhandled exceptions during connecting to iSCSI portals	2021-06-30 00:00:10 UTC
Red Hat Bugzilla	1924768	1	None	None	None	2024-10-01 17:32:40 UTC
Red Hat Issue Tracker	OSP-760	0	None	None	None	2024-10-01 17:28:09 UTC
Red Hat Product Errata	RHEA-2021:3483	0	None	None	None	2021-09-15 07:12:16 UTC

Description Takashi Kajinami 2021-02-02 11:03:48 UTC

Description of problem:

When os-brick attaches an iscsi volume, it executes "iscsiadm -m session" first to obtain
all existing sessions, then execute "iscsiadm -m node -T <target> -p <portal> --login"
if it has not yet logged into that portal.

If multipath is enabled, it execute these process in threads and run iscsi commands concurrently
for multiple iscsi devices under an multipath device it is attaching.
However current implementation doesn't care about the failure in "iscsiadm -m session",
and if the command fails the volume attachment never completes.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Takashi Kajinami 2021-02-03 23:26:42 UTC

It turned out we are hitting an issue with iscsiadm command in RHEL8.2.
Since current os-brick executes volume connection (iscsiadm -m session and isciadm -m node ... --login) in thread
we are likely to hit this bug.

Since in controller nodes we don't keep iscsi connections but create ones when creating a volume from an image
we have more possibility to hit this, but we can hit the same in compute node when attaching a first volume to
that node.

Comment 24 Tzach Shefi 2021-07-18 11:43:48 UTC

Pablo, 

Is there a way I could verify this?
I see iSCSI and multipath mentioned, 
suspect this also depends on specific driver implementation  
"it executes "iscsiadm -m session" first to obtain all existing sessions"

Do you happen to know if I should use 3par iSCSI or netapp iSCSI for this case?

Reproduce steps would be what, just successfully attach a volume to an instance?

Comment 25 Pablo Caruana 2021-07-20 13:07:20 UTC

Tzach, it was reported on different backends iteration using iscsi like
- NetApp Storage backend
- iscsi storage( Dell-Emc XtremIO)

Requirement is iscsi-initiator-utils installed at the nova-compute container and cinder-volume as
https://bugzilla.redhat.com/show_bug.cgi?id=1924768 (for RHEL8.4)
and
https://bugzilla.redhat.com/show_bug.cgi?id=1940666 (for RHEL8.2)

You can use the netapp and the 3par if you times allow you.

Basically the test need iscsi storage and multipath enabled, so when attaching a volume cinder run iscsi commands and multipath commands to initialize connection and multipath device on multiple iscsi devices.

Run a attach volume command to specific instance and observe
When cinder connects the volume it first runs
Run manually `iscsiadm -m session -d ` is run through cinder.
to get a list of all sessions and then run
iscsiadm -m node -T <target> -p <portal> --login
if the target/portal is not yet logged in.

If multipath is enabled, it execute these process in threads and run iscsi commands concurrently
for multiple iscsi devices under an multipath device it is attaching.

Old os-brick code doesn't properly catch some possible exceptions during connectiing to iSCSI portals, like failures in "iscsiadm -m session". about the failure in "iscsiadm -m session", and if the command fails the volume attachment never completes.

If volume is attached succesfully, this should be enough for passing the test.

Thanks for double checking on this.
Regards.

Comment 32 Tzach Shefi 2021-08-02 09:51:28 UTC

Verified on
python3-os-brick-2.10.7-2.20210528134947.el8ost.2.noarch

On a multipath deployment using netapp iSCSI as Cinder backend, 
I ran the below scripts in parallel terminals:

[root@controller-2 ~]# cat iscsiwatch.sh                                                                                                                                                                                                     
# cat check_session                                                                                                                                                                                                                          
while true; do                                                                                                                                                                                                                               
date ; iscsiadm -m session                                                                                                                                                                                                                   
echo "==================================================================================="                                                                                                                                                   
done 

[root@controller-2 ~]# cat lin.sh 
# cat login loop                                                                                                                                                                                                                             
while true; do                                                                                                                                                                                                                               
date ; iscsiadm -m node -T iqn.1992-08.com.netapp:sn.83806661cc2f11eba182d039ea28c8f6:vs.19 -p 10.XXXXXXXX:3260 --login     (xxx ->internal IP)                                                                                                                
echo "==================================================================================="                                                                                                                                                   
done

[root@controller-2 ~]# cat lout.sh                                                                                                                                                                                                
# cat logout loop
while true; do
date ; iscsiadm -m node -T iqn.1992-08.com.netapp:sn.83806661cc2f11eba182d039ea28c8f6:vs.19 -p 10.XXXXXXXXX:3260 --logout
echo "==================================================================================="
done


To try to simulate a "iscsiadm -m session" failure, I'd even downgraded to:
iscsi-initiator-utils-6.2.0.878-5
Yet still I never managed to hit it. 

Several cycles of volume attach/detach to an instance, completed without any issues.
This was done while the loops were executed in parallel on the compute node. 

I then tested several cycles of volume create from a rhel.qcow2 image,
while running loops on the controller were c-vol was running, 
Again no issues were hit, all volumes were available. 

Looks good to verify.

Comment 34 errata-xmlrpc 2021-09-15 07:11:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483

Note You need to log in before you can comment on or make changes to this bug.