Bug 170656

Summary:

iSCSI connection recovery uses session address instead of portal address

Product:

Red Hat Enterprise Linux 4

Reporter:

Cesar Garde <cgarde>

Component:

kernel

Assignee:

Mike Christie <mchristi>

Status:

CLOSED ERRATA

QA Contact:

Brock Organ <borgan>

Severity:

high

Docs Contact:

Priority:

medium

Version:

4.0

CC:

coughlan, mchristi, poelstra, rkenna

Target Milestone:

---

Target Release:

---

Hardware:

i386

OS:

Linux

Whiteboard:

Fixed In Version:

RHSA-2006-0132

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2006-03-07 20:24:24 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

168429

Attachments:

Description	Flags
retry portal addr after 5 retries	none
retry portal address	none
retry portal address immediately	none
network trace of 11-16-05 patch retest	none
patch retest with login_timeout=5	none
messages file from 12-7-05 retest	none
network trace of 12-7-05 retest	none
add debug output	none

Description Cesar Garde 2005-10-13 15:44:39 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7 (ax)

Description of problem:
When connecting to the portal address on the Equallogic array, the array will issue a temporary login redirect. The redirected address, however is not guranteed to be there if the connection is interrupted. In the case of a connection loss, the initiator must reconnect to the portal address, not the last connection address. Since the EQL array is unique in its use of login redirect, the initiator could use the temporary redirect as the heuristic on whether to go back to the portal address or retry the last address.

Version-Release number of selected component (if applicable):
kernel-2.6.9-22

How reproducible:
Always

Steps to Reproduce:
1. Create a volume on the EQL array and enable at least 2 of the ethernet ports
2. Connect the RH software initiator to the volume on the Equallogic array and start some I/O load.
3. Use the GUI to determine which ethernet port is being used, and disable the port.
4. The load application will suspend and not resume.
5. Network traces show the initiator attempting to reconnect to the ethernet port that it was previously connected to, rather tham the portal address.

Actual Results: The initiator will continue to retry, and connections are lost to the volume. Utilities such as iscsi -ls and /etc/init.d/iscsi reload hang and require a ctrl-c to break out.

Expected Results: The initiator should connect to the portal address. As mentioned before, since the EQL array issues the temporary login redirect, this could be the inidicator to the initiator that connection retry should go back to the portal address.

Additional info:

Comment 7 Mike Christie 2005-10-24 16:37:06 UTC

Changing Component to kernel. In the future please try to select
iscsi-initiator-utils for userspace problems and kernel for driver problems. If
there is a kernel and userspace change required then make two :(. iSCSI is
legacy Component from when it was all bundled in one rpm. Thanks.

Note that if we add some sory of heristics to determine when to switch paths
userspace changes will be neccessary.

Comment 8 Cesar Garde 2005-10-24 18:16:35 UTC

When you have a fix identified, would it be available for us in case we have
customers that run into this problem?

Comment 9 Mike Christie 2005-11-15 03:34:32 UTC

Created attachment 121044 [details]
retry portal addr after 5 retries

Cesar, could you verify this patch works for you? It is the simple fix we
discussed earlier. To fully support login redirect, we need some fixes in
userspace for iscsi mangement and that might not be done in U3 (U4 for sure).
For now though, I would at least like to get something that works for you guys.

Comment 10 Mike Christie 2005-11-15 03:35:31 UTC

oh yeah, that patch should work against the current RHEL4 U2 kernel source.

Comment 11 Mike Christie 2005-11-15 04:06:04 UTC

Comment on attachment 121044 [details]
retry portal addr after 5 retries

Wrong path. Do not test this one. I will upload a correct patch in a minute.

Comment 12 Mike Christie 2005-11-15 04:26:51 UTC

Created attachment 121045 [details]
retry portal address

Ok Cesar, please try this patch.

Comment 13 Cesar Garde 2005-11-15 20:05:19 UTC

Thanks Mike.  I pulled the patch in comment 12.  We'll let you know how it works.

Comment 14 Mike Christie 2005-11-16 16:05:33 UTC

Created attachment 121133 [details]
retry portal address immediately

Cessar, sorry about this. Cisco has sent us a fix for this they prefer. It
turns out it is also what your engineer had mentioned too. Please verify this
patch works for you guys.

Comment 15 Cesar Garde 2005-11-16 16:09:08 UTC

Just to be clear, I should use the attachment in comment 14 instead of the one
in comment 12, correct?

Comment 16 Mike Christie 2005-11-16 17:00:38 UTC

Yes.

Comment 19 Cesar Garde 2005-12-06 21:25:28 UTC

Created attachment 121946 [details]
network trace of 11-16-05 patch retest

Comment 20 Cesar Garde 2005-12-06 21:27:15 UTC

Sorry for the delayed response.  We finally had a chance to retest the patch and
found that the initiator does not attempt a login to the portal address.  I
included the trace in comment 19.

Comment 21 Mike Christie 2005-12-06 21:49:51 UTC

This trace only show the first relogin attempt right? It does not fall back to
the  portal address until the relogin times out.

BTW, you should be seeting the login_timout to a much shorter value to avoid delays.

Comment 22 Cesar Garde 2005-12-06 22:52:01 UTC

Created attachment 121949 [details]
patch retest with login_timeout=5

A little more detail about the retest:

Login_timout was set to 5 in iscsi.conf and iscsid was restarted.

Initiator (172.19.31.8) connected to 3 volumes on the EQL array.  2 connected
to IP address 172.19.102.161, and 1 connected to IP address .162.

Ethernet port for the .161 address was shut down.

fdisk command on the iscsi targets was issued.

Response came back from .162 (not shut down), and the command did not return. 
Trace was stopped after 5 minutes.

Comment 23 Mike Christie 2005-12-06 23:18:41 UTC

could you post the /var/log/messages too?

Comment 24 Mike Christie 2005-12-06 23:20:32 UTC

also I am not sure how a iscsid restart will work is the session you are
resetting the login_timeout for was created already. Could you just do a:

echo 5 > /sys/class/scsi_host/hostN/login_timeout or set the login_timeout
before iscsid is run for the first time.

Comment 25 Cesar Garde 2005-12-07 15:41:00 UTC

Created attachment 121983 [details]
messages file from 12-7-05 retest

We re-ran the test today and started with a reboot of the system.  This is the
messages file.	The network trace will be in the next attachment.

Comment 26 Cesar Garde 2005-12-07 15:43:18 UTC

Created attachment 121984 [details]
network trace of 12-7-05 retest

Network trace from today's retest of the 11-16 patch.  login_timeout was set to
5.

Comment 27 Mike Christie 2005-12-07 20:25:30 UTC

Thanks Cesar, I have not gotton to look at the network trace, but wrt to the
messages could you verify for me that for the session on host0 you did someting
that forced a logout and then that session came back fine, but for the sessions
connected to host1 and host2 did you just pull a cable?

Comment 28 Cesar Garde 2005-12-07 20:40:43 UTC

We didn't do anything to force a logout on host0.  After the boot, connections
were made to 3 volumes on the array (which show up as host0, host1, host2). 
Host0 connected to one of the 3 ethernet ports, and host1 and host2 connected to
another.  Nothing was done with the host0 connection.  We shut down the ethernet
port that host1 and host2 were using.  Would the login redirect show up as a
logout, then a login?

Comment 29 Mike Christie 2005-12-08 00:50:27 UTC

No. In the messages the logout is showing up as a result of a AEN.

Dec  7 10:19:38 serverb kernel: iscsi-sfnet:host0: Target requests logout within
3 seconds for session
Dec  7 10:19:38 serverb kernel: iscsi-sfnet:host0: Session logged out
Dec  7 10:19:38 serverb kernel: iscsi-sfnet:host0: Session dropped
Dec  7 10:19:39 serverb kernel: iscsi-sfnet:host0: Login failed to authenticate
with target iqn.2001-05.com.equallogic:6-8a0900-7f4a52a01-60a000038e64395a-vol1
Dec  7 10:19:39 serverb kernel: iscsi-sfnet:host0: Session established

Comment 30 Mike Christie 2005-12-08 01:09:56 UTC

I may have misunderstood you question.

I guess when we login we will get a error value indicating that the login failed
becuase the target wants to us to try another address.

Then there is that case above where for host0 we are logged in, then the target
logged us out. I did not see that in your trace so I am not sure what happened
as far as what addresses we used.

Comment 31 Mike Christie 2005-12-08 01:30:57 UTC

Created attachment 122017 [details]
add debug output

Could you grab the kernel from here
http://people.redhat.com/~jbaron/rhel4/
and apply this patch and send me all the log messages (login and failure).

Could you also maybe simplify the problem and not use so many ports or
something?

Comment 32 Cesar Garde 2005-12-08 22:21:39 UTC

We took the update 2 initiator and built it with the kernel in
http://people.redhat.com/~jbaron/rhel4/, and it works fine for us.  We'll check
it out in U3.  In the interim, can we get a hotfix that we can make available to
customers that may run into this problem before U3 is available? 

The fix resolves this bug from our point of view.  I'll let you move its status
according to your process.  Thanks for your help.

Comment 34 Red Hat Bugzilla 2006-03-07 20:24:24 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html