Bug 170656 - iSCSI connection recovery uses session address instead of portal address
Summary: iSCSI connection recovery uses session address instead of portal address
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Mike Christie
QA Contact: Brock Organ
URL:
Whiteboard:
Depends On:
Blocks: 168429
TreeView+ depends on / blocked
 
Reported: 2005-10-13 15:44 UTC by Cesar Garde
Modified: 2007-11-30 22:07 UTC (History)
4 users (show)

Fixed In Version: RHSA-2006-0132
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-03-07 20:24:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
retry portal addr after 5 retries (1.05 KB, patch)
2005-11-15 03:34 UTC, Mike Christie
no flags Details | Diff
retry portal address (1.47 KB, patch)
2005-11-15 04:26 UTC, Mike Christie
no flags Details | Diff
retry portal address immediately (689 bytes, patch)
2005-11-16 16:05 UTC, Mike Christie
no flags Details | Diff
network trace of 11-16-05 patch retest (169.07 KB, application/octet-stream)
2005-12-06 21:25 UTC, Cesar Garde
no flags Details
patch retest with login_timeout=5 (196.17 KB, application/octet-stream)
2005-12-06 22:52 UTC, Cesar Garde
no flags Details
messages file from 12-7-05 retest (95.37 KB, text/plain)
2005-12-07 15:41 UTC, Cesar Garde
no flags Details
network trace of 12-7-05 retest (180.56 KB, application/octet-stream)
2005-12-07 15:43 UTC, Cesar Garde
no flags Details
add debug output (2.54 KB, patch)
2005-12-08 01:30 UTC, Mike Christie
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2005:808 0 normal SHIPPED_LIVE Important: kernel security update 2005-10-27 04:00:00 UTC
Red Hat Product Errata RHSA-2006:0132 0 qe-ready SHIPPED_LIVE Moderate: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 3 2006-03-09 16:31:00 UTC

Description Cesar Garde 2005-10-13 15:44:39 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7 (ax)

Description of problem:
When connecting to the portal address on the Equallogic array, the array will issue a temporary login redirect.  The redirected address, however is not guranteed to be there if the connection is interrupted.  In the case of a connection loss, the initiator must reconnect to the portal address, not the last connection address.  Since the EQL array is unique in its use of login redirect, the initiator could use the temporary redirect as the heuristic on whether to go back to the portal address or retry the last address.

Version-Release number of selected component (if applicable):
kernel-2.6.9-22

How reproducible:
Always

Steps to Reproduce:
1. Create a volume on the EQL array and enable at least 2 of the ethernet ports
2. Connect the RH software initiator to the volume on the Equallogic array and start some I/O load.  
3. Use the GUI to determine which ethernet port is being used, and disable the port.
4. The load application will suspend and not resume.
5. Network traces show the initiator attempting to reconnect to the ethernet port that it was previously connected to, rather tham the portal address.
  

Actual Results:  The initiator will continue to retry, and connections are lost to the volume.  Utilities such as iscsi -ls and /etc/init.d/iscsi reload hang and require a ctrl-c to break out.

Expected Results:  The initiator should connect to the portal address.  As mentioned before, since the EQL array issues the temporary login redirect, this could be the inidicator to the initiator that connection retry should go back to the portal address.

Additional info:

Comment 7 Mike Christie 2005-10-24 16:37:06 UTC
Changing Component to kernel. In the future please try to select
iscsi-initiator-utils for userspace problems and kernel for driver problems. If
there is a kernel and userspace change required then make two :(. iSCSI is
legacy Component from when it was all bundled in one rpm. Thanks.

Note that if we add some sory of heristics to determine when to switch paths
userspace changes will be neccessary.

Comment 8 Cesar Garde 2005-10-24 18:16:35 UTC
When you have a fix identified, would it be available for us in case we have
customers that run into this problem?  


Comment 9 Mike Christie 2005-11-15 03:34:32 UTC
Created attachment 121044 [details]
retry portal addr after 5 retries

Cesar, could you verify this patch works for you? It is the simple fix we
discussed earlier. To fully support login redirect, we need some fixes in
userspace for iscsi mangement and that might not be done in U3 (U4 for sure).
For now though, I would at least like to get something that works for you guys.

Comment 10 Mike Christie 2005-11-15 03:35:31 UTC
oh yeah, that patch should work against the current RHEL4 U2 kernel source.

Comment 11 Mike Christie 2005-11-15 04:06:04 UTC
Comment on attachment 121044 [details]
retry portal addr after 5 retries

Wrong path. Do not test this one. I will upload a correct patch in a minute.

Comment 12 Mike Christie 2005-11-15 04:26:51 UTC
Created attachment 121045 [details]
retry portal address

Ok Cesar, please try this patch.

Comment 13 Cesar Garde 2005-11-15 20:05:19 UTC
Thanks Mike.  I pulled the patch in comment 12.  We'll let you know how it works.



Comment 14 Mike Christie 2005-11-16 16:05:33 UTC
Created attachment 121133 [details]
retry portal address immediately

Cessar, sorry about this. Cisco has sent us a fix for this they prefer. It
turns out it is also what your engineer had mentioned too. Please verify this
patch works for you guys.

Comment 15 Cesar Garde 2005-11-16 16:09:08 UTC
Just to be clear, I should use the attachment in comment 14 instead of the one
in comment 12, correct?

Comment 16 Mike Christie 2005-11-16 17:00:38 UTC
Yes.

Comment 19 Cesar Garde 2005-12-06 21:25:28 UTC
Created attachment 121946 [details]
network trace of 11-16-05 patch retest

Comment 20 Cesar Garde 2005-12-06 21:27:15 UTC
Sorry for the delayed response.  We finally had a chance to retest the patch and
found that the initiator does not attempt a login to the portal address.  I
included the trace in comment 19.


Comment 21 Mike Christie 2005-12-06 21:49:51 UTC
This trace only show the first relogin attempt right? It does not fall back to
the  portal address until the relogin times out.

BTW, you should be seeting the login_timout to a much shorter value to avoid delays.

Comment 22 Cesar Garde 2005-12-06 22:52:01 UTC
Created attachment 121949 [details]
patch retest with login_timeout=5

A little more detail about the retest:

Login_timout was set to 5 in iscsi.conf and iscsid was restarted.

Initiator (172.19.31.8) connected to 3 volumes on the EQL array.  2 connected
to IP address 172.19.102.161, and 1 connected to IP address .162.

Ethernet port for the .161 address was shut down.

fdisk command on the iscsi targets was issued.

Response came back from .162 (not shut down), and the command did not return. 
Trace was stopped after 5 minutes.

Comment 23 Mike Christie 2005-12-06 23:18:41 UTC
could you post the /var/log/messages too?

Comment 24 Mike Christie 2005-12-06 23:20:32 UTC
also I am not sure how a iscsid restart will work is the session you are
resetting the login_timeout for was created already. Could you just do a:

echo 5 > /sys/class/scsi_host/hostN/login_timeout or set the login_timeout
before iscsid is run for the first time.

Comment 25 Cesar Garde 2005-12-07 15:41:00 UTC
Created attachment 121983 [details]
messages file from 12-7-05 retest

We re-ran the test today and started with a reboot of the system.  This is the
messages file.	The network trace will be in the next attachment.

Comment 26 Cesar Garde 2005-12-07 15:43:18 UTC
Created attachment 121984 [details]
network trace of 12-7-05 retest

Network trace from today's retest of the 11-16 patch.  login_timeout was set to
5.

Comment 27 Mike Christie 2005-12-07 20:25:30 UTC
Thanks Cesar, I have not gotton to look at the network trace, but wrt to the
messages could you verify for me that for the session on host0 you did someting
that forced a logout and then that session came back fine, but for the sessions
connected to host1 and host2 did you just pull a cable?

Comment 28 Cesar Garde 2005-12-07 20:40:43 UTC
We didn't do anything to force a logout on host0.  After the boot, connections
were made to 3 volumes on the array (which show up as host0, host1, host2). 
Host0 connected to one of the 3 ethernet ports, and host1 and host2 connected to
another.  Nothing was done with the host0 connection.  We shut down the ethernet
port that host1 and host2 were using.  Would the login redirect show up as a
logout, then a login?

Comment 29 Mike Christie 2005-12-08 00:50:27 UTC
No. In the messages the logout is showing up as a result of a AEN.

Dec  7 10:19:38 serverb kernel: iscsi-sfnet:host0: Target requests logout within
3 seconds for session
Dec  7 10:19:38 serverb kernel: iscsi-sfnet:host0: Session logged out
Dec  7 10:19:38 serverb kernel: iscsi-sfnet:host0: Session dropped
Dec  7 10:19:39 serverb kernel: iscsi-sfnet:host0: Login failed to authenticate
with target iqn.2001-05.com.equallogic:6-8a0900-7f4a52a01-60a000038e64395a-vol1
Dec  7 10:19:39 serverb kernel: iscsi-sfnet:host0: Session established

Comment 30 Mike Christie 2005-12-08 01:09:56 UTC
I may have misunderstood you question.

I guess when we login we will get a error value indicating that the login failed
becuase the target wants to us to try another address.

Then there is that case above where for host0 we are logged in, then the target
logged us out. I did not see that in your trace so I am not sure what happened
as far as what addresses we used.

Comment 31 Mike Christie 2005-12-08 01:30:57 UTC
Created attachment 122017 [details]
add debug output

Could you grab the kernel from here
http://people.redhat.com/~jbaron/rhel4/
and apply this patch and send me all the log messages (login and failure).

Could you also maybe simplify the problem and not use so many ports or
something?

Comment 32 Cesar Garde 2005-12-08 22:21:39 UTC
We took the update 2 initiator and built it with the kernel in
http://people.redhat.com/~jbaron/rhel4/, and it works fine for us.  We'll check
it out in U3.  In the interim, can we get a hotfix that we can make available to
customers that may run into this problem before U3 is available? 

The fix resolves this bug from our point of view.  I'll let you move its status
according to your process.  Thanks for your help.

Comment 34 Red Hat Bugzilla 2006-03-07 20:24:24 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html



Note You need to log in before you can comment on or make changes to this bug.