Red Hat Bugzilla – Bug 170656
iSCSI connection recovery uses session address instead of portal address
Last modified: 2007-11-30 17:07:21 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7 (ax)
Description of problem:
When connecting to the portal address on the Equallogic array, the array will issue a temporary login redirect. The redirected address, however is not guranteed to be there if the connection is interrupted. In the case of a connection loss, the initiator must reconnect to the portal address, not the last connection address. Since the EQL array is unique in its use of login redirect, the initiator could use the temporary redirect as the heuristic on whether to go back to the portal address or retry the last address.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Create a volume on the EQL array and enable at least 2 of the ethernet ports
2. Connect the RH software initiator to the volume on the Equallogic array and start some I/O load.
3. Use the GUI to determine which ethernet port is being used, and disable the port.
4. The load application will suspend and not resume.
5. Network traces show the initiator attempting to reconnect to the ethernet port that it was previously connected to, rather tham the portal address.
Actual Results: The initiator will continue to retry, and connections are lost to the volume. Utilities such as iscsi -ls and /etc/init.d/iscsi reload hang and require a ctrl-c to break out.
Expected Results: The initiator should connect to the portal address. As mentioned before, since the EQL array issues the temporary login redirect, this could be the inidicator to the initiator that connection retry should go back to the portal address.
Changing Component to kernel. In the future please try to select
iscsi-initiator-utils for userspace problems and kernel for driver problems. If
there is a kernel and userspace change required then make two :(. iSCSI is
legacy Component from when it was all bundled in one rpm. Thanks.
Note that if we add some sory of heristics to determine when to switch paths
userspace changes will be neccessary.
When you have a fix identified, would it be available for us in case we have
customers that run into this problem?
Created attachment 121044 [details]
retry portal addr after 5 retries
Cesar, could you verify this patch works for you? It is the simple fix we
discussed earlier. To fully support login redirect, we need some fixes in
userspace for iscsi mangement and that might not be done in U3 (U4 for sure).
For now though, I would at least like to get something that works for you guys.
oh yeah, that patch should work against the current RHEL4 U2 kernel source.
Comment on attachment 121044 [details]
retry portal addr after 5 retries
Wrong path. Do not test this one. I will upload a correct patch in a minute.
Created attachment 121045 [details]
retry portal address
Ok Cesar, please try this patch.
Thanks Mike. I pulled the patch in comment 12. We'll let you know how it works.
Created attachment 121133 [details]
retry portal address immediately
Cessar, sorry about this. Cisco has sent us a fix for this they prefer. It
turns out it is also what your engineer had mentioned too. Please verify this
patch works for you guys.
Just to be clear, I should use the attachment in comment 14 instead of the one
in comment 12, correct?
Created attachment 121946 [details]
network trace of 11-16-05 patch retest
Sorry for the delayed response. We finally had a chance to retest the patch and
found that the initiator does not attempt a login to the portal address. I
included the trace in comment 19.
This trace only show the first relogin attempt right? It does not fall back to
the portal address until the relogin times out.
BTW, you should be seeting the login_timout to a much shorter value to avoid delays.
Created attachment 121949 [details]
patch retest with login_timeout=5
A little more detail about the retest:
Login_timout was set to 5 in iscsi.conf and iscsid was restarted.
Initiator (172.19.31.8) connected to 3 volumes on the EQL array. 2 connected
to IP address 172.19.102.161, and 1 connected to IP address .162.
Ethernet port for the .161 address was shut down.
fdisk command on the iscsi targets was issued.
Response came back from .162 (not shut down), and the command did not return.
Trace was stopped after 5 minutes.
could you post the /var/log/messages too?
also I am not sure how a iscsid restart will work is the session you are
resetting the login_timeout for was created already. Could you just do a:
echo 5 > /sys/class/scsi_host/hostN/login_timeout or set the login_timeout
before iscsid is run for the first time.
Created attachment 121983 [details]
messages file from 12-7-05 retest
We re-ran the test today and started with a reboot of the system. This is the
messages file. The network trace will be in the next attachment.
Created attachment 121984 [details]
network trace of 12-7-05 retest
Network trace from today's retest of the 11-16 patch. login_timeout was set to
Thanks Cesar, I have not gotton to look at the network trace, but wrt to the
messages could you verify for me that for the session on host0 you did someting
that forced a logout and then that session came back fine, but for the sessions
connected to host1 and host2 did you just pull a cable?
We didn't do anything to force a logout on host0. After the boot, connections
were made to 3 volumes on the array (which show up as host0, host1, host2).
Host0 connected to one of the 3 ethernet ports, and host1 and host2 connected to
another. Nothing was done with the host0 connection. We shut down the ethernet
port that host1 and host2 were using. Would the login redirect show up as a
logout, then a login?
No. In the messages the logout is showing up as a result of a AEN.
Dec 7 10:19:38 serverb kernel: iscsi-sfnet:host0: Target requests logout within
3 seconds for session
Dec 7 10:19:38 serverb kernel: iscsi-sfnet:host0: Session logged out
Dec 7 10:19:38 serverb kernel: iscsi-sfnet:host0: Session dropped
Dec 7 10:19:39 serverb kernel: iscsi-sfnet:host0: Login failed to authenticate
with target iqn.2001-05.com.equallogic:6-8a0900-7f4a52a01-60a000038e64395a-vol1
Dec 7 10:19:39 serverb kernel: iscsi-sfnet:host0: Session established
I may have misunderstood you question.
I guess when we login we will get a error value indicating that the login failed
becuase the target wants to us to try another address.
Then there is that case above where for host0 we are logged in, then the target
logged us out. I did not see that in your trace so I am not sure what happened
as far as what addresses we used.
Created attachment 122017 [details]
add debug output
Could you grab the kernel from here
and apply this patch and send me all the log messages (login and failure).
Could you also maybe simplify the problem and not use so many ports or
We took the update 2 initiator and built it with the kernel in
http://people.redhat.com/~jbaron/rhel4/, and it works fine for us. We'll check
it out in U3. In the interim, can we get a hotfix that we can make available to
customers that may run into this problem before U3 is available?
The fix resolves this bug from our point of view. I'll let you move its status
according to your process. Thanks for your help.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.