From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7 (ax) Description of problem: When connecting to the portal address on the Equallogic array, the array will issue a temporary login redirect. The redirected address, however is not guranteed to be there if the connection is interrupted. In the case of a connection loss, the initiator must reconnect to the portal address, not the last connection address. Since the EQL array is unique in its use of login redirect, the initiator could use the temporary redirect as the heuristic on whether to go back to the portal address or retry the last address. Version-Release number of selected component (if applicable): kernel-2.6.9-22 How reproducible: Always Steps to Reproduce: 1. Create a volume on the EQL array and enable at least 2 of the ethernet ports 2. Connect the RH software initiator to the volume on the Equallogic array and start some I/O load. 3. Use the GUI to determine which ethernet port is being used, and disable the port. 4. The load application will suspend and not resume. 5. Network traces show the initiator attempting to reconnect to the ethernet port that it was previously connected to, rather tham the portal address. Actual Results: The initiator will continue to retry, and connections are lost to the volume. Utilities such as iscsi -ls and /etc/init.d/iscsi reload hang and require a ctrl-c to break out. Expected Results: The initiator should connect to the portal address. As mentioned before, since the EQL array issues the temporary login redirect, this could be the inidicator to the initiator that connection retry should go back to the portal address. Additional info:
Changing Component to kernel. In the future please try to select iscsi-initiator-utils for userspace problems and kernel for driver problems. If there is a kernel and userspace change required then make two :(. iSCSI is legacy Component from when it was all bundled in one rpm. Thanks. Note that if we add some sory of heristics to determine when to switch paths userspace changes will be neccessary.
When you have a fix identified, would it be available for us in case we have customers that run into this problem?
Created attachment 121044 [details] retry portal addr after 5 retries Cesar, could you verify this patch works for you? It is the simple fix we discussed earlier. To fully support login redirect, we need some fixes in userspace for iscsi mangement and that might not be done in U3 (U4 for sure). For now though, I would at least like to get something that works for you guys.
oh yeah, that patch should work against the current RHEL4 U2 kernel source.
Comment on attachment 121044 [details] retry portal addr after 5 retries Wrong path. Do not test this one. I will upload a correct patch in a minute.
Created attachment 121045 [details] retry portal address Ok Cesar, please try this patch.
Thanks Mike. I pulled the patch in comment 12. We'll let you know how it works.
Created attachment 121133 [details] retry portal address immediately Cessar, sorry about this. Cisco has sent us a fix for this they prefer. It turns out it is also what your engineer had mentioned too. Please verify this patch works for you guys.
Just to be clear, I should use the attachment in comment 14 instead of the one in comment 12, correct?
Yes.
Created attachment 121946 [details] network trace of 11-16-05 patch retest
Sorry for the delayed response. We finally had a chance to retest the patch and found that the initiator does not attempt a login to the portal address. I included the trace in comment 19.
This trace only show the first relogin attempt right? It does not fall back to the portal address until the relogin times out. BTW, you should be seeting the login_timout to a much shorter value to avoid delays.
Created attachment 121949 [details] patch retest with login_timeout=5 A little more detail about the retest: Login_timout was set to 5 in iscsi.conf and iscsid was restarted. Initiator (172.19.31.8) connected to 3 volumes on the EQL array. 2 connected to IP address 172.19.102.161, and 1 connected to IP address .162. Ethernet port for the .161 address was shut down. fdisk command on the iscsi targets was issued. Response came back from .162 (not shut down), and the command did not return. Trace was stopped after 5 minutes.
could you post the /var/log/messages too?
also I am not sure how a iscsid restart will work is the session you are resetting the login_timeout for was created already. Could you just do a: echo 5 > /sys/class/scsi_host/hostN/login_timeout or set the login_timeout before iscsid is run for the first time.
Created attachment 121983 [details] messages file from 12-7-05 retest We re-ran the test today and started with a reboot of the system. This is the messages file. The network trace will be in the next attachment.
Created attachment 121984 [details] network trace of 12-7-05 retest Network trace from today's retest of the 11-16 patch. login_timeout was set to 5.
Thanks Cesar, I have not gotton to look at the network trace, but wrt to the messages could you verify for me that for the session on host0 you did someting that forced a logout and then that session came back fine, but for the sessions connected to host1 and host2 did you just pull a cable?
We didn't do anything to force a logout on host0. After the boot, connections were made to 3 volumes on the array (which show up as host0, host1, host2). Host0 connected to one of the 3 ethernet ports, and host1 and host2 connected to another. Nothing was done with the host0 connection. We shut down the ethernet port that host1 and host2 were using. Would the login redirect show up as a logout, then a login?
No. In the messages the logout is showing up as a result of a AEN. Dec 7 10:19:38 serverb kernel: iscsi-sfnet:host0: Target requests logout within 3 seconds for session Dec 7 10:19:38 serverb kernel: iscsi-sfnet:host0: Session logged out Dec 7 10:19:38 serverb kernel: iscsi-sfnet:host0: Session dropped Dec 7 10:19:39 serverb kernel: iscsi-sfnet:host0: Login failed to authenticate with target iqn.2001-05.com.equallogic:6-8a0900-7f4a52a01-60a000038e64395a-vol1 Dec 7 10:19:39 serverb kernel: iscsi-sfnet:host0: Session established
I may have misunderstood you question. I guess when we login we will get a error value indicating that the login failed becuase the target wants to us to try another address. Then there is that case above where for host0 we are logged in, then the target logged us out. I did not see that in your trace so I am not sure what happened as far as what addresses we used.
Created attachment 122017 [details] add debug output Could you grab the kernel from here http://people.redhat.com/~jbaron/rhel4/ and apply this patch and send me all the log messages (login and failure). Could you also maybe simplify the problem and not use so many ports or something?
We took the update 2 initiator and built it with the kernel in http://people.redhat.com/~jbaron/rhel4/, and it works fine for us. We'll check it out in U3. In the interim, can we get a hotfix that we can make available to customers that may run into this problem before U3 is available? The fix resolves this bug from our point of view. I'll let you move its status according to your process. Thanks for your help.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html