Bug 168057
Summary: | linux-iscsi initiator in rhel4 u2 does not recover from tpgt change via "iscsi reload" | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Dave Wysochanski <davidw> | ||||||
Component: | iscsi-initiator-utils | Assignee: | Mike Christie <mchristi> | ||||||
Status: | CLOSED WONTFIX | QA Contact: | Brock Organ <borgan> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 4.0 | CC: | 157070.alewis, coughlan | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | alpha | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2012-06-20 16:11:05 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Dave Wysochanski
2005-09-11 21:41:16 UTC
Created attachment 118697 [details]
iscsi-kill-session script
I heard back from someone internally that has dealt with this issue a lot before. As I thought, the right behavior is to tear down the session, fail any I/Os, then start all over. I don't think this was the case in previous linux-iscsi drivers (I/Os would not fail, but would just be resubmitted under the covers w/out LUN map revalidation, etc). This previous behavior was not really technically correct, but is probably practically ok. Here's the precise wording from the developer: My expectation of an initiator when the TPGT changes is for it to recognize that something has changed and re-discover its iSCSI targets and LUNs. My expectation of the target is to close all existing sessions tied to the target portal whose TPGT has changed (which we do). If the initiator doesn't do the right thing, the next best thing it can do is ignore the TPGT change altogether. Ignoring TPGT is not an expected behavior according to the RFC and we shouldn't ask any initiator developers to implement it as such. If possible, we should encourage them to do the right thing. Just confirmed that on RHEL3 linux-iscsi, an "iscsi reload" will just cause an update of the tpgt, not an invalidation of the LUNs (I had a LUN mounted with I/O going to it, then changed the tpgt on the target, which shutdown the session, initiator was trying to reconnect and couldn't b/c of tpgt change, then issued "iscsi reload", and initiator reconnected and I/O resumed w/out any I/O errors). Looks like if we change the target name though, the LUN gets removed and I/Os will fail (see below). Not sure if you change an IP address if you'll get the tpgt change behavior or the target nodename change behavior - my guess is the former. Sep 12 15:28:46 rachman kernel: iSCSI: session dfff0000 portal group tag mismatch, expected 1001, received 1 Sep 12 15:28:46 rachman kernel: iSCSI: session dfff0000 retrying login to portal 1 at 6744844 Sep 12 15:28:46 rachman kernel: iSCSI: session dfff0000 to iqn.1992-08.com.netapp:sn.50393227.yanni waiting 1 seconds before next login attempt Sep 12 15:28:46 rachman iscsid[28350]: Connected to Discovery Address 10.60.155.94 Sep 12 15:28:46 rachman iscsid[28108]: updating bus 0 target 3 to configuration #2 Sep 12 15:28:47 rachman kernel: iSCSI: bus 0 target 3 updating configuration of session dfff0000 to iqn.1992-08.com.netapp:sn.50393227.yanni Sep 12 15:28:47 rachman kernel: iSCSI: bus 0 target 3 = iqn.1992-08.com.netapp:sn.50393227.yanni Sep 12 15:28:47 rachman kernel: iSCSI: bus 0 target 3 portal 0 = address 10.60.155.94 port 3260 group 1 Sep 12 15:28:47 rachman kernel: iSCSI: bus 0 target 3 portal 1 = address 10.60.155.95 port 3260 group 1 Sep 12 15:28:47 rachman kernel: iSCSI: bus 0 target 3 portals have changed, failed to find a new portal in portal group 1002, session dfff0000 trying portal 0 group 1 Sep 12 15:28:47 rachman kernel: iSCSI: bus 0 target 3 configuration updated at 6744961 while session dfff0000 to iqn.1992-08.com.netapp:sn.50393227.yanni is not established Sep 12 15:28:47 rachman kernel: iSCSI: bus 0 target 3 trying to establish session dfff0000 to portal 0, address 10.60.155.94 port 3260 group 1 Sep 12 15:28:49 rachman kernel: iSCSI: bus 0 target 3 established session dfff0000 #3, portal 0, address 10.60.155.94 port 3260 group 1 Sep 12 15:28:49 rachman kernel: iSCSI: session dfff0000 recv_cmd d35a6600, cdb 0x2a, status 0x2, response 0x0, senselen 22, key 06, ASC/ASCQ 29/00, itt 140554 task ce6ec358 to (5 0 3 0), iqn.1992-08.com.netapp:sn.50393227.yanni Sep 12 15:28:49 rachman kernel: iSCSI: Sense f0000600 0000000e 00000000 29000000 0000 Sep 12 15:28:49 rachman kernel: iSCSI: session dfff0000 recv_cmd d35a6600, itt 140554, task ce6ec358 to (5 0 3 0), cdb 0x2a, U underflow, received 0, residual 8192, expected 8192 Sep 12 15:42:50 rachman kernel: iSCSI: session dfff024c target dropping all connections, reconnect min 0 max 0 Sep 12 15:42:50 rachman iscsid[28350]: Connection to Discovery Address 10.60.155.94 closed Sep 12 15:42:50 rachman kernel: iSCSI: session dfff0000 closed by target iqn.1992-08.com.netapp:sn.50393227.yanni at 6829312 Sep 12 15:42:50 rachman kernel: iSCSI: session dfff0000 to iqn.1992-08.com.netapp:sn.50393227.yanni dropped Sep 12 15:42:51 rachman kernel: iSCSI: bus 0 target 3 trying to establish session dfff0000 to portal 0, address 10.60.155.94 port 3260 group 1 Sep 12 15:42:51 rachman kernel: iSCSI: session dfff0000 to iqn.1992-08.com.netapp:sn.50393227.yanni failed to connect, rc -111, Connection refused Sep 12 15:42:51 rachman kernel: iSCSI: session dfff0000 connect failed at 6829358 Sep 12 15:42:51 rachman kernel: iSCSI: session dfff0000 to iqn.1992-08.com.netapp:sn.50393227.yanni waiting 1 seconds before next login attempt Sep 12 15:42:52 rachman kernel: iSCSI: bus 0 target 3 trying to establish session dfff0000 to portal 0, address 10.60.155.94 port 3260 group 1 Sep 12 15:42:52 rachman kernel: iSCSI: session dfff0000 login rejected: initiator error - target not found (02/03) Sep 12 15:42:52 rachman kernel: iSCSI: session dfff0000 giving up at 6829499 Sep 12 15:42:52 rachman kernel: iSCSI: session dfff0000 terminating, failing all SCSI commands Sep 12 15:42:52 rachman kernel: iSCSI: session dfff0000 failing command d35a6600 cdb 0x28 to (5 0 3 0) at 6829512 Sep 12 15:42:52 rachman kernel: iSCSI: session dfff0000 failing command cdb1d600 cdb 0x28 to (5 0 3 0) at 6829512 Sep 12 15:42:52 rachman kernel: iSCSI: session dfff0000 failing command d0831400 cdb 0x2a to (5 0 3 0) at 6829512 Sep 12 15:42:52 rachman kernel: Device 08:30 not ready. Sep 12 15:42:52 rachman kernel: I/O error: dev 08:30, sector 4460560 Sep 12 15:42:53 rachman kernel: iSCSI: session dfff0000 terminating, failing to queue d0831000 cdb 0x28 and any following commands to (5 0 3 0), iqn.1992-08.com.netapp:sn.50393227.yanni Sep 12 15:42:53 rachman kernel: Device 08:30 not ready. Sep 12 15:42:53 rachman kernel: I/O error: dev 08:30, sector 4460768 Sep 12 15:42:53 rachman kernel: Device 08:30 not ready. Sep 12 15:42:53 rachman kernel: I/O error: dev 08:30, sector 61160 Sep 12 15:42:53 rachman kernel: scsi5: remove-single-device 0 3 0 failed, device busy(1). Sep 12 15:42:53 rachman kernel: iSCSI: session dfff0000 error -16 writing 'scsi remove-single-device 5 0 3 0' to /proc/scsi/scsi Sep 12 15:42:53 rachman kernel: Device 08:30 not ready. Sep 12 15:42:53 rachman kernel: I/O error: dev 08:30, sector 61208 Sep 12 15:42:53 rachman kernel: Device 08:30 not ready. Sep 12 15:42:53 rachman kernel: I/O error: dev 08:30, sector 4460560 Sep 12 15:42:53 rachman kernel: Device 08:30 not ready. Sep 12 15:42:53 rachman kernel: I/O error: dev 08:30, sector 4460560 Sep 12 15:42:56 rachman kernel: iSCSI: queuecommand d0831400 failed to find a session for HBA cdb1db00, (5 0 3 0) Sep 12 15:42:56 rachman kernel: Device 08:30 not ready. Sep 12 15:42:56 rachman kernel: I/O error: dev 08:30, sector 61216 Sep 12 15:42:56 rachman kernel: Device 08:30 not ready. Sep 12 15:42:56 rachman kernel: I/O error: dev 08:30, sector 61232 Sep 12 15:43:10 rachman kernel: Device 08:30 not ready. Sep 12 15:43:10 rachman kernel: I/O error: dev 08:30, sector 4419952 Sep 12 15:43:11 rachman kernel: Device 08:30 not ready. Sep 12 15:43:11 rachman kernel: I/O error: dev 08:30, sector 4456448 Sep 12 15:43:11 rachman kernel: Device 08:30 not ready. Sep 12 15:43:11 rachman kernel: I/O error: dev 08:30, sector 6815744 Looks like in RHEL3 as long as the target nodename doesn't change, the IPs or tpgts can change without it affecting I/Os (mainly just an FYI). Things like ips and tpgts just cause a logout then the commands are internally failed (like they are failed for our erl=0 handling) and setup for a requeue. Beucase that driver does scanning internally it can also probe luns without failing IO to upper layers and causing sd? to change. If you turn on debugging in iscsid you can tell if we are doing a a probe luns on this reconfig (iscsid -d 8) for sure. I mean fo I'd be ok with that if it wasn't for the fact that the scanning it does internally is just based on LUN # - if "LUN 0" was mapped before, and "LUN 0" is still mapped, it might be a different LUN but the rhel3 driver (or other drivers) won't recognize it (they do LUN map validation based on bitmaps, not INQUIRY pg 0x83 data or something like that). It's probably mostly a pathalogical case, but something to be aware of for rhel3 and prior drivers. We hit this when doing LUN resizes (in rhel3 if you resize a LUN, then do an "iscsi reload", the kernel will still think the LUN is the old size b/c the iscsi driver just sees that LU's are still mapped at the same #'s so it doesn't tell the kernel anything new). ok with what? I was only describing the code and I thought we were only trying to fix RHEL4 to do something. Are you also wanting to fix RHEL3? Sorry - was mainly just pointing out differences between rhel4 and rhel3 - not suggesting we change rhel3 behavior. Created attachment 118738 [details]
messages file snippit showing scsi_lib error after iscsi-kill-session when tpgt changed
In rhel4 I'm just looking for the best way to recover from this scenario. Right now the best I can come up with is just "iscsi stop/start" or something like that. Maybe that's the best answer right now since "iscsi reload" doesn't work like it used to and maybe it's too hard to get the old behavior back. It might be nice to be able to kill the one session and keep going but I'm not sure it's worth the effort - Mike you can probably make the call. Maybe it would be nice to make "iscsi-kill-session" safe and fix errors like the scsi_lib one but then again maybe not. Changing Component to iscsi-initiator-utils for now. In the future please try to select iscsi-initiator-utils for userspace problems and kernel for driver problems. iSCSI is legacy Component from when it was all bundled in one rpm. Thanks. Mike, just to let you know we're probably going to go with a workaround on our end so that getting the driver to recover from a tpgt change is probably not as important anymore. Not sure if you were working on it or not but you can probably close this or at least lower the priority until further notice and I can re-open if need be (unless you're getting other vendors that want this fixed). Thanks. Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. Please See https://access.redhat.com/support/policy/updates/errata/ If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue. |