Bug 144293
Summary: | The iscsi SW initiator (Cisco) fail to discover all the LUNs from a target | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Sorin Faibish <sfaibish> |
Component: | kernel | Assignee: | Mike Christie <mchristi> |
Status: | CLOSED NOTABUG | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.0 | CC: | berthiaume_wayne, bpeck, conway_heather, coughlan, kaufman_susan, perez-kolk_santiago, rkenna, sfaibish, smitha.narayan |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-10-25 18:04:09 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 170445 | ||
Attachments: |
Description
Sorin Faibish
2005-01-05 18:08:06 UTC
Created attachment 109384 [details]
tcpdump file in text format including the correct case
Created attachment 109385 [details]
tcpdump in text format with the erroneous iscsi initiator behavior
Sorin, I have tested with over 60 LUNs and I can not reproduce this problem. I have attached a debug version of the driver. This will print lots of stuff in /var/log/messages. Hopefully it will uncover the problem on your system. This driver is built for 2.4.21-27.ELsmp. Please do the following: gunzip iscsi_sfnet.o.gz service iscsi stop rmmod iscsi_sfnet insmod iscsi_sfnet.o service iscsi start Check to see if all the LUNs are configured. You can re-test with "service iscsi restart" if needed. Post the results. Tom Created attachment 109743 [details]
debug driver
Created attachment 110229 [details]
This includes 2 files: iscsi-ls.1 and iscsi_messages.1
After several additional tests, using the proposed debuging tools, I hope I was able understand in what cases the error shows. First I tried to use the same setup as Tom using Celerra target with 40 LUNs and I was not able to reproduce the problem. Then I used iscsi-ls -l in order to see that all the LUNs were discovered and I realized that the Cisco iSCSI switch has a different behavior than Celerra. When conncting to the Cisco MDS switch the iscsi-ls -l shows that there are 2 targets (see the attached file) while Celerra had only one. The missing LUNs were on one or the other of the targets. In the attached log file iscsi-ls.1 there were suposed to be 52 LUNs (26 on each target) and the log shows that there are only 27 on both targets. Attached as well is the /var/log/messages ections for different tests such as: 1. sequence service iscsi stop rmmod iscsi_sfnet insmod iscsi_sfnet.o service iscsi start 2. service iscsi restart, 3. service iscsi reload as marked in the second file iscsi_messages.1 which only capture the iscsi commands and debug information. This is where I am today. I am planning to simulate the MDS behavior on the Celerra to see if I can reproduce the problem without using the Cisco MDS iscsi bridge/switc Sorin, The attached log was not generated by the debug version of the iscsi driver that I posted. Please re-try your test with the debug driver. It should provide the information we need. Also, the attached log does not show any LUNs being configured at all. I thought that ~half the LUNs were working okay. Is this so? Any idea why the don't show up in the log? Tom Created attachment 110521 [details]
New Debug data using iscsi_sfnet.o patch
Debug data
On the Celerra, the driver is configuring LUNs 0-7. This is because the Report LUNs command is failing. From the log: Feb 1 13:56:50 localhost kernel: iSCSI: session c2384000 REPORT_LUNS failed, falling back to INQUIRY When the driver fails back to sending Inquiry, it will only attempt to configure LUNs 0-7. From the code: /* if REPORT_LUNS failed, then either it's a SCSI-2 device * that doesn't understand the command, or it's a SCSI-3 * device that only has one LUN and decided not to implement * REPORT_LUNS. In either case, we're safe just probing LUNs * 0-7 with INQUIRY, since SCSI-2 can't have more than 8 LUNs, * and SCSI-3 should do REPORT_LUNS if it has more than 1 LUN. */ Now I know the Celerra is capable of doing Report LUNs. Do you have any idea why it is failing on your system? This is a different scenario than the original report, where Report LUNs was working, but it was not getting the full list of LUNs. Can you re-run the packet trace, so we can see why the Report LUNs is failing? If not, I can give you another debug driver that will trace this area in more detail. On the MDS, the Report LUN command is working. I see 52 Clariion LUNs being configured, 29 on target 4 and 23 on target 5. Is this not correct? Tom - I believe the REPORT_LUNs failing and only the first 8 LUNs being reported for the Celerra has to do with Bugzilla #130365. I don't have access to a Celerra though to confirm it's reporting in /proc. Sorin - would you please refer to Bugzilla #130365? Thanks, As I said in Bugzilla #130365, the 2.4 Cisco iSCSI driver ignores the SCSI midlayer whitelist. So adding Celerra to the whitelist will not make any difference. The code comment in #9 above is straight from the Cisco iSCSI driver, indicating that the driver will only scan LUNs 0-7. The midlayer scanning is ignored. In fact, the 2.4 SCSI midlayer does not issue the Report LUNs command at all. This is only done by the iSCSI driver. The critical question is why Report LUNs fails on Sorin's Celerra, and not on the one here at Red Hat. Created attachment 110771 [details]
Debug data using Celerra iSCSI target
I repeated the tests after some discussions with our iSCSI team. We think we
understand the problem at least with Celerra. The attachment includes 3 files:
iscsi_dump.3 - the tcpdump file; to be viewed using ethereal
iscsi_messages.3 - the /var/log/messages for several cases separated by >>>>>
iscsi-ls.3.log - the output of the iscsi-ls -l
The problem is as follows:
If we configure 0-31 LUNS on Celerra, the initiator discovers all the LUNs
When we configure 33 LUNs on Celerra, as they are more than 32 (264 bytes
worth) the initiator falls back to the SCSI2 and only report 8 LUNs 0-7
If any more LUNs are added to the Celerra the result is the same when we do a
restart.
If I remove the extra LUNs above 32 and do a reload the initiator sees again
all of them.
The log file represent the following sequence:
Create LUNs 0-7 10-33 on Celerra target
Execute on the Linux client: service iscsi restart
Result: all the LUNs are reported
Add LUN 34 on Celerra target
Execute on the Linux client: service iscsi restart
Result: only LUNs 0-7 are discovered if they exist; if not no LUNs are reported
(if there are only LUNs 3-5 for example it will report only 3-5)
Add LUNs 35-51 to Celerra
Execute on the Linux client: service iscsi restart
Result: only LUNs 0-7 are discovered
Remove LUNs 34-51 on Celerra
Execute on the Linux client: service iscsi reload
Result: All the 32 LUNs are discovered.
I hope this will clarify the situation. We do not believe this has anything to
do with Bugzilla #130365. It looks like but it doesn't
Created attachment 110772 [details] Attachment as binary this time I did a mistake with the attchment. This is the correct attached file: bug144293.3.tar.gz Please repeat this test: Add LUNs 35-51 to Celerra Execute on the Linux client: service iscsi restart with the debug iSCSI driver that I provided. You only had the debug driver loaded for the first test, when there were less than 33 LUNs configured. Please post the full output from /var/log/messages. Created attachment 110844 [details]
Repeat of last test
I hope this time is OK. The problem is that I have to rmmod and insmod before
each test otherwise the debug info is not in the messages.
Sorin, Would you please post the iscsi.conf file that was in use on the system when the most recent set of traces was captured. Thank-you, Tom Created attachment 111643 [details]
second debug iscsi driver, for 2.4.21-27.ELsmp
This driver adds more debug statememts. I expect the log from running this
driver will finally indicate what is causing this problem. I have tried several
configurations on the Celerra in our lab, and it never fails, even with 60
LUNs. So I'll need your help again to test this.
Same preceedure as the other debug driver I sent. If you have switched to a
different SMP kernel, you can still run this driver. Just ignore the warning
when it loads.
You only need to run one test, on a system with more than 32 LUNs that fails to
configure all the LUNs. Post the full /var/log/messages output. I don't need an
ethereal trace this time.
Thanks,
Tom
*** Bug 150483 has been marked as a duplicate of this bug. *** The ReportLuns from MDS sets an overflow flag in the SCSI response. This is treated as an error by the driver and hence report lun fails and an inquiry of upto 8 luns is done. The first Report lun that is sent is sent to accomodate 32 luns. If the target has greater than 32 luns, the size of the buffer is incremented and sent again, but in the case here, it fails because of the overflow flag. It's being fixed from the MDS side. As a workaround, you can ignore the overflow checking: diff -Narup linux-iscsi/driver/iscsi.c linux-iscsi_new/driver/iscsi.c --- linux-iscsi/driver/iscsi.c 2005-07-15 19:41:17.000000000 +0530 +++ linux-iscsi_new/driver/iscsi.c 2005-08-30 15:29:23.000000000 +0530 @@ -3470,8 +3470,9 @@ process_task_response(iscsi_session_t * } else if (stsrh->flags & ISCSI_FLAG_CMD_OVERFLOW) { ISCSI_TRACE(ISCSI_TRACE_RxOverflow, sc, task, ntohl(stsrh->residual_count), expected); - sc->result = HOST_BYTE(DID_ERROR) | STATUS_BYTE(stsrh->cmd_status); - sc->resid = expected; + /* FIXME: Do not know how to report overflow to scsi ml. Ignoring + * overflow + */ } else if (task->rxdata < expected) { /* All the read data did not arrive. This can * happen without an underflow indication from Cisco this is reminder to indicate if you need a fix similar to the one in comment 19 or has your HW/firmware been fixed. Thanks. Changing Component to kernel. In the future please try to select iscsi-initiator-utils for userspace problems and kernel for driver problems. If there is a kernel and userspace change required then make two :(. iSCSI is legacy Component from when it was all bundled in one rpm. Thanks. fixed in HW for both EMC and Cisco. Closing BZ. Moving off proposed list. |