Description of problem: I try to connect the iscsi initiator to a target which contains a relatively large number of LUNs, more specific 52 LUNs. The target used were either one from a FC/iSCSI Cisco MDS 9000 switch or an iSCSI target on Celerra NS704G NAS server connected to a CLARiiON CX700. After configuring the target I run "$ service iscsi restart" which start the Login, authentication and initial handshaking and after that it tries to discover the LUNs using the Report LUNs command. After this I execute: "$ iscsi-ls -l" and I find out that there are only 33 LUNs discovered by the initiator. I repeat the restart command and in most cases with the same result, missing LUNs. Sometimes, randomly, the initiator will discover all the LUNs. I tried also to replace the restart by stop/start with several seconds delay between the commands with the same results. At this point after many trials I took network traces using the tcpdump command as follows: "$ tcpdump -i eth1 -w iscsi_LUN_Report_BUG host testMDS1 -s 512 -c 10000" Attached will find a text version of the iscsi packets collected during the test. The dump shows that after the Login and NOOP In and NOOP Out commands a Report LUNs command is issued by the iscsi initiator (172.24.107.198). In the SCSI section of the iSCSI command there is defined the size of the report expected by the initiator (which has no knowledge of the total size of the LUNs List) as 264 bytes (it is probably the default) in TCP Frame 36. The target replies with "Data In" command, Frame 38. In the reply the LUN List size is reported as 416 bytes (52 LUNs x 8 bytes/LUN = 416 bytes). Of course as the initiator only allcated 264 the list is truncated including only 33 LUNs. As result the initiator issues a new Report LUNs command but the Allocation size is also 264 bytes. To the best of my understanding in this case the initiator should allocate 416 bytes or more as it has the information about the size of the LUNs List from the first reply. As a result of course the List is truncated the same. In some rare cases all the LUNs are discovered as shown in file iscsi_LUN_Report_Correct, in Frames 48, 50 (first req/reply 264 bytes) and 53, 55 (req./Reply with size 424 bytes). In this case the allocated buffer size 424 is greater than 416 as reported by the target and all the 52 LUNs are discovered. In summary it looks like the problem is in the SCSI part of the iSCSI protocol implementation. It has nothing to do with the network latency. Version-Release number of selected component (if applicable): This is the iscsi initiator included in the latest RH distribution: RHEL3-U4-re1216.0-i386-AS. How reproducible: 98% of the cases after a "service iscsi restart" command is issued. Steps to Reproduce: 1.Configure the iscsi initiator: /etc/iscsi.conf and /etc/initiatorname.iscsi and Grant access to N>33 LUNs and the initiator to access the iscsi target. 2. Run: "service iscsi restart" 3. Run: iscsi-ls -l and count the number of LUNs to be equal to N Actual results: You will probably see in most of the attempts M LUNs with M=32 < N Expected results: You should count exactly N LUNs. Additional info: As a workaround after the "service iscsi restart" run: "service iscsi reload". In this case the initiator will discover all the LUNs.
Created attachment 109384 [details] tcpdump file in text format including the correct case
Created attachment 109385 [details] tcpdump in text format with the erroneous iscsi initiator behavior
Sorin, I have tested with over 60 LUNs and I can not reproduce this problem. I have attached a debug version of the driver. This will print lots of stuff in /var/log/messages. Hopefully it will uncover the problem on your system. This driver is built for 2.4.21-27.ELsmp. Please do the following: gunzip iscsi_sfnet.o.gz service iscsi stop rmmod iscsi_sfnet insmod iscsi_sfnet.o service iscsi start Check to see if all the LUNs are configured. You can re-test with "service iscsi restart" if needed. Post the results. Tom
Created attachment 109743 [details] debug driver
Created attachment 110229 [details] This includes 2 files: iscsi-ls.1 and iscsi_messages.1
After several additional tests, using the proposed debuging tools, I hope I was able understand in what cases the error shows. First I tried to use the same setup as Tom using Celerra target with 40 LUNs and I was not able to reproduce the problem. Then I used iscsi-ls -l in order to see that all the LUNs were discovered and I realized that the Cisco iSCSI switch has a different behavior than Celerra. When conncting to the Cisco MDS switch the iscsi-ls -l shows that there are 2 targets (see the attached file) while Celerra had only one. The missing LUNs were on one or the other of the targets. In the attached log file iscsi-ls.1 there were suposed to be 52 LUNs (26 on each target) and the log shows that there are only 27 on both targets. Attached as well is the /var/log/messages ections for different tests such as: 1. sequence service iscsi stop rmmod iscsi_sfnet insmod iscsi_sfnet.o service iscsi start 2. service iscsi restart, 3. service iscsi reload as marked in the second file iscsi_messages.1 which only capture the iscsi commands and debug information. This is where I am today. I am planning to simulate the MDS behavior on the Celerra to see if I can reproduce the problem without using the Cisco MDS iscsi bridge/switc
Sorin, The attached log was not generated by the debug version of the iscsi driver that I posted. Please re-try your test with the debug driver. It should provide the information we need. Also, the attached log does not show any LUNs being configured at all. I thought that ~half the LUNs were working okay. Is this so? Any idea why the don't show up in the log? Tom
Created attachment 110521 [details] New Debug data using iscsi_sfnet.o patch Debug data
On the Celerra, the driver is configuring LUNs 0-7. This is because the Report LUNs command is failing. From the log: Feb 1 13:56:50 localhost kernel: iSCSI: session c2384000 REPORT_LUNS failed, falling back to INQUIRY When the driver fails back to sending Inquiry, it will only attempt to configure LUNs 0-7. From the code: /* if REPORT_LUNS failed, then either it's a SCSI-2 device * that doesn't understand the command, or it's a SCSI-3 * device that only has one LUN and decided not to implement * REPORT_LUNS. In either case, we're safe just probing LUNs * 0-7 with INQUIRY, since SCSI-2 can't have more than 8 LUNs, * and SCSI-3 should do REPORT_LUNS if it has more than 1 LUN. */ Now I know the Celerra is capable of doing Report LUNs. Do you have any idea why it is failing on your system? This is a different scenario than the original report, where Report LUNs was working, but it was not getting the full list of LUNs. Can you re-run the packet trace, so we can see why the Report LUNs is failing? If not, I can give you another debug driver that will trace this area in more detail. On the MDS, the Report LUN command is working. I see 52 Clariion LUNs being configured, 29 on target 4 and 23 on target 5. Is this not correct?
Tom - I believe the REPORT_LUNs failing and only the first 8 LUNs being reported for the Celerra has to do with Bugzilla #130365. I don't have access to a Celerra though to confirm it's reporting in /proc. Sorin - would you please refer to Bugzilla #130365? Thanks,
As I said in Bugzilla #130365, the 2.4 Cisco iSCSI driver ignores the SCSI midlayer whitelist. So adding Celerra to the whitelist will not make any difference. The code comment in #9 above is straight from the Cisco iSCSI driver, indicating that the driver will only scan LUNs 0-7. The midlayer scanning is ignored. In fact, the 2.4 SCSI midlayer does not issue the Report LUNs command at all. This is only done by the iSCSI driver. The critical question is why Report LUNs fails on Sorin's Celerra, and not on the one here at Red Hat.
Created attachment 110771 [details] Debug data using Celerra iSCSI target I repeated the tests after some discussions with our iSCSI team. We think we understand the problem at least with Celerra. The attachment includes 3 files: iscsi_dump.3 - the tcpdump file; to be viewed using ethereal iscsi_messages.3 - the /var/log/messages for several cases separated by >>>>> iscsi-ls.3.log - the output of the iscsi-ls -l The problem is as follows: If we configure 0-31 LUNS on Celerra, the initiator discovers all the LUNs When we configure 33 LUNs on Celerra, as they are more than 32 (264 bytes worth) the initiator falls back to the SCSI2 and only report 8 LUNs 0-7 If any more LUNs are added to the Celerra the result is the same when we do a restart. If I remove the extra LUNs above 32 and do a reload the initiator sees again all of them. The log file represent the following sequence: Create LUNs 0-7 10-33 on Celerra target Execute on the Linux client: service iscsi restart Result: all the LUNs are reported Add LUN 34 on Celerra target Execute on the Linux client: service iscsi restart Result: only LUNs 0-7 are discovered if they exist; if not no LUNs are reported (if there are only LUNs 3-5 for example it will report only 3-5) Add LUNs 35-51 to Celerra Execute on the Linux client: service iscsi restart Result: only LUNs 0-7 are discovered Remove LUNs 34-51 on Celerra Execute on the Linux client: service iscsi reload Result: All the 32 LUNs are discovered. I hope this will clarify the situation. We do not believe this has anything to do with Bugzilla #130365. It looks like but it doesn't
Created attachment 110772 [details] Attachment as binary this time I did a mistake with the attchment. This is the correct attached file: bug144293.3.tar.gz
Please repeat this test: Add LUNs 35-51 to Celerra Execute on the Linux client: service iscsi restart with the debug iSCSI driver that I provided. You only had the debug driver loaded for the first test, when there were less than 33 LUNs configured. Please post the full output from /var/log/messages.
Created attachment 110844 [details] Repeat of last test I hope this time is OK. The problem is that I have to rmmod and insmod before each test otherwise the debug info is not in the messages.
Sorin, Would you please post the iscsi.conf file that was in use on the system when the most recent set of traces was captured. Thank-you, Tom
Created attachment 111643 [details] second debug iscsi driver, for 2.4.21-27.ELsmp This driver adds more debug statememts. I expect the log from running this driver will finally indicate what is causing this problem. I have tried several configurations on the Celerra in our lab, and it never fails, even with 60 LUNs. So I'll need your help again to test this. Same preceedure as the other debug driver I sent. If you have switched to a different SMP kernel, you can still run this driver. Just ignore the warning when it loads. You only need to run one test, on a system with more than 32 LUNs that fails to configure all the LUNs. Post the full /var/log/messages output. I don't need an ethereal trace this time. Thanks, Tom
*** Bug 150483 has been marked as a duplicate of this bug. ***
The ReportLuns from MDS sets an overflow flag in the SCSI response. This is treated as an error by the driver and hence report lun fails and an inquiry of upto 8 luns is done. The first Report lun that is sent is sent to accomodate 32 luns. If the target has greater than 32 luns, the size of the buffer is incremented and sent again, but in the case here, it fails because of the overflow flag. It's being fixed from the MDS side. As a workaround, you can ignore the overflow checking: diff -Narup linux-iscsi/driver/iscsi.c linux-iscsi_new/driver/iscsi.c --- linux-iscsi/driver/iscsi.c 2005-07-15 19:41:17.000000000 +0530 +++ linux-iscsi_new/driver/iscsi.c 2005-08-30 15:29:23.000000000 +0530 @@ -3470,8 +3470,9 @@ process_task_response(iscsi_session_t * } else if (stsrh->flags & ISCSI_FLAG_CMD_OVERFLOW) { ISCSI_TRACE(ISCSI_TRACE_RxOverflow, sc, task, ntohl(stsrh->residual_count), expected); - sc->result = HOST_BYTE(DID_ERROR) | STATUS_BYTE(stsrh->cmd_status); - sc->resid = expected; + /* FIXME: Do not know how to report overflow to scsi ml. Ignoring + * overflow + */ } else if (task->rxdata < expected) { /* All the read data did not arrive. This can * happen without an underflow indication from
Cisco this is reminder to indicate if you need a fix similar to the one in comment 19 or has your HW/firmware been fixed. Thanks.
Changing Component to kernel. In the future please try to select iscsi-initiator-utils for userspace problems and kernel for driver problems. If there is a kernel and userspace change required then make two :(. iSCSI is legacy Component from when it was all bundled in one rpm. Thanks.
fixed in HW for both EMC and Cisco. Closing BZ.
Moving off proposed list.