Description of problem: rescan-scsi-bus.sh script is not working as expected on a RHEL5.4 machine. Following issues are seen 1.It does not delete LUN0. LUN0 gets deleted only if it is the last lun to be deleted. The following code snippet is the reason for this if [ "${#oldsearch}" = "${#newsearch}" ] ; then # Stale lun lunremove="$lunremove $lun" LUN0 never gets added to the list. 2. Sometimes it does not scan all the LUNs. For example [root@lnx-200-175 ~]# rescan-scsi-bus.sh -r Host adapter 0 (mptsas) found. Host adapter 29 (iscsi_tcp) found. Host adapter 30 (iscsi_tcp) found. Host adapter 31 (iscsi_tcp) found. Host adapter 32 (iscsi_tcp) found. Scanning SCSI subsystem for new devices and remove devices that have disappeared Scanning host 0 channels 0 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs Scanning for device 0 0 0 0 ... OLD: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: IBM-ESXS Model: VPA073C3-ETS10 N Rev: A49B Type: Direct-Access ANSI SCSI revision: 05 Scanning for device 0 0 1 0 ... OLD: Host: scsi0 Channel: 00 Id: 01 Lun: 00 Vendor: IBM-ESXS Model: VPA073C3-ETS10 N Rev: A49B Type: Direct-Access ANSI SCSI revision: 05 Scanning host 29 channels 0 for SCSI target IDs 0 1 2 3 4 5 6 7, all LUNs Scanning for device 29 0 0 0 ... OLD: Host: scsi29 Channel: 00 Id: 00 Lun: 00 Vendor: NETAPP Model: LUN Rev: 7310 Type: Direct-Access ANSI SCSI revision: 04 Scanning for device 29 0 0 2 ... NEW: Host: scsi29 Channel: 00 Id: 00 Lun: 02 Vendor: NETAPP Model: LUN Rev: 7310 Type: Direct-Access ANSI SCSI revision: 04 sg3 changed: device 29 0 0 1 ... LU not available (PQual 1) 00 Id: 00 Lun: 01 REM: Host: scsi29 Channel: 00 Id: 00 Lun: 01 Rev: 7310 DEL: Type: Direct-Access ANSI SCSI revision: 04 Scanning host 30 channels 0 for SCSI target IDs 0 1 2 3 4 5 6 7, LUNs 2 Scanning for device 30 0 0 2 ... NEW: Host: scsi30 Channel: 00 Id: 00 Lun: 02 Vendor: NETAPP Model: LUN Rev: 7310 Type: Direct-Access ANSI SCSI revision: 04 Scanning host 31 channels 0 for SCSI target IDs 0 1 2 3 4 5 6 7, LUNs 2 Scanning for device 31 0 0 2 ... NEW: Host: scsi31 Channel: 00 Id: 00 Lun: 02 Vendor: NETAPP Model: LUN Rev: 7310 Type: Direct-Access ANSI SCSI revision: 04 Scanning host 32 channels 0 for SCSI target IDs 0 1 2 3 4 5 6 7, LUNs 2 Scanning for device 32 0 0 2 ... NEW: Host: scsi32 Channel: 00 Id: 00 Lun: 02 Vendor: NETAPP Model: LUN Rev: 7310 Type: Direct-Access ANSI SCSI revision: 04 4 new device(s) found. 1 device(s) removed. Here, I added LUN2 and deleted LUN1. Instead of scanning all LUNs, the script scanned only LUN2 for many of the hosts. Hence all the entries of LUN1 could not get deleted. The reason for this behavior is the stale values present in lunsearch variable of doreportlun(). Changing lunsearch to lun_search solves the problem. 3. If first LUN to be mapped is not LUN0, it does not get detected. In doreportlun(),lun is initialized as 0 and it is tried to add LUN0. As LUN0 is not present, it fails and getluns() return null. Hence in absence of LUN0, it never tries to add other LUNs. Version-Release number of selected component (if applicable): sg3_utils-1.27-17.7 How reproducible: Always Steps to Reproduce: 1.map some luns to the host 2.unmap lun0 3.run rescan-scsi-bus.sh -r 4.the lun does not get deleted Actual Results: lun0 does not get deleted and the entry remain present in /proc/scsi/scsi and /sys/class/scsi_device
Created attachment 349054 [details] patch to handle LUN0
I have made some changes to the script which fixes the first two issues.
(In reply to comment #0) > Version-Release number of selected component (if applicable): > sg3_utils-1.27-17.7 There is no such package version in RHEL or Fedora, but thanks for the report and patch, I will track the issue here and forward the patch to the upstream author.
My mistake. [root@lnx-200-175 ~]# rpm -qf /usr/bin/rescan-scsi-bus.sh sg3_utils-1.25-3.el5 Following is the version of the script $Id: rescan-scsi-bus.sh-1.29,v 1.1 2009/03/12 11:03:19
From bug 427259: sg3_utils-1.26-2.fc10 has been submitted as an update for Fedora 10. http://admin.fedoraproject.org/updates/sg3_utils-1.26-2.fc10
I'm thinking this update should happen after the solution actually gets pulled in from Fedora first, no? Would the rescan script be pulled in for 5.4 or 5.5?
Created attachment 349362 [details] modified version 1.29 from OpenSuSE package Tanvi, I have extracted a modified version (1.29 + some fixes) of the rescan-scsi-bus.sh script from OpenSUSE sg3_utils package. Could you, please, test it in your environment?
Some more issues seen with rescan script 1. On a FC host, the rescan script will detect new LUNs only when we issue LIP. LUNs can be detected without issuing lip. The following patch will fix this issue. @@ -534,8 +535,8 @@ # It's pointless to do a target scan on FC if test -n "$lipreset" ; then echo 1 > /sys/class/fc_host/host$host/issue_lip 2> /dev/null; - echo "- - -" > /sys/class/scsi_host/host$host/scan 2> /dev/null; fi + echo "- - -" > /sys/class/scsi_host/host$host/scan 2> /dev/null; channelsearch="" idsearch="" fi 2. When no LUNs are mapped and if you run this script it will print lot of meaningless messages. The following change will fix this issue. @@ -230,6 +230,7 @@ local tmpchan for dev in /sys/class/scsi_device/${host}:* ; do + [ -d $dev ] || continue; hcil=${dev##*/} cil=${hcil#*:} chan=${cil%%:*} (The patches are w.r.t the rescan script shipped in 5.4 alpha.) 3. To resize the scsi device on the host,we need to 'rescan' that device using following command. echo 1 > /sys/class/scsi_device/<sd device>/device/rescan This part of the code lies in "remove" section of the script. So this has to be moved from here as resize and remove are different. So I think we should add one more option for resizing the devices and move the corresponding code.
(In reply to comment #8) > Created an attachment (id=349362) [details] > modified version 1.29 from OpenSuSE package > > Tanvi, I have extracted a modified version (1.29 + some fixes) of the > rescan-scsi-bus.sh script from OpenSUSE sg3_utils package. Could you, please, > test it in your environment? I tested the above script. All the above three (explained in the Description section) issues are still present.
Increasing the severity.
Created attachment 350798 [details] version 1.29 + OpenSUSE updates + NetApp updates I have merged the updates from OpenSUSE and both the NetApp's ones and prepared new version of the script. IMHO it could solve the 3rd issue in comment #9 when the user uses "--forcerescan" command line option.
It solved 2nd issue described in the description section, but the script is still unable to delete LUN0. If we change lun_search="`getluns`" to lun_search=" `getluns`" (a space is inserted before `getluns`) in doreportlun(), LUN0 gets deleted properly.
(In reply to comment #18) > It solved 2nd issue described in the description section, but the script is > still unable to delete LUN0. > If we change lun_search="`getluns`" to lun_search=" `getluns`" (a space is > inserted before `getluns`) in doreportlun(), LUN0 gets deleted properly. Ah, it's my fault, when I was merging the changes I removed the space, because its purpose was unclear to me.
(In reply to comment #15) > Created an attachment (id=350798) [details] > version 1.29 + OpenSUSE updates + NetApp updates > > I have merged the updates from OpenSUSE and both the NetApp's ones and prepared > new version of the script. IMHO it could solve the 3rd issue in comment #9 when > the user uses "--forcerescan" command line option. The "--forcerescan" option also enables removing the devices. IMHO it will be better if "rescan" option only rescan the devices and does not remove the devices. Thanks for incorporating the other fixes in comment 9.
For 5.4 I have commited the last version from the attachments. Please open new bug to track the additional deficiencies so they can be solved in next releases.
Please test Snapshot 3 when it is released - this will contain the follow-on fixes.
I have tested it on snapshot3. All the issues except 3rd issue in description and 3rd issue in comment#9 have been addressed. Thank You.
Tanvi - I think we'll have to document these known issues and defer fixing them until 5.5. Please open a new BZ with the last remaining issues to do for 5.5, and reference this BZ.
Hi Tanvi, The issues you noted in comment 28 need to be documented such that customers will understand the deficiencies before they start trying to use the script. I will be attempting to update the 'online storage configuration guide' with information that the script is available, and what problems it currently has. Issues: iscsi: I tried the version attached in this bugzilla with iscsi and found the script hung briefly and then didn't work when I removed a lun and tried to run rescan to see the lun get unconfigured. Have you tried using the script with iscsi at all? My thought on preventing customers from seeing issues using rescan-scsi-bus.sh with iscsi would to qualify use of rescan-scsi-bus.sh to be used only with Fibre Channel. If you have other experience with iscsi and rescan-scsi-bus.sh, and think that this should be documented differently, please let me know. The 3rd issue in the description: > 3. If first LUN to be mapped is not LUN0, it does not get detected. In > doreportlun(),lun is initialized as 0 and it is tried to add LUN0. As LUN0 is > not present, it fails and getluns() return null. Hence in absence of LUN0, it > never tries to add other LUNs. Looking at the help output of rescan-scsi-bus.sh, I see: "--nooptscan: don't stop looking for LUNs is 0 is not found" Can you see if this gets around the problem with lun0 not being mapped? 3rd issue in comment 9: > 3. To resize the scsi device on the host,we need to 'rescan' that device using > following command. > > echo 1 > /sys/class/scsi_device/<sd device>/device/rescan > > This part of the code lies in "remove" section of the script. > So this has to be moved from here as resize and remove are different. > So I think we should add one more option for resizing the devices and move the > corresponding code. Not sure exactly how I would characterize this for a customer trying to use this script. Maybe something like: Due to a bug in the rescan-scsi-bus.sh script, the functionality to recognize a change in the size of a lun executes when the --remove option is used. Is this enough to characterize the problem? Rob
(In reply to comment #30) > Issues: > > iscsi: > > I tried the version attached in this bugzilla with iscsi and found the script > hung briefly and then didn't work when I removed a lun and tried to run rescan > to see the lun get unconfigured. Have you tried using the script with iscsi at > all? Yes, I used the script to add/delete iscsi devices and I did not see any hang. The devices get unconfigured when -r option is used. Even if I don't use -r option, I don't see any brief hang, but devices do not get unconfigured either. > My thought on preventing customers from seeing issues using rescan-scsi-bus.sh > with iscsi would to qualify use of rescan-scsi-bus.sh to be used only with > Fibre Channel. If you have other experience with iscsi and rescan-scsi-bus.sh, > and think that this should be documented differently, please let me know. IMO, customers should not be prevented from using the script for iscsi devices. > The 3rd issue in the description: > > > 3. If first LUN to be mapped is not LUN0, it does not get detected. In > > doreportlun(),lun is initialized as 0 and it is tried to add LUN0. As LUN0 is > > not present, it fails and getluns() return null. Hence in absence of LUN0, it > > never tries to add other LUNs. > > Looking at the help output of rescan-scsi-bus.sh, I see: > > "--nooptscan: don't stop looking for LUNs is 0 is not found" > > Can you see if this gets around the problem with lun0 not being mapped? I tried using --nooptscan option, it did not help. > 3rd issue in comment 9: > > > 3. To resize the scsi device on the host,we need to 'rescan' that device using > > following command. > > > > echo 1 > /sys/class/scsi_device/<sd device>/device/rescan > > > > This part of the code lies in "remove" section of the script. > > So this has to be moved from here as resize and remove are different. > > So I think we should add one more option for resizing the devices and move the > > corresponding code. > > Not sure exactly how I would characterize this for a customer trying to use > this script. Maybe something like: > > Due to a bug in the rescan-scsi-bus.sh script, the functionality to recognize a > change in the size of a lun executes when the --remove option is used. > > Is this enough to characterize the problem? Yes, that should be OK. Apart from above open issues, there is one more issue. We need to scan twice when LUNs are mapped for the first time. During first scan, only LUN0 gets added and other LUNs get added during second scan. This is a timing related issue, if we add a sleep statement as follows in doreportlun(), the issue is not seen. #If not a single LUN is present then assign lun=0 if [ -z $lun ]; then lun=0 devnr="$host $channel $id $lun" echo "Scanning for device $devnr ..." printf "${yellow}OLD: $norm" testexist if test -z "$SCSISTR"; then # Device does not exist, try to add printf "\r${green}NEW: $norm" if test -e /sys/class/scsi_host/host${host}/scan; then echo "$channel $id $lun" > /sys/class/scsi_host/host${host}/scan 2> /dev/null sleep 1 else But, it delays the entire rescanning process. IMO we can ignore the issue and have it documented.
see bz264001 comment 66 for info on recommendations about documenting sg3_utils:rescan-scsi-bus.sh and its current limitations.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1357.html