Bug 507379 - [NetApp 5.4 bug] Issues with rescan-scsi-bus.sh script
[NetApp 5.4 bug] Issues with rescan-scsi-bus.sh script
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: sg3_utils (Show other bugs)
5.4
All Linux
medium Severity high
: rc
: 5.4
Assigned To: Dan Horák
BaseOS QE
: OtherQA
Depends On: 427259
Blocks: 461680 538787
  Show dependency treegraph
 
Reported: 2009-06-22 11:25 EDT by Tanvi
Modified: 2009-11-19 07:06 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 538787 (view as bug list)
Environment:
Last Closed: 2009-09-02 07:23:39 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch to handle LUN0 (1.57 KB, patch)
2009-06-23 04:10 EDT, Tanvi
no flags Details | Diff
modified version 1.29 from OpenSuSE package (15.50 KB, application/x-shellscript)
2009-06-25 05:19 EDT, Dan Horák
no flags Details
version 1.29 + OpenSUSE updates + NetApp updates (15.97 KB, application/x-shellscript)
2009-07-07 09:51 EDT, Dan Horák
no flags Details

  None (edit)
Description Tanvi 2009-06-22 11:25:04 EDT
Description of problem:

rescan-scsi-bus.sh script is not working as expected on a RHEL5.4 machine. Following issues are seen

1.It does not delete LUN0. LUN0 gets deleted only if it is the last lun to be deleted. The following code snippet is the reason for this

if [ "${#oldsearch}" = "${#newsearch}" ] ; then
        # Stale lun
        lunremove="$lunremove $lun"

LUN0 never gets added to the list.

2. Sometimes it does not scan all the LUNs. For example

[root@lnx-200-175 ~]# rescan-scsi-bus.sh -r
Host adapter 0 (mptsas) found.
Host adapter 29 (iscsi_tcp) found.
Host adapter 30 (iscsi_tcp) found.
Host adapter 31 (iscsi_tcp) found.
Host adapter 32 (iscsi_tcp) found.
Scanning SCSI subsystem for new devices
 and remove devices that have disappeared
Scanning host 0 channels 0 for  SCSI target IDs  0 1 2 3 4 5 6 7, all LUNs
Scanning for device 0 0 0 0 ...
OLD: Host: scsi0 Channel: 00 Id: 00 Lun: 00
      Vendor: IBM-ESXS Model: VPA073C3-ETS10 N Rev: A49B
      Type:   Direct-Access                    ANSI SCSI revision: 05
Scanning for device 0 0 1 0 ...
OLD: Host: scsi0 Channel: 00 Id: 01 Lun: 00
      Vendor: IBM-ESXS Model: VPA073C3-ETS10 N Rev: A49B
      Type:   Direct-Access                    ANSI SCSI revision: 05
Scanning host 29 channels 0 for  SCSI target IDs  0 1 2 3 4 5 6 7, all LUNs
Scanning for device 29 0 0 0 ...
OLD: Host: scsi29 Channel: 00 Id: 00 Lun: 00
      Vendor: NETAPP   Model: LUN              Rev: 7310
      Type:   Direct-Access                    ANSI SCSI revision: 04
Scanning for device 29 0 0 2 ...
NEW: Host: scsi29 Channel: 00 Id: 00 Lun: 02
      Vendor: NETAPP   Model: LUN              Rev: 7310
      Type:   Direct-Access                    ANSI SCSI revision: 04
sg3 changed: device 29 0 0 1 ...
LU not available (PQual 1) 00 Id: 00 Lun: 01
REM: Host: scsi29 Channel: 00 Id: 00 Lun: 01   Rev: 7310
DEL:  Type:   Direct-Access                    ANSI SCSI revision: 04
Scanning host 30 channels 0 for  SCSI target IDs  0 1 2 3 4 5 6 7, LUNs  2
Scanning for device 30 0 0 2 ...
NEW: Host: scsi30 Channel: 00 Id: 00 Lun: 02
      Vendor: NETAPP   Model: LUN              Rev: 7310
      Type:   Direct-Access                    ANSI SCSI revision: 04
Scanning host 31 channels 0 for  SCSI target IDs  0 1 2 3 4 5 6 7, LUNs  2
Scanning for device 31 0 0 2 ...
NEW: Host: scsi31 Channel: 00 Id: 00 Lun: 02
      Vendor: NETAPP   Model: LUN              Rev: 7310
      Type:   Direct-Access                    ANSI SCSI revision: 04
Scanning host 32 channels 0 for  SCSI target IDs  0 1 2 3 4 5 6 7, LUNs  2
Scanning for device 32 0 0 2 ...
NEW: Host: scsi32 Channel: 00 Id: 00 Lun: 02
      Vendor: NETAPP   Model: LUN              Rev: 7310
      Type:   Direct-Access                    ANSI SCSI revision: 04
4 new device(s) found.
1 device(s) removed.

Here, I added LUN2 and deleted LUN1. Instead of scanning all LUNs, the
script scanned only LUN2 for many of the hosts. Hence all the entries of LUN1 could not get deleted. The reason for this behavior is the stale values present in lunsearch variable of doreportlun(). Changing lunsearch to lun_search solves the problem.

3. If first LUN to be mapped is not LUN0, it does not get detected. In doreportlun(),lun is initialized as 0 and it is tried to add LUN0. As LUN0 is not present, it fails and getluns() return null. Hence in absence of LUN0, it never tries to add other LUNs.


Version-Release number of selected component (if applicable):
sg3_utils-1.27-17.7

How reproducible:
Always

Steps to Reproduce:
1.map some luns to the host
2.unmap lun0 
3.run rescan-scsi-bus.sh -r 
4.the lun does not get deleted 
Actual Results:  
lun0 does not get deleted and the entry remain present in /proc/scsi/scsi and
/sys/class/scsi_device
Comment 1 Tanvi 2009-06-23 04:10:42 EDT
Created attachment 349054 [details]
patch to handle LUN0
Comment 2 Tanvi 2009-06-23 04:12:51 EDT
I have made some changes to the script which fixes the first two issues.
Comment 3 Dan Horák 2009-06-23 04:57:54 EDT
(In reply to comment #0)
> Version-Release number of selected component (if applicable):
> sg3_utils-1.27-17.7

There is no such package version in RHEL or Fedora, but thanks for the report and patch, I will track the issue here and forward the patch to the upstream author.
Comment 4 Tanvi 2009-06-23 05:10:20 EDT
My mistake.
[root@lnx-200-175 ~]# rpm -qf /usr/bin/rescan-scsi-bus.sh
sg3_utils-1.25-3.el5

Following is the version of the script
$Id: rescan-scsi-bus.sh-1.29,v 1.1 2009/03/12 11:03:19
Comment 5 Andrius Benokraitis 2009-06-23 23:23:57 EDT
From bug 427259:

sg3_utils-1.26-2.fc10 has been submitted as an update for Fedora 10.
http://admin.fedoraproject.org/updates/sg3_utils-1.26-2.fc10
Comment 6 Andrius Benokraitis 2009-06-23 23:54:54 EDT
I'm thinking this update should happen after the solution actually gets pulled in from Fedora first, no? Would the rescan script be pulled in for 5.4 or 5.5?
Comment 8 Dan Horák 2009-06-25 05:19:09 EDT
Created attachment 349362 [details]
modified version 1.29 from OpenSuSE package

Tanvi, I have extracted a modified version (1.29 + some fixes) of the rescan-scsi-bus.sh script from  OpenSUSE sg3_utils package. Could you, please, test it in your environment?
Comment 9 Naveen Reddy 2009-06-25 05:20:44 EDT
Some more issues seen with rescan script

1. On a FC host, the rescan script will detect new LUNs only when we issue LIP.
LUNs can be detected without issuing lip.
The following patch will fix this issue.

@@ -534,8 +535,8 @@
     # It's pointless to do a target scan on FC
     if test -n "$lipreset" ; then
       echo 1 > /sys/class/fc_host/host$host/issue_lip 2> /dev/null;
-      echo "- - -" > /sys/class/scsi_host/host$host/scan 2> /dev/null;
     fi
+    echo "- - -" > /sys/class/scsi_host/host$host/scan 2> /dev/null;
     channelsearch=""
     idsearch=""
   fi


2. When no LUNs are mapped and if you run this script it will print lot of 
meaningless messages. The following change will fix this issue. 

@@ -230,6 +230,7 @@
   local tmpchan

   for dev in /sys/class/scsi_device/${host}:* ; do
+    [ -d $dev ] || continue;
     hcil=${dev##*/}
     cil=${hcil#*:}
     chan=${cil%%:*}

(The patches are w.r.t the rescan script shipped in 5.4 alpha.)

3. To resize the scsi device on the host,we need to 'rescan' that device using following command.

	echo 1 > /sys/class/scsi_device/<sd device>/device/rescan

This part of the code lies in "remove" section of the script. 
So this has to be moved from here as resize and remove are different.
So I think we should add one more option for resizing the devices and move the corresponding code.
Comment 10 Tanvi 2009-06-25 06:06:59 EDT
(In reply to comment #8)
> Created an attachment (id=349362) [details]
> modified version 1.29 from OpenSuSE package
> 
> Tanvi, I have extracted a modified version (1.29 + some fixes) of the
> rescan-scsi-bus.sh script from  OpenSUSE sg3_utils package. Could you, please,
> test it in your environment?  

I tested the above script. All the above three (explained in the Description section) issues are still present.
Comment 12 Tanvi 2009-06-30 02:25:17 EDT
Increasing the severity.
Comment 15 Dan Horák 2009-07-07 09:51:55 EDT
Created attachment 350798 [details]
version 1.29 + OpenSUSE updates + NetApp updates

I have merged the updates from OpenSUSE and both the NetApp's ones and prepared new version of the script. IMHO it could solve the 3rd issue in comment #9 when the user uses "--forcerescan" command line option.
Comment 18 Tanvi 2009-07-08 01:46:12 EDT
It solved 2nd issue described in the description section, but the script is still unable to delete LUN0. 
If we change lun_search="`getluns`" to lun_search=" `getluns`" (a space is inserted before `getluns`) in doreportlun(), LUN0 gets deleted properly.
Comment 19 Dan Horák 2009-07-08 02:47:27 EDT
(In reply to comment #18)
> It solved 2nd issue described in the description section, but the script is
> still unable to delete LUN0. 
> If we change lun_search="`getluns`" to lun_search=" `getluns`" (a space is
> inserted before `getluns`) in doreportlun(), LUN0 gets deleted properly.  

Ah, it's my fault, when I was merging the changes I removed the space, because its purpose was unclear to me.
Comment 21 Naveen Reddy 2009-07-09 03:09:11 EDT
(In reply to comment #15)
> Created an attachment (id=350798) [details]
> version 1.29 + OpenSUSE updates + NetApp updates
> 
> I have merged the updates from OpenSUSE and both the NetApp's ones and prepared
> new version of the script. IMHO it could solve the 3rd issue in comment #9 when
> the user uses "--forcerescan" command line option.  

The "--forcerescan" option also enables removing the devices. IMHO it will be better if "rescan" option only rescan the devices and does not remove the devices. 
Thanks for incorporating the other fixes in comment 9.
Comment 24 Dan Horák 2009-07-14 04:02:38 EDT
For 5.4 I have commited the last version from the attachments. Please open new bug to track the additional deficiencies so they can be solved in next releases.
Comment 27 Andrius Benokraitis 2009-07-17 09:08:14 EDT
Please test Snapshot 3 when it is released - this will contain the follow-on fixes.
Comment 28 Tanvi 2009-07-27 08:53:33 EDT
I have tested it on snapshot3. All the issues except 3rd issue in description and 3rd issue in comment#9 have been addressed. Thank You.
Comment 29 Andrius Benokraitis 2009-07-27 11:18:49 EDT
Tanvi - I think we'll have to document these known issues and defer fixing them until 5.5. Please open a new BZ with the last remaining issues to do for 5.5, and reference this BZ.
Comment 30 Rob Evers 2009-07-28 11:18:55 EDT
Hi Tanvi,

The issues you noted in comment 28 need to be documented such that customers will understand the deficiencies before they start trying to use the script.

I will be attempting to update the 'online storage configuration guide' with information that the script is available, and what problems it currently has.

Issues:

iscsi:

I tried the version attached in this bugzilla with iscsi and found the script hung briefly and then didn't work when I removed a lun and tried to run rescan to see the lun get unconfigured.  Have you tried using the script with iscsi at all?

My thought on preventing customers from seeing issues using rescan-scsi-bus.sh with iscsi would to qualify use of rescan-scsi-bus.sh to be used only with Fibre Channel.  If you have other experience with iscsi and rescan-scsi-bus.sh, and think that this should be documented differently, please let me know.

The 3rd issue in the description:

> 3. If first LUN to be mapped is not LUN0, it does not get detected. In
> doreportlun(),lun is initialized as 0 and it is tried to add LUN0. As LUN0 is
> not present, it fails and getluns() return null. Hence in absence of LUN0, it
> never tries to add other LUNs.

Looking at the help output of rescan-scsi-bus.sh, I see:

 "--nooptscan:     don't stop looking for LUNs is 0 is not found"

Can you see if this gets around the problem with lun0 not being mapped?

3rd issue in comment 9:

> 3. To resize the scsi device on the host,we need to 'rescan' that device using
> following command.
>
>  echo 1 > /sys/class/scsi_device/<sd device>/device/rescan
>
> This part of the code lies in "remove" section of the script. 
> So this has to be moved from here as resize and remove are different.
> So I think we should add one more option for resizing the devices and move the
> corresponding code.

Not sure exactly how I would characterize this for a customer trying to use this script.  Maybe something like:

Due to a bug in the rescan-scsi-bus.sh script, the functionality to recognize a change in the size of a lun executes when the --remove option is used.

Is this enough to characterize the problem?

Rob
Comment 31 Tanvi 2009-07-29 02:14:38 EDT
(In reply to comment #30)
> Issues:
> 
> iscsi:
> 
> I tried the version attached in this bugzilla with iscsi and found the script
> hung briefly and then didn't work when I removed a lun and tried to run rescan
> to see the lun get unconfigured.  Have you tried using the script with iscsi at
> all?

Yes, I used the script to add/delete iscsi devices and I did not see any hang. The devices get unconfigured when -r option is used. Even if I don't use -r option, I don't see any brief hang, but devices do not get unconfigured either.

> My thought on preventing customers from seeing issues using rescan-scsi-bus.sh
> with iscsi would to qualify use of rescan-scsi-bus.sh to be used only with
> Fibre Channel.  If you have other experience with iscsi and rescan-scsi-bus.sh,
> and think that this should be documented differently, please let me know.
IMO, customers should not be prevented from using the script for iscsi devices.

> The 3rd issue in the description:
> 
> > 3. If first LUN to be mapped is not LUN0, it does not get detected. In
> > doreportlun(),lun is initialized as 0 and it is tried to add LUN0. As LUN0 is
> > not present, it fails and getluns() return null. Hence in absence of LUN0, it
> > never tries to add other LUNs.
> 
> Looking at the help output of rescan-scsi-bus.sh, I see:
> 
>  "--nooptscan:     don't stop looking for LUNs is 0 is not found"
> 
> Can you see if this gets around the problem with lun0 not being mapped?

I tried using --nooptscan option, it did not help.

> 3rd issue in comment 9:
> 
> > 3. To resize the scsi device on the host,we need to 'rescan' that device using
> > following command.
> >
> >  echo 1 > /sys/class/scsi_device/<sd device>/device/rescan
> >
> > This part of the code lies in "remove" section of the script. 
> > So this has to be moved from here as resize and remove are different.
> > So I think we should add one more option for resizing the devices and move the
> > corresponding code.
> 
> Not sure exactly how I would characterize this for a customer trying to use
> this script.  Maybe something like:
> 
> Due to a bug in the rescan-scsi-bus.sh script, the functionality to recognize a
> change in the size of a lun executes when the --remove option is used.
> 
> Is this enough to characterize the problem?
Yes, that should be OK.

Apart from above open issues, there is one more issue. We need to scan twice when LUNs are mapped for the first time. During first scan, only LUN0 gets added and other LUNs get added during second scan. This is a timing related issue, if we add a sleep statement as follows in doreportlun(), the issue is not seen.

 #If not a single LUN is present then assign lun=0
  if [ -z $lun ]; then
    lun=0
    devnr="$host $channel $id $lun"
    echo "Scanning for device $devnr ..."
    printf "${yellow}OLD: $norm"
    testexist
    if test -z "$SCSISTR"; then
      # Device does not exist, try to add
      printf "\r${green}NEW: $norm"
      if test -e /sys/class/scsi_host/host${host}/scan; then
        echo "$channel $id $lun" > /sys/class/scsi_host/host${host}/scan 2> /dev/null
        sleep 1
      else

But, it delays the entire rescanning process. IMO we can ignore the issue and have it documented.
Comment 32 Rob Evers 2009-07-29 13:21:18 EDT
see bz264001 comment 66 for info on recommendations about documenting sg3_utils:rescan-scsi-bus.sh and its current limitations.
Comment 34 errata-xmlrpc 2009-09-02 07:23:39 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1357.html

Note You need to log in before you can comment on or make changes to this bug.