143151 – iscsi software initiator in rhel3, u4, does not properly automount LUNs

Bug 143151 - iscsi software initiator in rhel3, u4, does not properly automount LUNs

Summary: iscsi software initiator in rhel3, u4, does not properly automount LUNs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	iscsi-initiator-utils
Sub Component:
Version:	3.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	AJ Lewis
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-12-16 21:30 UTC by Dave Wysochanski
Modified:	2008-04-07 12:35 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-05-20 03:25:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
iscsi shutdown errors (67.08 KB, image/jpeg) 2005-01-21 19:30 UTC, Josh Hildebrand	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2005:168	0	low	SHIPPED_LIVE	iscsi-initiator-utils bug fix	2005-05-19 04:00:00 UTC

Description Dave Wysochanski 2004-12-16 21:30:46 UTC

Description of problem:
The iscsi software initiator in rhel3, u4, does not properly automount
LUNs.  The reason for this is related to the bringup order of the
iscsi service in relation to other things, and the fact that the
mounting is not retried for a long enough period of time.  For
example, the iscsi readme file located in
/usr/share/doc/iscsi-initiator-utils-3.6.2/README states to use
the /etc/fstab file with _netdev option.  However, if the network
takes a little while to come up (as is the case with certain switch
ports becoming active), it's very likely that the iSCSI LUNs will not
be available at the time the netfs service is run.  Another thing
listed in the README is devlabel.  However, since devlabel is run from
rc.sysinit, this is before the iscsi driver gets loaded and as a
result, any labels that used to exist get removed.  Thus, you cannot
use these in the /etc/fstab file.

Version-Release number of selected component (if applicable):
- iscsi-initiator-utils-3.6.2-1.4,
- iscsi kernel module v3.6.1

How reproducible:
I could not get automounting to work, no matter what I tried - every
method I tried failed.

Steps to Reproduce:
Follow the instructions in the README file to setup automounting of
iSCSI LUNs.  The recommended methods for achieving device naming
persistency do not work with automounting of LUNs.  These methods need
to be reviewed and perhaps adjustments need to be made in terms of
service loading, etc, for automounting of iSCSI LUNs.
  
Actual results:
Errors in the /var/log/messages file indicating iSCSI devices do not
exist at the time the mounting is attempted.

Expected results:
Either background the mount, or do something so that iSCSI LUNs are
eventually mounted automatically upon reboot, once these LUNs become
available.

Additional info:
As a result of this bug, upon reboot manual intervention seems to
always be needed in order to get iSCSI LUNs mounted.  This isn't the
end of the world, but is definately something that should be fixed for
better iSCSI usability.

Comment 1 AJ Lewis 2004-12-22 19:58:49 UTC

Taking ownership of this one - planning on adding a method to the iscsi init
script that checks for session establishment with a user-configurable timeout.

Comment 2 AJ Lewis 2004-12-22 20:00:10 UTC

Note that with the equipment I have tested on, sessions are established before
the init scripts that depend on them being established, so this issue is very
dependant on hardware.

Comment 3 Michael Dunlap 2005-01-10 15:29:33 UTC

I was using the linux-iscsi package from SourceForge before Red Hat
started including iscsi in the kernel, and they had a way of
automounting that I found quite acceptable.  The iscsi volumes would
be entered in /etc/fstab.iscsi and would be automounted when the iscsi
daemon started.  Would this be so difficult for RH to implement?  Is
there a reason it and the iscsi-mount and iscsi-mountall tools were
not put in the iscsi-initiator-utils package?

Comment 4 AJ Lewis 2005-01-10 16:51:20 UTC

The reason it was not implemented was because it is a non-standard
interface. It was yet another place users needed to add mount
information and none of the RHEL tools knew anything about it.  We
also ran into issues with the iscsi-(u)mountall scripts during
testing, IIRC.

Also, using /etc/fstab.iscsi does not solve the devlabel issue - it
only solves the netfs one - so you still don't have persistant disk
names without a volume manager.

I plan on fixing both the automount and devlabel issue by modifying
the iscsi initscript to see if the session has been established, and
once it has, (or a user-definable timeout has been reached,) reload
devlabel.

That being said, you certainly can grab the linux-iscsi tarball from
sourceforge and use the old initscript and iscsi-(u)mountall scripts -
it just will not be supported by Red Hat.

Comment 5 Dave Wysochanski 2005-01-11 13:35:59 UTC

Will your fix contain something to auto-umount iSCSI mounted 
filesystems when the driver is stopped?

This is another case that may warrant another bugzilla altogether.
Basically, when the driver is stopped, the script tries to unload
the module.  If there are existing iSCSI filesystems mounted, it'll
fail with something like this:
[root@dell2650-rtp8 /]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdb1             35001508   4634016  28589500  14% /
/dev/sda1               101089     25826     70044  27% /boot
none                   1027728         0   1027728   0% /dev/shm
/dev/sdc1             10321192     32828   9764080   1%
/iscsi_mnt/large_luns/10-1
[root@dell2650-rtp8 /]# /etc/init.d/iscsi stop
Stopping iSCSI: iscsidiscsi_sfnet: Device or resource busy
Unable to remove iscsi kernel driver - devices may still be in use

It's probably unacceptable to just remove the module removal code,
since the user might reload the driver.

Let me know if you want me to file another bugzilla for this case,
as it might be somewhat involved.

Thanks.

Comment 6 Michael Dunlap 2005-01-11 14:39:35 UTC

Thanks for the info AJ - sounds like you have this issue well in hand.
 I'll wait for your next rpm update, and until then make do by
mounting the iSCSI targets manually after reboot.

Comment 7 AJ Lewis 2005-01-11 14:40:54 UTC

In response to comment #5:
I wasn't planning on it, especially since the initscript doesn't mount
the devices.  I see it like any other scsi device driver - the fact
that we have a daemon running to monitor for multipath failure cases
shouldn't affect the driver - so if someone stops the daemon, they
shouldn't have to unmount the filesystem.  (and if they start the
daemon again, it should reconnect to the driver and go on like nothing
happened) The error/warning that is printed out is there to make sure
the user knows there's still an iscsi device in use.  It's up to them
to unmount it.  I'm especially opposed to using fuser to kill
processes running on the scsi device - which is what iscsi-umountall does.

The system shutdown path will umount iscsi devices before the iscsi
service is shut down, so the only time someone will see this is if
they are manually running 'service iscsi stop'.  In my opinion, it is
the responsibility of the person issuing the command to unmount iscsi
devices at that point.

As far as removing the module removal code, since the service will
load the driver if necessary, I don't see why we should remove it. 
But if the module is already loaded when the service script is run, no
error will be issued.

Comment 8 Dave Wysochanski 2005-01-11 14:52:38 UTC

Ok, that approach seems reasonable.

However, did you know there's a panic in the shutdown path if you
take this approach and someone tries to reboot the machine with
an iSCSI LUN mounted?


INIT: Stopping Red Hat Network Daemon: [  OK  ]
Stopping atd: [  OK  ]
Stopping cups: [  OK  ]
Shutting down xfs: [  OK  ]
Shutting down console mouse services: [  OK  ]
Stopping sshd:[  OK  ]
Stopping xinetd: [  OK  ]
Stopping crond: [  OK  ]
Stopping automount:umount2: Device or resource busy
umount: /u: device is busy
[FAILED]
Shutting down NIS services: [  OK  ]
Shutting down ntpd: [  OK  ]
Unmounting NFS filesystems:  [  OK  ]
Saving random seed:  [  OK  ]
Killing mdmonitor: [  OK  ]
Stopping NFS statd: [  OK  ]
Stopping portmapper: [  OK  ]
Shutting down kernel logger: [  OK  ]
Shutting down system logger: [  OK  ]
Stopping iscsi:  Stopping iSCSI: iscsidiscsi_sfnet: Device or resource
busy
Unable to remove iscsi kernel driver - devices may still be in use

[FAILED]
Shutting down interface eth0:  [  OK  ]
Shutting down interface eth1:  [  OK  ]
Shutting down loopback interface:  [  OK  ]
Shutting down audit subsystem[  OK  ]
Starting killall:  [  OK  ]
Sending all processes the TERM signal...
RPC: sendmsg returned error 101
portmap: RPC call returned error 101
Sending all processes the KILL siSCSI: session f7000000 has ended
quickly 1 times, login delay 1 seconds
iSCSI: tx thread 15601 received SIGKILL, killing rx thread 15602
NMI Watchdog detected LOCKUP on CPU3, eip c0138fdf, registers:
iscsi_sfnet nfs lockd sunrpc usbserial lp parport autofs4 audit e1000
tg3 floppy sg microcode keybdev mousedev hid input usb-ohci up
CPU:    3
EIP:    0060:[<c0138fdf>]    Not tainted
EFLAGS: 00000082

EIP is at __group_send_sig_info [kernel] 0x3ef (2.4.21-27.ELsmp/i686)
eax: c73af100   ebx: 00000286   ecx: 00000000   edx: f6ede000
esi: 00000012   edi: c7304000   ebp: f6edff40   esp: f6edfeec
ds: 0068   es: 0068   ss: 0068
Process killall5 (pid: 16369, stackpage=f6edf000)
Stack: 00000012 f6edff40 c7304000 c7304000 f6edff40 00000012 f6ede000
c01368bc
       00000012 f6edff40 c7304000 00000010 00000000 f6ede000 00003ff1
00003d02
       bfffcba8 c0137a87 00000012 f6edff40 ffffffff 00000012 00000000
00000000
Call Trace:   [<c01368bc>] kill_something_info [kernel] 0xcc (0xf6edff08)
[<c0137a87>] sys_kill [kernel] 0x57 (0xf6edff30)
[<c017822e>] vfs_readdir [kernel] 0xae (0xf6edff54)
[<c017d740>] dput [kernel] 0x30 (0xf6edff64)
[<c016531b>] __fput [kernel] 0xbb (0xf6edff78)
[<c01634be>] filp_close [kernel] 0x8e (0xf6edff94)
[<c0163566>] sys_close [kernel] 0x66 (0xf6edffb0)

Code: 80 b8 04 05 00 00 00 f3 90 7e f5 e9 23 d6 ff ff e8 bc 1f fd

console shuts up ...
N<MI4> N rMI:  W8,a  1de02te4 ctKBed)
OC NKMUI PWatchdog detected LOCKUP    L

Comment 9 AJ Lewis 2005-01-11 14:57:23 UTC

heh - nope, didn't know that - you mind bugging that separately with
me as the owner?

Comment 10 Dave Wysochanski 2005-01-11 15:11:40 UTC

Ok, filed bugzilla 144781.

Thanks.

Comment 11 Josh Hildebrand 2005-01-21 19:30:30 UTC

Created attachment 110068 [details]
iscsi shutdown errors

it stops for about 5 minutes before it finally reboots.

Comment 12 Josh Hildebrand 2005-01-21 19:31:25 UTC

I'm running:

iscsi-initiator-utils-3.6.2-4
kernel-smp-2.4.21-27.0.2.EL

I had to add a 30 second delay to the beginning of the iscsi init 
script under the "start" section to give the e1000 NIC time to start 
passing packets with my Cisco 3750G switch.

I'd recommend adding a user configurable start up delay (along with 
the other TIMEOUT variables you already hard code in the top of the 
script) into a /etc/sysconfig/iscsi config file, so the changes 
aren't over written with each kernel update.

With the 30 second sleep, iscsi starts up properly and the system 
mounts my e2label'd partitions specified in /etc/fstab like so:
LABEL=/oracle/home      /oracle/home            ext3    
defaults        1 2

However, I still have issues rebooting.  It fails to stop the iscsi 
services due to a device still in use.  No processes were running 
that would hold it locked, but for some reason the umount-all stuff 
doesn't seem to be unmounting one or all of the iscsi partitions.  
Then the system seems to hang for about 5 minutes before rebooting 
while it's sending KILL signals.  

See the JPG attachment.  It was the only way I could capture this 
event. ;(

Comment 13 AJ Lewis 2005-02-16 18:12:18 UTC

There is an i386 rpm at
http://people.redhat.com/alewis/rpms/iscsi-initiator-utils-3.6.2-6.i386.rpm that
includes a workaround for this bug - there is now an /etc/sysconfig/iscsi
configuration file that allows you to increase the timeouts an startup - you
should set ESTABLISHTIMEOUT to the maximum time it takes to establish the iscsi
sessions.

I still want to find a way to check on the session status, but I haven't found a
good way to code it yet.

Comment 14 AJ Lewis 2005-02-16 18:13:06 UTC

Please note the rpm in comment #13 is unsupported.  This is a test rpm to verify
the workaround works.

Comment 15 Dave Wysochanski 2005-04-07 03:59:26 UTC

AJ, I tested this in RHEL4 U5 beta but I had to set non-default values to get it
to automount.  Is that your intent?  Could we set reasonable default values so
that it works for *most* cases?

Thanks.

Comment 16 Dave Wysochanski 2005-04-07 04:00:37 UTC

I set ESTABLISHTIMEOUT=60 for my tests - your mileage may vary.

Comment 17 AJ Lewis 2005-04-07 13:38:13 UTC

The timeouts that are default in the beta version are easily enough for my
setup.  How common do you think your router/timeout issue is going to be?  I
really don't want the iscsi initscripts holding up boot by 60+ seconds if it's
not necessary.

Comment 18 Dave Wysochanski 2005-04-07 15:09:15 UTC

Not sure what you mean by "router/timeout issue". Are you saying the defaults
mounted the luns in your setup?  How thoroughly did you test this to arrive at
the defaults - was it only one machine, one target, etc?  I'm just asking
because I've only tested on one machine so far (ibm x325) with one target and it
failed to mount every time until I set the ESTABLISHTIMEOUT.  I think I tried 15
and it didn't work, then tried 60.

I'm not suggesting the correct default should be 60.  But I think the fact that
the first machine I tried failed every time probably indicates the current
defaults are not adequate.  I can try to test a few more machines and recommend
a default if you want, or maybe you guys want to do that?

Thanks.

Comment 21 AJ Lewis 2005-04-07 16:12:09 UTC

The defaults work for me on a single machine connected to a single netapp
exporting 27 LUs to my initiator ID.  I have no other targets to test against,
so if you can do some checking and find a reasonable default, please do so.

Comment 22 Dave Wysochanski 2005-04-07 16:50:08 UTC

What's the architecture of the host?  Did you put it in a reboot loop or just
try it a few times?  Can you tell if it's *close* to failing or there's a lot of
margin?

It looks like ESTABLISHTIMEOUT=30 works on my ibm x325.

I will try another architecture machine.

I'm not sure what the right default is for *most* customers.  Do you think most
iSCSI customers will care that much about bootup times?  Tradeoff seems to be
whether bootup time is more important than automouting working for iSCSI customers.

Comment 23 AJ Lewis 2005-04-07 17:45:48 UTC

Dual 2.4 GHz Xeon system - so i686.  I think automounting working is more
important than bootup time, but we can't guarantee it'll work for everyone out
of the box with this method.  If we want to do that, we'll have to set it to
3600, and even then it might not catch all cases.

30 seconds sounds reasonable to me though, so unless you see a problem with that
on another system/setup you test, I'll plan on rolling another package with the
timeout cranked up to that by default.

Comment 25 AJ Lewis 2005-05-18 14:38:29 UTC

ESTABLISHTIMEOUT defaults didn't get changed in time for the RHEL3-U5 cutoff, so
the default is still 15 seconds.  The errata advisory mentions that this may
need to be modified in /etc/sysconfig/iscsi.  On the upside, the current default
worked fine for our testing.

Comment 26 Dennis Gregorovic 2005-05-20 03:25:51 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-168.html

Note You need to log in before you can comment on or make changes to this bug.