Bug 253989 - iSCSI Hangs on reboot
Summary: iSCSI Hangs on reboot
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.1
Hardware: All
OS: Linux
high
medium
Target Milestone: ---
: ---
Assignee: Mike Christie
QA Contact: Brock Organ
URL:
Whiteboard:
Depends On:
Blocks: 227737 372911 420521 422431 422441
TreeView+ depends on / blocked
 
Reported: 2007-08-23 13:35 UTC by Rob Kenna
Modified: 2008-05-21 14:53 UTC (History)
5 users (show)

Fixed In Version: RHBA-2008-0314
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-21 14:53:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Console screen image of hang (353.35 KB, image/png)
2007-08-23 13:41 UTC, Rob Kenna
no flags Details
hack halt script so iscsid is not killed too early (763 bytes, patch)
2007-08-30 23:18 UTC, Mike Christie
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0314 0 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5.2 2008-05-20 18:43:34 UTC

Description Rob Kenna 2007-08-23 13:35:24 UTC
Description of problem:
Rebooting a node with a mounting iSCSI LUN hangs (see enclosed image of the failure)

Version-Release number of selected component (if applicable):
 2.6.18-42 (RHEL 5.1 beta, multiple versions)

How reproducible:
100%

Steps to Reproduce:
1. Mount iSCSI LUN
2. Reboot
Note: This is in a cluster configuration
  
Actual results:
Hangs with 

iscsi: can not unicast skb (-111)
iscsi: can not broadcast skb (-3)
connection6:0 iscsi: conn error  (1011)
Unmounting file systems: iscsi: can not unicast skb (-111)
...
...

Expected results:


Additional info:

Comment 1 Rob Kenna 2007-08-23 13:37:44 UTC
This is potentially a blocker.  We need to get this vetted ASAP.

Comment 2 Rob Kenna 2007-08-23 13:41:04 UTC
Created attachment 168592 [details]
Console screen image of hang

Comment 3 Mike Christie 2007-08-23 16:18:56 UTC
Some questions:

1. What file system was this mounted with? Was it GFS?
2. Did you mount the lun by hand or did you enter it in your /etc/fstab with the
_netdev?
3. Is the iscsi init script properly installed? Run chkconfig --list iscsi, to
make sure that the iscsi script is installed and on.
4. How are you setting up the network? Are you using NetworkManager or did you
just setup the networking during the installtion of the system and have not
worried about it. 

Comment 4 Rob Kenna 2007-08-23 18:22:30 UTC
Response to Comment #3

1) There are 3 nodes in a 5.1 cluster configuration.  The nodes share a GFS file
system (/guest_roots) which is iSCSI based from a EqualLogic array.

2) Mounted via /fstab

3)[root@et-virt05 ~]# chkconfig --list iscsi
iscsi           0:off   1:off   2:off   3:on    4:on    5:on    6:off

4) These are server machines not using NetworkManager.  DHCP assigns the addresses.

Also, if I umount the GFS fs (the one on the iSCSI LUN) and then reboot, there
are no problems.  Could this be a race between the umount and stopping iscsi?


Comment 5 Mike Christie 2007-08-23 18:28:29 UTC
Could you post your fstab?

Comment 6 Rob Kenna 2007-08-23 18:30:36 UTC
/dev/VolGroup00/LogVol00 /                       ext3    defaults        1 1
LABEL=/boot             /boot                   ext3    defaults        1 2
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
tmpfs                   /dev/shm                tmpfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
sysfs                   /sys                    sysfs   defaults        0 0
/dev/VolGroup00/LogVol01 swap                    swap    defaults        0 0
/dev/guest_vg/guest_roots /guest_roots gfs2 defaults 0 0
covert:/cluster/mounts/nfs/sandbox /sandbox nfs defaults 0 0


Comment 7 Mike Christie 2007-08-23 22:16:45 UTC
Ok so I guess this is not going to be an easy one. I installed the snapshot from
the 22nd here and it works fine for me with netapp and IET and tgt.

What might be happening is if between the time iscsid is killed
(/etc/init.d/halt kills all processes) and before the FS is umounted by the same
halt script, the equalogic box sent a nop, then that nop could have timed out
(iscsid was killed by halt so it cannot respond) and the target would have
dropped the session. We will then get connection errors, and because halt killed
iscsid we cannot recover and we end up hanging because the kernel is waiting for
userspace to come back.

The targets I tested with do not send nops as pings so we would not hit the problem.

If we just add the disk you are mounting to the fstab with the netdev option
like in the mail I sent then we do not have to worry about any of this since the
FS will be umounted at the right time. So I am not sure this is a regression. It
might have existed in 5.0, but the timing was not right. If it is the iscsi
ping/nop problem I think it is then this was in 5.0.

Let me do some more checking to make sure it is the problem I think it is though.

Comment 10 Mike Christie 2007-08-30 23:18:26 UTC
Created attachment 182301 [details]
hack halt script so iscsid is not killed too early

I am also attaching my hacky workaround patch which works but may have lots of
side effects and we do not have time to test the hacky patch out. Also as a
note to users that find this bugzilla, just put your mounts in fstab with netdv
or remember to unmount your manual mounts.

If you are doing iscsi root then you may still have this problem, but to work
around it you can turn nops off on your target. I think equalogic allows this,
but when using their multipath there may be side effects so you may just want
to increase the timeout to allow for the time it takes to do shutdown. For
other targets like th DS300, you can probably just turn them off since the
initiator does nops itself and will figure things out.

Comment 12 Don Zickus 2008-01-10 20:42:29 UTC
in 2.6.18-66.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 15 errata-xmlrpc 2008-05-21 14:53:34 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html



Note You need to log in before you can comment on or make changes to this bug.