Red Hat Bugzilla – Bug 253989
iSCSI Hangs on reboot
Last modified: 2008-05-21 10:53:34 EDT
Description of problem:
Rebooting a node with a mounting iSCSI LUN hangs (see enclosed image of the failure)
Version-Release number of selected component (if applicable):
2.6.18-42 (RHEL 5.1 beta, multiple versions)
Steps to Reproduce:
1. Mount iSCSI LUN
Note: This is in a cluster configuration
iscsi: can not unicast skb (-111)
iscsi: can not broadcast skb (-3)
connection6:0 iscsi: conn error (1011)
Unmounting file systems: iscsi: can not unicast skb (-111)
This is potentially a blocker. We need to get this vetted ASAP.
Created attachment 168592 [details]
Console screen image of hang
1. What file system was this mounted with? Was it GFS?
2. Did you mount the lun by hand or did you enter it in your /etc/fstab with the
3. Is the iscsi init script properly installed? Run chkconfig --list iscsi, to
make sure that the iscsi script is installed and on.
4. How are you setting up the network? Are you using NetworkManager or did you
just setup the networking during the installtion of the system and have not
worried about it.
Response to Comment #3
1) There are 3 nodes in a 5.1 cluster configuration. The nodes share a GFS file
system (/guest_roots) which is iSCSI based from a EqualLogic array.
2) Mounted via /fstab
3)[root@et-virt05 ~]# chkconfig --list iscsi
iscsi 0:off 1:off 2:off 3:on 4:on 5:on 6:off
4) These are server machines not using NetworkManager. DHCP assigns the addresses.
Also, if I umount the GFS fs (the one on the iSCSI LUN) and then reboot, there
are no problems. Could this be a race between the umount and stopping iscsi?
Could you post your fstab?
/dev/VolGroup00/LogVol00 / ext3 defaults 1 1
LABEL=/boot /boot ext3 defaults 1 2
devpts /dev/pts devpts gid=5,mode=620 0 0
tmpfs /dev/shm tmpfs defaults 0 0
proc /proc proc defaults 0 0
sysfs /sys sysfs defaults 0 0
/dev/VolGroup00/LogVol01 swap swap defaults 0 0
/dev/guest_vg/guest_roots /guest_roots gfs2 defaults 0 0
covert:/cluster/mounts/nfs/sandbox /sandbox nfs defaults 0 0
Ok so I guess this is not going to be an easy one. I installed the snapshot from
the 22nd here and it works fine for me with netapp and IET and tgt.
What might be happening is if between the time iscsid is killed
(/etc/init.d/halt kills all processes) and before the FS is umounted by the same
halt script, the equalogic box sent a nop, then that nop could have timed out
(iscsid was killed by halt so it cannot respond) and the target would have
dropped the session. We will then get connection errors, and because halt killed
iscsid we cannot recover and we end up hanging because the kernel is waiting for
userspace to come back.
The targets I tested with do not send nops as pings so we would not hit the problem.
If we just add the disk you are mounting to the fstab with the netdev option
like in the mail I sent then we do not have to worry about any of this since the
FS will be umounted at the right time. So I am not sure this is a regression. It
might have existed in 5.0, but the timing was not right. If it is the iscsi
ping/nop problem I think it is then this was in 5.0.
Let me do some more checking to make sure it is the problem I think it is though.
Created attachment 182301 [details]
hack halt script so iscsid is not killed too early
I am also attaching my hacky workaround patch which works but may have lots of
side effects and we do not have time to test the hacky patch out. Also as a
note to users that find this bugzilla, just put your mounts in fstab with netdv
or remember to unmount your manual mounts.
If you are doing iscsi root then you may still have this problem, but to work
around it you can turn nops off on your target. I think equalogic allows this,
but when using their multipath there may be side effects so you may just want
to increase the timeout to allow for the time it takes to do shutdown. For
other targets like th DS300, you can probably just turn them off since the
initiator does nops itself and will figure things out.
You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.