Description of problem: Rebooting a node with a mounting iSCSI LUN hangs (see enclosed image of the failure) Version-Release number of selected component (if applicable): 2.6.18-42 (RHEL 5.1 beta, multiple versions) How reproducible: 100% Steps to Reproduce: 1. Mount iSCSI LUN 2. Reboot Note: This is in a cluster configuration Actual results: Hangs with iscsi: can not unicast skb (-111) iscsi: can not broadcast skb (-3) connection6:0 iscsi: conn error (1011) Unmounting file systems: iscsi: can not unicast skb (-111) ... ... Expected results: Additional info:
This is potentially a blocker. We need to get this vetted ASAP.
Created attachment 168592 [details] Console screen image of hang
Some questions: 1. What file system was this mounted with? Was it GFS? 2. Did you mount the lun by hand or did you enter it in your /etc/fstab with the _netdev? 3. Is the iscsi init script properly installed? Run chkconfig --list iscsi, to make sure that the iscsi script is installed and on. 4. How are you setting up the network? Are you using NetworkManager or did you just setup the networking during the installtion of the system and have not worried about it.
Response to Comment #3 1) There are 3 nodes in a 5.1 cluster configuration. The nodes share a GFS file system (/guest_roots) which is iSCSI based from a EqualLogic array. 2) Mounted via /fstab 3)[root@et-virt05 ~]# chkconfig --list iscsi iscsi 0:off 1:off 2:off 3:on 4:on 5:on 6:off 4) These are server machines not using NetworkManager. DHCP assigns the addresses. Also, if I umount the GFS fs (the one on the iSCSI LUN) and then reboot, there are no problems. Could this be a race between the umount and stopping iscsi?
Could you post your fstab?
/dev/VolGroup00/LogVol00 / ext3 defaults 1 1 LABEL=/boot /boot ext3 defaults 1 2 devpts /dev/pts devpts gid=5,mode=620 0 0 tmpfs /dev/shm tmpfs defaults 0 0 proc /proc proc defaults 0 0 sysfs /sys sysfs defaults 0 0 /dev/VolGroup00/LogVol01 swap swap defaults 0 0 /dev/guest_vg/guest_roots /guest_roots gfs2 defaults 0 0 covert:/cluster/mounts/nfs/sandbox /sandbox nfs defaults 0 0
Ok so I guess this is not going to be an easy one. I installed the snapshot from the 22nd here and it works fine for me with netapp and IET and tgt. What might be happening is if between the time iscsid is killed (/etc/init.d/halt kills all processes) and before the FS is umounted by the same halt script, the equalogic box sent a nop, then that nop could have timed out (iscsid was killed by halt so it cannot respond) and the target would have dropped the session. We will then get connection errors, and because halt killed iscsid we cannot recover and we end up hanging because the kernel is waiting for userspace to come back. The targets I tested with do not send nops as pings so we would not hit the problem. If we just add the disk you are mounting to the fstab with the netdev option like in the mail I sent then we do not have to worry about any of this since the FS will be umounted at the right time. So I am not sure this is a regression. It might have existed in 5.0, but the timing was not right. If it is the iscsi ping/nop problem I think it is then this was in 5.0. Let me do some more checking to make sure it is the problem I think it is though.
Created attachment 182301 [details] hack halt script so iscsid is not killed too early I am also attaching my hacky workaround patch which works but may have lots of side effects and we do not have time to test the hacky patch out. Also as a note to users that find this bugzilla, just put your mounts in fstab with netdv or remember to unmount your manual mounts. If you are doing iscsi root then you may still have this problem, but to work around it you can turn nops off on your target. I think equalogic allows this, but when using their multipath there may be side effects so you may just want to increase the timeout to allow for the time it takes to do shutdown. For other targets like th DS300, you can probably just turn them off since the initiator does nops itself and will figure things out.
in 2.6.18-66.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html