Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 253989

Summary:

iSCSI Hangs on reboot

Product:

Red Hat Enterprise Linux 5

Reporter:

Rob Kenna <rkenna>

Component:

kernel

Assignee:

Mike Christie <mchristi>

Status:

CLOSED ERRATA

QA Contact:

Brock Organ <borgan>

Severity:

medium

Docs Contact:

Priority:

high

Version:

5.1

CC:

157070.alewis, bpeters, coughlan, dshaks, rsarraf

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

RHBA-2008-0314

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2008-05-21 14:53:34 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

227737, 372911, 420521, 422431, 422441

Attachments:

Description	Flags
Console screen image of hang	none
hack halt script so iscsid is not killed too early	none

Description Rob Kenna 2007-08-23 13:35:24 UTC

Description of problem:
Rebooting a node with a mounting iSCSI LUN hangs (see enclosed image of the failure)

Version-Release number of selected component (if applicable):
 2.6.18-42 (RHEL 5.1 beta, multiple versions)

How reproducible:
100%

Steps to Reproduce:
1. Mount iSCSI LUN
2. Reboot
Note: This is in a cluster configuration
  
Actual results:
Hangs with 

iscsi: can not unicast skb (-111)
iscsi: can not broadcast skb (-3)
connection6:0 iscsi: conn error  (1011)
Unmounting file systems: iscsi: can not unicast skb (-111)
...
...

Expected results:


Additional info:

Comment 1 Rob Kenna 2007-08-23 13:37:44 UTC

This is potentially a blocker.  We need to get this vetted ASAP.

Comment 2 Rob Kenna 2007-08-23 13:41:04 UTC

Created attachment 168592 [details]
Console screen image of hang

Comment 3 Mike Christie 2007-08-23 16:18:56 UTC

Some questions:

1. What file system was this mounted with? Was it GFS?
2. Did you mount the lun by hand or did you enter it in your /etc/fstab with the
_netdev?
3. Is the iscsi init script properly installed? Run chkconfig --list iscsi, to
make sure that the iscsi script is installed and on.
4. How are you setting up the network? Are you using NetworkManager or did you
just setup the networking during the installtion of the system and have not
worried about it.

Comment 4 Rob Kenna 2007-08-23 18:22:30 UTC

Response to Comment #3

1) There are 3 nodes in a 5.1 cluster configuration.  The nodes share a GFS file
system (/guest_roots) which is iSCSI based from a EqualLogic array.

2) Mounted via /fstab

3)[root@et-virt05 ~]# chkconfig --list iscsi
iscsi           0:off   1:off   2:off   3:on    4:on    5:on    6:off

4) These are server machines not using NetworkManager.  DHCP assigns the addresses.

Also, if I umount the GFS fs (the one on the iSCSI LUN) and then reboot, there
are no problems.  Could this be a race between the umount and stopping iscsi?

Comment 5 Mike Christie 2007-08-23 18:28:29 UTC

Could you post your fstab?

Comment 6 Rob Kenna 2007-08-23 18:30:36 UTC

/dev/VolGroup00/LogVol00 /                       ext3    defaults        1 1
LABEL=/boot             /boot                   ext3    defaults        1 2
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
tmpfs                   /dev/shm                tmpfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
sysfs                   /sys                    sysfs   defaults        0 0
/dev/VolGroup00/LogVol01 swap                    swap    defaults        0 0
/dev/guest_vg/guest_roots /guest_roots gfs2 defaults 0 0
covert:/cluster/mounts/nfs/sandbox /sandbox nfs defaults 0 0

Comment 7 Mike Christie 2007-08-23 22:16:45 UTC

Ok so I guess this is not going to be an easy one. I installed the snapshot from
the 22nd here and it works fine for me with netapp and IET and tgt.

What might be happening is if between the time iscsid is killed
(/etc/init.d/halt kills all processes) and before the FS is umounted by the same
halt script, the equalogic box sent a nop, then that nop could have timed out
(iscsid was killed by halt so it cannot respond) and the target would have
dropped the session. We will then get connection errors, and because halt killed
iscsid we cannot recover and we end up hanging because the kernel is waiting for
userspace to come back.

The targets I tested with do not send nops as pings so we would not hit the problem.

If we just add the disk you are mounting to the fstab with the netdev option
like in the mail I sent then we do not have to worry about any of this since the
FS will be umounted at the right time. So I am not sure this is a regression. It
might have existed in 5.0, but the timing was not right. If it is the iscsi
ping/nop problem I think it is then this was in 5.0.

Let me do some more checking to make sure it is the problem I think it is though.

Comment 10 Mike Christie 2007-08-30 23:18:26 UTC

Created attachment 182301 [details]
hack halt script so iscsid is not killed too early

I am also attaching my hacky workaround patch which works but may have lots of
side effects and we do not have time to test the hacky patch out. Also as a
note to users that find this bugzilla, just put your mounts in fstab with netdv
or remember to unmount your manual mounts.

If you are doing iscsi root then you may still have this problem, but to work
around it you can turn nops off on your target. I think equalogic allows this,
but when using their multipath there may be side effects so you may just want
to increase the timeout to allow for the time it takes to do shutdown. For
other targets like th DS300, you can probably just turn them off since the
initiator does nops itself and will figure things out.

Comment 12 Don Zickus 2008-01-10 20:42:29 UTC

in 2.6.18-66.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 15 errata-xmlrpc 2008-05-21 14:53:34 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html