Bug 209523 - Problem stopping iscsi initiator service causes shutdown to hang
Summary: Problem stopping iscsi initiator service causes shutdown to hang
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: iscsi-initiator-utils
Version: 5.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Chris Leech
QA Contact: Brock Organ
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-10-05 19:35 UTC by Chris Evich
Modified: 2014-03-07 14:55 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-03-07 14:55:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Chris Evich 2006-10-05 19:35:38 UTC
Steps to Reproduce:
1. Install xen guest and include iscsi-initiator-utils package
2. Use iscsiadm to discover, login, and access an iscsi target
3. sync disks
4. hard-reboot domain: xm destroy dom1
5. start up domain: xm create dom1
6. Make sure the iscsi device shows up in /proc/scsi/scsi on guest
7. shutdown guest: xm shutdown dom1
8. monitor progress: xm console dom1
9. system will never dissapear from output of: xm list

Actual results:
Near the end of the shutdown process you see:

...
Shutting down system logger: [  OK  ]
Shutting down hidd: [  OK  ]
Stopping iSCSI initiator service: KERNEL: assertion
(!atomic_read(&sk->sk_rmem_alloc)) failed at net/netlink/af_netlink.c (145)
[  OK  ]
Shutting down interface eth0:  [  OK  ]
Shutting down interface eth1:  [  OK  ]
...
Turning off swap:
Turning off quotas:
Unmounting pipe file systems:
Unmounting file systems:
Halting system...
md: stopping all md devices.
Synchronizing SCSI cache for disk sda:
iscsi: can not broadcast skb (-3)
 connection0:0: iscsi: detected conn error (1011)

-hangs-



Expected results:
Near the end of the shutdown process you see:
...
Shutting down system logger: [  OK  ]
Shutting down hidd: [  OK  ]
Stopping iSCSI initiator service: [  OK  ]
Shutting down interface eth0:  [  OK  ]
Shutting down interface eth1:  [  OK  ]
...
Turning off swap:
Turning off quotas:
Unmounting pipe file systems:
Unmounting file systems:
Halting system...
md: stopping all md devices.

-domain dissapears from output of: xm list-

Comment 1 Mike Christie 2006-10-05 20:08:28 UTC
(In reply to comment #0)
> Stopping iSCSI initiator service: KERNEL: assertion
> (!atomic_read(&sk->sk_rmem_alloc)) failed at net/netlink/af_netlink.c (145)
> [  OK  ]

This assert is fixed now. It is not the cause of your problem. I just am writing
this so people reading the BZ do not get worried about it.

> Shutting down interface eth0:  [  OK  ]
> Shutting down interface eth1:  [  OK  ]
> ...


This may be the problem. Are these the interfaces the iscsi traffic goes
through? If so then that is why the shutdown hangs down below.


> Turning off swap:
> Turning off quotas:
> Unmounting pipe file systems:
> Unmounting file systems:
> Halting system...
> md: stopping all md devices.
> Synchronizing SCSI cache for disk sda:
> iscsi: can not broadcast skb (-3)
>  connection0:0: iscsi: detected conn error (1011)
> 

Down here the network is not up, but iscsi disks are still running so we sort of
screwed. During kernel shutdown the scsi layer will send a cache sync command
for each disk then wait for it to finish. But because the network is not up we
cannot send iscsi commands, and ince userpsace is not up iscsid cannot handle
the error and fail the command so we are stuck and we wait around forever.

So you currently have to manually turn off networking shutdown.

Bill or Miloslav,

A while back I had proposed something like this

--- S00killall.orig	2006-05-03 11:29:17.000000000 -0500
+++ S00killall.work	2006-05-03 11:29:48.000000000 -0500
@@ -20,8 +20,10 @@ for i in /var/lock/subsys/* ; do
 	# Get the subsystem name.
 	subsys=${i#/var/lock/subsys/}
 	
-	# Networking could be needed for NFS root.
+	# Networking could be needed for NFS root or services like raid
+	# or multipath over iscsi
 	[ $subsys = network ] && continue
+	[ $subsys = iscsi ] && continue
 
 	# Bring the subsystem down.
 	if [ -f /etc/init.d/$subsys.init ]; then


But you guys did not like it. Are you still against it? For RHEL5/FC6 we have to
support iscsi root boot and iscsi over lots of stuff so it would be nice to make
it so the user does not have to touch anything. Should I instead do somethig
like this in the iscsi script?

+       # we do not want iscsi or network to run during system shutdown
+       # incase there are RAID or multipath devices using
+       # iscsi disks
+       chkconfig --level 06 network off
+       rm /etc/rc0.d/*network
+       rm /etc/rc6.d/*network

Comment 2 Bill Nottingham 2006-10-05 20:58:40 UTC
Are your filesystems mounted with the _netdev option?

Comment 3 Chris Evich 2006-10-06 14:01:21 UTC
No they're not mounted with _netdev.  The only iscsi device (sda) contains a
GFS2 filesystem which was mounted at the time the shutdown was issued via xm. 
However, my understanding is the GFS2 service should have come down prior to the
iscsi and network services.  That means the filesystem should not have been
mounted when iscsi was shutdown (the GFS2 service should have unmounted it).  

Seems to me, for iscsi devices, we should do the final sync when the iscsi
service shuts down, and somehow make bloody sure it also removes any hooks which
could cause a final SCSI sync on any iscsi devices later on as they're
guaranteed to fail.  That's just my non-developer opinion though :)

Comment 4 Bill Nottingham 2006-10-06 15:09:14 UTC
The reason I asked about _netdev is that is the option used by the network
scripts to know whether or not to shutdown the network; it's somewhat
tangiential to the iscsi/gfs shutdown order, but it's needed so that the network
script knows you have a network FS.

Comment 5 Chris Evich 2006-10-06 16:08:26 UTC
I'll try setting it and see what happens.

Comment 6 Chris Evich 2006-10-06 18:53:25 UTC
Uh-oh...I get:

kernel: GFS2: fsid=: unknown option: _netdev
kernel: GFS2: fsid=: invalid mount option(s)
kernel: GFS2: can't parse mount arguments

Though this seems like a a seperate problem.  Shall I open another bug on it or
is it a known problem?

Comment 7 Bill Nottingham 2006-10-06 18:55:16 UTC
Is this being mounted by mount(8), or mount(2)?

Comment 8 Chris Evich 2006-10-09 18:43:59 UTC
mount(8), those were the entries in /var/log/messages.

Comment 9 Bill Nottingham 2006-11-02 19:34:49 UTC
Argh, I wasn't very clear, and sorry about the delay.

_netdev is used in fstab to characterize network filesystems. Is this fs in
fstab, or mounted by hand?

Comment 10 Chris Evich 2006-11-02 20:58:45 UTC
Yes, I put _netdev in fstab when I got that GFS2 error.  Though I think those
GFS2 errors are a seperate problem.  The filesystem was in fstab and it was
being mounted by hand.  

Wouldn't it be valid to use _netdev with an ext3 filesystem over iSCSI in a
similar manner?  Unfortunately my reproducer for this problem has been formatted
and re-installed.  However it should be fairly easy to recreate the setup in the
lab.  

Comment 11 Wolfram Richter 2007-05-24 13:09:08 UTC
I'm not sure if this should'nt be cloned to RHEL 5 final.
Still a RHEL 5 machine with a GFS2 FS on a iSCSI target does not reboot.

I haven't yet found an easy workaround except manual shutdown.

Comment 12 Vadym Chepkov 2010-06-10 12:22:47 UTC
Was this ever solved?

Comment 13 Vadym Chepkov 2010-06-10 13:29:50 UTC
In Redhat 5.3 this code was present in init.d/iscsid:


chkconfig --level 06 network off
rm -f /etc/rc0.d/*network
rm -f /etc/rc6.d/*network

But in 5.5 it was removed and now systems hangs during shutdown.
Was there another solution introduced in 5.5 I have to enable?

Thanks

Comment 14 Bill Nottingham 2010-06-10 15:37:54 UTC
For properly configured iscsi devices (with _netdev/_rnetdev set, etc.) the network scripts will not shut down networking.

Comment 15 Vadym Chepkov 2010-06-10 18:09:29 UTC
well, that would imply they have to be mounted from /etc/fstab

what about autofs? pacemaker?

Comment 16 Chris Evich 2010-06-10 18:16:54 UTC
By my reading of /etc/init.d/network in RHEL 5.5 the network scripts check /proc/mounts and/or /etc/mtab so the situations you're concerned about should be covered (assuming autofs/pacemaker specify _netdev option for iscsi filesystems).

Comment 17 Vadym Chepkov 2010-06-10 19:55:02 UTC
But it could be not even a file system,
just a block device for sbd daemon, for example.

if you just login to the targets and don't do anything at all, you wouldn't be able to reboot your server.

That's what happening to me, for example:


Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
Waiting for corosync services to unload:..[  OK  ]
Stopping sshd: [  OK  ]
Shutting down sm-client: [  OK  ]
Shutting down sendmail: [  OK  ]
Shutting down ntpd: [  OK  ]
Stopping system message bus: [  OK  ]
Shutting down kernel logger: [  OK  ]
Shutting down system logger: [  OK  ]
Shutting down interface eth0:  [  OK  ]
Shutting down interface eth1:  [  OK  ]
Shutting down loopback interface:  [  OK  ]
Starting killall:  [  OK  ]
Sending all processes the TERM signal... 
Sending all processes the KILL signal... 
Saving random seed:  
Syncing hardware clock to system time Cannot access the Hardware Clock via any known method.
Use the --debug option to see the details of our search for an access method.

Turning off swap:  
Please stand by while rebooting the system...
md: stopping all md devices.
Synchronizing SCSI cache for disk sdb: 
 connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4299361323, last ping 4299362602, now 4299363876
 connection1:0: detected conn error (1011)

That's it, system is toasted

Comment 18 Mike Christie 2010-06-10 20:21:27 UTC
Vadym,

You are hitting this:
https://bugzilla.redhat.com/show_bug.cgi?id=583218

Comment 19 RHEL Program Management 2014-03-07 13:33:26 UTC
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.

Comment 20 Chris Evich 2014-03-07 14:55:55 UTC
Based on comment 14, and this being documented configuration requirements, I think we can just close this.


Note You need to log in before you can comment on or make changes to this bug.