Red Hat Bugzilla – Bug 144781
Kernel panic in shutdown path when iSCSI LUNs are mounted
Last modified: 2007-11-30 17:07:05 EST
Description of problem: Kernel panic in shutdown path when iSCSI LUNs
Version-Release number of selected component (if applicable):
Happens every time
Steps to Reproduce:
1. Boot machine with "nmi_watchdog=1" kernel option
2. Mount iSCSI LUN
3. Shutdown or reboot the machine
Kernel panic on shutdown
Machine should shutdown without a panic.
Here's the panic:
Red Hat Enterprise Linux AS release 3 (Taroon Update 4)
Kernel 2.4.21-27.ELsmp on an i686
INIT: Stopping Red Hat Network Daemon: [ OK ]
Stopping atd: [ OK ]
Stopping cups: [ OK ]
Shutting down xfs: [ OK ]
Shutting down console mouse services: [ OK ]
Stopping sshd:[ OK ]
Stopping xinetd: [ OK ]
Stopping crond: [ OK ]
Stopping automount:[ OK ]
Shutting down NIS services: [ OK ]
Shutting down ntpd: [ OK ]
Saving random seed: [ OK ]
Killing mdmonitor: [ OK ]
Stopping NFS statd: [ OK ]
Stopping portmapper: [ OK ]
Shutting down kernel logger: [ OK ]
Shutting down system logger: [ OK ]
Stopping iscsi: Stopping iSCSI: iscsidiscsi_sfnet: Device or resource
Unable to remove iscsi kernel driver - devices may still be in use
Shutting down interface eth0: [ OK ]
Shutting down interface eth1: [ OK ]
Shutting down loopback interface: [ OK ]
Shutting down audit subsystem[ OK ]
Starting killall: [ OK ]
Sending all processes the TERM signal...
Sending all prociSCSI: tx thread 1691 received SIGKILL, killing rx
iSCSI: session dd1e2000 has ended quickly 1 times, login delay 1 seconds
NMI Watchdog detected LOCKUP on CPU3, eip c0138fe6, registers:
iscsi_sfnet nfs lockd sunrpc usbserial lp parport autofs4 audit e1000
tg3 floppy sg microcode keybdev mousedev hid input usb-ohci up
EIP: 0060:[<c0138fe6>] Not tainted
EIP is at __group_send_sig_info [kernel] 0x3f6 (2.4.21-27.ELsmp/i686)
eax: ddf31880 ebx: 00000282 ecx: 00000000 edx: dcc2c000
esi: 00000012 edi: dbe4a000 ebp: dcc2df40 esp: dcc2deec
ds: 0068 es: 0068 ss: 0068
Process killall5 (pid: 2686, stackpage=dcc2d000)
Stack: 00000012 dcc2df40 dbe4a000 dbe4a000 dcc2df40 00000012 dcc2c000
00000012 dcc2df40 dbe4a000 00000011 00000000 dcc2c000 00000a7e
bfff9d98 c0137a87 00000012 dcc2df40 ffffffff 00000012 00000000
Call Trace: [<c01368bc>] kill_something_info [kernel] 0xcc (0xdcc2df08)
[<c0137a87>] sys_kill [kernel] 0x57 (0xdcc2df30)
[<c0125ed4>] context_switch [kernel] 0xa4 (0xdcc2df60)
[<c0123f14>] schedule [kernel] 0x2f4 (0xdcc2df7c)
Code: f3 90 7e f5 e9 23 d6 ff ff e8 bc 1f fd ff e9 d9 d6 ff ff e8
console shuts up ...
Reassigning to Tom to look at a kernel-side fix for this.
Created attachment 110211 [details]
iscsi shutdown errors
I am having a similar issue. The machine locks up after the end of what you
see in the screen shot attachment. Then the machine's watchdog notices the
machine is locked up (after a few minutes pass) and power cycles it. This is a
Dell PowerEdge 2850. I opened a case with Dell (754209) but they can't do
anything more than enter a bug (like this one) on bugzilla. ;(
I'll be happy to help debug this issue. I do not understand why the iscsi
mounts are not unmounting properly. Nothing was holding them hostage.
I added -o _netdev to mount as recommended in the README file;
however, this doesn't seem to have any affect. Obviously, if I
unmount the filesystems then I'm able to stop or restart iscsid or
reboot/shutdown the server. The filesystems have no processes
attached or running that would cause a lock to exist.
There are two probems here. The first is that the netfs service is not
umounting the iSCSI devices. This fails even though the device is
mounted with "_netdev". I did find that "umount -a -O _netdev" works,
but "service netfs stop" does not. Would you look in to this?
The second problem is a kernel hang when we get to "Sending all
processes the KILL signal" when there are still iSCSI devices mounted.
I will have to look into this one. It will usually be avoided if the
first problem is fixed, but it needs to be fixed anyway.
I'll see if i can figure out what's going on with the netfs script
The netfs issue is being dealt with in BZ 147610 (thanks AJ). I'm reassigning
this to me, to deal with the kernel part of it.
Has this been addressed in Update5?
This is not fixed in U5. Our expectation is that this problem will not occur in
normal circumstances, once BZ 147610 is fixed. The latter is planned for U5.
This bug is on the list for U6.
Created attachment 115957 [details]
fix sigkill shutdown
For sigkill when we go to kill the other process
the function does not return until the other process
has run. We unfortunately hold the signal lock which
the other process will also try to acquire.
To fix this the patch just takes the values we need
then releases the lock. If a signal comes along
while the lock is dropped we will get it on the
next spin and that does not happen too often
and is not a performance critical operation.
A fix for this problem has just been committed to the RHEL3 U6
patch pool this evening (in kernel version 2.4.21-32.11.EL).
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.