Description of problem: Kernel panic in shutdown path when iSCSI LUNs are mounted Version-Release number of selected component (if applicable): 3.6.2 How reproducible: Happens every time Steps to Reproduce: 1. Boot machine with "nmi_watchdog=1" kernel option 2. Mount iSCSI LUN 3. Shutdown or reboot the machine Actual results: Kernel panic on shutdown Expected results: Machine should shutdown without a panic. Additional info:
Here's the panic: Red Hat Enterprise Linux AS release 3 (Taroon Update 4) Kernel 2.4.21-27.ELsmp on an i686 INIT: Stopping Red Hat Network Daemon: [ OK ] Stopping atd: [ OK ] Stopping cups: [ OK ] Shutting down xfs: [ OK ] Shutting down console mouse services: [ OK ] Stopping sshd:[ OK ] Stopping xinetd: [ OK ] Stopping crond: [ OK ] Stopping automount:[ OK ] Shutting down NIS services: [ OK ] Shutting down ntpd: [ OK ] Saving random seed: [ OK ] Killing mdmonitor: [ OK ] Stopping NFS statd: [ OK ] Stopping portmapper: [ OK ] Shutting down kernel logger: [ OK ] Shutting down system logger: [ OK ] Stopping iscsi: Stopping iSCSI: iscsidiscsi_sfnet: Device or resource busy Unable to remove iscsi kernel driver - devices may still be in use [FAILED] Shutting down interface eth0: [ OK ] Shutting down interface eth1: [ OK ] Shutting down loopback interface: [ OK ] Shutting down audit subsystem[ OK ] Starting killall: [ OK ] Sending all processes the TERM signal... Sending all prociSCSI: tx thread 1691 received SIGKILL, killing rx thread 1692 iSCSI: session dd1e2000 has ended quickly 1 times, login delay 1 seconds NMI Watchdog detected LOCKUP on CPU3, eip c0138fe6, registers: iscsi_sfnet nfs lockd sunrpc usbserial lp parport autofs4 audit e1000 tg3 floppy sg microcode keybdev mousedev hid input usb-ohci up CPU: 3 EIP: 0060:[<c0138fe6>] Not tainted EFLAGS: 00000082 EIP is at __group_send_sig_info [kernel] 0x3f6 (2.4.21-27.ELsmp/i686) eax: ddf31880 ebx: 00000282 ecx: 00000000 edx: dcc2c000 esi: 00000012 edi: dbe4a000 ebp: dcc2df40 esp: dcc2deec ds: 0068 es: 0068 ss: 0068 Process killall5 (pid: 2686, stackpage=dcc2d000) Stack: 00000012 dcc2df40 dbe4a000 dbe4a000 dcc2df40 00000012 dcc2c000 c01368bc 00000012 dcc2df40 dbe4a000 00000011 00000000 dcc2c000 00000a7e 000007c9 bfff9d98 c0137a87 00000012 dcc2df40 ffffffff 00000012 00000000 00000000 Call Trace: [<c01368bc>] kill_something_info [kernel] 0xcc (0xdcc2df08) [<c0137a87>] sys_kill [kernel] 0x57 (0xdcc2df30) [<c0125ed4>] context_switch [kernel] 0xa4 (0xdcc2df60) [<c0123f14>] schedule [kernel] 0x2f4 (0xdcc2df7c) Code: f3 90 7e f5 e9 23 d6 ff ff e8 bc 1f fd ff e9 d9 d6 ff ff e8 console shuts up ...
Reassigning to Tom to look at a kernel-side fix for this.
Created attachment 110211 [details] iscsi shutdown errors I am having a similar issue. The machine locks up after the end of what you see in the screen shot attachment. Then the machine's watchdog notices the machine is locked up (after a few minutes pass) and power cycles it. This is a Dell PowerEdge 2850. I opened a case with Dell (754209) but they can't do anything more than enter a bug (like this one) on bugzilla. ;( I'll be happy to help debug this issue. I do not understand why the iscsi mounts are not unmounting properly. Nothing was holding them hostage.
I added -o _netdev to mount as recommended in the README file; however, this doesn't seem to have any affect. Obviously, if I unmount the filesystems then I'm able to stop or restart iscsid or reboot/shutdown the server. The filesystems have no processes attached or running that would cause a lock to exist.
AJ, There are two probems here. The first is that the netfs service is not umounting the iSCSI devices. This fails even though the device is mounted with "_netdev". I did find that "umount -a -O _netdev" works, but "service netfs stop" does not. Would you look in to this? The second problem is a kernel hang when we get to "Sending all processes the KILL signal" when there are still iSCSI devices mounted. I will have to look into this one. It will usually be avoided if the first problem is fixed, but it needs to be fixed anyway. Tom
I'll see if i can figure out what's going on with the netfs script
The netfs issue is being dealt with in BZ 147610 (thanks AJ). I'm reassigning this to me, to deal with the kernel part of it.
Has this been addressed in Update5?
This is not fixed in U5. Our expectation is that this problem will not occur in normal circumstances, once BZ 147610 is fixed. The latter is planned for U5. This bug is on the list for U6.
Created attachment 115957 [details] fix sigkill shutdown For sigkill when we go to kill the other process the function does not return until the other process has run. We unfortunately hold the signal lock which the other process will also try to acquire. To fix this the patch just takes the values we need then releases the lock. If a signal comes along while the lock is dropped we will get it on the next spin and that does not happen too often and is not a performance critical operation.
A fix for this problem has just been committed to the RHEL3 U6 patch pool this evening (in kernel version 2.4.21-32.11.EL).
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-663.html