Bug 144781

Summary: Kernel panic in shutdown path when iSCSI LUNs are mounted
Product: Red Hat Enterprise Linux 3 Reporter: Dave Wysochanski <davidw>
Component: kernelAssignee: Mike Christie <mchristi>
Status: CLOSED ERRATA QA Contact: Brock Organ <borgan>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: berthiaume_wayne, conway_heather, coughlan, davidw, josh, kaufman_susan, petrides, poelstra, rkenna
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2005-663 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-09-28 14:41:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 147610    
Bug Blocks: 156320    
Attachments:
Description Flags
iscsi shutdown errors
none
fix sigkill shutdown none

Description Dave Wysochanski 2005-01-11 15:06:02 UTC
Description of problem: Kernel panic in shutdown path when iSCSI LUNs
are mounted


Version-Release number of selected component (if applicable):
3.6.2

How reproducible:
Happens every time

Steps to Reproduce:
1. Boot machine with "nmi_watchdog=1" kernel option
2. Mount iSCSI LUN
3. Shutdown or reboot the machine
  
Actual results:
Kernel panic on shutdown

Expected results:
Machine should shutdown without a panic.

Additional info:

Comment 1 Dave Wysochanski 2005-01-11 15:06:49 UTC
Here's the panic:

Red Hat Enterprise Linux AS release 3 (Taroon Update 4)
Kernel 2.4.21-27.ELsmp on an i686

INIT: Stopping Red Hat Network Daemon: [  OK  ]
Stopping atd: [  OK  ]
Stopping cups: [  OK  ]
Shutting down xfs: [  OK  ]
Shutting down console mouse services: [  OK  ]
Stopping sshd:[  OK  ]
Stopping xinetd: [  OK  ]
Stopping crond: [  OK  ]
Stopping automount:[  OK  ]
Shutting down NIS services: [  OK  ]
Shutting down ntpd: [  OK  ]
Saving random seed:  [  OK  ]
Killing mdmonitor: [  OK  ]
Stopping NFS statd: [  OK  ]
Stopping portmapper: [  OK  ]
Shutting down kernel logger: [  OK  ]
Shutting down system logger: [  OK  ]
Stopping iscsi:  Stopping iSCSI: iscsidiscsi_sfnet: Device or resource
busy
Unable to remove iscsi kernel driver - devices may still be in use

[FAILED]
Shutting down interface eth0:  [  OK  ]
Shutting down interface eth1:  [  OK  ]
Shutting down loopback interface:  [  OK  ]
Shutting down audit subsystem[  OK  ]
Starting killall:  [  OK  ]
Sending all processes the TERM signal...
Sending all prociSCSI: tx thread 1691 received SIGKILL, killing rx
thread 1692
iSCSI: session dd1e2000 has ended quickly 1 times, login delay 1 seconds
NMI Watchdog detected LOCKUP on CPU3, eip c0138fe6, registers:
iscsi_sfnet nfs lockd sunrpc usbserial lp parport autofs4 audit e1000
tg3 floppy sg microcode keybdev mousedev hid input usb-ohci up
CPU:    3
EIP:    0060:[<c0138fe6>]    Not tainted
EFLAGS: 00000082

EIP is at __group_send_sig_info [kernel] 0x3f6 (2.4.21-27.ELsmp/i686)
eax: ddf31880   ebx: 00000282   ecx: 00000000   edx: dcc2c000
esi: 00000012   edi: dbe4a000   ebp: dcc2df40   esp: dcc2deec
ds: 0068   es: 0068   ss: 0068
Process killall5 (pid: 2686, stackpage=dcc2d000)
Stack: 00000012 dcc2df40 dbe4a000 dbe4a000 dcc2df40 00000012 dcc2c000
c01368bc
       00000012 dcc2df40 dbe4a000 00000011 00000000 dcc2c000 00000a7e
000007c9
       bfff9d98 c0137a87 00000012 dcc2df40 ffffffff 00000012 00000000
00000000
Call Trace:   [<c01368bc>] kill_something_info [kernel] 0xcc (0xdcc2df08)
[<c0137a87>] sys_kill [kernel] 0x57 (0xdcc2df30)
[<c0125ed4>] context_switch [kernel] 0xa4 (0xdcc2df60)
[<c0123f14>] schedule [kernel] 0x2f4 (0xdcc2df7c)

Code: f3 90 7e f5 e9 23 d6 ff ff e8 bc 1f fd ff e9 d9 d6 ff ff e8

console shuts up ...



Comment 2 AJ Lewis 2005-01-11 17:58:58 UTC
Reassigning to Tom to look at a kernel-side fix for this.

Comment 3 Josh Hildebrand 2005-01-25 19:33:36 UTC
Created attachment 110211 [details]
iscsi shutdown errors

I am having a similar issue.  The machine locks up after the end of what you
see in the screen shot attachment.  Then the machine's watchdog notices the
machine is locked up (after a few minutes pass) and power cycles it.  This is a
Dell PowerEdge 2850.  I opened a case with Dell (754209) but they can't do
anything more than enter a bug (like this one) on bugzilla. ;(

I'll be happy to help debug this issue.  I do not understand why the iscsi
mounts are not unmounting properly.  Nothing was holding them hostage.

Comment 4 Wayne Berthiaume 2005-02-04 18:22:12 UTC
I added -o _netdev to mount as recommended in the README file; 
however, this doesn't seem to have any affect. Obviously, if I 
unmount the filesystems then I'm able to stop or restart iscsid or 
reboot/shutdown the server. The filesystems have no processes 
attached or running that would cause a lock to exist.

Comment 5 Tom Coughlan 2005-02-09 12:57:15 UTC
AJ,

There are two probems here. The first is that the netfs service is not
umounting the iSCSI devices. This fails even though the device is
mounted with "_netdev". I did find that "umount -a -O _netdev" works,
but "service netfs stop" does not. Would you look in to this?

The second problem is a kernel hang when we get to "Sending all
processes the KILL signal" when there are still iSCSI devices mounted.
I will have to look into this one. It will usually be avoided if the
first problem is fixed, but it needs to be fixed anyway.

Tom

Comment 6 AJ Lewis 2005-02-09 17:43:09 UTC
I'll see if i can figure out what's going on with the netfs script

Comment 7 Tom Coughlan 2005-02-09 20:36:48 UTC
The netfs issue is being dealt with in BZ 147610 (thanks AJ). I'm reassigning
this to me, to deal with the kernel part of it.

Comment 9 Dave Wysochanski 2005-04-07 03:57:40 UTC
Has this been addressed in Update5?

Comment 10 Tom Coughlan 2005-04-07 19:09:46 UTC
This is not fixed in U5. Our expectation is that this problem will not occur in
normal circumstances, once BZ 147610 is fixed. The latter is planned for U5.
This bug is on the list for U6.

Comment 11 Mike Christie 2005-06-24 21:37:35 UTC
Created attachment 115957 [details]
fix sigkill shutdown

For sigkill when we go to kill the other process
the function does not return until the other process
has run. We unfortunately hold the signal lock which
the other process will also try to acquire.

To fix this the patch just takes the values we need
then releases the lock. If a signal comes along
while the lock is dropped we will get it on the
next spin and that does not happen too often
and is not a performance critical operation.

Comment 18 Ernie Petrides 2005-07-13 23:35:26 UTC
A fix for this problem has just been committed to the RHEL3 U6
patch pool this evening (in kernel version 2.4.21-32.11.EL).


Comment 23 Red Hat Bugzilla 2005-09-28 14:41:31 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-663.html