Bug 228273

Summary: nfsd process, kernel BUG at include/asm/spinlock.h:133!
Product: Red Hat Enterprise Linux 4 Reporter: Primoz Tolar <primoz.tolar>
Component: kernelAssignee: Peter Staubach <staubach>
Status: CLOSED DUPLICATE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.4CC: leroy.vanlogchem, steved
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-02-28 21:33:43 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
kernel panic messages none

Description Primoz Tolar 2007-02-12 12:59:03 UTC
Description of problem:


 Kernel panic every day or two days. Always fails in nfsd process with the
message: kernel BUG at include/asm/spinlock.h:133!
 We have a ProLiant D380G5 with double Xeon, 8Gb of RAM and MAS1000 disk array.
 It happens on CentOS 4.4 but I guess it's relevant to you guys too.
 I attached the kernel messages respective to versions below.

Version-Release number of selected component (if applicable):

Feb  5 14:23:50 adore kernel: Linux version 2.6.9-42.0.8.ELsmp
(mockbuild.org) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3))
#1 SMP Tue Jan 30 12:33:47 EST 2007
Feb  7 14:13:29 adore kernel: Linux version 2.6.9-42.0.8.ELsmp
(mockbuild.org) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3))
#1 SMP Tue Jan 30 12:33:47 EST 2007
Feb  8 13:54:36 adore kernel: Linux version 2.6.9-42.0.8.ELsmp
(mockbuild.org) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3))
#1 SMP Tue Jan 30 12:33:47 EST 2007
Feb  9 09:39:43 adore kernel: Linux version 2.6.9-42.0.8.ELsmp
(mockbuild.org) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3))
#1 SMP Tue Jan 30 12:33:47 EST 2007
Feb  9 11:04:52 adore kernel: Linux version 2.6.9-42.0.3.ELsmp
(buildsvn@build-i386) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP Fri
Oct 6 06:21:39 CDT 2006
Feb  9 11:15:39 adore kernel: Linux version 2.6.9-42.0.3.ELsmp
(buildsvn@build-i386) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP Fri
Oct 6 06:21:39 CDT 2006
Feb  9 11:22:47 adore kernel: Linux version 2.6.9-42.0.8.ELsmp
(mockbuild.org) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3))
#1 SMP Tue Jan 30 12:33:47 EST 2007
Feb  9 11:27:36 adore kernel: Linux version 2.6.9-42.0.3.ELsmp
(buildsvn@build-i386) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP Fri
Oct 6 06:21:39 CDT 2006
Feb  9 11:38:32 adore kernel: Linux version 2.6.9-42.ELsmp
(buildcentos@build-i386) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP
Sat Aug 12 09:39:11 CDT 2006


How reproducible:

I guess some nfs usage triggers it. No idea how to reproduce it.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Primoz Tolar 2007-02-12 12:59:03 UTC
Created attachment 147889 [details]
kernel panic messages

Comment 2 Primoz Tolar 2007-02-13 07:39:33 UTC
Tonight, a different kernel module crashed the server:

Feb 13 01:26:11 adore kernel: eip: c0172b81
Feb 13 01:26:11 adore kernel: ------------[ cut here ]------------
Feb 13 01:26:11 adore kernel: kernel BUG at include/asm/spinlock.h:133!
Feb 13 01:26:11 adore kernel: invalid operand: 0000 [#1]
Feb 13 01:26:11 adore kernel: SMP
Feb 13 01:26:11 adore kernel: Modules linked in: mptctl mptbase 8021q sg
cpqci(U) nfs nfsd exportfs lockd nfs_acl parport_pc lp parport autofs4 i2c_dev
i2c_core sunrpc button battery ac md5 ipv6 joydev ehci_hcd uhci_hcd bnx2 qla2400
qla2300 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2322 cciss qla2xxx
scsi_transport_fc sd_mod scsi_mod
Feb 13 01:26:11 adore kernel: CPU:    2
Feb 13 01:26:11 adore kernel: EIP:    0060:[<c02d3296>]    Tainted: P      VLI
Feb 13 01:26:11 adore kernel: EFLAGS: 00010216   (2.6.9-42.ELsmp) 
Feb 13 01:26:11 adore kernel: EIP is at _spin_lock+0x1c/0x34
Feb 13 01:26:11 adore kernel: eax: c02e6f92   ebx: f708221c   ecx: df2adee0  
edx: c0172b81
Feb 13 01:26:11 adore kernel: esi: f7082214   edi: f708221c   ebp: 00000020  
esp: df2adee4
Feb 13 01:26:11 adore kernel: ds: 007b   es: 007b   ss: 0068
Feb 13 01:26:11 adore kernel: Process vbda (pid: 17506, threadinfo=df2ad000
task=f7102d30)
Feb 13 01:26:11 adore kernel: Stack: 00000000 c0172b81 00000000 df2adf20
f58669d8 000001f0 c0172482 f7082214
Feb 13 01:26:11 adore kernel:        45d105a3 00000000 f5866a48 00000000
f58669d8 df2adf60 c0159c47 000001f0
Feb 13 01:26:11 adore kernel:        00000000 00000000 00000000 f7e2c600
f890e7f5 45d092cc 00000000 45ca4bd7
Feb 13 01:26:11 adore kernel: Call Trace: 
Feb 13 01:26:11 adore kernel:  [<c0172b81>] dnotify_parent+0x1b/0x6e
Feb 13 01:26:11 adore kernel:  [<c0172482>] notify_change+0x1e2/0x1ec
Feb 13 01:26:11 adore kernel:  [<c0159c47>] sys_utime+0x110/0x134
Feb 13 01:26:11 adore kernel:  [<c01268d8>] current_fs_time+0x44/0x4c
Feb 13 01:26:11 adore kernel:  [<c016b158>] filldir64+0x0/0x11a
Feb 13 01:26:11 adore kernel:  [<c0171bc9>] update_atime+0x6e/0x93
Feb 13 01:26:11 adore kernel:  [<c016f070>] dput+0x34/0x1a7 
Feb 13 01:26:11 adore kernel:  [<c015bc5b>] __fput+0xda/0x100
Feb 13 01:26:11 adore kernel:  [<c02d4703>] syscall_call+0x7/0xb 
Feb 13 01:26:11 adore kernel:  [<c02d007b>] packet_rcv+0x242/0x307
Feb 13 01:26:11 adore kernel: Code: 00 75 09 f0 81 02 00 00 00 01 30 c9 89 c8 c3
53 89 c3 81 78 04 ad 4e ad de 74 18 ff 74 24 04 68 92 6f 2e c0 e8 0a f6 e4 ff 58
5a <0f> 0b 85 00 fd 5f 2e c0 f0 fe 0b 79 09 f3 90 80 3b 00 7e f9 eb 
Feb 13 01:26:11 adore kernel:  <0>Fatal exception: panic in 5 seconds


Comment 3 Leroy van Logchem 2007-02-13 22:16:24 UTC
Yesterday we had three of these panics too. Listed below are two of them:

Feb 12 09:58:15 kernel: eip: c016ff1d
Feb 12 09:58:15 kernel: ------------[ cut here ]------------
Feb 12 09:58:15 kernel: kernel BUG at include/asm/spinlock.h:133!
Feb 12 09:58:15 kernel: invalid operand: 0000 [0000001]
Feb 12 09:58:15 kernel: SMP
Feb 12 09:58:15 kernel: Modules linked in: nfsd exportfs nfs lockd nfs_acl
sunrpc drbd(U) md5 ipv6 autofs4 dm_mod button battery ac uhci_hcd ehci_hcd
e752x_edac edac_mc e1000 floppy aic79xx sg ext3 jbd arcmsr(U) sd_mod scsi_mod
Feb 12 09:58:15 kernel: CPU: 0
Feb 12 09:58:15 kernel: EIP: 0060:[<c02d335a>] Not tainted VLI
Feb 12 09:58:15 kernel: EFLAGS: 00010216 (2.6.9-42.0.3.ELsmp)
Feb 12 09:58:15 kernel: EIP is at _spin_lock+0x1c/0x34
Feb 12 09:58:15 kernel: eax: c02e7076 ebx: d64656dc ecx: cbd98d38 edx: c016ff1d
Feb 12 09:58:15 kernel: esi: d646573c edi: 00000000 ebp: edd84e8c esp: cbd98d3c
Feb 12 09:58:15 kernel: ds: 007b es: 007b ss: 0068
Feb 12 09:58:15 kernel: Process smbd (pid: 13923, threadinfo=cbd98000 task=e5ba42b0)
Feb 12 09:58:15 kernel: Stack: d64656d4 c016ff1d 00000000 c5782058 eb0296a9
e6648c8e 00000007 cbd98db4
Feb 12 09:58:15 kernel: e6648c8e cbd98f0c 00000000 cbd98db4 c0166777 ee942100
cbd98dac e6648c8e
Feb 12 09:58:15 kernel: d9e3c0f8 e6648c8e cbd98f0c c0166b19 00000000 00000003
e032ba94 00000000
Feb 12 09:58:15 kernel: Call Trace:
Feb 12 09:58:15 kernel: [<c016ff1d>] __d_lookup+0x8a/0x109
Feb 12 09:58:15 kernel: [<c0166777>] do_lookup+0x1f/0x8f
Feb 12 09:58:15 kernel: [<c0166b19>] __link_path_walk+0x332/0xbb5
Feb 12 09:58:15 kernel: [<c01673df>] link_path_walk+0x43/0xbe
Feb 12 09:58:15 kernel: [<c016720b>] __link_path_walk+0xa24/0xbb5
Feb 12 09:58:15 kernel: [<c01673df>] link_path_walk+0x43/0xbe
Feb 12 09:58:15 kernel: [<c0167774>] path_lookup+0x14b/0x17f
Feb 12 09:58:15 kernel: [<c01678bc>] __user_walk+0x21/0x51
Feb 12 09:58:15 kernel: [<c0162b41>] vfs_stat+0x14/0x3a
Feb 12 09:58:15 kernel: [<c016314a>] sys_stat64+0xf/0x23
Feb 12 09:58:15 kernel: [<c02d47cb>] syscall_call+0x7/0xb
Feb 12 09:58:15 kernel: [<c02d007b>] packet_rcv+0x17e/0x307
Feb 12 09:58:15 kernel: Code: 00 75 09 f0 81 02 00 00 00 01 30 c9 89 c8 c3 53 89
c3 81 78 04 ad 4e ad de 74 18 ff 74 24 04 68 76 70 2e c0 e8 46 f5 e4
ff 58 5a <0f> 0b 85 00 e1 60 2e c0 f0 fe 0b 79 09 f3 90 80 3b 00 7e f9 eb
Feb 12 09:58:15 kernel: <0>Fatal exception: panic in 5 seconds

And another one a while later


Feb 12 10:30:26 kernel: eip: c016ff1d
Feb 12 10:30:26 kernel: ------------[ cut here ]------------
Feb 12 10:30:26 kernel: kernel BUG at include/asm/spinlock.h:133!
Feb 12 10:30:26 kernel: invalid operand: 0000 [0000001]
Feb 12 10:30:26 kernel: SMP
Feb 12 10:30:26 kernel: Modules linked in: nfsd exportfs lockd nfs_acl sunrpc
drbd(U) md5 ipv6 autofs4 dm_mod button battery ac uhci_hcd ehci_hcd e752x_edac
edac_mc e1000 floppy aic79xx sg ext3 jbd arcmsr(U) sd_mod scsi_mod
Feb 12 10:30:26 kernel: CPU: 0
Feb 12 10:30:26 kernel: EIP: 0060:[<c02d335a>] Not tainted VLI
Feb 12 10:30:26 kernel: EFLAGS: 00010216 (2.6.9-42.0.3.ELsmp)
Feb 12 10:30:26 kernel: EIP is at _spin_lock+0x1c/0x34
Feb 12 10:30:26 kernel: eax: c02e7076 ebx: d46fb644 ecx: d1eace9c edx: c016ff1d
Feb 12 10:30:26 kernel: esi: d46fb6a4 edi: e8e6c2ac ebp: e8e6c2ac esp: d1eacea0
Feb 12 10:30:26 kernel: ds: 007b es: 007b ss: 0068
Feb 12 10:30:26 kernel: Process nfsd (pid: 5103, threadinfo=d1eac000 task=ef4ed6b0)
Feb 12 10:30:26 kernel: Stack: d46fb63c c016ff1d 00000000 c57bdbf0 d1ea1090
11dd5c7d 00000005 d1eacf00
Feb 12 10:30:26 kernel: 00000000 d1eacf00 e8e6c2ac 00000000 c01664dc 00000000
e8e6c2ac d1eacf00
Feb 12 10:30:26 kernel: 00000000 c01677ee d1f9cbc0 ffffffff d1ea1095 11dd5c7d
e8e6c2ac c016788c
Feb 12 10:30:26 kernel: Call Trace:
Feb 12 10:30:26 kernel: [<c016ff1d>] __d_lookup+0x8a/0x109
Feb 12 10:30:26 kernel: [<c01664dc>] cached_lookup+0xf/0x56
Feb 12 10:30:26 kernel: [<c01677ee>] __lookup_hash+0x46/0x89
Feb 12 10:30:26 kernel: [<c016788c>] lookup_one_len+0x54/0x63
Feb 12 10:30:26 kernel: [<fac0fe18>] nfsd_lookup+0x321/0x3ad [nfsd]
Feb 12 10:30:26 kernel: [<fabbb442>] svcauth_unix_set_client+0xa7/0xb5 [sunrpc]
Feb 12 10:30:26 kernel: [<fac18275>] nfsd3_proc_lookup+0xa9/0xb3 [nfsd]
Feb 12 10:30:26 kernel: [<fac1a253>] nfs3svc_decode_diropargs+0x0/0xfa [nfsd]
Feb 12 10:30:26 kernel: [<fac0d681>] nfsd_dispatch+0xba/0x16d [nfsd]
Feb 12 10:30:26 kernel: [<fabb8603>] svc_process+0x432/0x6e3 [sunrpc]
Feb 12 10:30:26 kernel: [<fac0d45a>] nfsd+0x1cc/0x339 [nfsd]
Feb 12 10:30:26 kernel: [<fac0d28e>] nfsd+0x0/0x339 [nfsd]
Feb 12 10:30:26 kernel: [<c01041f5>] kernel_thread_helper+0x5/0xb
Feb 12 10:30:26 kernel: Code: 00 75 09 f0 81 02 00 00 00 01 30 c9 89 c8 c3 53 89
c3 81 78 04 ad 4e ad de 74 18 ff 74 24 04 68 76 70 2e c0 e8 46 f5 e4 ff 58 5a
<0f> 0b 85 00 e1 60 2e c0 f0 fe 0b 79 09 f3 90 80 3b 00 7e f9 eb
Feb 12 10:30:26 kernel: <0>Fatal exception: panic in 5 seconds

The netdump service has been enabled, next time I hope to gather a coredump.

Comment 4 Gatis Visnevskis 2007-02-17 13:11:49 UTC
Reproduce is easy.
Make shell script:
mount -o vers=2 nfsserver:/share /point
cp some file
sleep 1;
umount
repeat

disabling hyper-threading in cpu helps, also booting uni-cpu kernel makes crash
not soo often and in different place

Comment 5 Leroy van Logchem 2007-02-17 16:49:31 UTC
http://people.redhat.com/~jbaron/rhel4/SRPMS.kernel/kernel-2.6.9-48.EL.src.rpm
solved the problem here, YMMV.

Comment 6 Gatis Visnevskis 2007-02-19 13:25:48 UTC
I think, this is duplicate of BUG 191831.

Thanks for this.

Is this latest kernel tested? Is it supported?

We are going to put it on production server.
(after some testing, of course :)


Comment 7 Gatis Visnevskis 2007-02-20 15:05:26 UTC
Hmm, i expected different result with -48 kernel:
(from Solaris 8/Sparc box)

mount -o vers=2 srv2:/share /mnt/nfs/share
NFS getattr failed for server srv2: error 9 (RPC: Program/version mismatch)
nfs mount: mount: /mnt/nfs/share: I/O error


Comment 8 Peter Staubach 2007-02-28 21:33:43 UTC
The situation described in Comment #7 is a bug in Solaris.  I recall
fixing it quite sometime ago in a newer Solaris release.  I would
suggest upgrading or at least complaining to the Solaris folks.  It
is definitely fixed in Solaris 10 GA.

*** This bug has been marked as a duplicate of 191831 ***