Bug 191831

Summary: kernel BUG at include/asm/spinlock.h:133!
Product: Red Hat Enterprise Linux 4 Reporter: Cheryl L. Southard <cld>
Component: kernelAssignee: Peter Staubach <staubach>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: cwebster, jas, jbaron, jeffery.hanano, jmccann, ken.depetris, paulw, primoz.tolar, racedo, raines, raymond.marx, richard.cunningham, santoshbr, steved
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0304 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-05-08 01:18:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
/var/log/messages
none
/var/log/messages
none
/var/log/messages
none
/var/log/message dump
none
Proposed patch none

Description Cheryl L. Southard 2006-05-15 23:46:19 UTC
Description of problem: 

Ever since we started upgrading our computers from RedHat Enterprise WS
Update 2 to Update 3, we've been seeing some pretty weird crashes.  A
computer will somewhat randomly crash in the middle of the night, then with
every subsequent reboot, it will consistently  crash again near the end of
the boot process.  These computers will reliably boot into single user
mode.  Two of them were crashing when they got to the exportfs comamnd
in /etc/init.d/nfs, so we removed the /etc/exportfs file, booted the
computers successfully, restored the /etc/exportfs files, then the computers
booted fine after that.  With the other computer, we downgraded the kernel
back to the Update 2 kernel from 2.6.9-34.ELsmp to 2.6.9-22.0.1.ELsmp
and it's been fine.

The /etc/exports file is very simple:
     /export/home    @astro-net(rw,insecure) agn(ro)

Version-Release number of selected component (if applicable):

2.6.9-34.ELsmp



How reproducible:

Once the computer is in this state, it reliably crashes every time we
boot multi-user mode.  It also crashes every time we boot single user
mode and run the exportfs command, or /etc/init.d/nfs start

Once we use one of the above two fixes, it is difficult to get the computer
back into this weird state.  We can't even get them to reliably crash
if we re-upgrade the kernel to 2.6.9-34.ELsmp AND put the /etc/exports
file back.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Cheryl L. Southard 2006-05-15 23:46:19 UTC
Created attachment 129142 [details]
/var/log/messages

Comment 2 Cheryl L. Southard 2006-05-25 18:51:36 UTC
The new 2.6.9-34.0.1.EL kernel does not fix the problem.  I just tried installing it on another computer 
that was crashing and it continued to crash with these spinlock errors.  Also, this problem occurs on both 
smp and non-smp computers.  

Comment 3 Calvin Webster 2006-05-31 15:32:59 UTC
We just experienced the same problem on our Dell PowerEdge 2800 running RHEL 4 AS.
It happend in the middle of a build on a remote machine using an NFS mounted
share served from this machine.

Has a fix been issued yet?

This problem has never occurred before. This production server has been returned
to service and I do not wish to try to induce the problem. I will report if it
happens again.

Kernel: 2.6.9-34.ELsmp

Relevant /var/log/messages entries immediately prior to failure:
---------------------------------
May 31 10:42:19 pegasus kernel: eip: f8ecdc00
May 31 10:42:19 pegasus kernel: ------------[ cut here ]------------
May 31 10:42:19 pegasus kernel: kernel BUG at include/asm/spinlock.h:133!
May 31 10:42:19 pegasus kernel: invalid operand: 0000 [#1]
May 31 10:42:19 pegasus kernel: SMP
May 31 10:42:19 pegasus kernel: Modules linked in: parport_pc lp parport autofs4
i2c_dev i2c_core nfsd exportfs lockd nfs_acl sunrpc md5 ipv6 dm_mirror dm_mod
button battery ac uhci_hcd ehci_hcd hw_random shpchp e1000 bonding(U) floppy st
sg ext3 jbd megaraid_mbox megaraid_mm aic7xxx sd_mod scsi_mod
May 31 10:42:19 pegasus kernel: CPU:    2
May 31 10:42:19 pegasus kernel: EIP:    0060:[<c02d11e8>]    Not tainted VLI
May 31 10:42:19 pegasus kernel: EFLAGS: 00010216   (2.6.9-34.ELsmp)
May 31 10:42:19 pegasus kernel: EIP is at _spin_lock+0x1c/0x34
May 31 10:42:19 pegasus kernel: eax: c02e4ca6   ebx: c6453644   ecx: f61d5c70
edx: f8ecdc00
May 31 10:42:19 pegasus kernel: esi: c645363c   edi: f716a700   ebp: 46000000
esp: f61d5c74
May 31 10:42:19 pegasus kernel: ds: 007b   es: 007b   ss: 0068
May 31 10:42:19 pegasus kernel: Process nfsd (pid: 2761, threadinfo=f61d5000
task=f75748b0)
May 31 10:42:19 pegasus kernel: Stack: f61c9810 f8ecdc00 f61c9810 00000001
00000000 f8975084 00000000 c6b4904c
May 31 10:42:19 pegasus kernel:        f88f0aa8 ffffff8c f61ca000 c645363c
f61d5ec8 c586b400 c6b490c8 0000007c
May 31 10:42:19 pegasus kernel:        f5c60718 0000007c 0000007c c027bec6
c60d30cc 00025200 0000fa4b f5c60718
May 31 10:42:19 pegasus kernel: Call Trace:
May 31 10:42:19 pegasus kernel:  [<f8ecdc00>] nfsd_acceptable+0x48/0xba [nfsd]
May 31 10:42:19 pegasus kernel:  [<f8975084>] find_exported_dentry+0x84/0x5e8
[exportfs]
May 31 10:42:19 pegasus kernel:  [<c027bec6>]
skb_copy_datagram_iovec+0x53/0x1e5May 31 10:42:19 pegasus kernel:  [<c027992b>]
release_sock+0xf/0x4f
May 31 10:42:19 pegasus kernel:  [<c029e1a2>] tcp_recvmsg+0x64a/0x681
May 31 10:42:19 pegasus kernel:  [<c0279a58>] sock_common_recvmsg+0x30/0x46
May 31 10:42:19 pegasus kernel:  [<c0276720>] sock_recvmsg+0xef/0x10c
May 31 10:42:19 pegasus kernel:  [<c02765e9>] sock_sendmsg+0xdb/0xf7
May 31 10:42:19 pegasus kernel:  [<c011cbf2>] recalc_task_prio+0x128/0x133
May 31 10:42:19 pegasus kernel:  [<c011cc85>] activate_task+0x88/0x95
May 31 10:42:19 pegasus kernel:  [<c011d1a3>] try_to_wake_up+0x281/0x28c
May 31 10:42:19 pegasus kernel:  [<c011e75d>] __wake_up_common+0x36/0x51
May 31 10:42:19 pegasus kernel:  [<c011e7a1>] __wake_up+0x29/0x3c
May 31 10:42:19 pegasus kernel:  [<f8ed250b>] svc_expkey_lookup+0x1f0/0x322 [nfsd]
May 31 10:42:19 pegasus kernel:  [<f897588e>] export_decode_fh+0x61/0x6d [exportfs]
May 31 10:42:19 pegasus kernel:  [<f8ecdbb8>] nfsd_acceptable+0x0/0xba [nfsd]
May 31 10:42:19 pegasus kernel:  [<f897582d>] export_decode_fh+0x0/0x6d [exportfs]
May 31 10:42:19 pegasus kernel:  [<f8ece067>] fh_verify+0x3f5/0x5f6 [nfsd]
May 31 10:42:19 pegasus kernel:  [<f8ecdbb8>] nfsd_acceptable+0x0/0xba [nfsd]
May 31 10:42:19 pegasus kernel:  [<f8eceb3c>] nfsd_lookup+0x45/0x3ad [nfsd]
May 31 10:42:19 pegasus kernel:  [<f8eb1383>] svcauth_unix_set_client+0xa7/0xb5
[sunrpc]
May 31 10:42:19 pegasus kernel:  [<f8eccfb0>] nfsd_proc_lookup+0x5f/0x71 [nfsd]
May 31 10:42:19 pegasus kernel:  [<f8ed4c1d>] nfssvc_decode_diropargs+0x0/0xa7
[nfsd]
May 31 10:42:19 pegasus kernel:  [<f8ecc681>] nfsd_dispatch+0xba/0x16d [nfsd]
May 31 10:42:19 pegasus kernel:  [<f8eae55b>] svc_process+0x432/0x6d7 [sunrpc]
May 31 10:42:19 pegasus kernel:  [<f8ecc45a>] nfsd+0x1cc/0x339 [nfsd]
May 31 10:42:19 pegasus kernel:  [<f8ecc28e>] nfsd+0x0/0x339 [nfsd]
May 31 10:42:19 pegasus kernel:  [<c01041f5>] kernel_thread_helper+0x5/0xb
May 31 10:42:19 pegasus kernel: Code: 00 75 09 f0 81 02 00 00 00 01 30 c9 89 c8
c3 53 89 c3 81 78 04 ad 4e ad de 74 18 ff 74 24 04 68 a6 4c 2e c0 e8 54 14 e5 ff
58 5a <0f> 0b 85 00 60 3d 2e c0 f0 fe 0b 79 09 f3 90 80 3b 00 7e f9 eb
May 31 10:42:19 pegasus kernel:  <0>Fatal exception: panic in 5 seconds
May 31 10:42:21 pegasus ntpd[2948]: synchronized to 192.168.2.5, stratum 2

(System hung at this point - all functions, including console, are unavailable)

May 31 10:56:49 pegasus syslogd 1.4.1: restart.
...
---------------------------------


Excerpt from /etc/exports:
---------------------------------
/home/cwebster/aegis    av8bdev(rw,async) av8bios(rw,async)
/home/lachman/aegis     av8bdev(rw,async) av8bios(rw,async)
...
/archive/trainer        av8bdev(rw,async) av8bios(rw,async)
---------------------------------

Each developer has a development directory exported from his home. These are
automounted in the same place on each of two legacy development systems. One is
a Concurrent Computer Corp. PowerHawk (ppc) running PowerMAXOS 4.3 and the other
is a Sun SuperSparc running Solaris 6.

/archive/trainer is an exported source code repository, also mounted on the two
legacy systems.



Comment 4 Calvin Webster 2006-05-31 16:55:52 UTC
Okay, it happened again. As soon as I got to the same point on a remote build it
immediately hung again.

At the point the RHEL server panics, the "gmake" on the PowerHawk is executing
an "rsh" to the Sparc. Both the PowerHawk and Sparc are executing commands on
files located in directories NFS mounted from the RHEL server.

I have not changed anything on this server since the last up2date session on Mon
13 Mar 2006 11:01:37 PM EST. I've done over 100 similar builds since then
without any indications of a problem. Now, all of a sudden, we're having these
NFS-related kernel panics.

I have reverted to kernel version 2.6.9-22.0.2.ELsmp and I've been able to get
through a successful build without another kernel panic... so far. Kernel panic
messages are very similar for both failures.

Relevant excerpts from /var/log/messages:
----------------------------------------------------------
May 31 11:47:36 pegasus kernel: eip: f8ecdc00
May 31 11:47:36 pegasus kernel: ------------[ cut here ]------------
May 31 11:47:36 pegasus kernel: kernel BUG at include/asm/spinlock.h:133!
May 31 11:47:36 pegasus kernel: invalid operand: 0000 [#1]
May 31 11:47:36 pegasus kernel: SMP
May 31 11:47:36 pegasus kernel: Modules linked in: parport_pc lp parport autofs4
i2c_dev i2c_core nfsd exportfs lockd nfs_acl sunrpc md5 ipv6 dm_mirror dm_mod
button battery ac uhci_hcd ehci_hcd hw_random shpchp e1000 bonding(U) floppy st
sg ext3 jbd megaraid_mbox megaraid_mm aic7xxx sd_mod scsi_mod
May 31 11:47:36 pegasus kernel: CPU:    2
May 31 11:47:36 pegasus kernel: EIP:    0060:[<c02d11e8>]    Not tainted VLI
May 31 11:47:36 pegasus kernel: EFLAGS: 00010216   (2.6.9-34.ELsmp)
May 31 11:47:36 pegasus kernel: EIP is at _spin_lock+0x1c/0x34
May 31 11:47:36 pegasus kernel: eax: c02e4ca6   ebx: f4a6dd64   ecx: f69e2ca4
edx: f8ecdc00
May 31 11:47:36 pegasus kernel: esi: f4a6dd5c   edi: f712b8c0   ebp: 46000000
esp: f69e2ca8
May 31 11:47:36 pegasus kernel: ds: 007b   es: 007b   ss: 0068
May 31 11:47:36 pegasus kernel: Process nfsd (pid: 2773, threadinfo=f69e2000
task=f69e06b0)
May 31 11:47:36 pegasus kernel: Stack: f4a6dd5c f8ecdc00 f5b74010 00000001
00000000 f8975084 c027bec6 f65f48cc
May 31 11:47:36 pegasus kernel:        f88f0aa8 ffffff8c f4e18b98 f482676c
f69e2efc c5861e00 f4e18980 c027992b
May 31 11:47:36 pegasus kernel:        0000006c 00000246 0000006c c029e1a2
00000004 f5b4a880 00000001 00000000
May 31 11:47:36 pegasus kernel: Call Trace:
May 31 11:47:36 pegasus kernel:  [<f8ecdc00>] nfsd_acceptable+0x48/0xba [nfsd]
May 31 11:47:36 pegasus kernel:  [<f8975084>] find_exported_dentry+0x84/0x5e8
[exportfs]
May 31 11:47:36 pegasus kernel:  [<c027bec6>]
skb_copy_datagram_iovec+0x53/0x1e5May 31 11:47:36 pegasus kernel:  [<c027992b>]
release_sock+0xf/0x4f
May 31 11:47:36 pegasus kernel:  [<c029e1a2>] tcp_recvmsg+0x64a/0x681
May 31 11:47:36 pegasus kernel:  [<c0279a58>] sock_common_recvmsg+0x30/0x46
May 31 11:47:36 pegasus kernel:  [<c0276720>] sock_recvmsg+0xef/0x10c
May 31 11:47:36 pegasus kernel:  [<c02765e9>] sock_sendmsg+0xdb/0xf7
May 31 11:47:36 pegasus kernel:  [<c011cbf2>] recalc_task_prio+0x128/0x133
May 31 11:47:36 pegasus kernel:  [<c011cc85>] activate_task+0x88/0x95
May 31 11:47:36 pegasus kernel:  [<c011d1a3>] try_to_wake_up+0x281/0x28c
May 31 11:47:36 pegasus kernel:  [<c011e75d>] __wake_up_common+0x36/0x51
May 31 11:47:36 pegasus kernel:  [<c011e7a1>] __wake_up+0x29/0x3c
May 31 11:47:36 pegasus kernel:  [<f8eae9d3>] svc_sock_enqueue+0x1d3/0x20f [sunrpc]
May 31 11:47:36 pegasus kernel:  [<f8eaf963>] svc_tcp_recvfrom+0x304/0x376 [sunrpc]
May 31 11:47:36 pegasus kernel:  [<f8ed250b>] svc_expkey_lookup+0x1f0/0x322 [nfsd]
May 31 11:47:36 pegasus kernel:  [<f897588e>] export_decode_fh+0x61/0x6d [exportfs]
May 31 11:47:36 pegasus kernel:  [<f8ecdbb8>] nfsd_acceptable+0x0/0xba [nfsd]
May 31 11:47:36 pegasus kernel:  [<f897582d>] export_decode_fh+0x0/0x6d [exportfs]
May 31 11:47:36 pegasus kernel:  [<f8ece067>] fh_verify+0x3f5/0x5f6 [nfsd]
May 31 11:47:36 pegasus kernel:  [<f8ecdbb8>] nfsd_acceptable+0x0/0xba [nfsd]
May 31 11:47:36 pegasus kernel:  [<f8ed6c5b>] nfsacld_proc_getattr+0x6a/0x6f [nfsd]
May 31 11:47:36 pegasus kernel:  [<f8ed6df2>]
nfsaclsvc_decode_fhandleargs+0x0/0x21 [nfsd]
May 31 11:47:36 pegasus kernel:  [<f8ecc681>] nfsd_dispatch+0xba/0x16d [nfsd]
May 31 11:47:36 pegasus kernel:  [<f8eae55b>] svc_process+0x432/0x6d7 [sunrpc]
May 31 11:47:36 pegasus kernel:  [<f8ecc45a>] nfsd+0x1cc/0x339 [nfsd]
May 31 11:47:36 pegasus kernel:  [<f8ecc28e>] nfsd+0x0/0x339 [nfsd]
May 31 11:47:36 pegasus kernel:  [<c01041f5>] kernel_thread_helper+0x5/0xb
May 31 11:47:36 pegasus kernel: Code: 00 75 09 f0 81 02 00 00 00 01 30 c9 89 c8
c3 53 89 c3 81 78 04 ad 4e ad de 74 18 ff 74 24 04 68 a6 4c 2e c0 e8 54 14 e5 ff
58 5a <0f> 0b 85 00 60 3d 2e c0 f0 fe 0b 79 09 f3 90 80 3b 00 7e f9 eb
May 31 11:47:36 pegasus kernel:  <0>Fatal exception: panic in 5 seconds

May 31 11:58:54 pegasus syslogd 1.4.1: restart.
----------------------------------------------------------





Comment 5 Jeff 2006-06-14 18:28:33 UTC
Created attachment 130903 [details]
/var/log/messages

Comment 6 Jeff 2006-06-14 18:32:54 UTC
Hi Folks,

I'm also encountering this problem.  The unit is a Dell PowerEdge 2850.
The relevant log entries are posted above.  I'd be very interesed if there is a
solution to this issue.  Also, has going back to the old kernel allowed for a
workaround until a solution is found?

Thanks,

- Jeff

Comment 7 Calvin Webster 2006-06-14 19:13:18 UTC
I originally reverted to kernel version 2.6.9-22.0.2.ELsmp where it doesn't
panic. However, Red Hat Support gave me a workaround that seems to be working.

Add the "no_subtree_check" option to /etc/exports entries.

I've added this to all my /etc/exports entries and installed a test kernel that
allows the netdump client to run with my bonded Ethernet interface. The current
kernels will not support bonded interfaces with netdump.

I wanted to get a good crash dump if it did panic again so I've since upgraded
to a test kernel [2.6.9-37.ELsmp] available from
[http://people.redhat.com/~jbaron/rhel4/RPMS.kernel/]. The test kernel does not
fix the panic bug, but it does allow me to run a netdump client with my bonded
Ethernet interface.

Red Hat Support informs me that the fix for this issue will likely be in RHEL 4
update 4 release. Last word I got was that it is still in testing.


Comment 9 Paul Raines 2006-06-15 15:16:40 UTC
After running for weeks on 2.6.9-34.ELsmp, I have suddenly had several
systems start to get the spinlock.h panic.  In one, it would panic within
seconds of a reboot until we rebooted it to the old 2.6.9-22.0.2.ELsmp
kernel.  I am not sure I buy the "no_subtree_check" bug as this system
was exporting only whole filesystems from their root, no subdirectories
of filesystems.  Does this subtree check still happen then?

Comment 11 Calvin Webster 2006-06-15 16:02:55 UTC
Adding the "no_subtree_check" was a workaround that Red Hat Support suggested
for my circumstances. I'm pretty sure that the "bug" is not in NFS itself, but
in the kernel source code related to it.

As you can see from the description above, we _do_ export subdirectories of
/home and /archive. So, this is a reasonable workaround for us. I can't say for
sure if it is circumventing the problem completely. All I know is that I haven't
had a kernel panic since. If I do, however, I'll have soem good crash dump info
to provide Red Hat since I'm running netdump.

Here is the full text of the "workaround" message from Red Hat Support. Again,
this was based upon the symptoms we are seeing at our site with this platform
and configuration. It may not work for you.

-----------------------------------------------------
This is not a solution but a temporary work around. 

You need to mount the NFS shares using NFSv3 protocol in the client side.

mount -o nfsvers=3 172.16.36.109:/backup /mnt/rem-backup

There is one more suggested work around is to specify " no_subtree_check "
option in the exports(NFS server)
-----------------------------------------------------

Using the "nfsvers=3" option on the client side was not possible for us because
we use "legacy" nfs clients that do not allow this option. This may be a viable
workaround for you, though.


Comment 12 Paul Raines 2006-06-26 15:12:49 UTC
no_subtree_check added to exports options does NOT work.  Our main home
directory server, after upgrading to 4.3, would crash within 20 minutes
of being booted with this "kernel BUG at include/asm/spinlock.h" error
even after putting no_subree_check in all the exports (which btw are all
full filesystem mounts, not subdirectories).  Other less loaded servers may
take days to panic.

The only solution is to boot into the old 2.6.9-22.0.2.ELsmp kernel.
Even the kernel 2.6.9-34 that was last update for 4.2 crashes so it is 
definitely some change in the 2.6.9-22 to 2.6.9-34 move.

We cannot force our 300+ clients to all use nfsvers=3

Comment 13 Calvin Webster 2006-06-26 15:41:40 UTC
I feel your pain Paul.

Have you submitted a support request? That's the preferred method of resolving
critical issues. If you paid for the support, why not use it? When lots of
customers register an issue it gives them an idea of just how critical it is.

I was recently notified that there's a test kernel available at:

http://people.redhat.com/~jbaron/rhel4/RPMS.kernel/

It contains a patch for the kernel bug as well as the "netdump-bonded-interface"
patch so netdump will work on bonded Ethernet interfaces. I went ahead and
installed it and haven't seen any problems to date. My problem wasn't as
persistent as yours, though.

I'm told that RHEL4 U4 beta is available too.

If you don't want to try a test kernel or beta release, you're better off
staying with the old kernel and waiting until the official release of RHEL4 U4
in the next few months.



Comment 14 Jason Baron 2006-07-19 14:56:54 UTC
We believe that this is a duplicate of bz #178848, which is resolved in the U4
beta available from: http://people.redhat.com/~jbaron/rhel4/

I'd appreciate if somebody could verify the fix with the U4 beta kernel.

Comment 15 Calvin Webster 2006-07-21 14:18:45 UTC
I've installed the U4 beta kernel and will be booting into it at the end of the
day today. I'll post the crash dump if it panics. I'm running netdump to another
RHEL 4 server.

Comment 16 Calvin Webster 2006-07-26 14:42:51 UTC
Three days of development using the new kernel, including a series of intense
test builds, have elapsed without incident. The test builds were designed to
simulate the same conditions under which the previous kernels panicked.

I could not access bug #178848 to compare symptoms.

Comment 17 Cheryl L. Southard 2006-08-01 16:34:13 UTC
Created attachment 133414 [details]
/var/log/messages

Comment 18 Cheryl L. Southard 2006-08-01 16:37:44 UTC
Hi,

We tried adding "no_subtree_check" to /etc/exports but that didn't fix the 
problem.

We also tried upgrading to the beta kernel, 2.6.9-42.ELsmp, and one of our
computers crashed with the spinlock bug within a day.  The above attachment
from our /var/log/messages file shows the crash.

Comment 19 Cheryl L. Southard 2006-08-21 15:51:30 UTC
We now have about 5 computers at 2.6.9.42.ELsmp that crash with this spinlock bug.

Comment 20 Paul Raines 2006-08-21 17:53:18 UTC
I have a server that till this weekend was running kernel-smp-2.6.9-34.
I did a long overdue update on it and it got updated to kernel-smp-2.6.9-34.0.2
as well as a glic-2.3.4-2.19 (a total of 172 rpms updates).  Within minutes
and sometimes seconds of this box booting and running NFS, it would
crash with this spinlock.h panic.  I tried going back to the 2.6.9-34 kernel
which it was happily using the day before but it still panicked which makes me
belive now it has something more to do with a glibc update.  Anyway I
tried the beta 2.6.9-42.ELsmp kernel and it still panicked. I now have installed
2.6.9-22.0.2.ELsmp from 4.2 and it is stable now.

I upgraded several others servers exactly the same and they still have 
the new 2.6.9-34.0.2 running and have not shown the problem (yet).  The server 
that does have the problem is different in that it is the most busy and also
runs samba as a PDC, a Flexnet license server and is a ntpd master.
This also means it is a critical server and I cannot afford to do beta
testing on it.

Anyway, I think it might be glibc instead of the kernel that is the source
of the problem.


Comment 21 Shalom 2006-09-03 08:16:16 UTC
Hi,
 
We have a Red Hat AS 4.0 U3 with the latest kernel and it crash when we are 
trying to export the nfs file system there.
The latest patches didn't solved the problem so I have done some workaround 
that took me some time and it seems to be working.
I have forced the server to work on nfsver 3 and let the mountd work with 
version 2 and 3.
I have also set the firewall to block the tcp connection to port 2049 and it 
seems that the machine is working fine but on udp.
 
Regards,
Shalom

Comment 22 Peter Staubach 2006-09-05 13:36:07 UTC
Would it be possible for someone, who can reproduce the problem, to try
using an update 4 kernel, please?

Comment 23 Shalom 2006-09-07 08:26:47 UTC
I have used the update 4 kernel  2.6.9-42.ELsmp and it didn't solved the problem
by mistake I have wrote that we have the U3.
The first thing that I did was installing the latest patches.

Comment 24 Peter Staubach 2006-09-07 12:30:47 UTC
Thanx for trying.

Just to be sure -- which combinations of NFS protocol version and
transport choice work and which ones do not?  Out of NFSv2/UDP,
NFSv2/TCP, NFSv3/UDP, and NFSv3/TCP, what works and what does not?

Comment 25 Shalom 2006-09-07 14:01:33 UTC
I have tried NFSv2/TCP, NFSv3/TCP, NFSv2/UDP and NFSv3/UDP.
works - all the options of UDP.
doesn't work - all the options of TCP.
I have blocked it with the iptables.

Comment 26 Peter Staubach 2006-09-07 14:38:35 UTC
Thanx!  That will help me to look at the correct areas.

Comment 29 John F. Siebenand 2006-09-07 18:04:41 UTC
Have duplicated this problem on RHEL4 U3 when
client is Solaris Sparc which has NFS mounted with -o vers=2
then performs a file copy operation.

Comment 30 Jeff Layton 2006-09-07 18:09:17 UTC
Yes, that problem should be resolved in U4. Solaris attempts to mount NFSv2 with
ACL's enabled and that was triggering the bug. U4 disables ACL checking in NFSv2
and should work around this. Apparently, however, there are other ways to
trigger this problem that don't involve ACL's on NFSv2.


Comment 31 Peter Staubach 2006-09-07 18:30:56 UTC
John, how reproducible is your situation?

Comment 32 John F. Siebenand 2006-09-12 15:40:57 UTC
Created attachment 136084 [details]
/var/log/message dump

Comment 37 Ken DePetris 2006-09-15 21:43:37 UTC
kernel: kernel BUG at include/asm/spinlock.h:133!

I am also experiencing this problem, we have two identical HP servers. Both are 
configured and patched exactly the same, running the same software. The kernels 
were upgraded (along with all other RHN patches) about 1 month ago, the 2.6.9-
42.ELsmp kernel has been running without incident since the patches. Today the 
server crashed, and would not recover on reboot. Hung at different points of 
booting, regressing to 2.6.9-34.ELsmp has caused the problem to (for now?) go 
away. The second (less busy) server has had no problems at all. These servers 
handle thousands of mount/unmount NFS requests daily, from Sun, AIX, and other 
Linux servers.
Since this is a production unit, I cannot test (or try) methods of fixing. Are 
there any certain methods to offer greater stability? I have read this thread 
but am unsure if any of the 'solutions' are actual fixes since this problem 
seems difficult to reproduce.

Comment 43 Peter Staubach 2006-10-18 13:46:23 UTC
For the deployments which are seeing these problems, is NFSv2 being
forced on the clients?  There was some mention of legacy clients,
what are these?

Comment 44 Cheryl L. Southard 2006-10-19 19:36:11 UTC
Hi,

We are not forcing the clients to mount with NFSv2.  It's not the
clients that are crashing though.  Only the NFS servers get the spinlock
error.  And they seem to crash when we run the exportfs command in the nfs startup
scripts.

Comment 45 Peter Staubach 2006-10-19 19:39:36 UTC
If the clients are not being forced to NFSv2, then why are they mounting
using NFSv2?  The only clients which support the NFS_ACL protocol default
to using NFSv3 or now NFSv4.  Things have to go really wrong before they
will revert back to NFSv2 unless they are told to.

So, the question again -- what are these legacy clients which are
causing the NFS server to fail?

Comment 47 Jeff Layton 2006-10-24 16:29:55 UTC
One possible workaround (Peter correct me if I'm wrong), would be to have mountd
deny NFSv2 mount requests. To do this, you'd want to add this line to
/etc/sysconfig/nfs:

MOUNTD_NFS_V2=no

...of course, this assumes you are not using NFSv2 at all, which should be the
case on any reasonably modern OS.

This will, unfortunately, not have any affect on hosts that already have NFSv2
mounts on this server (since mounts persist across server reboots), so you'd
need to check all the clients and make sure that no such mounts exist.

To this day we still don't have *any* confirmed cases where this problem
occurred and there was no NFSv2 traffic. We'd definitely be interested if anyone
can come up with such a case.


Comment 59 Jason Baron 2006-11-10 15:01:29 UTC
committed in stream U5 build 42.24. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/

i hope it fixes this :)

Comment 63 RHEL Program Management 2006-11-28 03:35:06 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 64 RHEL Program Management 2006-11-28 03:35:55 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 65 RHEL Program Management 2006-11-28 03:36:10 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 66 RHEL Program Management 2006-11-28 03:36:12 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 68 RHEL Program Management 2006-12-12 17:01:03 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 70 Joshua Myles 2006-12-15 22:13:01 UTC
(In reply to comment #47)
> One possible workaround (Peter correct me if I'm wrong), would be to have mountd
> deny NFSv2 mount requests. To do this, you'd want to add this line to
> /etc/sysconfig/nfs:
> 
> MOUNTD_NFS_V2=no

This setting only affects mountd. nfsd can still provide v2 unless the following
is also set in /etc/sysconfig/nfs:

RPCNFSDARGS="-N 2"

I was able to reliably crash a 2.6.9-42.EL machine by mounting from a Solaris
client using "-o vers=2", even with MOUNTD_NFS_V2=no. Adding RPCNFSDARGS as
above resolved the problem.

Comment 71 Jay Turner 2007-01-02 13:43:26 UTC
QE ack for 4.5.

Comment 73 Cheryl L. Southard 2007-01-04 19:34:48 UTC
(In reply to comment #59)
> committed in stream U5 build 42.24. A test kernel with this patch is available
> from http://people.redhat.com/~jbaron/rhel4/
> 
> i hope it fixes this :)

Hi Jason,

Thanks for the beta kernel.  We tried a bunch of them from that directory
2.6.9.42.24
2.6.9.42.32
2.6.9.42.36

They all seem to fix the spinlock problem!  Thanks for the fix!  We've
tried it on about 4 computers so far that were spinlocking, and so far
the problem has not reoccurred.

However, we have noticed on all 3 of these beta kernels that they cause
a somewhat different NFS bug to appear.  Where do I report problems
with these beta kernels?

Thanks again for the fix!

Comment 74 Jeff Layton 2007-01-04 20:06:10 UTC
What sort of bug?


Comment 75 Cheryl L. Southard 2007-01-04 20:42:34 UTC
(In reply to comment #74)
> What sort of bug?
> 

Well, I haven't had time to go through all of the source code and figure out exactly what's
going on, but the problem is with systems running the 2.6.9-42.24, 2.6.9-42.32 and
2.6.9-42.36 kernels and with the mail client called "pine".   When we are running pine
and are in the "MESSAGE INDEX" routine looking at the index of a mail file that is NFS
mounted, we are never notified of new e-mails.   We have to exit from the current mail file
then re-open it, or quit pine and restart it to see any incoming e-mail.  

Comment 79 Jason Baron 2007-01-09 21:30:20 UTC
*** Bug 220771 has been marked as a duplicate of this bug. ***

Comment 86 Peter Staubach 2007-02-14 22:18:40 UTC
Okie doke.  I'm glad that that was easily resolved. :-)

Comment 87 Peter Staubach 2007-02-28 21:33:48 UTC
*** Bug 228273 has been marked as a duplicate of this bug. ***

Comment 88 Jason Baron 2007-03-12 17:41:50 UTC
*** Bug 230094 has been marked as a duplicate of this bug. ***

Comment 92 Red Hat Bugzilla 2007-05-08 01:18:21 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0304.html

Comment 101 Issue Tracker 2007-06-20 17:35:56 UTC
Closing per todays call - issue resolved with the given fix.

Internal Status set to 'Resolved'
Status set to: Closed by Tech
Resolution set to: 'NotABug'

This event sent from IssueTracker by sfolkwil 
 issue 123455

Comment 102 Cheryl L. Southard 2007-06-22 16:30:05 UTC
I just wanted to let you know that I found a workaround for the pine problem
which started hapening with this "fixed" kernel.  Ever since the spinlock
problem has been fixed with these beta kernels, and even now with the newly
released RHEL_4 V5 kernel (2.6.9-55.EL), we've been having problems with pine
seeing new e-mails.  I mentioned it in one of my above comments.   The
workaround is to NFS mount our /var/mail spool directory "udp" instead of "tcp".
 With "tcp" we are never notified of new incoming e-mails.  With "udp"
everything works fine and as expected.  This is probably mentioned in another
bugzilla, but I wanted to follow up in this one because I brought it up back in
January.  Thanks again for the fix.