140319 – NFS server hangs overnight

Bug 140319 - NFS server hangs overnight

Summary: NFS server hangs overnight

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Steve Dickson
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-11-22 10:56 UTC by Tim Towers
Modified:	2007-11-30 22:07 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-10-19 19:13:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
output of ksymoops-2.4.9 (2.53 KB, text/plain) 2004-12-02 10:44 UTC, Tim Towers	no flags	Details
View All

Description Tim Towers 2004-11-22 10:56:05 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5)
Gecko/20041111 Firefox/1.0

Description of problem:
I am seeing a problem with a Fedora core 3 client accessing a RedHat
AS3 server over nfs. The AS3 machine did not go down as far as I am
aware, but sometime overnight one of my mounted partitions got into a
dead state. All accesses to it hung and did not time out. I was able
to log onto the server and visit the exported filesystems to validate
that they were ok. I was only able to fix it on the client by making
sure nothing was accessing it and using "umount -f" and remounting by
hand. The following errors were in dmesg:

nfs_statfs: statfs error = 512
RPC: error 5 connecting to server filehost
RPC: error 512 connecting to server filehost

This has happened every night since I upgraded to a Fedora core 3
client. It only happens to one of the three partitions that I mount on
my client, and I suspect it is load-triggered because the problem
partition can get busy overnight with batch jobs. (that is why I did
not log it as a fedora bug).

Version-Release number of selected component (if applicable):
kernel-2.4.21-4.EL

How reproducible:
Always

Steps to Reproduce:
1. mount partitions over nfs from an advance server 3 server to a
fedora core 3 client.
2. do lots of I/O on one of the partitions. rsync is our recommended
method.
3. wait
    

Actual Results:  one of the three nfs partitions becomes unavailable.
every command that touches it goes into a waiting state and does not
return. As this partition is included in my $PATH this means that lots
of things break, for example xlock will not unlock the screen but
stops whilst displaying "Checking..." after I type my password.

Expected Results:  intermittent outages over NFS may cause pauses, but
the requests should be retried and should repair themselves.

Additional info:
 Whilst this is slightly similar to bug 118413 we do not use autofs.
Also, I dont get the kernel oops message.

Comment 1 Ernie Petrides 2004-11-22 20:48:10 UTC

Can this problem be reproduced on RHEL3 U3 (released)
or on RHEL3 U4 (currently in the RHN beta channel)?

Comment 2 Tim Towers 2004-11-30 00:01:25 UTC

currently produced on advance server 3, released last year.

When it failed this morning, unmounting the filesystem with "umount
-f" twice caused an oops on my fedora core3 client. Regardless of what
the server is sending over the wire it should not cause an oops on my
client so I am now sure there is something wrong with the shipped
fedora core3 kernel. I have since upgraded from the initial
2.6.9.667smp kernel to 2.6.9.68?smp using yum.

I will provide an update in a few days about the status of the bug,
but it looks like the AS3 (kernel 2.4.21-4.EL) NFS server tickles a
fatal error in the fedora core3 kernel 2.6.9. This may mean that I
assigned the bug partially in the wrong category.

Comment 3 Steve Dickson 2004-11-30 14:01:02 UTC

If possible, could you please post the oops the next time it happens?

Comment 4 Tim Towers 2004-12-02 00:51:27 UTC

After deciding that the core 3 kernel I had was fatally flawed I
updated using yum. I dont know whether this has guaranteed a fix but
my workstation did not crash last night. If I get another kernel oops
I will post it here. I will also hunt in my log files for one of the
previous oopses.

Comment 5 Tim Towers 2004-12-02 10:28:53 UTC

my client was hung this morning.
after 2 x umount -f /mount/point there was a slight delay of about 15
seconds then the following oops. The machine is completely wedged
(e.g. caps/numlock doesnt work, cannot scroll with shift-pgup)

kernel BUG at kernel/timer.c:416!
invalid operand: 0000 [#1]
SMP
Modules linked in: mga parport_pc lp parport autofs4 i2c_dev i2c_core
nfs lockd sunrpc microcode button battery ac uhci_hcd ehci_hcd
snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd_page_alloc gameport snd_mpu401_uart snd_rawmidi
snd_seq_device snd soundcore tg3 floppy dm_snapshot dm_zero dm_mirror
ext3 jbd dm_mod aic79xx sd_mod scsi_mod
CPU: 1
EIP: 00600:[<02126d31>] Not tainted VLI
EFLAGS: 00010087 (2.6.9-1.681_FC3smp)
EIP is at cascade+0x18/0x37
eax: 03816760 ebx: 038170b0 ecx: 00000028 edx: 39e88d1c
esi: 39e88d1c edi: 03816760 epb: 00000028 esp: 023b4fb4
ds: 007b es: 007b ss: 0068
Process swapper (pid: 0, threadinfo=023b4000 task=03ab1080
Stack: 00000000 02377b88 03816760 023b4fcc 021271a0 00000246 023b4fcc
023b4fcc
       0000000a 00000001 02377b88 0000000a 00000001 02123eb8 39ed0f70
00000046
       023a4184 00000000 021082a9
Call Trace:
Stack pointer is garbage, not printing trace
Code: e8 51 ff ff ff 85 c0 74 08 0f 0b 89 01 17 f4 2c 02 5b c3 55 89
cd 57 89 c7
 56 53 8d 1c ca 8b 33 39 de 74 1a 39 7e 20 89 f2 74 08 <0f> 0b a0 01
17 f4 2c 02
 8b 36 89 f8 e8 86 fc ff ff eb e2 89 1b

I've done my best to type it in accurately, but no guarantee.

Comment 6 Tim Towers 2004-12-02 10:44:02 UTC

Created attachment 107762 [details]
output of ksymoops-2.4.9

command used was ksymoops-2.4.9/ksymoops -m /boot/System.map-2.6.9-1.681_FC3smp
< ~/tmp/oops

Comment 7 Steve Dickson 2004-12-02 13:33:06 UTC

hmm... I'm running fc3 on a number of desktops and I'm not
seeing this problem.... but it does appear that something
is seriously wrong... 

Although its not clear how much faith we can put in that
oops trace your seeing (the stack seems to be in pretty bad
shape), it appears the oops happen in the swapper
process. Which leads me to wonder how much memory
does this machine have?

It also appears there are two problems:
1) the mount hanging
2) the oops that is caused by doing mount -f twice.

Would it be possible to get a AltSysRq-T system trace
after the mount hangs and before the mount -f are done?

The easiest way to get a system trace is to
echo t > /proc/sysrq-trigger, then use dmesg
to capture the trace into file.

Comment 8 Tim Towers 2004-12-03 10:16:19 UTC

It has occurred to me (because I had forgotten about my comms room
temperature monitor) that I had a cron process writing to the nfs
filesystem every 5 minutes. This caused nfs traffic from my
workstation whilst the server disks were very busy. I would imagine
that a quiescent NFS mount would never cause a problem. Because we
have a different method of monitoring the temperature I have turned
off my cron job.

If you wish to repeat the access pattern then append a single line to
a 2Gb+ file every 5 minutes whilst you stress your server disk with
parallel rsyncs of multiple slowly-growing ~600Mb log files.

I will see if I can get a system trace if it hangs again, which I now
doubt - but if you want me to re-enable my cron job to trigger a hang
let me know.

Comment 9 Steve Dickson 2004-12-03 11:25:48 UTC

Just to be clear, disabling the cron job stop the system
from hanging and re-enabling it causes the system
to hang.

Comment 10 Tim Towers 2004-12-06 11:03:38 UTC

I can confirm that my client workstation has not crashed since I
stopped writing to the share overnight.
The cron job is something like "date > filename", every 5 minutes.
Even with the cron job going I cannot guarantee a hang overnight,
maybe 50% liklihood.

Comment 11 Tim Towers 2004-12-27 18:01:40 UTC

Still no crash.
I see there is another AS3 kernel available, Should I apply it and
re-enable the cron job that keeps the nfs share busy overnight?

Comment 12 Aleksey Nogin 2005-04-22 04:40:05 UTC

I am also seing "RPC: error 5 connecting to server <_client's_ IP address>" in
dmesg on a RHEL AS 3 (fully updated, 2.4.21-27.0.2.ELsmp). The client is a RHEL
WS4. 

The error seems to happen when init on the client is "switching modes". Namely,
 - On boot, after the /etc/rc.d/rc is done, but before the /sbin/mingetty is
started.
 - On shutdown, after the "init: Switching to runlevel: 6" is logged, but before
the "system is going down for reboot" message appears.

In both cases the following sequence of events happens:
 1) The client appears to be completely frozen, the server is spewing "RPC:
error 5 connecting to server <_client's_ IP address>" every 10 seconds or so.
This lasts for about a minute or two.
 2) The client unfreeze, the server stops spewing messages. Running ps on the
client shows that init is stuck in the "D" state. Sometimes that clears out
after a while.

One of the "non-standard" thing we do is that the client is set up with
root-over-NFS. We also had to apply the patches discussed in bug 152557 to the
client's kernel (before that the "frozen state" described above would be
followed by the kernel panic instead of clearing up and after that the "error 5"
messages would be spewed forever by the server) - see bug 152557 comment #4.

Comment 15 RHEL Program Management 2007-10-19 19:13:36 UTC

This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.

Note You need to log in before you can comment on or make changes to this bug.