From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041111 Firefox/1.0 Description of problem: I am seeing a problem with a Fedora core 3 client accessing a RedHat AS3 server over nfs. The AS3 machine did not go down as far as I am aware, but sometime overnight one of my mounted partitions got into a dead state. All accesses to it hung and did not time out. I was able to log onto the server and visit the exported filesystems to validate that they were ok. I was only able to fix it on the client by making sure nothing was accessing it and using "umount -f" and remounting by hand. The following errors were in dmesg: nfs_statfs: statfs error = 512 RPC: error 5 connecting to server filehost RPC: error 512 connecting to server filehost This has happened every night since I upgraded to a Fedora core 3 client. It only happens to one of the three partitions that I mount on my client, and I suspect it is load-triggered because the problem partition can get busy overnight with batch jobs. (that is why I did not log it as a fedora bug). Version-Release number of selected component (if applicable): kernel-2.4.21-4.EL How reproducible: Always Steps to Reproduce: 1. mount partitions over nfs from an advance server 3 server to a fedora core 3 client. 2. do lots of I/O on one of the partitions. rsync is our recommended method. 3. wait Actual Results: one of the three nfs partitions becomes unavailable. every command that touches it goes into a waiting state and does not return. As this partition is included in my $PATH this means that lots of things break, for example xlock will not unlock the screen but stops whilst displaying "Checking..." after I type my password. Expected Results: intermittent outages over NFS may cause pauses, but the requests should be retried and should repair themselves. Additional info: Whilst this is slightly similar to bug 118413 we do not use autofs. Also, I dont get the kernel oops message.
Can this problem be reproduced on RHEL3 U3 (released) or on RHEL3 U4 (currently in the RHN beta channel)?
currently produced on advance server 3, released last year. When it failed this morning, unmounting the filesystem with "umount -f" twice caused an oops on my fedora core3 client. Regardless of what the server is sending over the wire it should not cause an oops on my client so I am now sure there is something wrong with the shipped fedora core3 kernel. I have since upgraded from the initial 2.6.9.667smp kernel to 2.6.9.68?smp using yum. I will provide an update in a few days about the status of the bug, but it looks like the AS3 (kernel 2.4.21-4.EL) NFS server tickles a fatal error in the fedora core3 kernel 2.6.9. This may mean that I assigned the bug partially in the wrong category.
If possible, could you please post the oops the next time it happens?
After deciding that the core 3 kernel I had was fatally flawed I updated using yum. I dont know whether this has guaranteed a fix but my workstation did not crash last night. If I get another kernel oops I will post it here. I will also hunt in my log files for one of the previous oopses.
my client was hung this morning. after 2 x umount -f /mount/point there was a slight delay of about 15 seconds then the following oops. The machine is completely wedged (e.g. caps/numlock doesnt work, cannot scroll with shift-pgup) kernel BUG at kernel/timer.c:416! invalid operand: 0000 [#1] SMP Modules linked in: mga parport_pc lp parport autofs4 i2c_dev i2c_core nfs lockd sunrpc microcode button battery ac uhci_hcd ehci_hcd snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc gameport snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod aic79xx sd_mod scsi_mod CPU: 1 EIP: 00600:[<02126d31>] Not tainted VLI EFLAGS: 00010087 (2.6.9-1.681_FC3smp) EIP is at cascade+0x18/0x37 eax: 03816760 ebx: 038170b0 ecx: 00000028 edx: 39e88d1c esi: 39e88d1c edi: 03816760 epb: 00000028 esp: 023b4fb4 ds: 007b es: 007b ss: 0068 Process swapper (pid: 0, threadinfo=023b4000 task=03ab1080 Stack: 00000000 02377b88 03816760 023b4fcc 021271a0 00000246 023b4fcc 023b4fcc 0000000a 00000001 02377b88 0000000a 00000001 02123eb8 39ed0f70 00000046 023a4184 00000000 021082a9 Call Trace: Stack pointer is garbage, not printing trace Code: e8 51 ff ff ff 85 c0 74 08 0f 0b 89 01 17 f4 2c 02 5b c3 55 89 cd 57 89 c7 56 53 8d 1c ca 8b 33 39 de 74 1a 39 7e 20 89 f2 74 08 <0f> 0b a0 01 17 f4 2c 02 8b 36 89 f8 e8 86 fc ff ff eb e2 89 1b I've done my best to type it in accurately, but no guarantee.
Created attachment 107762 [details] output of ksymoops-2.4.9 command used was ksymoops-2.4.9/ksymoops -m /boot/System.map-2.6.9-1.681_FC3smp < ~/tmp/oops
hmm... I'm running fc3 on a number of desktops and I'm not seeing this problem.... but it does appear that something is seriously wrong... Although its not clear how much faith we can put in that oops trace your seeing (the stack seems to be in pretty bad shape), it appears the oops happen in the swapper process. Which leads me to wonder how much memory does this machine have? It also appears there are two problems: 1) the mount hanging 2) the oops that is caused by doing mount -f twice. Would it be possible to get a AltSysRq-T system trace after the mount hangs and before the mount -f are done? The easiest way to get a system trace is to echo t > /proc/sysrq-trigger, then use dmesg to capture the trace into file.
It has occurred to me (because I had forgotten about my comms room temperature monitor) that I had a cron process writing to the nfs filesystem every 5 minutes. This caused nfs traffic from my workstation whilst the server disks were very busy. I would imagine that a quiescent NFS mount would never cause a problem. Because we have a different method of monitoring the temperature I have turned off my cron job. If you wish to repeat the access pattern then append a single line to a 2Gb+ file every 5 minutes whilst you stress your server disk with parallel rsyncs of multiple slowly-growing ~600Mb log files. I will see if I can get a system trace if it hangs again, which I now doubt - but if you want me to re-enable my cron job to trigger a hang let me know.
Just to be clear, disabling the cron job stop the system from hanging and re-enabling it causes the system to hang.
I can confirm that my client workstation has not crashed since I stopped writing to the share overnight. The cron job is something like "date > filename", every 5 minutes. Even with the cron job going I cannot guarantee a hang overnight, maybe 50% liklihood.
Still no crash. I see there is another AS3 kernel available, Should I apply it and re-enable the cron job that keeps the nfs share busy overnight?
I am also seing "RPC: error 5 connecting to server <_client's_ IP address>" in dmesg on a RHEL AS 3 (fully updated, 2.4.21-27.0.2.ELsmp). The client is a RHEL WS4. The error seems to happen when init on the client is "switching modes". Namely, - On boot, after the /etc/rc.d/rc is done, but before the /sbin/mingetty is started. - On shutdown, after the "init: Switching to runlevel: 6" is logged, but before the "system is going down for reboot" message appears. In both cases the following sequence of events happens: 1) The client appears to be completely frozen, the server is spewing "RPC: error 5 connecting to server <_client's_ IP address>" every 10 seconds or so. This lasts for about a minute or two. 2) The client unfreeze, the server stops spewing messages. Running ps on the client shows that init is stuck in the "D" state. Sometimes that clears out after a while. One of the "non-standard" thing we do is that the client is set up with root-over-NFS. We also had to apply the patches discussed in bug 152557 to the client's kernel (before that the "frozen state" described above would be followed by the kernel panic instead of clearing up and after that the "error 5" messages would be spewed forever by the server) - see bug 152557 comment #4.
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.