Bug 140319
Summary: | NFS server hangs overnight | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Tim Towers <tim> | ||||
Component: | kernel | Assignee: | Steve Dickson <steved> | ||||
Status: | CLOSED WONTFIX | QA Contact: | |||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3.0 | CC: | aleksey, crt, jerome, mkpai, petrides, riel, tao | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2007-10-19 19:13:36 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Tim Towers
2004-11-22 10:56:05 UTC
Can this problem be reproduced on RHEL3 U3 (released) or on RHEL3 U4 (currently in the RHN beta channel)? currently produced on advance server 3, released last year. When it failed this morning, unmounting the filesystem with "umount -f" twice caused an oops on my fedora core3 client. Regardless of what the server is sending over the wire it should not cause an oops on my client so I am now sure there is something wrong with the shipped fedora core3 kernel. I have since upgraded from the initial 2.6.9.667smp kernel to 2.6.9.68?smp using yum. I will provide an update in a few days about the status of the bug, but it looks like the AS3 (kernel 2.4.21-4.EL) NFS server tickles a fatal error in the fedora core3 kernel 2.6.9. This may mean that I assigned the bug partially in the wrong category. If possible, could you please post the oops the next time it happens? After deciding that the core 3 kernel I had was fatally flawed I updated using yum. I dont know whether this has guaranteed a fix but my workstation did not crash last night. If I get another kernel oops I will post it here. I will also hunt in my log files for one of the previous oopses. my client was hung this morning. after 2 x umount -f /mount/point there was a slight delay of about 15 seconds then the following oops. The machine is completely wedged (e.g. caps/numlock doesnt work, cannot scroll with shift-pgup) kernel BUG at kernel/timer.c:416! invalid operand: 0000 [#1] SMP Modules linked in: mga parport_pc lp parport autofs4 i2c_dev i2c_core nfs lockd sunrpc microcode button battery ac uhci_hcd ehci_hcd snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc gameport snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod aic79xx sd_mod scsi_mod CPU: 1 EIP: 00600:[<02126d31>] Not tainted VLI EFLAGS: 00010087 (2.6.9-1.681_FC3smp) EIP is at cascade+0x18/0x37 eax: 03816760 ebx: 038170b0 ecx: 00000028 edx: 39e88d1c esi: 39e88d1c edi: 03816760 epb: 00000028 esp: 023b4fb4 ds: 007b es: 007b ss: 0068 Process swapper (pid: 0, threadinfo=023b4000 task=03ab1080 Stack: 00000000 02377b88 03816760 023b4fcc 021271a0 00000246 023b4fcc 023b4fcc 0000000a 00000001 02377b88 0000000a 00000001 02123eb8 39ed0f70 00000046 023a4184 00000000 021082a9 Call Trace: Stack pointer is garbage, not printing trace Code: e8 51 ff ff ff 85 c0 74 08 0f 0b 89 01 17 f4 2c 02 5b c3 55 89 cd 57 89 c7 56 53 8d 1c ca 8b 33 39 de 74 1a 39 7e 20 89 f2 74 08 <0f> 0b a0 01 17 f4 2c 02 8b 36 89 f8 e8 86 fc ff ff eb e2 89 1b I've done my best to type it in accurately, but no guarantee. Created attachment 107762 [details]
output of ksymoops-2.4.9
command used was ksymoops-2.4.9/ksymoops -m /boot/System.map-2.6.9-1.681_FC3smp
< ~/tmp/oops
hmm... I'm running fc3 on a number of desktops and I'm not seeing this problem.... but it does appear that something is seriously wrong... Although its not clear how much faith we can put in that oops trace your seeing (the stack seems to be in pretty bad shape), it appears the oops happen in the swapper process. Which leads me to wonder how much memory does this machine have? It also appears there are two problems: 1) the mount hanging 2) the oops that is caused by doing mount -f twice. Would it be possible to get a AltSysRq-T system trace after the mount hangs and before the mount -f are done? The easiest way to get a system trace is to echo t > /proc/sysrq-trigger, then use dmesg to capture the trace into file. It has occurred to me (because I had forgotten about my comms room temperature monitor) that I had a cron process writing to the nfs filesystem every 5 minutes. This caused nfs traffic from my workstation whilst the server disks were very busy. I would imagine that a quiescent NFS mount would never cause a problem. Because we have a different method of monitoring the temperature I have turned off my cron job. If you wish to repeat the access pattern then append a single line to a 2Gb+ file every 5 minutes whilst you stress your server disk with parallel rsyncs of multiple slowly-growing ~600Mb log files. I will see if I can get a system trace if it hangs again, which I now doubt - but if you want me to re-enable my cron job to trigger a hang let me know. Just to be clear, disabling the cron job stop the system from hanging and re-enabling it causes the system to hang. I can confirm that my client workstation has not crashed since I stopped writing to the share overnight. The cron job is something like "date > filename", every 5 minutes. Even with the cron job going I cannot guarantee a hang overnight, maybe 50% liklihood. Still no crash. I see there is another AS3 kernel available, Should I apply it and re-enable the cron job that keeps the nfs share busy overnight? I am also seing "RPC: error 5 connecting to server <_client's_ IP address>" in dmesg on a RHEL AS 3 (fully updated, 2.4.21-27.0.2.ELsmp). The client is a RHEL WS4. The error seems to happen when init on the client is "switching modes". Namely, - On boot, after the /etc/rc.d/rc is done, but before the /sbin/mingetty is started. - On shutdown, after the "init: Switching to runlevel: 6" is logged, but before the "system is going down for reboot" message appears. In both cases the following sequence of events happens: 1) The client appears to be completely frozen, the server is spewing "RPC: error 5 connecting to server <_client's_ IP address>" every 10 seconds or so. This lasts for about a minute or two. 2) The client unfreeze, the server stops spewing messages. Running ps on the client shows that init is stuck in the "D" state. Sometimes that clears out after a while. One of the "non-standard" thing we do is that the client is set up with root-over-NFS. We also had to apply the patches discussed in bug 152557 to the client's kernel (before that the "frozen state" described above would be followed by the kernel panic instead of clearing up and after that the "error 5" messages would be spewed forever by the server) - see bug 152557 comment #4. This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you. |