Bug 140631
Summary: | Mounting nfs partition causes modern machines to hang | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Ed Friedman <edfriedmangvs> | ||||||
Component: | nfs-utils | Assignee: | Steve Dickson <steved> | ||||||
Status: | CLOSED CANTFIX | QA Contact: | Ben Levenson <benl> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3 | CC: | jeremy, mattdm, mtonn | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2006-10-31 15:53:36 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Ed Friedman
2004-11-23 21:57:51 UTC
Two things: 1) is autofs in the picture? 2) could you post a AltSysRq-t system trace by "echo t > /proc/sysrq-trigger" and then use dmesg to capture the trace (i.e. dmesg > /tmp/systrace) 1) Autofs is out of the picture 2) There was no way to generate an ALTSysRq-t system trace, because when the system hangs it is impossible to login. I tried leaving open sessions on the console and via ssh, but was unable to use them after the system hanged. I did note that when I did a reboot via CTRL-ALT-DEL, the message for unmounting NFS said failed twice in a row. I did test one system with no crontab writing to the NFS partition and an identical system with a crontab that wrote to it every 5 minutes. The one with no crontab did not crash, but the other one did. Here is the relevant entry from /etc/fstab: galton:/var/spool/mail /var/spool/mail nfs rw,bg,actimeo=0 0 0 And here is the relevant entry from the crontab: 0,5,10,15,20,25,30,35,40,45,50,55 * * * * /usr/bin/w > /var/spool/mail/rwh/raj Before the system hangs, turn on the AltSysRq processing by "echo 1 > /proc/sys/kernel/sysrq" Then when the system hangs type AltSysRq-t on the console key board and a trace should appear.... Reboot and the trace *should be* in /var/log/messages.... Created attachment 108129 [details]
System trace created when machine hung
I did further tests to refine the problem. 1. Cron is not associated with this problem (I used a shell script in place of cron to write every 5 minutes and it still got hung up). 2. Writing to local disks does not have this problem (I used a cron job to write every 5 minutes to a file on a local partition). 3. Reading every 5 minutes from a file on a NFS mounted partition also causes the machine to hang within 24 hours (I substituted a read for the write on my cron job accessing the NFS mounted partition). Looking at the system trace, it appears there are quite a few shells hung in getting permission bits from the server (in nfs3_proc_access() to be exact). If you remove the actimeo=0 mount option, does the hang still happen? I removed the actimeo=0 mount option and the hang still happens, just as before. Ok... I'm trying to reproduce this here, but looking at the system trace you posted it appears the top half is missing. I'm trying to find the first sh process that hung, since the rest of the sh process are just suck behind that one.... /var/log/messages should have the complete trace. Also could you please post an AltSysRq-m and an "cat /proc/slabinfo".... just to see how your doing on memory consumption I'll try to generate another trace and send you the complete /var/log/messages file when I do. As an experiment, I tried writing to the nfs mounted partition every 10 minutes, instead of every 5 minutes and it never hung. Is it possible that there is a problem when a disk write is sent at the same instant that the computer is flushing its cache to an nfs mounted disk? If you want, I can try other times to see which intervals cause the machine to hang. I'm not sure whats going on.... Over the weekend I was not able to reproduce this.... What os is running on the server side? Linux, Solaris, netapps? The server is running Fedora 1. There is nothing fancy going on there, and the patches should be current. Created attachment 114082 [details]
netdump output
This is my netdump output.
I am having the same problems with RedHat 3.0 connected to a NetApp filer. Any command that has any association with the NFS mount point will hang. Wow - I've finally discovered how to make NFS mounts stable. One of my users observed that older versions of RedHat and Fedora were using udp when doing NFS mounts, but the newer Fedora versions are using tcp. Since I have added the flags "notcp, udp" to my mount options, everything has been working perfectly. So basically your saying the NFS server in your FC1 does work with NFS mount using TCP? Sorry for not making the fix more clear. Basically, the server works with both TCP and UDP, but TCP is the only one that occasionally hangs. I don't change any server settings, but on the client machines, add the flags "notcp,udp" to the NFS mount options in /etc/fstab and /etc/auto.master. This prohibits TCP mounting and forces UDP mounting. With these options in place, there have been no more crashes or hangups for a week now, even when I run programs that used to always cause the machine to hang within 24 hours. Fedora Core 3 is now maintained by the Fedora Legacy project for security updates only. If this problem is a security issue, please reopen and reassign to the Fedora Legacy product. If it is not a security issue and hasn't been resolved in the current FC5 updates or in the FC6 test release, reopen and change the version to match. Thank you! Closing per lack of response to previous request for information. This bug was originally filed against a much earlier version of Fedora Core, and significant changes have taken place since the last version for which this bug is confirmed. Note that FC3 and FC4 are supported by Fedora Legacy for security fixes only. Please install a still supported version and retest. If it still occurs on FC5 or FC6, please reopen and assign to the correct version. Otherwise, if this a security issue, please change the product to Fedora Legacy. Thanks, and we are sorry that we did not get to this bug earlier. |