Bug 140631

Summary:

Mounting nfs partition causes modern machines to hang

Product:

[Fedora] Fedora

Reporter:

Ed Friedman <edfriedmangvs>

Component:

nfs-utils

Assignee:

Steve Dickson <steved>

Status:

CLOSED CANTFIX

QA Contact:

Ben Levenson <benl>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

jeremy, mattdm, mtonn

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2006-10-31 15:53:36 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
System trace created when machine hung	none
netdump output	none

Description Ed Friedman 2004-11-23 21:57:51 UTC

Description of problem:
Having any partitions mounted via NFS causes machine to hang within 24
hours (inablity to login, if you can login, inability to view the
mounted partition).

Version-Release number of selected component (if applicable):
nfs-utils-1.0.6-39

How reproducible:
Almost always.

Steps to Reproduce:
1.On a modern machine (Asus P4P800E-Deluxe or Abit IS7-E motherboard),
mount a partition from a different machine using nfs mounting
2.Wait 24 hours
3.Try to login to machine
  
Actual results:
Usually, the login hangs after accepting the name and password. 
Sometimes, you can login successfully, but then trying to view the
mounted partition (ls or df) results in your command hanging (and the
load of the machine steadily rises into the hundreds)

Expected results:
You should be able to login every time and you should be able to view
the mounted partition.

Additional info:
This problem is even worse if you are using autofs as well.  Turning
on nscd does not fix the problem.

Comment 1 Steve Dickson 2004-11-30 14:23:38 UTC

Two things: 
1) is autofs in the picture? 

2) could you post a AltSysRq-t system trace by 
"echo t > /proc/sysrq-trigger" and then use dmesg to 
capture the trace (i.e. dmesg > /tmp/systrace)

Comment 2 Ed Friedman 2004-12-06 21:39:04 UTC

1) Autofs is out of the picture

2) There was no way to generate an ALTSysRq-t system trace, because
when the system hangs it is impossible to login.  I tried leaving open
sessions on the console and via ssh, but was unable to use them after
the system hanged.  I did note that when I did a reboot via
CTRL-ALT-DEL, the message for unmounting NFS said failed twice in a row.

I did test one system with no crontab writing to the NFS partition and
an identical system with a crontab that wrote to it every 5 minutes. 
The one with no crontab did not crash, but the other one did.

Here is the relevant entry from /etc/fstab:

galton:/var/spool/mail  /var/spool/mail     nfs  rw,bg,actimeo=0 0 0

And here is the relevant entry from the crontab:

0,5,10,15,20,25,30,35,40,45,50,55 * * * * /usr/bin/w >
/var/spool/mail/rwh/raj

Comment 3 Steve Dickson 2004-12-06 23:12:21 UTC

Before the system hangs, turn on the AltSysRq processing
by "echo 1 > /proc/sys/kernel/sysrq" Then when the 
system hangs type AltSysRq-t on the console key board
and a trace should appear.... 

Reboot and the trace *should be* in /var/log/messages....

Comment 4 Ed Friedman 2004-12-08 17:37:17 UTC

Created attachment 108129 [details]
System trace created when machine hung

Comment 5 Ed Friedman 2004-12-28 19:48:21 UTC

I did further tests to refine the problem.

1. Cron is not associated with this problem (I used a shell script in
place of cron to write every 5 minutes and it still got hung up).

2. Writing to local disks does not have this problem (I used a cron
job to write every 5 minutes to a file on a local partition).

3. Reading every 5 minutes from a file on a NFS mounted partition also
causes the machine to hang within 24 hours (I substituted a read for
the write on my cron job accessing the NFS mounted partition).

Comment 6 Steve Dickson 2005-01-03 16:22:18 UTC

Looking at the system trace, it appears there are
quite a few shells hung in getting permission bits
from the server (in nfs3_proc_access() to be exact).

If you remove the actimeo=0 mount option, does the
hang still happen?

Comment 7 Ed Friedman 2005-01-05 18:21:32 UTC

I removed the actimeo=0 mount option and the hang still happens, just
as before.

Comment 8 Steve Dickson 2005-01-07 11:42:13 UTC

Ok... I'm trying to reproduce this here, but looking at the
system trace you posted it appears the top half is missing.
I'm trying to find the first sh process that hung, since the rest
of the sh process are just suck behind that one....

/var/log/messages should have the complete trace.

Also could you please post an AltSysRq-m and an
"cat /proc/slabinfo".... just to see how your doing on memory
consumption

Comment 9 Ed Friedman 2005-01-10 19:25:31 UTC

I'll try to generate another trace and send you the complete
/var/log/messages file when I do.

As an experiment, I tried writing to the nfs mounted partition every
10 minutes, instead of every 5 minutes and it never hung.  Is it
possible that there is a problem when a disk write is sent at the same
instant that the computer is flushing its cache to an nfs mounted
disk?  If you want, I can try other times to see which intervals cause
the machine to hang.

Comment 10 Steve Dickson 2005-01-11 12:16:22 UTC

I'm not sure whats going on.... Over the weekend I was 
not able to reproduce this.... 

What os is running on the server side? Linux, Solaris, netapps?

Comment 11 Ed Friedman 2005-01-11 17:42:54 UTC

The server is running Fedora 1.  There is nothing fancy going on
there, and the patches should be current.

Comment 12 Michael Tonn 2005-05-06 14:37:37 UTC

Created attachment 114082 [details]
netdump output

This is my netdump output.

Comment 13 Michael Tonn 2005-05-06 14:38:45 UTC

I am having the same problems with RedHat 3.0 connected to a NetApp filer.  
Any command that has any association with the NFS mount point will hang.

Comment 14 Ed Friedman 2005-08-12 18:17:38 UTC

Wow - I've finally discovered how to make NFS mounts stable.  One of my users
observed that older versions of RedHat and Fedora were using udp when doing NFS
mounts, but the newer Fedora versions are using tcp.  Since I have added the
flags "notcp, udp" to my mount options, everything has been working perfectly.

Comment 15 Steve Dickson 2005-08-17 03:39:26 UTC

So basically your saying the NFS server in your FC1 does work with NFS
mount using TCP?

Comment 16 Ed Friedman 2005-08-18 18:45:04 UTC

Sorry for not making the fix more clear.  Basically, the server works with both
TCP and UDP, but TCP is the only one that occasionally hangs.  I don't change
any server settings, but on the client machines, add the flags "notcp,udp" to
the NFS mount options in /etc/fstab and /etc/auto.master.  This prohibits TCP
mounting and forces UDP mounting.  With these options in place, there have been
no more crashes or hangups for a week now, even when I run programs that used to
always cause the machine to hang within 24 hours.

Comment 17 Matthew Miller 2006-07-10 21:19:46 UTC

Fedora Core 3 is now maintained by the Fedora Legacy project for security
updates only. If this problem is a security issue, please reopen and
reassign to the Fedora Legacy product. If it is not a security issue and
hasn't been resolved in the current FC5 updates or in the FC6 test
release, reopen and change the version to match.

Thank you!

Comment 18 John Thacker 2006-10-31 15:53:36 UTC

Closing per lack of response to previous request for information.
This bug was originally filed against a much earlier version of Fedora
Core, and significant changes have taken place since the last version
for which this bug is confirmed.

Note that FC3 and FC4 are supported by Fedora Legacy for security
fixes only.  Please install a still supported version and retest.  If
it still occurs on FC5 or FC6, please reopen and assign to the correct
version.  Otherwise, if this a security issue, please change the
product to Fedora Legacy.  Thanks, and we are sorry that we did not
get to this bug earlier.