Red Hat Bugzilla – Bug 75000
NFS Locks up after prolonged file transfers
Last modified: 2007-04-18 12:47:04 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20020830
Description of problem:
I am backing up about half a dozen machines over NFS to a backup system.
For about 3 months the backups ran perfectly. Now the cp command hangs, showing
a 'D' status in ps, and shows up as a process waiting for CPU in the output of
The command that it hung on last night was:
cp -aux /eda/cvsroot/CVSROOT /eda/cvsroot/ConlinsExtra /eda/cvsroot/NE-NS
/eda/cvsroot/PSCbackend /eda/cvsroot/PlexQStencils /eda/cvsroot/TrendMLS_Scripts
/eda/cvsroot/eradager-website /eda/cvsroot/etc-mail /eda/cvsroot/extra
The backup server is a Compaq Proliant DL580 with a hardware RAID controller,
and two logical disks, one as a RAID 1 mirror, and one as a three drive RAID 5.
The two machines are both running the same kernel version:
and the same version of nfs-utils
If I try to do a 'df' whilst it's hung, the df hangs also (classic NFS hang).
The NFS server appears to be fine, I can mount it somewhere else, but if I try
to access it, the new client also hangs.
The only way seemingly to get out of it, is to 'kill -9' the 'cp', shutdown NFS
on the server, then restart NFS on the server (avatar). It takes a minute or
two, but the clients come back around after that.
Some of the machines that are backing up have very large files, although the
total for the directory /eda/cvsroot is only about 250MB. I have already
checked to see if there are any files larger than 2gig, and it doesn't appear
that there are.
I have now staggered the cron.daily execution to spread it out, but it didn't
help. The smaller machines that went first appeared to be okay, but the largest
machine (the DB) croaked.
the syslog shows:
Oct 3 06:03:18 manganese kernel: nfs: server avatar not responding, still trying
Once at the beginning of the job, then:
Oct 3 10:55:31 manganese kernel: nfs: task 12614 can't get a request slot
After I start kicking the NFS server (taking it down, and bringing it back up).
Version-Release number of selected component (if applicable):
Steps to Reproduce:
I don't know how to reproduce in a sterile environment, I don't have that
This seems to be fixed in later kernels.