Bug 75000 - NFS Locks up after prolonged file transfers
NFS Locks up after prolonged file transfers
Status: CLOSED CURRENTRELEASE
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
7.2
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Steve Dickson
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2002-10-03 11:23 EDT by Alex Turner
Modified: 2007-04-18 12:47 EDT (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-08-11 07:06:42 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Alex Turner 2002-10-03 11:23:27 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20020830

Description of problem:
I am backing up about half a dozen machines over NFS to a backup system.
For about 3 months the backups ran perfectly.  Now the cp command hangs, showing
a 'D' status in ps, and shows up as a process waiting for CPU in the output of
uptime.

The command that it hung on last night was:
cp -aux /eda/cvsroot/CVSROOT /eda/cvsroot/ConlinsExtra /eda/cvsroot/NE-NS
/eda/cvsroot/PSCbackend /eda/cvsroot/PlexQStencils /eda/cvsroot/TrendMLS_Scripts
/eda/cvsroot/[repository /eda/cvsroot/agent-websites
/eda/cvsroot/autopagegenerator /eda/cvsroot/backup-scripts
/eda/cvsroot/cbpreferred-concierge-parser
/eda/cvsroot/cbpreferred-conciergevendor-parser /eda/cvsroot/cbpreferred-website
/eda/cvsroot/cbpreferred-website-old /eda/cvsroot/cbrealty-website
/eda/cvsroot/clarkmadara-website /eda/cvsroot/code_library
/eda/cvsroot/conlin-extra-app /eda/cvsroot/continental-website
/eda/cvsroot/directory-admin /eda/cvsroot/duffyrealestate-website
/eda/cvsroot/eastcoastsalon-bin /eda/cvsroot/eastcoastsalon-cgi-bin
/eda/cvsroot/eastcoastsalon-sql /eda/cvsroot/eastcoastsalon-website
/eda/cvsroot/eradager-website /eda/cvsroot/etc-mail /eda/cvsroot/extra
/eda/cvsroot/glocker-website /eda/cvsroot/harcum-sql
/eda/cvsroot/harcum-squirrelmail-plugin /eda/cvsroot/harcum-webmail-help
/eda/cvsroot/harcum-website /eda/cvsroot/harcum-website-v2
/eda/cvsroot/herb-website /eda/cvsroot/johnalexanderltd-website
/eda/cvsroot/kodylighting-website /eda/cvsroot/ldap-admin
/eda/cvsroot/neteconomist-ldap-schema /eda/cvsroot/neteconomist-website
/eda/cvsroot/prudentialcommonwealth-cgi-bin
/eda/cvsroot/prudentialcommonwealth-website
/eda/cvsroot/prudentialcommonwealth-website-v1 /eda/cvsroot/psc-website
/eda/cvsroot/quinnwilson-website /eda/cvsroot/realleads-sql
/eda/cvsroot/realtoredge-dynamic-system /eda/cvsroot/realtoredgeleads-bin
/eda/cvsroot/realtoredgeleads-cgi-bin /eda/cvsroot/realtoredgeleads-website
/eda/cvsroot/remaxrealtygrouppa-website /eda/cvsroot/springfordcc-bin
/eda/cvsroot/springfordcc-cgi-bin /eda/cvsroot/springfordcc-website
/eda/cvsroot/trendmls-admin-site /eda/cvsroot/trendmls-conf-live
/eda/cvsroot/trendmls-conf-qa /eda/cvsroot/trendmls-website
/eda/cvsroot/trendpull-old /eda/cvsroot/web-toolkit
/mnt/backup/manganese.neteconomist.com/eda/cvsroot

The backup server is a Compaq Proliant DL580 with a hardware RAID controller,
and two logical disks, one as a RAID 1 mirror, and one as a three drive RAID 5.

The two machines are both running the same kernel version:
kernel-2.4.9-31
and the same version of nfs-utils
nfs-utils-0.3.1-13.7.2.1

If I try to do a 'df' whilst it's hung, the df hangs also (classic NFS hang). 
The NFS server appears to be fine, I can mount it somewhere else, but if I try
to access it, the new client also hangs.

The only way seemingly to get out of it, is to 'kill -9' the 'cp', shutdown NFS
on the server, then restart NFS on the server (avatar).  It takes a minute or
two, but the clients come back around after that.

Some of the machines that are backing up have very large files, although the
total for the directory /eda/cvsroot is only about 250MB.  I have already
checked to see if there are any files larger than 2gig, and it doesn't appear
that there are.

I have now staggered the cron.daily execution to spread it out, but it didn't
help.  The smaller machines that went first appeared to be okay, but the largest
machine (the DB) croaked.


the syslog shows:

Oct  3 06:03:18 manganese kernel: nfs: server avatar not responding, still trying
Once at the beginning of the job, then:
Oct  3 10:55:31 manganese kernel: nfs: task 12614 can't get a request slot
After I start kicking the NFS server (taking it down, and bringing it back up).

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
I don't know how to reproduce in a sterile environment, I don't have that
capability.

1.
2.
3.
	

Additional info:
Comment 1 Steve Dickson 2004-08-11 07:06:42 EDT
This seems to be fixed in later kernels.

Note You need to log in before you can comment on or make changes to this bug.