Bug 75000

Summary:	NFS Locks up after prolonged file transfers
Product:	[Retired] Red Hat Linux	Reporter:	Alex Turner <armtuk>
Component:	kernel	Assignee:	Steve Dickson <steved>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.2
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-08-11 11:06:42 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alex Turner 2002-10-03 15:23:27 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20020830

Description of problem:
I am backing up about half a dozen machines over NFS to a backup system.
For about 3 months the backups ran perfectly.  Now the cp command hangs, showing
a 'D' status in ps, and shows up as a process waiting for CPU in the output of
uptime.

The command that it hung on last night was:
cp -aux /eda/cvsroot/CVSROOT /eda/cvsroot/ConlinsExtra /eda/cvsroot/NE-NS
/eda/cvsroot/PSCbackend /eda/cvsroot/PlexQStencils /eda/cvsroot/TrendMLS_Scripts
/eda/cvsroot/[repository /eda/cvsroot/agent-websites
/eda/cvsroot/autopagegenerator /eda/cvsroot/backup-scripts
/eda/cvsroot/cbpreferred-concierge-parser
/eda/cvsroot/cbpreferred-conciergevendor-parser /eda/cvsroot/cbpreferred-website
/eda/cvsroot/cbpreferred-website-old /eda/cvsroot/cbrealty-website
/eda/cvsroot/clarkmadara-website /eda/cvsroot/code_library
/eda/cvsroot/conlin-extra-app /eda/cvsroot/continental-website
/eda/cvsroot/directory-admin /eda/cvsroot/duffyrealestate-website
/eda/cvsroot/eastcoastsalon-bin /eda/cvsroot/eastcoastsalon-cgi-bin
/eda/cvsroot/eastcoastsalon-sql /eda/cvsroot/eastcoastsalon-website
/eda/cvsroot/eradager-website /eda/cvsroot/etc-mail /eda/cvsroot/extra
/eda/cvsroot/glocker-website /eda/cvsroot/harcum-sql
/eda/cvsroot/harcum-squirrelmail-plugin /eda/cvsroot/harcum-webmail-help
/eda/cvsroot/harcum-website /eda/cvsroot/harcum-website-v2
/eda/cvsroot/herb-website /eda/cvsroot/johnalexanderltd-website
/eda/cvsroot/kodylighting-website /eda/cvsroot/ldap-admin
/eda/cvsroot/neteconomist-ldap-schema /eda/cvsroot/neteconomist-website
/eda/cvsroot/prudentialcommonwealth-cgi-bin
/eda/cvsroot/prudentialcommonwealth-website
/eda/cvsroot/prudentialcommonwealth-website-v1 /eda/cvsroot/psc-website
/eda/cvsroot/quinnwilson-website /eda/cvsroot/realleads-sql
/eda/cvsroot/realtoredge-dynamic-system /eda/cvsroot/realtoredgeleads-bin
/eda/cvsroot/realtoredgeleads-cgi-bin /eda/cvsroot/realtoredgeleads-website
/eda/cvsroot/remaxrealtygrouppa-website /eda/cvsroot/springfordcc-bin
/eda/cvsroot/springfordcc-cgi-bin /eda/cvsroot/springfordcc-website
/eda/cvsroot/trendmls-admin-site /eda/cvsroot/trendmls-conf-live
/eda/cvsroot/trendmls-conf-qa /eda/cvsroot/trendmls-website
/eda/cvsroot/trendpull-old /eda/cvsroot/web-toolkit
/mnt/backup/manganese.neteconomist.com/eda/cvsroot

The backup server is a Compaq Proliant DL580 with a hardware RAID controller,
and two logical disks, one as a RAID 1 mirror, and one as a three drive RAID 5.

The two machines are both running the same kernel version:
kernel-2.4.9-31
and the same version of nfs-utils
nfs-utils-0.3.1-13.7.2.1

If I try to do a 'df' whilst it's hung, the df hangs also (classic NFS hang). 
The NFS server appears to be fine, I can mount it somewhere else, but if I try
to access it, the new client also hangs.

The only way seemingly to get out of it, is to 'kill -9' the 'cp', shutdown NFS
on the server, then restart NFS on the server (avatar).  It takes a minute or
two, but the clients come back around after that.

Some of the machines that are backing up have very large files, although the
total for the directory /eda/cvsroot is only about 250MB.  I have already
checked to see if there are any files larger than 2gig, and it doesn't appear
that there are.

I have now staggered the cron.daily execution to spread it out, but it didn't
help.  The smaller machines that went first appeared to be okay, but the largest
machine (the DB) croaked.


the syslog shows:

Oct  3 06:03:18 manganese kernel: nfs: server avatar not responding, still trying
Once at the beginning of the job, then:
Oct  3 10:55:31 manganese kernel: nfs: task 12614 can't get a request slot
After I start kicking the NFS server (taking it down, and bringing it back up).

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
I don't know how to reproduce in a sterile environment, I don't have that
capability.

1.
2.
3.
	

Additional info:

Comment 1 Steve Dickson 2004-08-11 11:06:42 UTC

This seems to be fixed in later kernels.