75000 – NFS Locks up after prolonged file transfers

Bug 75000 - NFS Locks up after prolonged file transfers

Summary: NFS Locks up after prolonged file transfers

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.2
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Steve Dickson
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-10-03 15:23 UTC by Alex Turner
Modified:	2007-04-18 16:47 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-08-11 11:06:42 UTC
Embargoed:

Attachments	(Terms of Use)

Description Alex Turner 2002-10-03 15:23:27 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20020830

Description of problem:
I am backing up about half a dozen machines over NFS to a backup system.
For about 3 months the backups ran perfectly.  Now the cp command hangs, showing
a 'D' status in ps, and shows up as a process waiting for CPU in the output of
uptime.

The command that it hung on last night was:
cp -aux /eda/cvsroot/CVSROOT /eda/cvsroot/ConlinsExtra /eda/cvsroot/NE-NS
/eda/cvsroot/PSCbackend /eda/cvsroot/PlexQStencils /eda/cvsroot/TrendMLS_Scripts
/eda/cvsroot/[repository /eda/cvsroot/agent-websites
/eda/cvsroot/autopagegenerator /eda/cvsroot/backup-scripts
/eda/cvsroot/cbpreferred-concierge-parser
/eda/cvsroot/cbpreferred-conciergevendor-parser /eda/cvsroot/cbpreferred-website
/eda/cvsroot/cbpreferred-website-old /eda/cvsroot/cbrealty-website
/eda/cvsroot/clarkmadara-website /eda/cvsroot/code_library
/eda/cvsroot/conlin-extra-app /eda/cvsroot/continental-website
/eda/cvsroot/directory-admin /eda/cvsroot/duffyrealestate-website
/eda/cvsroot/eastcoastsalon-bin /eda/cvsroot/eastcoastsalon-cgi-bin
/eda/cvsroot/eastcoastsalon-sql /eda/cvsroot/eastcoastsalon-website
/eda/cvsroot/eradager-website /eda/cvsroot/etc-mail /eda/cvsroot/extra
/eda/cvsroot/glocker-website /eda/cvsroot/harcum-sql
/eda/cvsroot/harcum-squirrelmail-plugin /eda/cvsroot/harcum-webmail-help
/eda/cvsroot/harcum-website /eda/cvsroot/harcum-website-v2
/eda/cvsroot/herb-website /eda/cvsroot/johnalexanderltd-website
/eda/cvsroot/kodylighting-website /eda/cvsroot/ldap-admin
/eda/cvsroot/neteconomist-ldap-schema /eda/cvsroot/neteconomist-website
/eda/cvsroot/prudentialcommonwealth-cgi-bin
/eda/cvsroot/prudentialcommonwealth-website
/eda/cvsroot/prudentialcommonwealth-website-v1 /eda/cvsroot/psc-website
/eda/cvsroot/quinnwilson-website /eda/cvsroot/realleads-sql
/eda/cvsroot/realtoredge-dynamic-system /eda/cvsroot/realtoredgeleads-bin
/eda/cvsroot/realtoredgeleads-cgi-bin /eda/cvsroot/realtoredgeleads-website
/eda/cvsroot/remaxrealtygrouppa-website /eda/cvsroot/springfordcc-bin
/eda/cvsroot/springfordcc-cgi-bin /eda/cvsroot/springfordcc-website
/eda/cvsroot/trendmls-admin-site /eda/cvsroot/trendmls-conf-live
/eda/cvsroot/trendmls-conf-qa /eda/cvsroot/trendmls-website
/eda/cvsroot/trendpull-old /eda/cvsroot/web-toolkit
/mnt/backup/manganese.neteconomist.com/eda/cvsroot

The backup server is a Compaq Proliant DL580 with a hardware RAID controller,
and two logical disks, one as a RAID 1 mirror, and one as a three drive RAID 5.

The two machines are both running the same kernel version:
kernel-2.4.9-31
and the same version of nfs-utils
nfs-utils-0.3.1-13.7.2.1

If I try to do a 'df' whilst it's hung, the df hangs also (classic NFS hang). 
The NFS server appears to be fine, I can mount it somewhere else, but if I try
to access it, the new client also hangs.

The only way seemingly to get out of it, is to 'kill -9' the 'cp', shutdown NFS
on the server, then restart NFS on the server (avatar).  It takes a minute or
two, but the clients come back around after that.

Some of the machines that are backing up have very large files, although the
total for the directory /eda/cvsroot is only about 250MB.  I have already
checked to see if there are any files larger than 2gig, and it doesn't appear
that there are.

I have now staggered the cron.daily execution to spread it out, but it didn't
help.  The smaller machines that went first appeared to be okay, but the largest
machine (the DB) croaked.


the syslog shows:

Oct  3 06:03:18 manganese kernel: nfs: server avatar not responding, still trying
Once at the beginning of the job, then:
Oct  3 10:55:31 manganese kernel: nfs: task 12614 can't get a request slot
After I start kicking the NFS server (taking it down, and bringing it back up).

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
I don't know how to reproduce in a sterile environment, I don't have that
capability.

1.
2.
3.
	

Additional info:

Comment 1 Steve Dickson 2004-08-11 11:06:42 UTC

This seems to be fixed in later kernels.

Note You need to log in before you can comment on or make changes to this bug.