53645 – nfs hangs when copying files from another RH system

Bug 53645 - nfs hangs when copying files from another RH system

Summary: nfs hangs when copying files from another RH system

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	nfs-server
Sub Component:
Version:	7.1
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Steve Dickson
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-09-13 19:30 UTC by Bob Lawson
Modified:	2007-04-18 16:37 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-11-27 22:41:16 UTC
Embargoed:

Attachments	(Terms of Use)

Description Bob Lawson 2001-09-13 19:30:52 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows 95)

Description of problem:
I have two new systems both running RH 7.1.  I use NFS to mount
file systems between the two system.  I am trying to copy an entire
filesystem from one machine to another using CPIO.  The copy always
lockup in the same place.

I have tried 2.4.2-2smp as well as 2.4.3-12smp.  It responds the
same on both Kernels.

I have tried the copy using cp -a -v and it still lockup
although in a different place.


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Boot the system, logon.  The file system mounted is:
nasstor2:staff on /nasstor2/staff
nasstor2:bbx on /nasstor2/bbx
nasstor2:imaging on /nasstor2/imaging

2. cd /nasstor2/staff/prodstaff

3. find . -print | cpio -pdvmu /staff/prodstaff

It copies files for several minutes then locks up.
	

Actual Results:  
Sep 13 14:18:55 nasstor3 kernel: nfs: server nasstor2 not responding, 
still trying
Sep 13 14:18:55 nasstor3 last message repeated 2 times
Sep 13 14:20:44 nasstor3 kernel: nfs: task 29003 can't get a request slot
Sep 13 14:20:44 nasstor3 kernel: nfs: task 29004 can't get a request slot
Sep 13 14:20:44 nasstor3 kernel: nfs: task 29005 can't get a request slot
Sep 13 14:20:44 nasstor3 kernel: nfs: task 29006 can't get a request slot
Sep 13 14:20:44 nasstor3 kernel: nfs: task 29007 can't get a request slot
Sep 13 14:20:44 nasstor3 kernel: nfs: task 29008 can't get a request slot
Sep 13 14:20:44 nasstor3 kernel: nfs: task 29009 can't get a request slot
Sep 13 14:20:44 nasstor3 kernel: nfs: task 29010 can't get a request slo


Expected Results:  Files would copy until finished.

Additional info:

Once this happens I can get the system back by running the script:
   /etc/rc2.d/K75netfs stop
   sleep 5
   /etc/rc2.d/K75netfs start

When I do this I get:

Sep 13 14:28:37 nasstor3 kernel: nfs_statfs: statfs error = 
5                   
Sep 13 14:28:37 nasstor3 umount: umount2: Device or resource 
busy               
Sep 13 14:28:37 nasstor3 umount: umount: /nasstor2/staff: device is 
busy        
Sep 13 14:28:38 nasstor3 netfs: Unmounting NFS filesystems:  
failed             
Sep 13 14:28:46 nasstor3 netfs: Unmounting NFS filesystems (retry):  
succeeded
Sep 13 14:28:53 nasstor3 netfs: Mounting NFS filesystems:  
succeeded            
Sep 13 14:28:53 nasstor3 netfs: Mounting other filesystems:  succeeded

Hardware:
00:00.0 Host bridge: ServerWorks CNB20LE (rev 05)
00:00.1 Host bridge: ServerWorks CNB20LE (rev 05)
00:02.0 VGA compatible controller: ATI Technologies Inc 3D Rage IIC 215IIC 
[Mach
64 GT IIC] (rev 7a)
00:03.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] 
(rev 08)

00:0f.0 ISA bridge: ServerWorks OSB4 (rev 4f)
00:0f.1 IDE interface: ServerWorks: Unknown device 0211
01:04.0 SCSI storage controller: Adaptec 7899P
01:04.1 SCSI storage controller: Adaptec 7899P
01:06.0 PCI bridge: Digital Equipment Corporation DECchip 21152 (rev 03)
01:07.0 PCI bridge: Intel Corporation 80960RP [i960 RP 
Microprocessor/Bridge] (r
ev 05)
01:07.1 RAID bus controller: Mylex Corporation DAC960PX (rev 05)
01:08.0 PCI bridge: Intel Corporation 80960RP [i960 RP 
Microprocessor/Bridge] (r
ev 05)
01:08.1 RAID bus controller: Mylex Corporation DAC960PX (rev 05)
02:04.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] 
(rev 05)

02:05.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] 
(rev 05)

OS: Linux nasstor3 2.4.3-12smp #1 SMP Fri Jun 8 14:38:50 EDT 2001 i686 
unknown

This is reproducable and stops in exactly the same spot everytime.

Bob Lawson
bobl

Comment 1 Arjan van de Ven 2001-09-14 08:24:22 UTC

During the "problem time", can you ping the other host ?
(eg is networking down)

Comment 2 Bob Lawson 2001-09-14 11:45:25 UTC

The network is not down as I am accessing both systems using telnet sessions
from a PC and they continue to work.

But I tested it anyway and yes I can ping between the two systems.

Comment 3 Bob Lawson 2001-10-31 18:49:58 UTC

Ok, I have now tried it with version 2.4.9-6smp and I get the
same thing happening.  It runs to the same point and then stops
and starts producing the message:

  nfs: server nasstor2 not responding, still trying

followed later by:

  nfs: task 31401 can't get a request slot
  nfs: task 31402 can't get a request slot
  .....

At this point no data moves between the two systems.

After I interrupt the process I get:

  nfs: server nasstor2 OK

But I still do not get a prompt back.

A df hangs once it hits the mounted filesystems.

Bob

Comment 4 Bob Lawson 2001-10-31 20:41:01 UTC

I have tried the e100 driver instead of the eepro100 driver.

But the problem persists.

Comment 5 Alan Cox 2001-11-01 00:02:23 UTC

You say exactly the same spot. Thats an important clue. You mean it stops on the
same file each time ?

Comment 6 Bob Lawson 2001-11-01 14:40:16 UTC

The transfer ALWAYS stops at exactly the same spot!

Something in the back of my head has been saying... content... content.

Ok... this seems very unlikely but I had to test to see if it was
possible.  I take the two files where the problem happens:
	DSC00036.JPG
	DSC00037.JPG

Cpio prints the first file name but not the second one.  I'm not sure
if cpio prints the name before it copies or after so I took both
files.

1)  I move the files to a different location on the same filesystem but
a location I am not trying to copy.  Then I run my test again and it flys
past the location it stopped at before.

2)  I move the two files to a location which occurs earlier in the copy
process.  In both cases I have moved the file so the inodes and data
allocation should remain untouched and the file is only renamed.

I run the test again and it stops at the same file DSC00036.JPG!
But DSC00036.JPG is now in a totally different directory and very
close to the beginning of the cpio copy.

3)  I once again move the files to an uncopied location.  Now I copy
the files back into a location to be copied.  This should allocate a new
inode and data blocks.

I once again run the test.  And it stops again at exactly the same file!

My thoughts:

It appears to be the contents of the file that are causing the problem.

However I can ftp the file off the system without any problem.  If it
was a problem with the adapter/driver I would expect to see the problem
when I used ftp.  Ok... yes I understand the protocols are different
and the packaging of the information is different.  But I have easily
moved many >1BG files with no problems over these adapters.

But I do have problems with NFS.  I also have problems with SCO<->Linux
NFS.

I still have the file, both the moved and copied files so if you
need information from or about the files I can easily get it.

Comment 7 Alan Cox 2001-11-01 14:57:25 UTC

Bit pattern dependant problems really have to be at the hardware level. They do
happen, obscurely quite often, when you get a slightly bad card combined with
bit patterns that are "worst case" for the ethernet clocking algorithm and encoding.

Things worth trying include changing the hub port the card is plugged into (in
case its a hub problem), and changing the card. Bear in mind its not clear which
end (or in the middle) may have problems.

Also check the error counter behaviour on the cards

Alan

Note You need to log in before you can comment on or make changes to this bug.