Red Hat Bugzilla – Bug 53645
nfs hangs when copying files from another RH system
Last modified: 2007-04-18 12:37:06 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows 95)
Description of problem:
I have two new systems both running RH 7.1. I use NFS to mount
file systems between the two system. I am trying to copy an entire
filesystem from one machine to another using CPIO. The copy always
lockup in the same place.
I have tried 2.4.2-2smp as well as 2.4.3-12smp. It responds the
same on both Kernels.
I have tried the copy using cp -a -v and it still lockup
although in a different place.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Boot the system, logon. The file system mounted is:
nasstor2:staff on /nasstor2/staff
nasstor2:bbx on /nasstor2/bbx
nasstor2:imaging on /nasstor2/imaging
2. cd /nasstor2/staff/prodstaff
3. find . -print | cpio -pdvmu /staff/prodstaff
It copies files for several minutes then locks up.
Sep 13 14:18:55 nasstor3 kernel: nfs: server nasstor2 not responding,
Sep 13 14:18:55 nasstor3 last message repeated 2 times
Sep 13 14:20:44 nasstor3 kernel: nfs: task 29003 can't get a request slot
Sep 13 14:20:44 nasstor3 kernel: nfs: task 29004 can't get a request slot
Sep 13 14:20:44 nasstor3 kernel: nfs: task 29005 can't get a request slot
Sep 13 14:20:44 nasstor3 kernel: nfs: task 29006 can't get a request slot
Sep 13 14:20:44 nasstor3 kernel: nfs: task 29007 can't get a request slot
Sep 13 14:20:44 nasstor3 kernel: nfs: task 29008 can't get a request slot
Sep 13 14:20:44 nasstor3 kernel: nfs: task 29009 can't get a request slot
Sep 13 14:20:44 nasstor3 kernel: nfs: task 29010 can't get a request slo
Expected Results: Files would copy until finished.
Once this happens I can get the system back by running the script:
When I do this I get:
Sep 13 14:28:37 nasstor3 kernel: nfs_statfs: statfs error =
Sep 13 14:28:37 nasstor3 umount: umount2: Device or resource
Sep 13 14:28:37 nasstor3 umount: umount: /nasstor2/staff: device is
Sep 13 14:28:38 nasstor3 netfs: Unmounting NFS filesystems:
Sep 13 14:28:46 nasstor3 netfs: Unmounting NFS filesystems (retry):
Sep 13 14:28:53 nasstor3 netfs: Mounting NFS filesystems:
Sep 13 14:28:53 nasstor3 netfs: Mounting other filesystems: succeeded
00:00.0 Host bridge: ServerWorks CNB20LE (rev 05)
00:00.1 Host bridge: ServerWorks CNB20LE (rev 05)
00:02.0 VGA compatible controller: ATI Technologies Inc 3D Rage IIC 215IIC
64 GT IIC] (rev 7a)
00:03.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100]
00:0f.0 ISA bridge: ServerWorks OSB4 (rev 4f)
00:0f.1 IDE interface: ServerWorks: Unknown device 0211
01:04.0 SCSI storage controller: Adaptec 7899P
01:04.1 SCSI storage controller: Adaptec 7899P
01:06.0 PCI bridge: Digital Equipment Corporation DECchip 21152 (rev 03)
01:07.0 PCI bridge: Intel Corporation 80960RP [i960 RP
01:07.1 RAID bus controller: Mylex Corporation DAC960PX (rev 05)
01:08.0 PCI bridge: Intel Corporation 80960RP [i960 RP
01:08.1 RAID bus controller: Mylex Corporation DAC960PX (rev 05)
02:04.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100]
02:05.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100]
OS: Linux nasstor3 2.4.3-12smp #1 SMP Fri Jun 8 14:38:50 EDT 2001 i686
This is reproducable and stops in exactly the same spot everytime.
During the "problem time", can you ping the other host ?
(eg is networking down)
The network is not down as I am accessing both systems using telnet sessions
from a PC and they continue to work.
But I tested it anyway and yes I can ping between the two systems.
Ok, I have now tried it with version 2.4.9-6smp and I get the
same thing happening. It runs to the same point and then stops
and starts producing the message:
nfs: server nasstor2 not responding, still trying
followed later by:
nfs: task 31401 can't get a request slot
nfs: task 31402 can't get a request slot
At this point no data moves between the two systems.
After I interrupt the process I get:
nfs: server nasstor2 OK
But I still do not get a prompt back.
A df hangs once it hits the mounted filesystems.
I have tried the e100 driver instead of the eepro100 driver.
But the problem persists.
You say exactly the same spot. Thats an important clue. You mean it stops on the
same file each time ?
The transfer ALWAYS stops at exactly the same spot!
Something in the back of my head has been saying... content... content.
Ok... this seems very unlikely but I had to test to see if it was
possible. I take the two files where the problem happens:
Cpio prints the first file name but not the second one. I'm not sure
if cpio prints the name before it copies or after so I took both
1) I move the files to a different location on the same filesystem but
a location I am not trying to copy. Then I run my test again and it flys
past the location it stopped at before.
2) I move the two files to a location which occurs earlier in the copy
process. In both cases I have moved the file so the inodes and data
allocation should remain untouched and the file is only renamed.
I run the test again and it stops at the same file DSC00036.JPG!
But DSC00036.JPG is now in a totally different directory and very
close to the beginning of the cpio copy.
3) I once again move the files to an uncopied location. Now I copy
the files back into a location to be copied. This should allocate a new
inode and data blocks.
I once again run the test. And it stops again at exactly the same file!
It appears to be the contents of the file that are causing the problem.
However I can ftp the file off the system without any problem. If it
was a problem with the adapter/driver I would expect to see the problem
when I used ftp. Ok... yes I understand the protocols are different
and the packaging of the information is different. But I have easily
moved many >1BG files with no problems over these adapters.
But I do have problems with NFS. I also have problems with SCO<->Linux
I still have the file, both the moved and copied files so if you
need information from or about the files I can easily get it.
Bit pattern dependant problems really have to be at the hardware level. They do
happen, obscurely quite often, when you get a slightly bad card combined with
bit patterns that are "worst case" for the ethernet clocking algorithm and encoding.
Things worth trying include changing the hub port the card is plugged into (in
case its a hub problem), and changing the card. Bear in mind its not clear which
end (or in the middle) may have problems.
Also check the error counter behaviour on the cards