Bug 1100941 - NFSv4 hang after directory copy with "cp -a"
Summary: NFSv4 hang after directory copy with "cp -a"
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 19
Hardware: i686
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: nfs-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-05-24 05:30 UTC by Terry Barnaby
Modified: 2014-06-24 05:50 UTC (History)
33 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2014-06-24 05:50:04 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Network trace (22.21 KB, application/x-gzip)
2014-05-24 12:12 UTC, Terry Barnaby
no flags Details

Description Terry Barnaby 2014-05-24 05:30:03 UTC
On an NFSv4 client system, if you have a directory with a few files and then run "cp -a <dir1> <dir2>" and then in <dir2> copy one of the files to a new name the file copy "cp" will hang.
Doing a "cp -r" does not have this problem, so it probably is todo with NFSv4 file attributes. There are obviously other cases of file hangs.
I also see some GUI clients occasionally not being able to write a file saying something like "resource temporarily unavailable".

This is with kernel-3.14.4-100.fc19.i686.PAE on both systems.
It happens all the time and on two different networked environments (home and work).

Comment 1 Steve Dickson 2014-05-24 10:50:31 UTC
(In reply to Terry Barnaby from comment #0)
> On an NFSv4 client system, if you have a directory with a few files and then
> run "cp -a <dir1> <dir2>" and then in <dir2> copy one of the files to a new
> name the file copy "cp" will hang.
> Doing a "cp -r" does not have this problem, so it probably is todo with
> NFSv4 file attributes. There are obviously other cases of file hangs.
> I also see some GUI clients occasionally not being able to write a file
> saying something like "resource temporarily unavailable".
> 
> This is with kernel-3.14.4-100.fc19.i686.PAE on both systems.
> It happens all the time and on two different networked environments (home
> and work).
Who is the server and would it be possible got get a binary network trace using
tshark -o /tmp/datap.pcap ; bzip /tmp/data.pcap?

Comment 2 Terry Barnaby 2014-05-24 12:11:58 UTC
The servers and clients are all Fedora19 updated to current versions.
Kernel versions 3.13.x were fine as far as I was aware.
I know that delegation code was added in 3.14.x ... (See Bug 1082586)

I attach a network trace. This is during:

cp -a tmp tmp2
cd tmp2
ls
cp -a t1.cpp t1.cpp.3

Comment 3 Terry Barnaby 2014-05-24 12:12:52 UTC
Created attachment 898873 [details]
Network trace

Comment 4 Steve Dickson 2014-05-27 10:57:27 UTC
(In reply to Terry Barnaby from comment #3)
> Created attachment 898873 [details]
> Network trace

Something is going on here... the server seems to be returning a "[Malformed Packet]" when the client sends a SETATTR setting an ACL. Then the server
start replying with a NFS4ERR_DELAY on the final open. Interesting...

Comment 5 J. Bruce Fields 2014-05-27 18:08:31 UTC
If I look at frame 188 with the wireshark in F20 (wireshark-gnome-1.10.7-1.fc20.x86_64), I see "[Malformed Packet]".  With a version I built myself from development (wireshark-1.11.11-5455-g58bb472), it parses fine.  Looking at the bytes, I'm pretty sure the latter is right, so this is just a bug in F20's wireshark.

Comment 6 J. Bruce Fields 2014-05-27 18:28:03 UTC
So the hang probably starts with the write OPEN in frame 341, which gets a NFS4ERR_DELAY in frame 342.

It may be a delegation problem, in which case you can work around it with "echo 0 >/proc/sys/fs/leases-enable" on the server.  It'd be interesting to know if that helps.

What's odd is it's an OPEN4_CREATE/EXCLUSIVE4 open for directory/filename 0x26c7788f/t1.ccp.3, following just a millisecond or so after a LOOKUP of the same thing in frame 334 which got an NFS4ERR_NOENT reply in frame 335.

So unless there's something else going on at the same time (e.g. a process on the server that just jumped in and created a file under that name), the OPEN that's returning a NFS4ERR_DELAY is an open of a newly created file.

It's not attempting to set any attributes here (an EXCLUSIVE4 open can't).

It could also be interesting to know whether the file does in fact exist or not at this point.  I guess one way to check would be to watch for the hanging create with either wireshark or strace, then check on the server side to see if a file with that name already exists.

Comment 7 Steve Dickson 2014-05-27 18:51:08 UTC
(In reply to J. Bruce Fields from comment #6)
> So the hang probably starts with the write OPEN in frame 341, which gets a
> NFS4ERR_DELAY in frame 342.
> 
> It may be a delegation problem, in which case you can work around it with
> "echo 0 >/proc/sys/fs/leases-enable" on the server.  It'd be interesting to
> know if that helps.
It did not, at least in my testing.

> 
> What's odd is it's an OPEN4_CREATE/EXCLUSIVE4 open for directory/filename
> 0x26c7788f/t1.ccp.3, following just a millisecond or so after a LOOKUP of
> the same thing in frame 334 which got an NFS4ERR_NOENT reply in frame 335.
> 
> So unless there's something else going on at the same time (e.g. a process
> on the server that just jumped in and created a file under that name), the
> OPEN that's returning a NFS4ERR_DELAY is an open of a newly created file.
> 
> It's not attempting to set any attributes here (an EXCLUSIVE4 open can't).
> 
> It could also be interesting to know whether the file does in fact exist or
> not at this point.  I guess one way to check would be to watch for the
> hanging create with either wireshark or strace, then check on the server
> side to see if a file with that name already exists.
This is a bit bizarre....  when doing the directories

cp -a tmp1 tmp2 the NFS4ERR_DELAY will happen only when tmp2 exists. 

Its just the opposite for the files

cp -a  t1.cpp t1.cpp.2 will only hang when t1.cpp.2 does not exist. 


bizarro!! :-)

Comment 8 Terry Barnaby 2014-06-05 05:22:43 UTC
Is anything happening with this bug ? It is pretty series for any NFS network server ...

Comment 9 J. Bruce Fields 2014-06-06 21:14:14 UTC
(In reply to Steve Dickson from comment #7)
> (In reply to J. Bruce Fields from comment #6)
> > So the hang probably starts with the write OPEN in frame 341, which gets a
> > NFS4ERR_DELAY in frame 342.
> > 
> > It may be a delegation problem, in which case you can work around it with
> > "echo 0 >/proc/sys/fs/leases-enable" on the server.  It'd be interesting to
> > know if that helps.
> It did not, at least in my testing.

Note you probably need to turn of leases before starting the nfs server.

> > What's odd is it's an OPEN4_CREATE/EXCLUSIVE4 open for directory/filename
> > 0x26c7788f/t1.ccp.3, following just a millisecond or so after a LOOKUP of
> > the same thing in frame 334 which got an NFS4ERR_NOENT reply in frame 335.
> > 
> > So unless there's something else going on at the same time (e.g. a process
> > on the server that just jumped in and created a file under that name), the
> > OPEN that's returning a NFS4ERR_DELAY is an open of a newly created file.
> > 
> > It's not attempting to set any attributes here (an EXCLUSIVE4 open can't).
> > 
> > It could also be interesting to know whether the file does in fact exist or
> > not at this point.  I guess one way to check would be to watch for the
> > hanging create with either wireshark or strace, then check on the server
> > side to see if a file with that name already exists.
> This is a bit bizarre....  when doing the directories
> 
> cp -a tmp1 tmp2 the NFS4ERR_DELAY will happen only when tmp2 exists. 
> 
> Its just the opposite for the files
> 
> cp -a  t1.cpp t1.cpp.2 will only hang when t1.cpp.2 does not exist. 
> 
> 
> bizarro!! :-)

Yeah, I don't have an explanation yet.

What filesystem are you exporting?  Do you see anything interesting in the logs when this happens?

Comment 10 Terry Barnaby 2014-06-24 05:50:04 UTC
This now appears to have been fixed by some update in Fedora 19 kernel 3.14.7-100.fc19.i686.PAE


Note You need to log in before you can comment on or make changes to this bug.