Bug 1075266 - Client gets stuck when accessing a random file on a NFSv4 share
Summary: Client gets stuck when accessing a random file on a NFSv4 share
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: nfs-utils
Version: 19
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Steve Dickson
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-03-11 21:54 UTC by Rosario Esposito
Modified: 2015-02-17 20:02 UTC (History)
2 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2015-02-17 20:02:18 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
nfsv4 client-server loop (915.07 KB, application/x-gzip)
2014-03-11 21:54 UTC, Rosario Esposito
no flags Details

Description Rosario Esposito 2014-03-11 21:54:33 UTC
Created attachment 873243 [details]
nfsv4 client-server loop

Description of problem:
I have ~50 fedora19 clients mounting /home from a nexentastor nfsv4 file server.
I'm seeing very often some processes get stuck when accessing a random file. When this happens, I see from tcpdump, a strange nfs client-server communication loop (see attached dump file). Something like this:

...
...
22:13:20.867199 IP 192.168.0.94.1384140474 > 192.168.0.214.nfs: 172 getattr fh 0,0/22
22:13:20.867311 IP 192.168.0.214.nfs > 192.168.0.94.1384140474: reply ok 52 getattr ERROR: unk 10025
22:13:20.867326 IP 192.168.0.94.1400917690 > 192.168.0.214.nfs: 172 getattr fh 0,0/22
22:13:20.867425 IP 192.168.0.214.nfs > 192.168.0.94.1400917690: reply ok 52 getattr ERROR: unk 10025
22:13:20.867439 IP 192.168.0.94.1417694906 > 192.168.0.214.nfs: 172 getattr fh 0,0/22
...
...   


On the stuck client, every command such as cat, more, stat on that particular file hangs forever. Any other files on the same nfs share are accessible.

The stuck file is perfectly accessible from other clients.
In fact, if from another client I copy the file to a new file, remove the old file and move the new file back to the old one, then it gets again accessible from the stuck client.

I also saw this happening several time with .history files:
A tcsh process gets stuck on clientX when accessing .history.
When I do, from clientY: 
  cp -a .history .history.new
  rm -f .history
  mv .history.new .history
Then tcsh on clientX gets unlocked.

Version-Release number of selected component (if applicable):
nfs-utils-1.2.8-6.3.fc19.x86_64
kernel 3.13.5-101.fc19.x86_64

How reproducible:
Not easily reproducible...

Comment 1 Rosario Esposito 2014-03-11 22:04:38 UTC
Additional info:

[root@clueet94 ~]# mount | grep home
eetnex-vol1:/volumes/vol1/home on /home type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.94,local_lock=none,addr=192.168.0.214,_netdev)

Comment 2 J. Bruce Fields 2014-03-11 22:06:12 UTC
The client is sending PUTFH+READ, the server is responding with NFS4ERR_BAD_STATEID on the READ, then the client is resending the PUTFH+READ with the same stateid.

Not sure if there's a known issue.

Worth retrying with the latest F20 if possible, as the client's recovery from state-related errors is something that seems to have been getting a lot of work.

Comment 3 Rosario Esposito 2014-03-11 22:23:35 UTC
I can't upgrade all clients to F20 in a short time...
Is there anything else I could try to mitigate the problem ?
A kernel upgrade ? or a nfs-utils package upgrade ?

Comment 4 J. Bruce Fields 2014-03-12 00:25:54 UTC
It's the kernel that's the relevant part, so, yes, if you have an easy way to upgrade the client's kernels, the results might be interesting.

Comment 5 Rosario Esposito 2014-03-12 17:47:06 UTC
Hi,
I updated one of the clients to 3.13.5-103.fc19.x86_64 which seems to be the latest available for F19/F20 and the problem is still there...
Any ideas ?

Comment 6 Mace Moneta 2014-06-23 19:39:01 UTC
I was having this problem. I installed:

kernel-3.16.0-0.rc2.git0.1.fc21
kernel-core-3.16.0-0.rc2.git0.1.fc21
kernel-modules-3.16.0-0.rc2.git0.1.fc21

from Koji (http://koji.fedoraproject.org/koji/buildinfo?buildID=539593), and I can no longer recreate the issue.

Comment 7 Fedora End Of Life 2015-01-09 21:13:21 UTC
This message is a notice that Fedora 19 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 19. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained. Approximately 4 (four) weeks from now this bug will
be closed as EOL if it remains open with a Fedora 'version' of '19'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 19 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 8 Fedora End Of Life 2015-02-17 20:02:18 UTC
Fedora 19 changed to end-of-life (EOL) status on 2015-01-06. Fedora 19 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.