Bug 901723

Summary: gnfs: E [nfs3.c:1545:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: seen during ltp on 6.3 client.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ben Turner <bturner>
Component: glusterdAssignee: santosh pradhan <spradhan>
Status: CLOSED ERRATA QA Contact: Ben Turner <bturner>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.0CC: bturner, grajaiya, kkeithle, rhs-bugs, saujain, shaines, vagarwal, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-23 22:39:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Sosreport from storage1
none
Sosreport from storage2
none
Sosreport from storage3(client) none

Description Ben Turner 2013-01-18 21:44:56 UTC
Description of problem:

During ltp tests I am seeing the following errors in the nfs.log from the node that I am mounting:

[2013-01-18 15:28:50.729072] E [nfs3.c:1545:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (10.8.0.15:752) REPLICATED : 5c495aeb-ffde-4b24-bbe3-e48e3c81e144
[2013-01-18 15:28:50.742044] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: f1ee3076, ACCESS: NFS: 2(No such file or directory), POSIX: 14(Bad address)
[2013-01-18 15:28:50.744025] E [nfs3.c:1545:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (10.8.0.15:752) REPLICATED : 5c495aeb-ffde-4b24-bbe3-e48e3c81e144
[2013-01-18 15:28:50.744057] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: f2ee3076, ACCESS: NFS: 2(No such file or directory), POSIX: 14(Bad address)
[2013-01-18 15:28:50.745529] E [nfs3.c:1545:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (10.8.0.15:752) REPLICATED : 5c495aeb-ffde-4b24-bbe3-e48e3c81e144
[2013-01-18 15:28:50.745557] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: f3ee3076, ACCESS: NFS: 2(No such file or directory), POSIX: 14(Bad address)
[2013-01-18 15:28:50.746100] E [nfs3.c:1545:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (10.8.0.15:752) REPLICATED : 5c495aeb-ffde-4b24-bbe3-e48e3c81e144
[2013-01-18 15:28:50.746128] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: f4ee3076, ACCESS: NFS: 2(No such file or directory), POSIX: 14(Bad address)

This is a 6.3 client with the latest EUS kernel mounting a replicated volume over NFS:

Volume Name: REPLICATED
Type: Replicate
Volume ID: 1443a320-90fa-423b-a3e3-54715380ea64
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: storage-qe01.lab.eng.rdu2.redhat.com:/brick1
Brick2: storage-qe02.lab.eng.rdu2.redhat.com:/brick1


Version-Release number of selected component (if applicable):

glusterfs-3.3.0.5rhs-40.el6rhs.x86_64

How reproducible:

Every run.

Steps to Reproduce:
1.  Install ltp and mount replicated volume over NFS
2.  Run ltp
3.  Check nfs.log on node that is getting mounted
  
Actual results:

Error messages in logs when ltp is run.

Expected results:

No errors when test is run.

Additional info:

Will update BZ will deeper dive shortly, attached sosreports.

Comment 1 Ben Turner 2013-01-18 21:59:36 UTC
Created attachment 682764 [details]
Sosreport from storage1

Comment 2 Ben Turner 2013-01-18 22:00:11 UTC
Created attachment 682765 [details]
Sosreport from storage2

Comment 3 Ben Turner 2013-01-18 22:00:55 UTC
Created attachment 682766 [details]
Sosreport from storage3(client)

Comment 5 Ben Turner 2013-01-29 22:00:59 UTC
I tested this on ever client version from 5.6-6.4 and I saw this behavior on all client versions.  Today I reran ltp to see if I could get to the bottom of which test was causing the errors.  I haven't found which test was giving the unable to resolve FH error but I am seeing some real strange behavior with:

time $LTP_DIR/fsstress/fsstress -d /gluster-mount -l 22 -n 22 -p 22

When I run it I see warnings spam the logs:

[2013-01-29 16:40:50.485398] W [client3_1-fops.c:187:client3_1_symlink_cbk] 0-DISTRIBUTED-client-1: remote operation failed: File name too long. Path: /p8/d3/l9 (00000000-0000-0000-0000-000000000000)
[2013-01-29 16:40:50.485436] W [nfs3.c:2939:nfs3svc_symlink_cbk] 0-nfs: 9646c2ba: /p8/d3/l9 => -1 (File name too long)
[2013-01-29 16:40:50.486733] W [client3_1-fops.c:187:client3_1_symlink_cbk] 0-DISTRIBUTED-client-1: remote operation failed: File name too long. Path: /p8/d3/l9 (00000000-0000-0000-0000-000000000000)
[2013-01-29 16:40:50.486765] W [nfs3.c:2939:nfs3svc_symlink_cbk] 0-nfs: 9746c2ba: /p8/d3/l9 => -1 (File name too long)

I picked one example and looked at it:

[2013-01-29 16:40:50.197716] W [nfs3.c:3391:nfs3svc_remove_cbk] 0-nfs: 1f42c2ba: /run1089/p7/d3/f5 => -1 (No such file or directory)

On /gluster mount I cd to the dir:

[root@storage-qe04 d3]# pwd
/gluster-mount/run1089/p7/d3

And I try to remove the file:

[root@storage-qe04 d3]# rm f5 
rm: remove regular file `f5'? y
rm: cannot remove `f5': No such file or directory

Now I check ll and I still see the file:

[root@storage-qe04 d3]# ll
total 0
-rw-rw-rw-. 1 root root 579411 Jan 29 16:24 f5

I tried unmounting and remounting the FS and still saw the same thing:

[root@storage-qe04 gluster-mount]# cd /gluster-mount/run1089/p7/d3
[root@storage-qe04 d3]# ls
f5
[root@storage-qe04 d3]# rm f5 
rm: remove regular file `f5'? y
rm: cannot remove `f5': No such file or directory

So I went on the backend bricks and looked:

[root@storage-qe01 d3]# pwd
/brick1/run1089/p7/d3
[root@storage-qe01 d3]# ll
total 0

[root@storage-qe02 d3]# pwd
/brick1/run1089/p7/d3
[root@storage-qe02 d3]# ll
total 0

The file was not on either brick but was still showing on the client.  I went ahead and mounted from a different client:

[root@storage-qe12 ~]# mount -t nfs -o mountproto=tcp,vers=3 storage-qe01.lab.eng.rdu2.redhat.com:/DISTRIBUTED $(mkdir /test-mount; echo /test-mount)
[root@storage-qe12 ~]# cd /test-mount/run1089/p7/d3
[root@storage-qe12 d3]# ll
total 0
-rw-rw-rw-. 1 root root 579411 Jan 29 16:24 f5

The file exists even on a client that is mounting for the first time. 

I am pretty sure that the lpt testcase that causing the FH error is the same one I am running, but after executing the whole testsuite I don't see the FH error again.  I will try tomorrow just running fsstress and see if I hit the FH error.

Comment 6 santosh pradhan 2013-08-08 07:11:42 UTC
Hi Ben,

1. "Unable to resolve FH" error is addressed as part of the BZ 960835. The FIX is available in the latest RHS-2.1 build (bigbend).

2. "File name too long" message in the log is expected because the underlying file system "XFS" or "ext2/3/4" does not support file name length more than 256 chars. The tool is trying to create the symlink of 1024 chars which is rejected by symlink() syscall. Which is OK.

I could not reproduce the issue in 3.4.0.13rhs-1 build.

Could you confirm?

Thanks,
Santosh

Comment 7 Ben Turner 2013-08-12 17:01:50 UTC
Verified that the FH issue is resolved on glusterfs-3.4.0.18rhs-1.el6rhs.x86_64.

Comment 8 Vivek Agarwal 2013-08-12 17:07:31 UTC
Thanks Ben

Comment 9 Scott Haines 2013-09-23 22:39:24 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Comment 10 Scott Haines 2013-09-23 22:43:43 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html