Bug 763456 (GLUSTER-1724)

Summary: kernel untar fails during add-brick
Product: [Community] GlusterFS Reporter: Lakshmipathi G <lakshmipathi>
Component: distributeAssignee: Shehjar Tikoo <shehjart>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: low    
Version: 3.1-alphaCC: gluster-bugs
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTP Mount Type: nfs
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
nfs log
none
nfs-log-adding single brick
none
NFS trace log giles none

Description Amar Tumballi 2010-09-28 05:48:01 UTC
lakshmi, Can i see the nfs-server log file ?

Also can you try running tar from outside, you can specify the target path with -C flag..

Comment 1 Lakshmipathi G 2010-09-28 06:46:40 UTC
Created attachment 318 [details]
cutting of the /etc/group file, for the groups involved in the problem

Comment 2 Lakshmipathi G 2010-09-28 07:34:55 UTC
on nfs-client mountpt ,untar the kernel and  add 2 more bricks to existing 2 dht bricks. untar files with following error
-------
linux-2.6.35/arch/arm/mach-mv78xx0/include/mach/
linux-2.6.35/arch/arm/mach-mv78xx0/include/mach/bridge-regs.h
linux-2.6.35/arch/arm/mach-mv78xx0/include/mach/debug-macro.S
linux-2.6.35/arch/arm/mach-mv78xx0/include/mach/entry-macro.S
linux-2.6.35/arch/arm/mach-mv78xx0/include/mach/gpio.h
linux-2.6.35/arch/arm/mach-mv78xx0/include/mach/hardware.h
linux-2.6.35/arch/arm/mach-mv78xx0/include/mach/io.h
linux-2.6.35/arch/arm/mach-mv78xx0/include/mach/irqs.h
linux-2.6.35/arch/arm/mach-mv78xx0/include/mach/memory.h
linux-2.6.35/arch/arm/mach-mv78xx0/include/mach/mv78xx0.h
linux-2.6.35/arch/arm/mach-mv78xx0/include/mach/system.h
linux-2.6.35/arch/arm/mach-mv78xx0/include/mach/timex.h
linux-2.6.35/arch/arm/mach-mv78xx0/include/mach/uncompress.h
linux-2.6.35/arch/arm/mach-mv78xx0/include/mach/vmalloc.h
linux-2.6.35/arch/arm/mach-mv78xx0/irq.c
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: Too many errors, quitting
tar: Error is not recoverable: exiting now
-------

initially ls gave "Stale NFS" message - but changing directory and returning  it started working.

----------
[root@ip-10-212-117-143 client5]# pwd
/mnt/client5
[root@ip-10-212-117-143 client5]# ls
ls: cannot open directory .: Stale NFS file handle
[root@ip-10-212-117-143 client5]# ls -ltr
ls: cannot open directory .: Stale NFS file handle

[root@ip-10-212-117-143 client5]# cd ..
[root@ip-10-212-117-143 mnt]# cd client1
[root@ip-10-212-117-143 client1]# ls
NFS.SH  run24289
[root@ip-10-212-117-143 client1]# cd ../client5
[root@ip-10-212-117-143 client5]# ls
linux-2.6.35  linux-2.6.35.tar
---------------

Comment 3 Lakshmipathi G 2010-09-30 02:40:51 UTC
Created attachment 320 [details]
blah.  this one should work better; wrong version of diff last time.

Comment 4 Lakshmipathi G 2010-09-30 02:43:45 UTC
(In reply to comment #3)
> Created an attachment (id=320) [details]
> nfs-log-adding single brick

adding single brick to exising 2-dht setup ,also has this issue.


---
linux-2.6.35/Documentation/filesystems/ecryptfs.txt
linux-2.6.35/Documentation/filesystems/exofs.txt
linux-2.6.35/Documentation/filesystems/ext2.txt
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: linux-2.6.35.tar: Cannot read: Stale NFS file handle
tar: Too many errors, quitting
tar: Error is not recoverable: exiting now
[root@ip-10-245-210-193 mnt]# ls
--

Comment 5 shishir gowda 2010-10-01 09:11:05 UTC
Created attachment 324

Comment 6 Lakshmipathi G 2010-10-02 07:55:15 UTC
tested with qa38,still untar fails.

Comment 7 Shehjar Tikoo 2010-10-04 02:27:31 UTC
NFS server needs to special case the Transport end-point errors because otherwise the server ends up returning an EIO which may be resulting in the errors we're seeing here. 

[2010-10-01 17:25:13.672120] T [rpc-clnt.c:1216:rpc_clnt_record] : Auth Info: pid: 0, uid: 0, gid: 0, owner: 1287
[2010-10-01 17:25:13.672139] T [rpc-clnt.c:1116:rpc_clnt_record_build_header] rpc-clnt: Request fraglen 156, payload: 28, rpc hdr: 128
[2010-10-01 17:25:13.672361] T [rpc-clnt.c:1388:rpc_clnt_submit] rpc-clnt: submitted request (XID: 0x4c0 Program: GlusterFS 3.1, ProgVers: 310, Proc: 16) to rpc-transport (new-client-1)
[2010-10-01 17:25:13.676783] D [glusterfsd-mgmt.c:650:glusterfs_mgmt_pmap_signout] fsd-mgmt: portmapper signout arguments not given
[2010-10-01 17:25:13.676815] I [glusterfsd.c:668:cleanup_and_exit] glusterfsd: shutting down
[2010-10-01 17:25:13.676831] D [nfs.c:845:fini] nfs: NFS service going down
[2010-10-01 17:25:13.677073] D [rpcsvc.c:2771:nfs_rpcsvc_program_unregister] nfsrpc: Program unregistered: MOUNT3, Num: 100005, Ver: 3, Port: 38465
[2010-10-01 17:25:13.677200] D [rpcsvc.c:2771:nfs_rpcsvc_program_unregister] nfsrpc: Program unregistered: MOUNT1, Num: 100005, Ver: 1, Port: 38466
[2010-10-01 17:25:13.677332] D [rpcsvc.c:2771:nfs_rpcsvc_program_unregister] nfsrpc: Program unregistered: NFS3, Num: 100003, Ver: 3, Port: 38467
[2010-10-01 17:25:13.677356] I [io-stats.c:1680:fini] new: io-stats translator unloaded
[2010-10-01 17:25:13.677883] T [socket.c:2569:fini] new-client-1: transport 0xce2648 destroyed
[2010-10-01 17:25:13.677911] D [rpc-clnt.c:489:rpc_clnt_connection_cleanup] rpc-clnt: cleaning up state in transport object 0xce2648
[2010-10-01 17:25:13.677945] E [rpc-clnt.c:338:saved_frames_unwind] rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(FSYNC(16)) called at 2010-10-01 17:25:13.672359
[2010-10-01 17:25:13.677968] D [dht-common.c:1480:dht_fsync_cbk] new-dht: subvolume new-client-1 returned -1 (Transport endpoint is not connected)
[2010-10-01 17:25:13.677998] T [write-behind.c:442:wb_sync] new-write-behind: no vectors are to besynced
[2010-10-01 17:25:13.678028] D [nfs3-helpers.c:2446:nfs3_log_commit_res] nfs-nfsv3: XID: 5cc1172e, COMMIT: NFS: 10006(Error occurred on the server or IO Error), POSIX: 107(Transport endpoint is not connected), wverf: 1285934079

Comment 8 Shehjar Tikoo 2010-10-04 02:30:27 UTC
nfs receives this error because it gets this error before any child-down is received for this volume.

Comment 9 Lakshmipathi G 2010-10-04 04:27:54 UTC
applied this patch to qa39.
http://dev.gluster.com/~shehjart/0001-nfs-nfs3-Disable-subvolume-on-ENOTCONN.patch

untar fails when adding 2 bricks. nfs trace log can be found at /share/tickets/1724

Comment 10 Lakshmipathi G 2010-10-04 08:37:10 UTC
> 
> untar fails when adding 2 bricks. nfs trace log can be found at
> /share/tickets/1724

thats a wrong nfs-server log. now moved the correct nfs server log along with tcpdump at /share/tickets/1724/logs

Comment 11 Shehjar Tikoo 2010-10-04 11:24:48 UTC
Upgrading to critical trace inspection through wireshark shows a file is receiving two different inode numbers from nfs server.

Comment 12 Shehjar Tikoo 2010-10-04 11:30:27 UTC
Dump file showing the differing inode numbers for the same file handle is at dev:/share/tickets/1724/logs/dump3.bin

The trace starts with a lookup request number 11 with reply at 12. The file handle returned is 0x450a00c4 and the fileid returned is 67141636.


Much later, at getattr request at number 10481 with reply at 10483, for the same file handle we receive a fileid of 100712454.

This is what is causing a read request failure for linux tar file on the same mountpoint.

Comment 13 Shehjar Tikoo 2010-10-04 11:38:39 UTC
Its a bug in nfs3-helpers.c:nfs3_stat_to_fattr3

        fa.fileid = buf->ia_ino;


ia_ino needs to be filled using the gfid, which is not being done right now.

Comment 14 Vijay Bellur 2010-10-04 13:26:03 UTC
PATCH: http://patches.gluster.com/patch/5247 in master (nfs,nfs3: Disable subvolume on ENOTCONN)

Comment 15 Vijay Bellur 2010-10-04 13:26:08 UTC
PATCH: http://patches.gluster.com/patch/5248 in master (nfs3: Convert gfid into inode number)

Comment 16 Vijay Bellur 2010-10-05 07:42:59 UTC
PATCH: http://patches.gluster.com/patch/5275 in master (nfs3: Convert gfid to ino only for non-root)

Comment 17 Vijay Bellur 2010-10-07 09:09:03 UTC
PATCH: http://patches.gluster.com/patch/5337 in master (nfs: Revert downed-subvolume changes)

Comment 18 Vijay Bellur 2010-10-07 09:09:07 UTC
PATCH: http://patches.gluster.com/patch/5338 in master (nfs3: Fix gfid to ino conversion)