Bug 1282763 - Detach tier + NFS : Cannot open: Invalid argument errors
Detach tier + NFS : Cannot open: Invalid argument errors
Status: CLOSED WORKSFORME
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: tier (Show other bugs)
3.1
Unspecified Unspecified
urgent Severity urgent
: ---
: RHGS 3.1.2
Assigned To: Soumya Koduri
nchilaka
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-11-17 06:00 EST by Bhaskarakiran
Modified: 2016-11-23 18:11 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-12-03 03:59:24 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Bhaskarakiran 2015-11-17 06:00:29 EST
Description of problem:
=======================

Did a linux untar from nfs mount, while it is in progress, did detach tier of disperse volume, writes are failing with "Cannot open: Invalid argument" errors.

linux-4.1.1/arch/x86/include/uapi/asm/socket.h
linux-4.1.1/arch/x86/include/uapi/asm/sockios.h
linux-4.1.1/arch/x86/include/uapi/asm/stat.h
linux-4.1.1/arch/x86/include/uapi/asm/statfs.h
linux-4.1.1/arch/x86/include/uapi/asm/svm.h
linux-4.1.1/arch/x86/include/uapi/asm/swab.h
linux-4.1.1/arch/x86/include/uapi/asm/termbits.h
linux-4.1.1/arch/x86/include/uapi/asm/termios.h
linux-4.1.1/arch/x86/include/uapi/asm/types.h
linux-4.1.1/arch/x86/include/uapi/asm/ucontext.h
linux-4.1.1/arch/x86/include/uapi/asm/unistd.h
linux-4.1.1/arch/x86/include/uapi/asm/vm86.h
linux-4.1.1/arch/x86/include/uapi/asm/vmx.h
linux-4.1.1/arch/x86/include/uapi/asm/vsyscall.h
linux-4.1.1/arch/x86/kernel/
tar: linux-4.1.1/arch/x86/kernel: Cannot mkdir: Invalid argument
linux-4.1.1/arch/x86/kernel/.gitignore
tar: linux-4.1.1/arch/x86/kernel/.gitignore: Cannot open: Invalid argument
linux-4.1.1/arch/x86/kernel/Makefile
tar: linux-4.1.1/arch/x86/kernel/Makefile: Cannot open: Invalid argument
linux-4.1.1/arch/x86/kernel/acpi/
tar: linux-4.1.1/arch/x86/kernel/acpi: Cannot mkdir: Invalid argument
linux-4.1.1/arch/x86/kernel/acpi/Makefile
tar: linux-4.1.1/arch/x86/kernel/acpi/Makefile: Cannot open: Invalid argument
linux-4.1.1/arch/x86/kernel/acpi/apei.c
tar: linux-4.1.1/arch/x86/kernel/acpi/apei.c: Cannot open: Invalid argument
linux-4.1.1/arch/x86/kernel/acpi/boot.c
tar: linux-4.1.1/arch/x86/kernel/acpi/boot.c: Cannot open: Invalid argument
linux-4.1.1/arch/x86/kernel/acpi/cstate.c
tar: linux-4.1.1/arch/x86/kernel/acpi/cstate.c: Cannot open: Invalid argument
linux-4.1.1/arch/x86/kernel/acpi/sleep.c
tar: linux-4.1.1/arch/x86/kernel/acpi/sleep.c: Cannot open: Invalid argument
linux-4.1.1/arch/x86/kernel/acpi/sleep.h
tar: linux-4.1.1/arch/x86/kernel/acpi/sleep.h: Cannot open: Invalid argument
linux-4.1.1/arch/x86/kernel/acpi/wakeup_32.S
tar: linux-4.1.1/arch/x86/kernel/acpi/wakeup_32.S: Cannot open: Invalid argument
linux-4.1.1/arch/x86/kernel/acpi/wakeup_64.S
tar: linux-4.1.1/arch/x86/kernel/acpi/wakeup_64.S: Cannot open: Invalid argument
linux-4.1.1/arch/x86/kernel/alternative.c
tar: linux-4.1.1/arch/x86/kernel/alternative.c: Cannot open: Invalid argument
linux-4.1.1/arch/x86/kernel/amd_gart_64.c
tar: linux-4.1.1/arch/x86/kernel/amd_gart_64.c: Cannot open: Invalid argument
linux-4.1.1/arch/x86/kernel/amd_nb.c
tar: linux-4.1.1/arch/x86/kernel/amd_nb.c: Cannot open: Invalid argument
linux-4.1.1/arch/x86/kernel/apb_timer.c
tar: linux-4.1.1/arch/x86/kernel/apb_timer.c: Cannot open: Invalid argument
linux-4.1.1/arch/x86/kernel/aperture_64.c

Version-Release number of selected component (if applicable):
=============================================================
3.7.5-6

[root@transformers ~]# gluster --version
glusterfs 3.7.5 built on Nov 10 2015 00:13:38
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.
[root@transformers ~]# 

How reproducible:
=================
100%

Steps to Reproduce:
1. Create an ec volume and attach a 2x2 dist-rep tier volume
2. Run linux untar 2 instances
3. Now detach tier volume and check the untar progress

Actual results:
===============
writes fail with invalid argument errors.

Expected results:
=================
no writes should fail.

Additional info:
================
sosreports will be copied to rhsqe.
Comment 3 Soumya Koduri 2015-11-20 05:40:13 EST
I have looked at the logs & code and below is my analysis -- 


Lets take one of the file which threw EINVAL error -

linux-4.1.1/arch/x86/kernel/Makefile
tar: linux-4.1.1/arch/x86/kernel/Makefile: Cannot open: Invalid argument


I see many of the below messages for many of the files  -
nfs.log:[2015-11-17 12:35:46.902641] W [MSGID: 112199] [nfs3-helpers.c:3404:nfs3_log_common_res] 0-nfs-nfsv3: <gfid:39c55c27-491d-4ded-b622-209d9d2372df>/kernel/Makefile => (XID: 8091a77d, REMOVE: NFS: 22(Invalid argument for operation), POSIX: 22(Invalid argument))

da5db9) [No such file or directory]
nfs.log:[2015-11-17 10:14:07.920627] W [MSGID: 112199] [nfs3-helpers.c:3404:nfs3_log_common_res] 0-nfs-nfsv3: /tarball/linux-4.1.1/arch/x86/kernel/Makefile => (XID: 1aad777d, CREATE: NFS: 17(File exists), POSIX: 0(Success)) [Invalid argument]
nfs.log:[2015-11-17 10:14:07.924004] E [MSGID: 109031] [dht-linkfile.c:306:dht_linkfile_setattr_cbk] 0-vol1-tier-dht: Failed to set attr uid/gid on /tarball/linux-4.1.1/arch/x86/kernel/Makefile :<gfid:00000000-0000-0000-0000-000000000000>  [No such file or directory]



Not sure why REMOVE and dht failures happened (errors which occurred in lower layers(dht) or brick process)

But if we consider the error message regarding CREATE -
nfs.log:[2015-11-17 11:41:42.895450] W [MSGID: 112199] [nfs3-helpers.c:3404:nfs3_log_common_res] 0-nfs-nfsv3: <gfid:d6e018d6-d489-489a-abb2-c3eae33c1029>/linux-4.1.1/arch/x86/kernel/Makefile => (XID: 7db7867d, CREATE: NFS: 17(File exists), POSIX: 0(Success)) [Invalid argument]


From the code, looks like we get this error in "nfs3svc_create_stat_cbk" during NFSv3 Exclusive create.

From NFSv3 RFC,
During Exclusive create, NFS client passes a verifier which are stored as file's mtime & atime. In case if the request is re-transmitted and server finds that the file is already created, it verifies if its re-transmitted request by matching the file's mtime & atime with the verifier passed.

So I am assuming, nfs client tried to exclusively create these files and maybe by the time it got response from the server, detach-tier would have happened causing NFS server to restart and file may have got rebalanced changing file's mtime & ctime. Now when the NFS client re-transmits that request, server's validation if the input verifier matches file's mtime & ctime would fail causing this error.
Comment 4 Soumya Koduri 2015-11-20 06:33:20 EST
But the above mentioned theory doesn't explain why mkdir would fail with EINVAL.

Anyways to confirm EEXIST error, could you please collect packet trace when detach tier is being performed. 

Also we need to check what the other error messages would mean (REMOVE failure especially)
Comment 5 Soumya Koduri 2015-11-24 04:12:44 EST
I have performed detach-tier operations for couple of times now. Haven't seen any errors but ran into hang issue bug1282771. 

As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1282771#c2 , kindle re-test this scenario after disabling throttling and capture the pkt trace if the issue still persist.
Comment 7 Bhaskarakiran 2015-11-30 04:17:15 EST
Disabled throttling and able to reproduce the issue. Provided the packet trace to DEV.
Comment 8 Soumya Koduri 2015-11-30 04:42:03 EST
(gdb) bt
#0  svc_lookup_cbk (frame=frame@entry=0x7fd0204ed64c, cookie=0x7fd0204f95c8, this=0x7fd010025b00, op_ret=op_ret@entry=-1, op_errno=22, inode=0x7fd00d832930, buf=buf@entry=0x7fd009d898ec, xdata=0x7fd01fee4410, 
    postparent=postparent@entry=0x7fd009d89b1c) at snapview-client.c:270
#1  0x00007fd0149f069b in dht_lookup_selfheal_cbk (frame=0x7fd0204f95c8, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, xdata=<optimized out>) at dht-common.c:213
#2  0x00007fd0149e40fb in dht_selfheal_dir_finish (frame=0x7fd0204f95c8, this=<optimized out>, ret=ret@entry=-1) at dht-selfheal.c:120
#3  0x00007fd0149e50b5 in dht_selfheal_layout_lock_cbk (frame=<optimized out>, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, xdata=<optimized out>) at dht-selfheal.c:323
#4  0x00007fd0149d8f1f in dht_inodelk_done (lock_frame=0x7fd0204c3468) at dht-helper.c:1520
#5  0x00007fd0149d8f39 in dht_inodelk_cleanup_cbk (frame=<optimized out>, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, xdata=<optimized out>) at dht-helper.c:1532
#6  0x00007fd0149d8f1f in dht_inodelk_done (lock_frame=0x7fd020518b00) at dht-helper.c:1520
#7  0x00007fd0149d8fad in dht_unlock_inodelk_cbk (frame=0x7fd020518b00, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, xdata=<optimized out>) at dht-helper.c:1601
#8  0x00007fd014a1571f in dht_inodelk_cbk (frame=0x7fd0204c00ac, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=0, xdata=<optimized out>) at dht-inode-read.c:1145
#9  0x00007fd0229ba1e3 in default_inodelk_cbk (frame=0x7fd020508b04, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, xdata=0x0) at defaults.c:1177
#10 0x00007fd014c57a95 in ec_manager_inodelk (fop=0x7fd008a31ef8, state=<optimized out>) at ec-locks.c:653
#11 0x00007fd014c4d47b in __ec_manager (fop=0x7fd008a31ef8, error=0) at ec-common.c:1935
#12 0x00007fd014c4d628 in ec_resume (fop=0x7fd008a31ef8, error=0) at ec-common.c:289
#13 0x00007fd014c4d76c in ec_complete (fop=0x7fd008a31ef8) at ec-common.c:362
#14 0x00007fd014c55399 in ec_inodelk_cbk (frame=<optimized out>, cookie=0xb, this=0x7fd01001b5b0, op_ret=0, op_errno=0, xdata=0x0) at ec-locks.c:552
#15 0x00007fd014eb702d in client3_3_inodelk_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fd0204e075c) at client-rpc-fops.c:1626
#16 0x00007fd022780b20 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fd0103292f0, pollin=pollin@entry=0x7fcfff9b4f30) at rpc-clnt.c:766
#17 0x00007fd022780ddf in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd010329320, event=<optimized out>, data=0x7fcfff9b4f30) at rpc-clnt.c:907
#18 0x00007fd02277c913 in rpc_transport_notify (this=this@entry=0x7fd010338fd0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fcfff9b4f30) at rpc-transport.c:545
#19 0x00007fd0175ba4c6 in socket_event_poll_in (this=this@entry=0x7fd010338fd0) at socket.c:2236
#20 0x00007fd0175bd3b4 in socket_event_handler (fd=fd@entry=38, idx=idx@entry=28, data=0x7fd010338fd0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2349
#21 0x00007fd022a1389a in event_dispatch_epoll_handler (event=0x7fd00b788e80, event_pool=0x7fd024e9cd10) at event-epoll.c:575
#22 event_dispatch_epoll_worker (data=0x7fd010095d60) at event-epoll.c:678
#23 0x00007fd02181adc5 in start_thread () from /lib64/libpthread.so.0
#24 0x00007fd0211611cd in clone () from /lib64/libc.so.6
(gdb) f 15
#15 0x00007fd014eb702d in client3_3_inodelk_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fd0204e075c) at client-rpc-fops.c:1626
1626	        CLIENT_STACK_UNWIND (inodelk, frame, rsp.op_ret,
(gdb) p rsp.op_ret
$1 = 0
(gdb) p rsp.op_errno
$2 = 0
(gdb) f 10
#10 0x00007fd014c57a95 in ec_manager_inodelk (fop=0x7fd008a31ef8, state=<optimized out>) at ec-locks.c:653
653	                    fop->cbks.finodelk(fop->req_frame, fop, fop->xl,
(gdb) l
648	            }
649	            else
650	            {
651	                if (fop->cbks.finodelk != NULL)
652	                {
653	                    fop->cbks.finodelk(fop->req_frame, fop, fop->xl,
654	                                       cbk->op_ret, cbk->op_errno, cbk->xdata);
655	                }
656	            }
657	
(gdb) p cbk->op_ret
$7 = 0
(gdb) p cbk->op_errno
$8 = 0
(gdb)
(gdb) f 4
#4  0x00007fd0149d8f1f in dht_inodelk_done (lock_frame=0x7fd0204c3468) at dht-helper.c:1520
1520	        inodelk_cbk (main_frame, NULL, main_frame->this, local->lock.op_ret,
(gdb)

Looks like some where in the dht stack, there is a failure reported. CCin Nithya who is looking at it. Thanks!
Comment 9 Bhaskarakiran 2015-12-03 03:59:24 EST
With 3.7.5-8 build i am not seeing these error messages after detach tier. Closing this for now. Will reopen if its seen again.
Comment 10 nchilaka 2015-12-21 08:23:14 EST
Even though we are not able to reproduce, this bug can be a potential risk for the release

Note You need to log in before you can comment on or make changes to this bug.