Bug 763704 (GLUSTER-1972)
Summary: | xcs get doesn't work with gNFS | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Harshavardhana <fharshav> |
Component: | nfs | Assignee: | Shehjar Tikoo <shehjart> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.1.0 | CC: | allen, cww, gluster-bugs, lakshmipathi, vijay |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | Type: | --- | |
Regression: | RTP | Mount Type: | nfs |
Documentation: | DA | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Harshavardhana
2010-10-18 22:07:48 UTC
The logs that you mailed have th following error: [2010-10-13 14:43:25.230265] I [client-handshake.c:535:client_setvolume_cbk] vol1-client-3: Connected to 172.25.123.238:24009, attached to remote volume '/export/glu'. [2010-10-13 14:43:31.510623] E [nfs3.c:673:nfs3_getattr] nfs-nfsv3: Failed to map FH to vol [2010-10-13 14:43:31.511398] E [nfs3.c:673:nfs3_getattr] nfs-nfsv3: Failed to map FH to vol [2010-10-13 14:43:32.870050] E [nfs3.c:673:nfs3_getattr] nfs-nfsv3: Failed to map FH to vol This is not related to the locks problem i mentioned over IM. Looking into it. Harsha, Did the user manually restart the gluster nfs process before running xcs? For eg, a pid of 12345 is of a gluster nfs process started by glusterd. User killed this process, then restarted the gluster nfs process only, by copy pasting the same command on the command line? (In reply to comment #2) > Harsha, > Did the user manually restart the gluster nfs process before running xcs? For > eg, a pid of 12345 is of a gluster nfs process started by glusterd. User killed > this process, then restarted the gluster nfs process only, by copy pasting the > same command on the command line? I restarted with TRACE log level myself since from gluster command setting volume options "diagnostics.client-log-level TRACE" didn't work. Now after restarting user was connecting to this server with trace enabled. (In reply to comment #3) > (In reply to comment #2) > > Harsha, > > Did the user manually restart the gluster nfs process before running xcs? For > > eg, a pid of 12345 is of a gluster nfs process started by glusterd. User killed > > this process, then restarted the gluster nfs process only, by copy pasting the > > same command on the command line? > > I restarted with TRACE log level myself since from gluster command setting > volume options "diagnostics.client-log-level TRACE" didn't work. > > Now after restarting user was connecting to this server with trace enabled. And this error started occurring after you restarted or was it occurring even before restarting?
> And this error started occurring after you restarted or was it occurring even
> before restarting?
Even before, i restarted to see if it goes away.
(In reply to comment #5) > > And this error started occurring after you restarted or was it occurring even > > before restarting? > > Even before, i restarted to see if it goes away. There is an email against this issue from me 2-3days back with dmesg logs from the client NFS with log level set to 65535 in nfs_debug (In reply to comment #5) > > And this error started occurring after you restarted or was it occurring even > > before restarting? > > Even before, i restarted to see if it goes away. Try this: 1. Mount nfs server. 2. Run xcs(will fail) 3. ls <mount-point> (should fail too) I want to know whether anything else works on the same mount-point on which xcs fails. Log in TRACE if possible. > > Try this: > > 1. Mount nfs server. > 2. Run xcs(will fail) > 3. ls <mount-point> (should fail too) > This has been done 100 times nothing fails except the application. Right after the app fails we can do mkdir,rmdir,touch, echo "1" > <somefile> all normal operations. > I want to know whether anything else works on the same mount-point on which xcs > fails. Everything else works. Did you happen to look at the nfs client dmesg logs that i forwared you? (In reply to comment #8) > > > > Try this: > > > > 1. Mount nfs server. > > 2. Run xcs(will fail) > > 3. ls <mount-point> (should fail too) > > > This has been done 100 times nothing fails except the application. > Right after the app fails we can do mkdir,rmdir,touch, echo "1" > <somefile> > all normal operations. > > > > I want to know whether anything else works on the same mount-point on which xcs > > fails. > > Everything else works. Did you happen to look at the nfs client dmesg logs that > i forwared you? Those are not helpful. The only thing left to do now is to run tcpdump. I need to see the difference between the filehandles being sent by xcs and those for the operations that succeed after xcs. Here are the steps: 1. tcpdump -i <interface-name> -s 0 -w /tmp/xcs-fail.dump 2. Mount nfs server. 3. Run xcs 4. ls <mount-point> 5. mkdir -p <mountpoint>/dir1/dir2/dir3 6. Unmount 7. Kill tcpdump 8. bzip /tmp/xcs-fail.dump and attach here. PATCH: http://patches.gluster.com/patch/5546 in master (nfs: Fix volume-id option declaration) http://dev.gluster.com/~support/xcs-fail.dump.bz2 -- Find the client side tcpdump following the each steps mentioned. Let me know how if it helps. (In reply to comment #11) > http://dev.gluster.com/~support/xcs-fail.dump.bz2 -- Find the client side > tcpdump following the each steps mentioned. > > Let me know how if it helps. Harsha, looking at the trace, i dont think Dean ran into the error this time. The trace does not have the errors seen in the shell output above. Did he run it with TRACE log level? That could help. Do you think you can get me access to the system? He can also try running xcs in strace. That will tell me which exact syscall fails, as and when xcs does fail. Do you know if it fails consistently? And send us the output of dmesg after xcs fails. Harsha, is this a 32-bit machine where xcs is run? The "Invalid argument" error in xcs output is probably a consequence of the following call in the strace output: LINE365: readlink("/devl/swtools/xcs", 0x7fff87701600, 4095) = -1 EINVAL (Invalid argument) Looks like xcs or whoever is calling readlink, thinks /devl/swtools/xcs is a symlink, but it is not and /devl/ is not mounted on nfs. Next, I am going to access the test machines to find out why the same app works fine with other nfs servers and how the readlink affects xcs behaviour with these servers. Had a call with Dean, the problem looks similar to 1996. Looking into the kernel source for Centos 5.4/2.6.18-164.el5 to confirm if it is exactly the same. As in 1996, the problem stems from gnfs returning 64 bit inodes but this time the problem slighly different. In 1996, the client system was 32 bit and bonnie was built without large file support. Here, the client is 64 bit but the module that calls getdents is built without large file support. In this case, the readdir reply is handled by: static int compat_filldir(void *__buf, const char *name, int namlen, loff_t offset, u64 ino, unsigned int d_type) { struct compat_linux_dirent __user * dirent; struct compat_getdents_callback *buf = __buf; ####### compat_ulong_t d_ino;########### int reclen = COMPAT_ROUND_UP(NAME_OFFSET(dirent) + namlen + 2); buf->error = -EINVAL; /* only used if we fail.. */ if (reclen > buf->count) return -EINVAL; d_ino = ino; if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) ######### return -EOVERFLOW; ############ in fs/compat.c which contains,"Kernel compatibililty routines for e.g. 32 bit syscall support on 64 bit kernels." In the kernel, compat_ulong_t is: typedef u32 compat_ulong_t; and so a 32 bit syscall, i.e. getdents ends up returning -EINVAL to apps. PATCH: http://patches.gluster.com/patch/5603 in master (nfs: Introduce nfs.enable-ino32 to support legacy 32-bit only apps) Allen, a patch has been submitted. The fix requires support for legacy apps by setting a new option in nfs volume section: option nfs.enable-ino32 on Please let user know about 3.1.1qa2 which will contain this fix. Thanks. (In reply to comment #18) > Allen, a patch has been submitted. The fix requires support for legacy apps by > setting a new option in nfs volume section: > > option nfs.enable-ino32 on > > Please let user know about 3.1.1qa2 which will contain this fix. Thanks. Shehjar, I am curious why are these options not provided with cli yet?, even the rpc.auth is not provided by cli? auth.allow will not make sense for only protocol/server , it should be the same for nfs/server. Shehjar, The fix in 3.1.1QA3 does fix the xcs issue Xilinx has. However we found that since we added the option directly to /etc/glusterd/nfs/nfs-server.vol, a restart of the volume will overwrite the nfs-server.vol file. So I think we need it in form of cli as Harsha said to keep it persistent. Customer has asked me to relay this info as a must fix. (In reply to comment #20) > Shehjar, > > The fix in 3.1.1QA3 does fix the xcs issue Xilinx has. However we found that > since we added the option directly to /etc/glusterd/nfs/nfs-server.vol, a > restart of the volume will overwrite the nfs-server.vol file. So I think we > need it in form of cli as Harsha said to keep it persistent. > > Customer has asked me to relay this info as a must fix. Fix on the way. PATCH: http://patches.gluster.com/patch/5632 in master (mgmt/Glusterd: add nfs.enable-ino32 as an option to set from CLI) PATCH: http://patches.gluster.com/patch/5724 in master (volgen: clean up 0fbf226c (... add nfs.enable-ino32 as an option ...)) Doc available for the enable-ino32 option in nfs. *** Bug 1996 has been marked as a duplicate of this bug. *** |