Bug 763704 (GLUSTER-1972)

Summary: xcs get doesn't work with gNFS
Product: [Community] GlusterFS Reporter: Harshavardhana <fharshav>
Component: nfsAssignee: Shehjar Tikoo <shehjart>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: urgent    
Version: 3.1.0CC: allen, cww, gluster-bugs, lakshmipathi, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTP Mount Type: nfs
Documentation: DA CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Harshavardhana 2010-10-18 18:07:48 EDT
"xcs" is a wrapper around CVS for version control. 

This command has been verified to work Against a another NFS box and also local filesystem. 

This issue is present only on Gluster NFS right now. Normal "mkdir" and other commands work. 

<81>senator@xcocfsvm111:/fiogfs/senator/sandbox-fiogfs/lin64/HEAD> xcs get -debug -pub=HEAD env/JDBHelpers/Timing                           

[WRAPPER WARNING]: You are running xcs from a deprecated location

           Use /tools/xint/prod/bin/xcs or add /tools/xint/prod/bin

           to the beginning of your PATH variable

 

[XCS::_SetupServer] (DEBUG) Server Info:

                            Server           =

                            RepoDataType     =

                            ServerDir        =

                            TestStatus       =

                            CodeStatus       = production

                            ServersDataFile  = /tools/xint/XINT201005/products/xcs/data/XcsServers

[XCS::SetHeaderInfo] (VERBOSE) Release XINT201005 - XCS

                               (c) Copyright 1998-2010 Xilinx, Inc. All Rights Reserved.

                               /tools/xint/XINT201005/bin/xcs get -debug -pub=HEAD env/JDBHelpers/Timing

[XCS::SetHeaderInfo] (DEBUG) User Command: /tools/xint/prod/bin/xcs get -debug -pub=HEAD env/JDBHelpers/Timing

[XCSlibmisc::ConnectCheck] (DEBUG) Attempting to "connect" to the repository at

                                     igor:/devl/xcs/repo

[XCSconfig::ValidateOSVersion] (DEBUG) Check the current OS version against tested versions

[XCSlibcvs::CvsRunCmd] (DEBUG) Compression level = 1

[XCSlibcvs::CvsRunCmd] (DEBUG) Ready to call CODE(0xfca4950)()

[XCSlibrun::_xcsRunCmd] (DEBUG) /tmp/xcssenator17514 [17519] "/tools/xint/XINT201005/products/cvs/bin/cvs.lin -z 1 -s XCSID=senator-get-xcocfsvm111-20101015230117-17514 -d :pserver:senator@igor:/devl/xcs/repo -r checkout -n -l XCSAdmin/download"

[XCSlibrun::_xcsRunCmd] (DEBUG) cvs checkout: Updating XCSAdmin/download

[XCSlibrun::_xcsRunCmd] (DEBUG) U XCSAdmin/download/.xcs

[XCSlibrun::_xcsRunCmd] (DEBUG) U XCSAdmin/download/admin

[XCSlibrun::_xcsRunCmd] (DEBUG) U XCSAdmin/download/branches

[XCSlibrun::_xcsRunCmd] (DEBUG) U XCSAdmin/download/branchstates

[XCSlibrun::_xcsRunCmd] (DEBUG) U XCSAdmin/download/cvsignore

[XCSlibrun::_xcsRunCmd] (DEBUG) U XCSAdmin/download/datatype

[XCSlibrun::_xcsRunCmd] (DEBUG) U XCSAdmin/download/function

[XCSlibrun::_xcsRunCmd] (DEBUG) U XCSAdmin/download/message

[XCSlibrun::_xcsRunCmd] (DEBUG) U XCSAdmin/download/modules

[XCSlibrun::_xcsRunCmd] (DEBUG) U XCSAdmin/download/permtags

[XCSlibrun::_xcsRunCmd] (DEBUG) U XCSAdmin/download/populate

[XCSlibrun::_xcsRunCmd] (DEBUG) U XCSAdmin/download/properties

[XCSlibrun::_xcsRunCmd] (DEBUG) U XCSAdmin/download/roleuser

[XCSlibrun::_xcsRunCmd] (DEBUG) U XCSAdmin/download/xcsversions

[XCSlibrun::_xcsRunCmd] (DEBUG) Initial return code is 0

[XCSlibrun::_xcsRunCmd] (DEBUG) Return code is 0

[XCSlibcvs::CvsRunCmd] (DEBUG) ReturnCode: 1

                               Signal    : 0

                               UnixCode  : 0

[XCS::_xcsSandboxRoot] (DEBUG) SandboxRoot =

[XCSlibcvs::CvsBranchesGet] (DEBUG) gOnExit is "48"

[XCS::_xcsSandboxRoot] (DEBUG) SandboxRoot =

[XCS::_xcsSandboxRoot] (DEBUG) SandboxRoot =

[XCSlibcvs::CvsBranchesGet] (DEBUG) gOnExit is "48"

[XCSget::_xcsgetUseCheckout] (DEBUG) Possible Nicknames:

                                     Env Paths         : env/JDBHelpers/Timing

                                     Files             :

[XCS::_xcsSandboxRoot] (DEBUG) SandboxRoot =

[XCSlibcvs::CvsRunCmd] (DEBUG) Compression level = 6

[XCSlibcvs::CvsRunCmd] (DEBUG) Ready to call CODE(0xfca4950)()

[XCSlibrun::_xcsRunCmd] (DEBUG) /fiogfs/senator/sandbox-fiogfs/lin64/HEAD [17523] "/tools/xint/XINT201005/products/cvs/bin/cvs.lin -z 6 -s XCSID=senator-get-xcocfsvm111-20101015230117-17514 -d :pserver:senator@igor:/devl/xcs/repo -r checkout -r publish+HEAD -P env/JDBHelpers/Timing"

[XCSlibrun::_xcsRunCmd] (DEBUG) cvs [checkout aborted]: cannot open current directory: Invalid argument

[XCSlibcvs::CvsStdOutputHandle] (WARNING) cvs [checkout aborted]: cannot open current directory: Invalid argument

[XCSlibrun::_xcsRunCmd] (DEBUG) Initial return code is 256

[XCSlibrun::_xcsRunCmd] (DEBUG) Return code is 1

[XCSlibcvs::CvsRunCmd] (DEBUG) ReturnCode: 1

                               Signal    : 0

                               UnixCode  : 1

[XCSlibcvs::CvsRunCmd] (DEBUG) Removing CVS...

[XCSget::_xcsgetUseCheckout] (ERROR)   cvs [checkout aborted]: cannot open current directory: Invalid argument

[XCSlibcleanup::CleanUp] (DEBUG) In CleanUp

[XCSlibcleanup::CleanUp] (DEBUG) Passing on signals...

unlink /tmp/xcssenator17514/CVS/Entries

unlink /tmp/xcssenator17514/CVS/Root

unlink /tmp/xcssenator17514/CVS/Repository

unlink /tmp/xcssenator17514/CVS/Entries.Log

rmdir /tmp/xcssenator17514/CVS

unlink /tmp/xcssenator17514/XCSAdmin/CVS/Entries.Static

unlink /tmp/xcssenator17514/XCSAdmin/CVS/Entries

unlink /tmp/xcssenator17514/XCSAdmin/CVS/Root

unlink /tmp/xcssenator17514/XCSAdmin/CVS/Repository

unlink /tmp/xcssenator17514/XCSAdmin/CVS/Entries.Log

rmdir /tmp/xcssenator17514/XCSAdmin/CVS

unlink /tmp/xcssenator17514/XCSAdmin/download/datatype

unlink /tmp/xcssenator17514/XCSAdmin/download/permtags

unlink /tmp/xcssenator17514/XCSAdmin/download/CVS/Entries

unlink /tmp/xcssenator17514/XCSAdmin/download/CVS/Root

unlink /tmp/xcssenator17514/XCSAdmin/download/CVS/Repository

rmdir /tmp/xcssenator17514/XCSAdmin/download/CVS

unlink /tmp/xcssenator17514/XCSAdmin/download/roleuser

unlink /tmp/xcssenator17514/XCSAdmin/download/message

unlink /tmp/xcssenator17514/XCSAdmin/download/.xcs

unlink /tmp/xcssenator17514/XCSAdmin/download/populate

unlink /tmp/xcssenator17514/XCSAdmin/download/properties

unlink /tmp/xcssenator17514/XCSAdmin/download/function

unlink /tmp/xcssenator17514/XCSAdmin/download/modules

unlink /tmp/xcssenator17514/XCSAdmin/download/xcsversions

unlink /tmp/xcssenator17514/XCSAdmin/download/branches

unlink /tmp/xcssenator17514/XCSAdmin/download/branchstates

unlink /tmp/xcssenator17514/XCSAdmin/download/admin

unlink /tmp/xcssenator17514/XCSAdmin/download/cvsignore

rmdir /tmp/xcssenator17514/XCSAdmin/download

rmdir /tmp/xcssenator17514/XCSAdmin

rmdir /tmp/xcssenator17514

[XCSlibcleanup::CleanUp] (DEBUG) Done with CleanUp
Comment 1 Shehjar Tikoo 2010-10-18 23:54:16 EDT
The logs that you mailed have th following error:

[2010-10-13 14:43:25.230265] I [client-handshake.c:535:client_setvolume_cbk] vol1-client-3: Connected to 172.25.123.238:24009, attached to remote volume '/export/glu'.
[2010-10-13 14:43:31.510623] E [nfs3.c:673:nfs3_getattr] nfs-nfsv3: Failed to map FH to vol
[2010-10-13 14:43:31.511398] E [nfs3.c:673:nfs3_getattr] nfs-nfsv3: Failed to map FH to vol
[2010-10-13 14:43:32.870050] E [nfs3.c:673:nfs3_getattr] nfs-nfsv3: Failed to map FH to vol


This is not related to the locks problem i mentioned over IM. Looking into it.
Comment 2 Shehjar Tikoo 2010-10-19 03:28:08 EDT
Harsha,
Did the user manually restart the gluster nfs process before running xcs? For eg, a pid of 12345 is of a gluster nfs process started by glusterd. User killed this process, then restarted the gluster nfs process only, by copy pasting the same command on the command line?
Comment 3 Harshavardhana 2010-10-19 09:35:30 EDT
(In reply to comment #2)
> Harsha,
> Did the user manually restart the gluster nfs process before running xcs? For
> eg, a pid of 12345 is of a gluster nfs process started by glusterd. User killed
> this process, then restarted the gluster nfs process only, by copy pasting the
> same command on the command line?

I restarted with TRACE log level myself since from gluster command setting volume options "diagnostics.client-log-level TRACE" didn't work. 

Now after restarting user was connecting to this server with trace enabled.
Comment 4 Shehjar Tikoo 2010-10-19 09:42:13 EDT
(In reply to comment #3)
> (In reply to comment #2)
> > Harsha,
> > Did the user manually restart the gluster nfs process before running xcs? For
> > eg, a pid of 12345 is of a gluster nfs process started by glusterd. User killed
> > this process, then restarted the gluster nfs process only, by copy pasting the
> > same command on the command line?
> 
> I restarted with TRACE log level myself since from gluster command setting
> volume options "diagnostics.client-log-level TRACE" didn't work. 
> 
> Now after restarting user was connecting to this server with trace enabled.

And this error started occurring after you restarted or was it occurring even before restarting?
Comment 5 Harshavardhana 2010-10-19 09:45:14 EDT
> And this error started occurring after you restarted or was it occurring even
> before restarting?

Even before, i restarted to see if it goes away.
Comment 6 Harshavardhana 2010-10-19 09:46:22 EDT
(In reply to comment #5)
> > And this error started occurring after you restarted or was it occurring even
> > before restarting?
> 
> Even before, i restarted to see if it goes away.

There is an email against this issue from me 2-3days back with dmesg logs from the client NFS with log level set to 65535  in nfs_debug
Comment 7 Shehjar Tikoo 2010-10-19 22:36:12 EDT
(In reply to comment #5)
> > And this error started occurring after you restarted or was it occurring even
> > before restarting?
> 
> Even before, i restarted to see if it goes away.


Try this:

1. Mount nfs server.
2. Run xcs(will fail)
3. ls <mount-point> (should fail too)

I want to know whether anything else works on the same mount-point on which xcs fails.
Log in TRACE if possible.
Comment 8 Harshavardhana 2010-10-19 22:40:35 EDT
> 
> Try this:
> 
> 1. Mount nfs server.
> 2. Run xcs(will fail)
> 3. ls <mount-point> (should fail too)
> 
This has been done 100 times nothing fails except the application. 
Right after the app fails we can do mkdir,rmdir,touch, echo "1" > <somefile> all normal operations. 


> I want to know whether anything else works on the same mount-point on which xcs
> fails.

Everything else works. Did you happen to look at the nfs client dmesg logs that i forwared you?
Comment 9 Shehjar Tikoo 2010-10-19 22:51:03 EDT
(In reply to comment #8)
> > 
> > Try this:
> > 
> > 1. Mount nfs server.
> > 2. Run xcs(will fail)
> > 3. ls <mount-point> (should fail too)
> > 
> This has been done 100 times nothing fails except the application. 
> Right after the app fails we can do mkdir,rmdir,touch, echo "1" > <somefile>
> all normal operations. 
> 
> 
> > I want to know whether anything else works on the same mount-point on which xcs
> > fails.
> 
> Everything else works. Did you happen to look at the nfs client dmesg logs that
> i forwared you?

Those are not helpful.

The only thing left to do now is to run tcpdump. I need to see the difference between the filehandles being sent by xcs and those for the operations that succeed after xcs. Here are the steps:

1. tcpdump -i <interface-name> -s 0 -w /tmp/xcs-fail.dump
2. Mount nfs server.
3. Run xcs
4. ls <mount-point>
5. mkdir -p <mountpoint>/dir1/dir2/dir3
6. Unmount
7. Kill tcpdump
8. bzip /tmp/xcs-fail.dump and attach here.
Comment 10 Anand Avati 2010-10-21 05:11:21 EDT
PATCH: http://patches.gluster.com/patch/5546 in master (nfs: Fix volume-id option declaration)
Comment 11 Harshavardhana 2010-10-22 15:00:29 EDT
http://dev.gluster.com/~support/xcs-fail.dump.bz2  -- Find the client side tcpdump following the each steps mentioned. 

Let me know how if it helps.
Comment 12 Shehjar Tikoo 2010-10-25 00:55:23 EDT
(In reply to comment #11)
> http://dev.gluster.com/~support/xcs-fail.dump.bz2  -- Find the client side
> tcpdump following the each steps mentioned. 
> 
> Let me know how if it helps.

Harsha, looking at the trace, i dont think Dean ran into the error this time. The trace does not have the errors seen in the shell output above.

Did he run it with TRACE log level? That could help.

Do you think you can get me access to the system?

He can also try running xcs in strace. That will tell me which exact syscall fails, as and when xcs does fail. Do you know if it fails consistently?

And send us the output of dmesg after xcs fails.
Comment 13 Shehjar Tikoo 2010-10-25 03:27:54 EDT
Harsha, is this a 32-bit machine where xcs is run?
Comment 14 Shehjar Tikoo 2010-10-28 06:31:27 EDT
The "Invalid argument" error in xcs output is probably a consequence of the following call in the strace output:

LINE365: readlink("/devl/swtools/xcs", 0x7fff87701600, 4095) = -1 EINVAL (Invalid argument)

Looks like xcs or whoever is calling readlink, thinks /devl/swtools/xcs is a symlink, but it is not and /devl/ is not mounted on nfs.


Next, I am going to access the test machines to find out why the same app works fine with other nfs servers and how the readlink affects xcs behaviour with these servers.
Comment 15 Shehjar Tikoo 2010-10-28 23:12:50 EDT
Had a call with Dean, the problem looks similar to 1996. Looking into the kernel source for Centos 5.4/2.6.18-164.el5 to confirm if it is exactly the same.
Comment 16 Shehjar Tikoo 2010-10-29 02:20:56 EDT
As in 1996, the problem stems from gnfs returning 64 bit inodes but this time the problem slighly different. In 1996, the client system was 32 bit and bonnie was built without large file support. Here, the client is 64 bit but the module that calls getdents is built without large file support. In this case, the readdir reply is handled by:

static int compat_filldir(void *__buf, const char *name, int namlen,
                loff_t offset, u64 ino, unsigned int d_type)
{
        struct compat_linux_dirent __user * dirent;
        struct compat_getdents_callback *buf = __buf;
#######        compat_ulong_t d_ino;###########
        int reclen = COMPAT_ROUND_UP(NAME_OFFSET(dirent) + namlen + 2);

        buf->error = -EINVAL;   /* only used if we fail.. */
        if (reclen > buf->count)
                return -EINVAL;
        d_ino = ino;    
        if (sizeof(d_ino) < sizeof(ino) && d_ino != ino)
#########                return -EOVERFLOW; ############

in fs/compat.c which contains,"Kernel compatibililty routines for e.g. 32 bit syscall support on 64 bit kernels."

In the kernel, compat_ulong_t is:
typedef u32             compat_ulong_t;

and so a 32 bit syscall, i.e. getdents ends up returning -EINVAL to apps.
Comment 17 Anand Avati 2010-10-29 04:17:11 EDT
PATCH: http://patches.gluster.com/patch/5603 in master (nfs: Introduce nfs.enable-ino32 to support legacy 32-bit only apps)
Comment 18 Shehjar Tikoo 2010-10-29 04:18:09 EDT
Allen, a patch has been submitted. The fix requires support for legacy apps by setting a new option in nfs volume section:

 option nfs.enable-ino32 on

Please let user know about 3.1.1qa2 which will contain this fix. Thanks.
Comment 19 Harshavardhana 2010-11-01 15:06:59 EDT
(In reply to comment #18)
> Allen, a patch has been submitted. The fix requires support for legacy apps by
> setting a new option in nfs volume section:
> 
>  option nfs.enable-ino32 on
> 
> Please let user know about 3.1.1qa2 which will contain this fix. Thanks.


Shehjar, 

I am curious why are these options not provided with cli yet?, even the rpc.auth is not provided by cli? auth.allow will not make sense for only protocol/server , it should be the same for nfs/server.
Comment 20 Allen Lu 2010-11-01 15:47:17 EDT
Shehjar,

The fix in 3.1.1QA3 does fix the xcs issue Xilinx has. However we found that since we added the option directly to /etc/glusterd/nfs/nfs-server.vol, a restart of the volume will overwrite the nfs-server.vol file. So I think we need it in form of cli as Harsha said to keep it persistent.

Customer has asked me to relay this info as a must fix.
Comment 21 Shehjar Tikoo 2010-11-01 22:23:45 EDT
(In reply to comment #20)
> Shehjar,
> 
> The fix in 3.1.1QA3 does fix the xcs issue Xilinx has. However we found that
> since we added the option directly to /etc/glusterd/nfs/nfs-server.vol, a
> restart of the volume will overwrite the nfs-server.vol file. So I think we
> need it in form of cli as Harsha said to keep it persistent.
> 
> Customer has asked me to relay this info as a must fix.

Fix on the way.
Comment 22 Anand Avati 2010-11-03 04:25:44 EDT
PATCH: http://patches.gluster.com/patch/5632 in master (mgmt/Glusterd: add nfs.enable-ino32 as an option to set from CLI)
Comment 23 Anand Avati 2010-11-24 00:33:53 EST
PATCH: http://patches.gluster.com/patch/5724 in master (volgen: clean up 0fbf226c (... add nfs.enable-ino32 as an option ...))
Comment 24 Shehjar Tikoo 2010-12-23 00:18:55 EST
Doc available for the enable-ino32 option in nfs.
Comment 25 Shehjar Tikoo 2010-12-23 00:20:32 EST
*** Bug 1996 has been marked as a duplicate of this bug. ***