Bug 764028 (GLUSTER-2296)

Summary: svn / subversion fails on gluster volume (replicated and non-replicated)
Product: [Community] GlusterFS Reporter: Johannes Martin <jmartin>
Component: coreAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: low    
Version: mainlineCC: aavati, admin, gluster-bugs, jmartin, mateusz-lists, pkarampu, rabhat, raghavendra, saurabh
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Johannes Martin 2011-01-17 09:02:17 UTC
In http://www.gluster.org/interact/log-details/?log=2010-10 reported on a similar error message. He suggested it might be a race condition.

Furtheron in that thread:
jdarcy: Somehow I think it'd work if you only had one half of your replica set.
jdarcy: My best guess is that it's forking the write to the two replicas, then losing some state before either returns, then it's just plain confused after that. 

Is there any way I can modify an existing volume to be non-replicated?

Or should I try to disable self-heal? If so, where in the volume configuration file would I do that?

My current volume configuration looks like:
---
volume home-client-0
    type protocol/client
    option remote-host 10.100.100.160
    option remote-subvolume /media/gluster/brick1/home
    option transport-type tcp
end-volume

volume home-client-1
    type protocol/client
    option remote-host 10.100.100.165
    option remote-subvolume /media/gluster/brick1/home
    option transport-type tcp
end-volume

volume home-replicate-0
    type cluster/replicate
    subvolumes home-client-0 home-client-1
end-volume

volume home-write-behind
    type performance/write-behind
    subvolumes home-replicate-0
end-volume
---

Comment 1 Johannes Martin 2011-01-17 09:13:26 UTC
I just created a new non-replicated volume with an identical copy of the subversion repository and tried the same commit.

It failed with a slightly different error message:
---
[2011-01-17 13:11:40.350006] D [client3_1-fops.c:4308:client3_1_lk] sources-client-0: (4998605): failed to get fd ctx. EBADFD
[2011-01-17 13:11:40.350076] W [fuse-bridge.c:2715:fuse_setlk_cbk] glusterfs-fuse: 119: ERR => -1 (File descriptor in bad state)
[2011-01-17 13:11:40.356393] D [client3_1-fops.c:629:client3_1_flush_cbk] sources-client-0: Attempting to delete locks of owner=2892826873185556891
---

Comment 2 Johannes Martin 2011-01-17 11:35:49 UTC
I'm running glusterfs 3.1.2 (compiled from source) on a two Proxmox hosts (mostly debian, 64bit kernel).

I've setup a replicated volume and mounted it using glusterfs.

The volume hosts several subversion repositories. When I try to commit something into one of the repositories, subversion reports
---
Adding         testfile
Transmitting file data .svn: Commit failed (details follow):
svn: Can't get exclusive lock on file '/home/sources/svn/Idefix/db/txn-current-lock': Transport endpoint is not connected
----

The last lines in the client log file read:
---
[2011-01-17 12:27:14.565057] D [client3_1-fops.c:4308:client3_1_lk] home-client-0: (34361880): failed to get fd ctx. EBADFD
[2011-01-17 12:27:14.565077] D [client3_1-fops.c:4308:client3_1_lk] home-client-1: (34361880): failed to get fd ctx. EBADFD
[2011-01-17 12:27:14.565098] W [fuse-bridge.c:2715:fuse_setlk_cbk] glusterfs-fuse: 1103: ERR => -1 (Transport endpoint is not connected)
---

Even when I remount the volume, the error persists.

I have also tried restarting the server processes (/etc/init.d/glusterd stop, killall glusterfsd).

Access also fails when I mount the volume using nfs:
---
Adding         testfile
Transmitting file data .svn: Commit failed (details follow):
svn: Can't get exclusive lock on file '/home/sources/svn/Idefix/db/txn-current-lock': Resource unavaliable
---

Comment 3 Pranith Kumar K 2011-01-18 03:32:22 UTC
(In reply to comment #2)
> I just created a new non-replicated volume with an identical copy of the
> subversion repository and tried the same commit.
> 
> It failed with a slightly different error message:
> ---
> [2011-01-17 13:11:40.350006] D [client3_1-fops.c:4308:client3_1_lk]
> sources-client-0: (4998605): failed to get fd ctx. EBADFD
> [2011-01-17 13:11:40.350076] W [fuse-bridge.c:2715:fuse_setlk_cbk]
> glusterfs-fuse: 119: ERR => -1 (File descriptor in bad state)
> [2011-01-17 13:11:40.356393] D [client3_1-fops.c:629:client3_1_flush_cbk]
> sources-client-0: Attempting to delete locks of owner=2892826873185556891
> ---

Its working fine for me. Could you please let us know if we are missing anything.

pranith @ /mnt/client/repo/trunk
10:15:25 :) $ mount | grep pranith
gvfs-fuse-daemon on /home/pranith/.gvfs type fuse.gvfs-fuse-daemon (rw,nosuid,nodev,user=pranith)
pranith-laptop:/pranith on /mnt/client type fuse.glusterfs (rw,allow_other,default_permissions,max_read=131072)
pranith @ /mnt/client/repo/trunk
10:14:43 :) $ sudo gluster volume info

Volume Name: pranith
Type: Distribute
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: pranith-laptop:/tmp/1
Brick2: pranith-laptop:/tmp/2

SVN COMMANDS: <<------
pranith @ /mnt/client/repo/trunk
10:11:19 :) $ sudo svn add /mnt/client/repo/trunk/testfile.txt
A         /mnt/client/repo/trunk/testfile.txt

pranith @ /mnt/client/repo/trunk
10:12:19 :) $ sudo svn commit -m "this is a test commit"
Adding         trunk/testfile.txt
Transmitting file data .
Committed revision 2.

Thanks
Pranith.

Comment 4 Johannes Martin 2011-01-18 04:18:04 UTC
Sorry, I should have made myself a little clearer...

The problem does not occur when I have my working copy on the gluster share but when the repository itself is on the gluster share.

So:
----
subserversion-server$ mount
gluster-server:/sources /sources
----
my-client$ svn info
URL: svn+ssh://subversion-server/sources/my-repository
----

The problem also occurs when I use file the file-protocol to connect to the repository on my client:
---
sudo mount -t glusterfs gluster-server:/sources /sources
svn switch --relocate svn+ssh://subversion-server/sources/my-repository file:///sources/my-repository
svn commit -m "testing" 
---

Comment 5 Pranith Kumar K 2011-01-18 04:26:39 UTC
(In reply to comment #4)
> Sorry, I should have made myself a little clearer...
> 
> The problem does not occur when I have my working copy on the gluster share but
> when the repository itself is on the gluster share.
> 
> So:
> ----
> subserversion-server$ mount
> gluster-server:/sources /sources
> ----
> my-client$ svn info
> URL: svn+ssh://subversion-server/sources/my-repository
> ----
> 
> The problem also occurs when I use file the file-protocol to connect to the
> repository on my client:
> ---
> sudo mount -t glusterfs gluster-server:/sources /sources
> svn switch --relocate svn+ssh://subversion-server/sources/my-repository
> file:///sources/my-repository
> svn commit -m "testing" 
> ---

Thats strange. The repo was also on the gluster share in my test case. Its gotta be something else.

pranith @ /mnt/client/repo
12:55:24 :) $ svn info
Path: .
URL: file:///mnt/client/svnrepo/commonsproj
Repository Root: file:///mnt/client/svnrepo
Repository UUID: 29580e7f-60f1-414f-93e1-9a130f9ba5e0
Revision: 1
Node Kind: directory
Schedule: normal
Last Changed Author: root
Last Changed Rev: 1
Last Changed Date: 2011-01-18 10:07:02 +0530 (Tue, 18 Jan 2011)

Pranith.

Comment 6 Johannes Martin 2011-01-18 05:01:09 UTC
I should have mentioned that it does not occur with all the repositories that I host on this share, but with this one consistently.

As the file that occured in the error message was the same lock file all the time, I now renamed it and tried again. The commit then worked.

I then compared the old and the new file:
$ ls -l /sources/my-repository/db/txn-current-lock*
-rw-r--r-- 1 martinj sources 0 2011-01-18 08:36 /sources/my-repository/db/txn-current-lock
-rw-rw-r-- 1 sources sources 0 2010-10-07 14:38 /sources/my-repository/db/txn-current-lock.bad

The only visible difference is the ownership, though it shouldn't really matter in this case since I'm member of group sources.

I renamed the orginal file to its original name again, commit failed again. Changed file ownership to martinj.sources, commit worked. Changed it back to sources.sources, commit failed. Changed it to some other owner, commit still failed.

If I try the commit by directly accessing the ext4 filesystem that's running below the gluster share, the commit works (same owner/group).

Also, when I modify some different repo's txn-current-lock to belong to sources.sources, the error occurs there, too.

When I strace svn, the last lines look like this:
open("/sources/my-repository/db/txn-current-lock", O_RDWR) = 4
fcntl(4, F_GETFD)                       = 0
fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
fcntl(4, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = -1 ENOTCONN (Transport endpoint is not connected)
close(4)                                = 0


Does this make sense?

Comment 7 Pranith Kumar K 2011-01-18 08:38:02 UTC
(In reply to comment #6)
> I should have mentioned that it does not occur with all the repositories that I
> host on this share, but with this one consistently.
> 
> As the file that occured in the error message was the same lock file all the
> time, I now renamed it and tried again. The commit then worked.
> 
> I then compared the old and the new file:
> $ ls -l /sources/my-repository/db/txn-current-lock*
> -rw-r--r-- 1 martinj sources 0 2011-01-18 08:36
> /sources/my-repository/db/txn-current-lock
> -rw-rw-r-- 1 sources sources 0 2010-10-07 14:38
> /sources/my-repository/db/txn-current-lock.bad
> 
> The only visible difference is the ownership, though it shouldn't really matter
> in this case since I'm member of group sources.
> 
> I renamed the orginal file to its original name again, commit failed again.
> Changed file ownership to martinj.sources, commit worked. Changed it back to
> sources.sources, commit failed. Changed it to some other owner, commit still
> failed.
> 
> If I try the commit by directly accessing the ext4 filesystem that's running
> below the gluster share, the commit works (same owner/group).
> 
> Also, when I modify some different repo's txn-current-lock to belong to
> sources.sources, the error occurs there, too.
> 
> When I strace svn, the last lines look like this:
> open("/sources/my-repository/db/txn-current-lock", O_RDWR) = 4
> fcntl(4, F_GETFD)                       = 0
> fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
> fcntl(4, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = -1
> ENOTCONN (Transport endpoint is not connected)
> close(4)                                = 0
> 
> 
> Does this make sense?

Thanks a lot for the detailed explanation. From the strace output and the logfile output you have provided, we suspect that the problem could be in quick-read translator. Could you please disable quick-read "gluster volume set <volname> quick-read disable" and do the svn operation again and see if it does not fail this time. Then we can confirm its quick-read that is causing this trouble.

Pranith

Comment 8 Johannes Martin 2011-01-18 09:29:57 UTC
(Mit Bezug zu comment #7)
> Could you please disable quick-read "gluster volume set
> <volname> quick-read disable" and do the svn operation again and see if it does
> not fail this time. Then we can confirm its quick-read that is causing this
> trouble.

I did this, but it looks like the problem gets worse. subversion can't even open the file any more:
---
open("/home/sources/svn/testing/db/txn-current-lock", O_RDWR) = -1 EPERM (Operation not permitted)
---
(That same call succeeds when quick-read is enabled).

I tried one more thing:
---
strace /bin/sh -c "echo foo >> txn-current-lock"
...
open("txn-current-lock", O_WRONLY|O_CREAT|O_APPEND, 0666) = -1 EPERM (Operation not permitted)
---

The error message in this last trace is the same no matter whether quick-read is enabled or disabled.

Comment 9 Pranith Kumar K 2011-01-19 02:36:26 UTC
(In reply to comment #8)
> (Mit Bezug zu comment #7)
> > Could you please disable quick-read "gluster volume set
> > <volname> quick-read disable" and do the svn operation again and see if it does
> > not fail this time. Then we can confirm its quick-read that is causing this
> > trouble.
> 
> I did this, but it looks like the problem gets worse. subversion can't even
> open the file any more:
> ---
> open("/home/sources/svn/testing/db/txn-current-lock", O_RDWR) = -1 EPERM
> (Operation not permitted)
> ---
> (That same call succeeds when quick-read is enabled).
> 
> I tried one more thing:
> ---
> strace /bin/sh -c "echo foo >> txn-current-lock"
> ...
> open("txn-current-lock", O_WRONLY|O_CREAT|O_APPEND, 0666) = -1 EPERM (Operation
> not permitted)
> ---
> 
> The error message in this last trace is the same no matter whether quick-read
> is enabled or disabled.

The error says user/user-group doesn't have write permission for that file. Could you please give "ls -l" output for that file.

Comment 10 Johannes Martin 2011-01-19 03:03:39 UTC
$ ls -l txn-current-lock 
-rw-rw-r-- 1 sources sources 4 19. Jan 06:53 txn-current-lock

(In reply to comment #9)
> The error says user/user-group doesn't have write permission for that file.
> Could you please give "ls -l" output for that file.

$ ls -l txn-current-lock 
-rw-rw-r-- 1 sources sources 4 19. Jan 06:53 txn-current-lock

I am member of group sources.

echo foo >> /media/gluster/brick1/home/sources/svn/testing/db/txn-current-lock 
(where /media/gluster/brick1 is the mount point of the ext4 partition) works fine.

Group sources is not a local group on the machine but imported via NIS. 

More tests: 
- changed ownership to sources.<my-primary-group>: 
--> echo foo >> txn-current-lock succeeded.
- changed ownership to sources.<one-of-my-other-secondary-groups(non-NIS)>
--> echo foo >> txn-current-lock failed

So I guess the problem is not related to NIS.

Comment 11 Pranith Kumar K 2011-01-20 09:15:45 UTC
(In reply to comment #10)
> $ ls -l txn-current-lock 
> -rw-rw-r-- 1 sources sources 4 19. Jan 06:53 txn-current-lock
> 
> (In reply to comment #9)
> > The error says user/user-group doesn't have write permission for that file.
> > Could you please give "ls -l" output for that file.
> 
> $ ls -l txn-current-lock 
> -rw-rw-r-- 1 sources sources 4 19. Jan 06:53 txn-current-lock
> 
> I am member of group sources.
> 
> echo foo >> /media/gluster/brick1/home/sources/svn/testing/db/txn-current-lock 
> (where /media/gluster/brick1 is the mount point of the ext4 partition) works
> fine.
> 
> Group sources is not a local group on the machine but imported via NIS. 
> 
> More tests: 
> - changed ownership to sources.<my-primary-group>: 
> --> echo foo >> txn-current-lock succeeded.
> - changed ownership to sources.<one-of-my-other-secondary-groups(non-NIS)>
> --> echo foo >> txn-current-lock failed
> 
> So I guess the problem is not related to NIS.

All the symptoms point to the following:
Fuse has a limitation of not sending auxilary gids to glusterfs. So the user can not be checked to be part of any auxilary group. So the access-control translator on the server side will treat you like "other". That is the reason for all this. 
gNFS does not support locking yet, that is the reason you saw the svn commit failure even with gNFS.

Comment 12 Johannes Martin 2011-01-20 17:17:11 UTC
> Fuse has a limitation of not sending auxilary gids to glusterfs.

Is this a known limitation or do you only suspect this to be the cause? If it's only a suspicion, how can we prove it? 

In any case, do you have any idea about how to work around this limitation?

I would assume that the client side of glusterfs knows the uid of the user trying to access the file. Can't it then use whatever /usr/bin/id does to find out what other groups that user belongs to?

Comment 13 Anand Avati 2011-01-20 18:28:57 UTC
(In reply to comment #12)
> > Fuse has a limitation of not sending auxilary gids to glusterfs.
> 
> Is this a known limitation or do you only suspect this to be the cause? If it's
> only a suspicion, how can we prove it? 
> 
> In any case, do you have any idea about how to work around this limitation?
> 
> I would assume that the client side of glusterfs knows the uid of the user
> trying to access the file. Can't it then use whatever /usr/bin/id does to find
> out what other groups that user belongs to?

We are working on fix this in 3.1.3

Avati

Comment 14 Pranith Kumar K 2011-01-21 02:08:28 UTC
(In reply to comment #12)
> > Fuse has a limitation of not sending auxilary gids to glusterfs.
> 
> Is this a known limitation or do you only suspect this to be the cause? If it's
> only a suspicion, how can we prove it? 
> 
> In any case, do you have any idea about how to work around this limitation?
> 
> I would assume that the client side of glusterfs knows the uid of the user
> trying to access the file. Can't it then use whatever /usr/bin/id does to find
> out what other groups that user belongs to?

Yes it is a known limitation. You can confirm it by doing the following. You will have to manually edit the brick server volfile and remove access-control translator. Bug 2304 has the same problem.

Comment 15 Pranith Kumar K 2011-01-24 02:58:31 UTC
*** Bug 2312 has been marked as a duplicate of this bug. ***

Comment 16 Pranith Kumar K 2011-01-24 03:00:11 UTC
*** Bug 2304 has been marked as a duplicate of this bug. ***

Comment 17 Anand Avati 2011-01-27 04:42:25 UTC
PATCH: http://patches.gluster.com/patch/6033 in master (features/access-control: skip access-tests if the call is from fuse)

Comment 18 Vijay Bellur 2011-01-28 06:47:28 UTC
*** Bug 1647 has been marked as a duplicate of this bug. ***

Comment 19 Amar Tumballi 2011-02-01 08:29:30 UTC
*** Bug 2183 has been marked as a duplicate of this bug. ***

Comment 20 Saurabh 2011-03-10 05:47:00 UTC
[root@centos-qa-3 ~]# id u1
uid=1056(u1) gid=505(g5) groups=505(g5),502(g2),503(g3),504(g4)


[root@centos-qa-3 ~]# mount 
glusterfs#10.1.12.109:/dist1 on /mnt/gluster-test type fuse (rw,allow_other,default_permissions,max_read=131072)




[root@centos-qa-3 ~]# cd /mnt/gluster-test/
[root@centos-qa-3 gluster-test]# touch file1
[root@centos-qa-3 gluster-test]# ls -lia file1
9240582 -rw-r--r-- 1 root root 0 Mar  9 15:19 file1
[root@centos-qa-3 gluster-test]# chown u1 file1
[root@centos-qa-3 gluster-test]# chgrp g1 file1
[root@centos-qa-3 gluster-test]# ls -lia
total 20
      1 drwxr-xr-x 2 root root 4096 Mar  9  2011 .
4620289 drwxr-xr-x 8 root root 4096 Mar  9 15:17 ..
9240582 -rw-r--r-- 1 u1   g1      0 Mar  9 15:19 file1


[root@centos-qa-3 gluster-test]# cat file1
[root@centos-qa-3 gluster-test]# echo text1 >  file1
[root@centos-qa-3 gluster-test]# cat file1
text1
[root@centos-qa-3 gluster-test]# vi file1 
[root@centos-qa-3 gluster-test]# cat file1
text1

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
ssssssssssssssssssssssssssssssssss
d

dddddddddddddddddddddddddddddddddddd
f



gggggggggggggggggggggggggggggggggggggggggggggg
ttttttttttttttttttttttttttttttttttt
[root@centos-qa-3 gluster-test]# ls
file1
[root@centos-qa-3 gluster-test]#