Bug 1775461 - protocol/client : After "Remove lock recovery logic from client and server protocol translators" patch commit, causing add fcntl lock on the file twice. [NEEDINFO]
Summary: protocol/client : After "Remove lock recovery logic from client and server pr...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: GlusterFS
Classification: Community
Component: protocol
Version: 6
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-22 02:54 UTC by yinkui
Modified: 2020-03-12 12:18 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-12 12:18:19 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
anoopcs: needinfo? (13965432176)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Gluster.org Gerrit 1272030 None None None 2019-11-22 02:54:36 UTC
Red Hat Bugzilla 1272030 'medium' 'CLOSED' 'Remove lock recovery logic from client and server protocol translators' 2019-12-06 11:12:21 UTC

Description yinkui 2019-11-22 02:54:36 UTC
Description of problem:
After "Remove lock recovery logic from client and server protocol translators" patch commit, causing add fcntl lock on the file twice.

Version-Release number of selected component (if applicable):
gluster-6.0

How reproducible:


Steps to Reproduce:
1.afr 2*3 volume
2.add fcntl F_WRLCK lock on the file1
3.make sure which afr group store the file1 and kill the group's glusterfsd
4.add fcntl F_WRLCK lock on the file1 again and success
5.gluster v start afr force
6.unlock fcntl lock success

Actual results:
can add fcntl lock twice

Expected results:
can not add fcntl lock twice

Additional info:

Comment 1 Ravishankar N 2019-12-04 12:33:06 UTC
Leaving a need-info on Anoop who authored the patch to comment on the bug.

Comment 2 Anoop C S 2019-12-05 06:46:11 UTC
(In reply to yinkui from comment #0)
> Steps to Reproduce:
> 1.afr 2*3 volume
> 2.add fcntl F_WRLCK lock on the file1
> 3.make sure which afr group store the file1 and kill the group's glusterfsd
> 4.add fcntl F_WRLCK lock on the file1 again and success
> 5.gluster v start afr force
> 6.unlock fcntl lock success

How are you trying to acquire F_WRLCK(Is it a hand-written program or some other existing utility)?
What do you mean by "afr group"? Did you mean replica brick processes hosting the file?
Can you paste the output of `gluster v info`?

Comment 3 yinkui 2019-12-09 06:36:32 UTC
1、the test file is a hand-written program.for example:
    struct flock _lock;
    char d[10] = {0};
    char a;

    _lock.l_type =  F_WRLCK;
    _lock.l_whence = SEEK_SET;
    _lock.l_start = 0;
    _lock.l_len = 0;

    int file = open( "/mnt/fourth_file5.txt",O_CREAT|O_RDWR,S_IRWXU|S_IRGRP|S_IWGRP|S_IRWXO );
    if ( file<0 )
    {
        puts( "open error" );
        return 0;
    }

    int ret = fcntl( file,F_SETLK,&_lock );
    if ( ret < 0 )
    {
        printf(" errno: %s\n", strerror(errno));
        close( file );
        return 0;
    }

    puts( "getchar ..." );

    a = getchar();
    printf(" char: %c\n", a);
    _lock.l_type =  F_UNLCK;
    _lock.l_whence = SEEK_SET;
    _lock.l_start = 0;
    _lock.l_len = 0;

    ret = fcntl( file,F_SETLK,&_lock );
    if ( ret < 0 )
    {
        puts( "unlock error" );
        printf(" errno: %s\n", strerror(errno));
    }
    close( file );


2、"afr group" means the lock file fourth_file5.txt which bricks store it. And kill all glusterfsd which bricks store the fourth_file5.txt. Then start all glusterfsd force and run the hand-written program again. It's also can add lock on the fourth_file5.txt. It's not true add the F_SETLK lock twice on the same file.

Comment 4 Anoop C S 2019-12-09 10:00:41 UTC
(In reply to yinkui from comment #3)
> 2、"afr group" means the lock file fourth_file5.txt which bricks store it.
> And kill all glusterfsd which bricks store the fourth_file5.txt. Then start
> all glusterfsd force and run the hand-written program again. It's also can
> add lock on the fourth_file5.txt. It's not true add the F_SETLK lock twice
> on the same file.

Can you paste the output of `gluster v info`?

Comment 5 yinkui 2019-12-09 10:39:23 UTC
(In reply to Anoop C S from comment #4)
> (In reply to yinkui from comment #3)
> > 2、"afr group" means the lock file fourth_file5.txt which bricks store it.
> > And kill all glusterfsd which bricks store the fourth_file5.txt. Then start
> > all glusterfsd force and run the hand-written program again. It's also can
> > add lock on the fourth_file5.txt. It's not true add the F_SETLK lock twice
> > on the same file.
> 
> Can you paste the output of `gluster v info`?

Sorry!The testing environment has been not in. You can reproduce according to my steps.

Comment 6 Anoop C S 2019-12-09 12:18:35 UTC
(In reply to yinkui from comment #5)
> (In reply to Anoop C S from comment #4)
> > (In reply to yinkui from comment #3)
> > > 2、"afr group" means the lock file fourth_file5.txt which bricks store it.
> > > And kill all glusterfsd which bricks store the fourth_file5.txt. Then start
> > > all glusterfsd force and run the hand-written program again. It's also can
> > > add lock on the fourth_file5.txt. It's not true add the F_SETLK lock twice
> > > on the same file.
> > 
> > Can you paste the output of `gluster v info`?
> 
> Sorry!The testing environment has been not in. You can reproduce according
> to my steps.

I believe this change in behaviour is due to the following volume set options. Did you happen to set those options on the volume?

features.lock-heal
features.grace-timeout

Comment 7 Xavi Hernandez 2019-12-09 14:16:57 UTC
Anoop, is this a regression ? bug #1272030 says that the lock recovery functionality is removed from client/server xlators because AFR will have better knowledge about the real state of locks. Is this feature already implemented in AFR ? if it's, then the bug should be moved to replicate and fixed there.

Yinkui, I don't fully understand what do you mean when you say "kill all glusterfsd which bricks store the fourth_file5.txt". Does this mean that you kill all 3 bricks from the replica that contains that file ? if so, I'm not sure if this is a bug because you have completely lost access to the file. Maybe the bug is that the next access to the file with the fd opened before killing and restarting all bricks should have failed ? can you clarify the exact scenario ?

Comment 8 yinkui 2019-12-10 02:08:51 UTC
I reproduce this bug. 
1、gluster v info test
[root@server6 ~]# gluster v info test
Volume Name: test
Type: Distributed-Replicate
Volume ID: 8c07f99a-85fe-4d62-b2e0-3deb480a371a
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: server6:/bricks/brick1/brick0
Brick2: server6:/bricks/brick2/brick0
Brick3: server6:/bricks/brick3/brick0
Brick4: server6:/bricks/brick4/brick0
Brick5: server6:/bricks/brick5/brick0
Brick6: server6:/bricks/brick6/brick0
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
2、Steps by reproduce:
2.1 lock file on file1
[root@server6 ~]# ./test 
 lock: Success

2.2 
[root@server6 ~]# gluster v status
Status of volume: test
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick server6:/bricks/brick1/brick0         49152     0          Y       3403 
Brick server6:/bricks/brick2/brick0         49153     0          Y       3423 
Brick server6:/bricks/brick3/brick0         49154     0          Y       3443 
Brick server6:/bricks/brick4/brick0         49155     0          Y       4167 
Brick server6:/bricks/brick5/brick0         49156     0          Y       4187 
Brick server6:/bricks/brick6/brick0         49157     0          Y       4207 
Self-heal Daemon on localhost               N/A       N/A        Y       4235 
 
Task Status of Volume test
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@server6 ~]# ls /bricks/brick4/brick0
file1  file2  file5  file6  file8
[root@server6 ~]# ls /bricks/brick5/brick0
file1  file2  file5  file6  file8
[root@server6 ~]# ls /bricks/brick6/brick0 
file1  file2  file5  file6  file8
[root@server6 ~]# kill -9 4167 4187 4207
[root@server6 ~]# gluster v start test force
volume start: test: success
[root@server6 ~]# gluster v status
Status of volume: test
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick server6:/bricks/brick1/brick0         49152     0          Y       3403 
Brick server6:/bricks/brick2/brick0         49153     0          Y       3423 
Brick server6:/bricks/brick3/brick0         49154     0          Y       3443 
Brick server6:/bricks/brick4/brick0         49155     0          Y       4410 
Brick server6:/bricks/brick5/brick0         49156     0          Y       4430 
Brick server6:/bricks/brick6/brick0         49157     0          Y       4450 
Self-heal Daemon on localhost               N/A       N/A        Y       4471 
 
Task Status of Volume test
------------------------------------------------------------------------------
There are no active volume tasks

2.3 Another terminal to run executable test file again and lock file1 success.
[root@server6 ~]# ./test 
 lock: Success

2.4 However, in gluster-3.7.12 version after glusterfsd restart glusterfs process will reopen and relock on file1 so executable test file can't lock file1.

Comment 9 yinkui 2019-12-10 02:16:40 UTC
1、“kill all glusterfsd which bricks store the fourth_file5.txt” means that kill all 3 bricks from the replica that contains that file. 
2、In fact, after kill all 3 bricks we can't access to the file. But, all 3 bricks restarting that glusterfs process to reopen the fourth_file5.txt and not relock the fourth_file5.txt after 'Remove lock recovery logic from client and server protocol translators' patch committed.

Comment 10 Anoop C S 2019-12-10 05:59:06 UTC
(In reply to Xavi Hernandez from comment #7)
> Anoop, is this a regression ? bug #1272030 says that the lock recovery
> functionality is removed from client/server xlators because AFR will have
> better knowledge about the real state of locks. Is this feature already
> implemented in AFR ?

It is not.

Comment 11 Anoop C S 2019-12-10 06:09:55 UTC
(In reply to yinkui from comment #8)
> I reproduce this bug. 
> 1、gluster v info test
> [root@server6 ~]# gluster v info test
> Volume Name: test
> Type: Distributed-Replicate
> Volume ID: 8c07f99a-85fe-4d62-b2e0-3deb480a371a
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 2 x 3 = 6
> Transport-type: tcp
> Bricks:
> Brick1: server6:/bricks/brick1/brick0
> Brick2: server6:/bricks/brick2/brick0
> Brick3: server6:/bricks/brick3/brick0
> Brick4: server6:/bricks/brick4/brick0
> Brick5: server6:/bricks/brick5/brick0
> Brick6: server6:/bricks/brick6/brick0
> Options Reconfigured:
> transport.address-family: inet
> nfs.disable: on
> performance.client-io-threads: off
> 2、Steps by reproduce:
> 2.1 lock file on file1
> [root@server6 ~]# ./test 
>  lock: Success

Do you maintain the fd in opened state through the next steps?
If yes, then it should be reopened after brick processes are back up after force starting the volume. Check for entry "1 fds open - Delaying child_up until they are re-opened" in mount log.
If not, then there is no point in trying to lock again as the previous lock would have been released as part of closing the fd or exiting the application.

Comment 12 yinkui 2019-12-10 07:11:20 UTC
Yes, the fd in opening state through the next steps. The mount_point'log has entry "1 fds open - Delaying child_up until they are re-opened".
mount log:
[2019-12-10 09:49:56.878253] I [MSGID: 114046] [client-handshake.c:1112:client_setvolume_cbk] 0-test-client-3: Connected to test-client-3, attached to remote volume '/bricks/brick4/brick0'.
[2019-12-10 09:49:56.878320] I [MSGID: 114042] [client-handshake.c:936:client_post_handshake] 0-test-client-3: 1 fds open - Delaying child_up until they are re-opened
[2019-12-10 09:49:56.878927] I [MSGID: 114041] [client-handshake.c:324:client_child_up_reopen_done] 0-test-client-3: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2019-12-10 09:49:56.879013] I [MSGID: 108005] [afr-common.c:5250:__afr_handle_child_up_event] 0-test-replicate-1: Subvolume 'test-client-3' came back up; going online.
[2019-12-10 09:49:56.882082] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 0-test-client-4: changing port to 49156 (from 0)
[2019-12-10 09:49:56.885800] I [MSGID: 114046] [client-handshake.c:1112:client_setvolume_cbk] 0-test-client-5: Connected to test-client-5, attached to remote volume '/bricks/brick6/brick0'.
[2019-12-10 09:49:56.885832] I [MSGID: 114042] [client-handshake.c:936:client_post_handshake] 0-test-client-5: 1 fds open - Delaying child_up until they are re-opened
[2019-12-10 09:49:56.886449] I [MSGID: 114041] [client-handshake.c:324:client_child_up_reopen_done] 0-test-client-5: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2019-12-10 09:49:56.886505] I [MSGID: 108002] [afr-common.c:5825:afr_notify] 0-test-replicate-1: Client-quorum is met
[2019-12-10 09:49:56.890239] I [MSGID: 114046] [client-handshake.c:1112:client_setvolume_cbk] 0-test-client-4: Connected to test-client-4, attached to remote volume '/bricks/brick5/brick0'.
[2019-12-10 09:49:56.890273] I [MSGID: 114042] [client-handshake.c:936:client_post_handshake] 0-test-client-4: 1 fds open - Delaying child_up until they are re-opened
[2019-12-10 09:49:56.890855] I [MSGID: 114041] [client-handshake.c:324:client_child_up_reopen_done] 0-test-client-4: last fd open'd/lock-self-heal'd - notifying CHILD-UP

Comment 13 Anoop C S 2019-12-10 09:56:58 UTC
(In reply to yinkui from comment #12)
> Yes, the fd in opening state through the next steps. The mount_point'log has
> entry "1 fds open - Delaying child_up until they are re-opened".
> mount log:

Cool. So that part is clear.

Now it is my understanding that prior to the commit mentioned in summary, re-acquiring of locks will only be attempted with lock-heal volume set option. As per current code base the behaviour you are observing is correct i.e, locks held before brick disconnect are cleaned up without registering grace timer(even volume set option also got removed). But are you pretty sure that you didn't have that option enabled with v3.7.X ?

Comment 14 yinkui 2019-12-10 10:10:53 UTC
Options enabled with v3.7.X.

[root@node-1 ~]# gluster v get test all | grep features.lock-heal
features.lock-heal                      on                                                                        
[root@node-1 ~]# gluster v get test all | grep features.grace-timeout
features.grace-timeout                  10

Comment 15 Anoop C S 2019-12-10 10:32:31 UTC
(In reply to yinkui from comment #14)
> Options enabled with v3.7.X.
> 
> [root@node-1 ~]# gluster v get test all | grep features.lock-heal
> features.lock-heal                      on                                  
> 
> [root@node-1 ~]# gluster v get test all | grep features.grace-timeout
> features.grace-timeout                  10

Ah.. then this behaviour is expected as these options(and corresponding implementation) got removed with v4.0 due to its incomplete and/or incorrect logic in various disconnect scenarios. It was decided to place lock-healing logic in replication layer rather than on protocol-client. In case you are interested refer the following doc for some notes around the internals of lock-healing implementation and issues:

https://anoopcs.fedorapeople.org/Lock%20recovery%20in%20GlusterFS.txt

Comment 16 Anoop C S 2019-12-19 10:40:06 UTC
I prefer to close the bug report as NOTABUG.

Comment 17 yinkui 2019-12-20 01:45:06 UTC
Can you give me some reason? thanks. My understanding that avoids different process setting the lock on the same file success so re-acquiring of locks should be attempted. You can close the bug If you have a good reason.

Comment 18 Worker Ant 2020-03-12 12:18:19 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/864, and will be tracked there from now on. Visit GitHub issues URL for further details


Note You need to log in before you can comment on or make changes to this bug.