Description of problem: After "Remove lock recovery logic from client and server protocol translators" patch commit, causing add fcntl lock on the file twice. Version-Release number of selected component (if applicable): gluster-6.0 How reproducible: Steps to Reproduce: 1.afr 2*3 volume 2.add fcntl F_WRLCK lock on the file1 3.make sure which afr group store the file1 and kill the group's glusterfsd 4.add fcntl F_WRLCK lock on the file1 again and success 5.gluster v start afr force 6.unlock fcntl lock success Actual results: can add fcntl lock twice Expected results: can not add fcntl lock twice Additional info:
Leaving a need-info on Anoop who authored the patch to comment on the bug.
(In reply to yinkui from comment #0) > Steps to Reproduce: > 1.afr 2*3 volume > 2.add fcntl F_WRLCK lock on the file1 > 3.make sure which afr group store the file1 and kill the group's glusterfsd > 4.add fcntl F_WRLCK lock on the file1 again and success > 5.gluster v start afr force > 6.unlock fcntl lock success How are you trying to acquire F_WRLCK(Is it a hand-written program or some other existing utility)? What do you mean by "afr group"? Did you mean replica brick processes hosting the file? Can you paste the output of `gluster v info`?
1、the test file is a hand-written program.for example: struct flock _lock; char d[10] = {0}; char a; _lock.l_type = F_WRLCK; _lock.l_whence = SEEK_SET; _lock.l_start = 0; _lock.l_len = 0; int file = open( "/mnt/fourth_file5.txt",O_CREAT|O_RDWR,S_IRWXU|S_IRGRP|S_IWGRP|S_IRWXO ); if ( file<0 ) { puts( "open error" ); return 0; } int ret = fcntl( file,F_SETLK,&_lock ); if ( ret < 0 ) { printf(" errno: %s\n", strerror(errno)); close( file ); return 0; } puts( "getchar ..." ); a = getchar(); printf(" char: %c\n", a); _lock.l_type = F_UNLCK; _lock.l_whence = SEEK_SET; _lock.l_start = 0; _lock.l_len = 0; ret = fcntl( file,F_SETLK,&_lock ); if ( ret < 0 ) { puts( "unlock error" ); printf(" errno: %s\n", strerror(errno)); } close( file ); 2、"afr group" means the lock file fourth_file5.txt which bricks store it. And kill all glusterfsd which bricks store the fourth_file5.txt. Then start all glusterfsd force and run the hand-written program again. It's also can add lock on the fourth_file5.txt. It's not true add the F_SETLK lock twice on the same file.
(In reply to yinkui from comment #3) > 2、"afr group" means the lock file fourth_file5.txt which bricks store it. > And kill all glusterfsd which bricks store the fourth_file5.txt. Then start > all glusterfsd force and run the hand-written program again. It's also can > add lock on the fourth_file5.txt. It's not true add the F_SETLK lock twice > on the same file. Can you paste the output of `gluster v info`?
(In reply to Anoop C S from comment #4) > (In reply to yinkui from comment #3) > > 2、"afr group" means the lock file fourth_file5.txt which bricks store it. > > And kill all glusterfsd which bricks store the fourth_file5.txt. Then start > > all glusterfsd force and run the hand-written program again. It's also can > > add lock on the fourth_file5.txt. It's not true add the F_SETLK lock twice > > on the same file. > > Can you paste the output of `gluster v info`? Sorry!The testing environment has been not in. You can reproduce according to my steps.
(In reply to yinkui from comment #5) > (In reply to Anoop C S from comment #4) > > (In reply to yinkui from comment #3) > > > 2、"afr group" means the lock file fourth_file5.txt which bricks store it. > > > And kill all glusterfsd which bricks store the fourth_file5.txt. Then start > > > all glusterfsd force and run the hand-written program again. It's also can > > > add lock on the fourth_file5.txt. It's not true add the F_SETLK lock twice > > > on the same file. > > > > Can you paste the output of `gluster v info`? > > Sorry!The testing environment has been not in. You can reproduce according > to my steps. I believe this change in behaviour is due to the following volume set options. Did you happen to set those options on the volume? features.lock-heal features.grace-timeout
Anoop, is this a regression ? bug #1272030 says that the lock recovery functionality is removed from client/server xlators because AFR will have better knowledge about the real state of locks. Is this feature already implemented in AFR ? if it's, then the bug should be moved to replicate and fixed there. Yinkui, I don't fully understand what do you mean when you say "kill all glusterfsd which bricks store the fourth_file5.txt". Does this mean that you kill all 3 bricks from the replica that contains that file ? if so, I'm not sure if this is a bug because you have completely lost access to the file. Maybe the bug is that the next access to the file with the fd opened before killing and restarting all bricks should have failed ? can you clarify the exact scenario ?
I reproduce this bug. 1、gluster v info test [root@server6 ~]# gluster v info test Volume Name: test Type: Distributed-Replicate Volume ID: 8c07f99a-85fe-4d62-b2e0-3deb480a371a Status: Started Snapshot Count: 0 Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: server6:/bricks/brick1/brick0 Brick2: server6:/bricks/brick2/brick0 Brick3: server6:/bricks/brick3/brick0 Brick4: server6:/bricks/brick4/brick0 Brick5: server6:/bricks/brick5/brick0 Brick6: server6:/bricks/brick6/brick0 Options Reconfigured: transport.address-family: inet nfs.disable: on performance.client-io-threads: off 2、Steps by reproduce: 2.1 lock file on file1 [root@server6 ~]# ./test lock: Success 2.2 [root@server6 ~]# gluster v status Status of volume: test Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick server6:/bricks/brick1/brick0 49152 0 Y 3403 Brick server6:/bricks/brick2/brick0 49153 0 Y 3423 Brick server6:/bricks/brick3/brick0 49154 0 Y 3443 Brick server6:/bricks/brick4/brick0 49155 0 Y 4167 Brick server6:/bricks/brick5/brick0 49156 0 Y 4187 Brick server6:/bricks/brick6/brick0 49157 0 Y 4207 Self-heal Daemon on localhost N/A N/A Y 4235 Task Status of Volume test ------------------------------------------------------------------------------ There are no active volume tasks [root@server6 ~]# ls /bricks/brick4/brick0 file1 file2 file5 file6 file8 [root@server6 ~]# ls /bricks/brick5/brick0 file1 file2 file5 file6 file8 [root@server6 ~]# ls /bricks/brick6/brick0 file1 file2 file5 file6 file8 [root@server6 ~]# kill -9 4167 4187 4207 [root@server6 ~]# gluster v start test force volume start: test: success [root@server6 ~]# gluster v status Status of volume: test Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick server6:/bricks/brick1/brick0 49152 0 Y 3403 Brick server6:/bricks/brick2/brick0 49153 0 Y 3423 Brick server6:/bricks/brick3/brick0 49154 0 Y 3443 Brick server6:/bricks/brick4/brick0 49155 0 Y 4410 Brick server6:/bricks/brick5/brick0 49156 0 Y 4430 Brick server6:/bricks/brick6/brick0 49157 0 Y 4450 Self-heal Daemon on localhost N/A N/A Y 4471 Task Status of Volume test ------------------------------------------------------------------------------ There are no active volume tasks 2.3 Another terminal to run executable test file again and lock file1 success. [root@server6 ~]# ./test lock: Success 2.4 However, in gluster-3.7.12 version after glusterfsd restart glusterfs process will reopen and relock on file1 so executable test file can't lock file1.
1、“kill all glusterfsd which bricks store the fourth_file5.txt” means that kill all 3 bricks from the replica that contains that file. 2、In fact, after kill all 3 bricks we can't access to the file. But, all 3 bricks restarting that glusterfs process to reopen the fourth_file5.txt and not relock the fourth_file5.txt after 'Remove lock recovery logic from client and server protocol translators' patch committed.
(In reply to Xavi Hernandez from comment #7) > Anoop, is this a regression ? bug #1272030 says that the lock recovery > functionality is removed from client/server xlators because AFR will have > better knowledge about the real state of locks. Is this feature already > implemented in AFR ? It is not.
(In reply to yinkui from comment #8) > I reproduce this bug. > 1、gluster v info test > [root@server6 ~]# gluster v info test > Volume Name: test > Type: Distributed-Replicate > Volume ID: 8c07f99a-85fe-4d62-b2e0-3deb480a371a > Status: Started > Snapshot Count: 0 > Number of Bricks: 2 x 3 = 6 > Transport-type: tcp > Bricks: > Brick1: server6:/bricks/brick1/brick0 > Brick2: server6:/bricks/brick2/brick0 > Brick3: server6:/bricks/brick3/brick0 > Brick4: server6:/bricks/brick4/brick0 > Brick5: server6:/bricks/brick5/brick0 > Brick6: server6:/bricks/brick6/brick0 > Options Reconfigured: > transport.address-family: inet > nfs.disable: on > performance.client-io-threads: off > 2、Steps by reproduce: > 2.1 lock file on file1 > [root@server6 ~]# ./test > lock: Success Do you maintain the fd in opened state through the next steps? If yes, then it should be reopened after brick processes are back up after force starting the volume. Check for entry "1 fds open - Delaying child_up until they are re-opened" in mount log. If not, then there is no point in trying to lock again as the previous lock would have been released as part of closing the fd or exiting the application.
Yes, the fd in opening state through the next steps. The mount_point'log has entry "1 fds open - Delaying child_up until they are re-opened". mount log: [2019-12-10 09:49:56.878253] I [MSGID: 114046] [client-handshake.c:1112:client_setvolume_cbk] 0-test-client-3: Connected to test-client-3, attached to remote volume '/bricks/brick4/brick0'. [2019-12-10 09:49:56.878320] I [MSGID: 114042] [client-handshake.c:936:client_post_handshake] 0-test-client-3: 1 fds open - Delaying child_up until they are re-opened [2019-12-10 09:49:56.878927] I [MSGID: 114041] [client-handshake.c:324:client_child_up_reopen_done] 0-test-client-3: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2019-12-10 09:49:56.879013] I [MSGID: 108005] [afr-common.c:5250:__afr_handle_child_up_event] 0-test-replicate-1: Subvolume 'test-client-3' came back up; going online. [2019-12-10 09:49:56.882082] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 0-test-client-4: changing port to 49156 (from 0) [2019-12-10 09:49:56.885800] I [MSGID: 114046] [client-handshake.c:1112:client_setvolume_cbk] 0-test-client-5: Connected to test-client-5, attached to remote volume '/bricks/brick6/brick0'. [2019-12-10 09:49:56.885832] I [MSGID: 114042] [client-handshake.c:936:client_post_handshake] 0-test-client-5: 1 fds open - Delaying child_up until they are re-opened [2019-12-10 09:49:56.886449] I [MSGID: 114041] [client-handshake.c:324:client_child_up_reopen_done] 0-test-client-5: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2019-12-10 09:49:56.886505] I [MSGID: 108002] [afr-common.c:5825:afr_notify] 0-test-replicate-1: Client-quorum is met [2019-12-10 09:49:56.890239] I [MSGID: 114046] [client-handshake.c:1112:client_setvolume_cbk] 0-test-client-4: Connected to test-client-4, attached to remote volume '/bricks/brick5/brick0'. [2019-12-10 09:49:56.890273] I [MSGID: 114042] [client-handshake.c:936:client_post_handshake] 0-test-client-4: 1 fds open - Delaying child_up until they are re-opened [2019-12-10 09:49:56.890855] I [MSGID: 114041] [client-handshake.c:324:client_child_up_reopen_done] 0-test-client-4: last fd open'd/lock-self-heal'd - notifying CHILD-UP
(In reply to yinkui from comment #12) > Yes, the fd in opening state through the next steps. The mount_point'log has > entry "1 fds open - Delaying child_up until they are re-opened". > mount log: Cool. So that part is clear. Now it is my understanding that prior to the commit mentioned in summary, re-acquiring of locks will only be attempted with lock-heal volume set option. As per current code base the behaviour you are observing is correct i.e, locks held before brick disconnect are cleaned up without registering grace timer(even volume set option also got removed). But are you pretty sure that you didn't have that option enabled with v3.7.X ?
Options enabled with v3.7.X. [root@node-1 ~]# gluster v get test all | grep features.lock-heal features.lock-heal on [root@node-1 ~]# gluster v get test all | grep features.grace-timeout features.grace-timeout 10
(In reply to yinkui from comment #14) > Options enabled with v3.7.X. > > [root@node-1 ~]# gluster v get test all | grep features.lock-heal > features.lock-heal on > > [root@node-1 ~]# gluster v get test all | grep features.grace-timeout > features.grace-timeout 10 Ah.. then this behaviour is expected as these options(and corresponding implementation) got removed with v4.0 due to its incomplete and/or incorrect logic in various disconnect scenarios. It was decided to place lock-healing logic in replication layer rather than on protocol-client. In case you are interested refer the following doc for some notes around the internals of lock-healing implementation and issues: https://anoopcs.fedorapeople.org/Lock%20recovery%20in%20GlusterFS.txt
I prefer to close the bug report as NOTABUG.
Can you give me some reason? thanks. My understanding that avoids different process setting the lock on the same file success so re-acquiring of locks should be attempted. You can close the bug If you have a good reason.
This bug is moved to https://github.com/gluster/glusterfs/issues/864, and will be tracked there from now on. Visit GitHub issues URL for further details
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days