Bug 1390914
Created attachment 1216826 [details]
sn-0_mnt-bricks-services-brick.1062.dump.1477882903 is dumped from glusterfs server on sn-0. sn-1mnt-bricks-services-brick.1066.dump.1477882904 is dumped from glusterfs server on sn-1.
sn-0_mnt-bricks-services-brick.1062.dump.1477882903 is dumped from glusterfs server on sn-0.
sn-1mnt-bricks-services-brick.1066.dump.1477882904 is dumped from glusterfs server on sn-1.
Attachments analyse as below: 1. below logs are copied from sn-1mnt-bricks-services-brick.1066.dump.1477882904. the granted flock didn't released on sn-1, but the flock was release on sn-0. [xlator.features.locks.services-locks.inode] path=/lightcm/locks/nodes.all mandatory=0 posixlk-count=2 posixlk.posixlk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 17537, owner=18a27aa2d1d2944f, client=0x19e6b60, connection-id=(null), granted at 2016-10-31 02:58:07 posixlk.posixlk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 17559, owner=18764b24e15b0520, client=0x19e6b60, connection-id=(null), blocked at 2016-10-31 02:58:07 2. This flock comes from lmn VM. below logs are copied from lmn_mnt-services.log. // application opened "nodes.all" file, the OPEN request was sent to sn-0, but didn't send to sn-1 because 0-services-client-1 didn't UP. [2016-10-31 02:57:50.816359] T [rpc-clnt.c:1381:rpc_clnt_record] 0-services-client-0: Auth Info: pid: 17536, uid: 0, gid: 0, owner: 0000000000000000 [2016-10-31 02:57:50.816371] T [rpc-clnt.c:1238:rpc_clnt_record_build_header] 0-rpc-clnt: Request fraglen 92, payload: 24, rpc hdr: 68 [2016-10-31 02:57:50.816390] T [rpc-clnt.c:1573:rpc_clnt_submit] 0-rpc-clnt: submitted request (XID: 0xc7b7aa Program: GlusterFS 3.3, ProgVers: 330, Proc: 11) to rpc-transport (services-client-0) //application flock the file. the FLOCK request was sent to sn-0. and got lock from sn-0 successfully. [2016-10-31 02:57:50.817424] T [rpc-clnt.c:1573:rpc_clnt_submit] 0-rpc-clnt: submitted request (XID: 0xc7b7ab Program: GlusterFS 3.3, ProgVers: 330, Proc: 26) to rpc-transport (services-client-0) [2016-10-31 02:57:51.277349] T [rpc-clnt.c:660:rpc_clnt_reply_init] 0-services-client-0: received rpc message (RPC XID: 0xc7b7ab Program: GlusterFS 3.3, ProgVers: 330, Proc: 26) from rpc-transport (services-client-0) //AFR xlator send FLOCK fop to sn-1 although 0-services-client-1 still didn't UP. Because AFR didn't open "nodes.all" on SN-1, client xlator sent the fop to server with a anonymous fd(GF_ANON_FD_NO -2), and got lock from server successfully too. this lock didn't released. [2016-10-31 02:57:51.277397] T [rpc-clnt.c:1573:rpc_clnt_submit] 0-rpc-clnt: submitted request (XID: 0xbe5479 Program: GlusterFS 3.3, ProgVers: 330, Proc: 26) to rpc-transport (services-client-1) [2016-10-31 02:57:51.324258] T [rpc-clnt.c:660:rpc_clnt_reply_init] 0-services-client-1: received rpc message (RPC XID: 0xbe5479 Program: GlusterFS 3.3, ProgVers: 330, Proc: 26) from rpc-transport (services-client-1) //application released flock, RELEASE fop was sent to sn-0 and lock is released on sn-0. But the RELEASE fop wasn't sent to sn-1. [2016-10-31 02:57:51.404366] T [rpc-clnt.c:1573:rpc_clnt_submit] 0-rpc-clnt: submitted request (XID: 0xc7b7ec Program: GlusterFS 3.3, ProgVers: 330, Proc: 41) to rpc-transport (services-client-0) [2016-10-31 02:57:51.406273] T [rpc-clnt.c:660:rpc_clnt_reply_init] 0-services-client-0: received rpc message (RPC XID: 0xc7b7ec Program: GlusterFS 3.3, ProgVers: 330, Proc: 41) from rpc-transport (services-client-0) //notice 0-services-client-0 UP time, it was before "nodes.all" open. [2016-10-31 02:57:49.802345] I [client-handshake.c:1052:client_post_handshake] 0-services-client-0: 22 fds open - Delaying child_up until they are re-opened [2016-10-31 02:57:49.983331] I [client-handshake.c:674:client_child_up_reopen_done] 0-services-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP //notice 0-services-client-1 UP time, it was after flock release. [2016-10-31 02:57:51.244367] I [client-handshake.c:1052:client_post_handshake] 0-services-client-1: 21 fds open - Delaying child_up until they are re-opened [2016-10-31 02:57:51.731500] I [client-handshake.c:674:client_child_up_reopen_done] 0-services-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP 3. code //FLOCK fop was sent to server with a anonymous fd, if flock file didn't opened. client3_3_lk (call_frame_t *frame, xlator_t *this, void *data) { } // RELEASE fop was sent to server, if the file didn't opened. client3_3_release (call_frame_t *frame, xlator_t *this, void *data) { } 4. If any way to fix the issue? We added a patch for this issue, please check if it is ok. --- a/old/client-rpc-fops.c +++ b/new/client-rpc-fops.c @@ -5260,6 +5260,14 @@ client3_3_lk (call_frame_t *frame, xlator_t *this, CLIENT_GET_REMOTE_FD (this, args->fd, DEFAULT_REMOTE_FD, remote_fd, op_errno, unwind); + if(remote_fd < 0) + { + gf_log (this->name, GF_LOG_INFO, "Didn't open remote fd(%ld), but return EBADFD, AFR shall ignore such error. pid: %u ", remote_fd, frame->root->pid); + op_errno = EBADFD; + CLIENT_STACK_UNWIND (lk, frame, -1, op_errno, NULL, NULL); + return 0; + } + ret = client_cmd_to_gf_cmd (args->cmd, &gf_cmd); if (ret) { op_errno = EINVAL; CCin Pranith (AFR and Posix-locks code maintainer) REVIEW: http://review.gluster.org/15804 (protocol/client: Do not fallback to anon-fd if fd is not open) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu) (In reply to xiaopwu from comment #3) > We added a patch for this issue, please check if it is ok. > > --- a/old/client-rpc-fops.c > +++ b/new/client-rpc-fops.c > @@ -5260,6 +5260,14 @@ client3_3_lk (call_frame_t *frame, xlator_t *this, > CLIENT_GET_REMOTE_FD (this, args->fd, DEFAULT_REMOTE_FD, > remote_fd, op_errno, unwind); > > + if(remote_fd < 0) > + { > + gf_log (this->name, GF_LOG_INFO, "Didn't open remote fd(%ld), > but return EBADFD, AFR shall ignore such error. pid: %u ", remote_fd, > frame->root->pid); > + op_errno = EBADFD; > + CLIENT_STACK_UNWIND (lk, frame, -1, op_errno, NULL, NULL); > + return 0; > + } > + > ret = client_cmd_to_gf_cmd (args->cmd, &gf_cmd); > if (ret) { > op_errno = EINVAL; Awesome debugging!! I think the fix can be generic, I posted the fix to the bug I introduced a while back! Thanks for raising this bug!. I will port this patch to lower branches once this one passes regressions. Pranith Regressions are failing on the change I submitted, will look into it a bit more and update REVIEW: https://review.gluster.org/15804 (protocol/client: Do not fallback to anon-fd if fd is not open) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu) REVIEW: https://review.gluster.org/15804 (protocol/client: Do not fallback to anon-fd if fd is not open) posted (#3) for review on master by Pranith Kumar Karampuri REVIEW: https://review.gluster.org/15804 (protocol/client: Do not fallback to anon-fd if fd is not open) merged (#8) on master by Raghavendra G |
Created attachment 1216401 [details] sn-0_mnt-bricks-services-brick.log is the glusterfs server log from sn-0. sn-1_mnt-bricks-services-brick.log is glusterfs server log from sn-1. lmn_mnt-services.log is glusterfs client log from lmn-0 Description of problem: - when tcp connection is connected between server and client, but AFR didn't get CHILD_UP event. if application open a file for flock operation. AFR would send the open fop to the CHILD client. - But the flock fop could be sent to server with a anonymous fd(-2), server could flock successfully. - When application release the flock, client can't send the release request to server, so server can't release the flock forever. Version-Release number of selected component (if applicable): 3.6.9 How reproducible: - In my env, there are a replicate volume, one brick locate in sn-0 VM, another brick locate in sn-1 VM. - glusterfs client locate in lmn-0 VM, sn-0, sn-1. - This issue isn't easy to reproduce. Steps to Reproduce: 1. when restart sn-0/sn-1 at the same time. 2. glusterfs server on sn-0/sn-1 startup. 3. application on lmn-0 VM will open a file to do flock. Actual results: 1. AFR get flock from sn-0 successfully by actual fd. 2 AFR get flock form sn-1 successfully by anonymous fd(-2). 3. when application release flock, the flock on sn-1 can't be release because release request didn't sent to server. Additional info: