+++ This bug was initially created as a clone of Bug #1717824 +++ Description of problem: In Glusterfs, we have support the fencing feature support. With this we can suppor the ALUA feature in LIO/TCMU now. The fencing doc: https://review.gluster.org/#/c/glusterfs-specs/+/21925/6/accepted/fencing.md The fencing test example: https://review.gluster.org/#/c/glusterfs/+/21496/12/tests/basic/fencing/fence-basic.c The LIO/tcmu-runner PR of supporting the ALUA is : https://github.com/open-iscsi/tcmu-runner/pull/554. But currently when testing it based the above PR in tcmu-runner by shutting down of the HA node, and start it after 2~3 minutes, in all the HA nodes we can see that the glfs_file_lock() get stucked, the following is from the /var/log/tcmu-runner.log: ==== 2019-06-06 13:50:15.755 1316 [DEBUG] tcmu_acquire_dev_lock:388 glfs/block3: lock call state 2 retries 0. tag 65535 reopen 0 2019-06-06 13:50:15.757 1316 [DEBUG] tcmu_acquire_dev_lock:440 glfs/block3: lock call done. lock state 1 2019-06-06 13:50:55.845 1316 [DEBUG] tcmu_acquire_dev_lock:388 glfs/block4: lock call state 2 retries 0. tag 65535 reopen 0 2019-06-06 13:50:55.847 1316 [DEBUG] tcmu_acquire_dev_lock:440 glfs/block4: lock call done. lock state 1 2019-06-06 13:57:50.102 1315 [DEBUG] tcmu_acquire_dev_lock:388 glfs/block3: lock call state 2 retries 0. tag 65535 reopen 0 2019-06-06 13:57:50.103 1315 [DEBUG] tcmu_acquire_dev_lock:440 glfs/block3: lock call done. lock state 1 2019-06-06 13:57:50.121 1315 [DEBUG] tcmu_acquire_dev_lock:388 glfs/block4: lock call state 2 retries 0. tag 65535 reopen 0 2019-06-06 13:57:50.132 1315 [DEBUG] tcmu_acquire_dev_lock:440 glfs/block4: lock call done. lock state 1 2019-06-06 14:09:03.654 1328 [DEBUG] tcmu_acquire_dev_lock:388 glfs/block3: lock call state 2 retries 0. tag 65535 reopen 0 2019-06-06 14:09:03.662 1328 [DEBUG] tcmu_acquire_dev_lock:440 glfs/block3: lock call done. lock state 1 2019-06-06 14:09:06.700 1328 [DEBUG] tcmu_acquire_dev_lock:388 glfs/block4: lock call state 2 retries 0. tag 65535 reopen 0 ==== The lock operation is never returned. I am using the following glusterfs built by myself: # rpm -qa|grep glusterfs glusterfs-extra-xlators-7dev-0.0.el7.x86_64 glusterfs-api-devel-7dev-0.0.el7.x86_64 glusterfs-7dev-0.0.el7.x86_64 glusterfs-server-7dev-0.0.el7.x86_64 glusterfs-cloudsync-plugins-7dev-0.0.el7.x86_64 glusterfs-resource-agents-7dev-0.0.el7.noarch glusterfs-api-7dev-0.0.el7.x86_64 glusterfs-devel-7dev-0.0.el7.x86_64 glusterfs-regression-tests-7dev-0.0.el7.x86_64 glusterfs-gnfs-7dev-0.0.el7.x86_64 glusterfs-client-xlators-7dev-0.0.el7.x86_64 glusterfs-geo-replication-7dev-0.0.el7.x86_64 glusterfs-debuginfo-7dev-0.0.el7.x86_64 glusterfs-fuse-7dev-0.0.el7.x86_64 glusterfs-events-7dev-0.0.el7.x86_64 glusterfs-libs-7dev-0.0.el7.x86_64 glusterfs-cli-7dev-0.0.el7.x86_64 glusterfs-rdma-7dev-0.0.el7.x86_64 How reproducible: 30%. Steps to Reproduce: 1. create one rep volume(HA >= 2) with the mandantary lock enabled 2. create one gluster-blockd target 3. login and do the fio in the client node 4. shutdown one of the HA nodes, and wait 2 ~3 minutes and start it again Actual results: all the time the fio couldn't recovery and the rw BW will be 0kb/s, and we can see tons of log from /var/log/tcmu-runnner.log file: 2019-06-06 15:01:06.641 1328 [DEBUG] alua_implicit_transition:561 glfs/block4: Lock acquisition operation is already in process. 2019-06-06 15:01:06.648 1328 [DEBUG_SCSI_CMD] tcmu_cdb_print_info:353 glfs/block4: 28 0 0 3 1f 80 0 0 8 0 2019-06-06 15:01:06.648 1328 [DEBUG] alua_implicit_transition:561 glfs/block4: Lock acquisition operation is already in process. 2019-06-06 15:01:06.655 1328 [DEBUG_SCSI_CMD] tcmu_cdb_print_info:353 glfs/block4: 28 0 0 3 1f 80 0 0 8 0 2019-06-06 15:01:06.655 1328 [DEBUG] alua_implicit_transition:561 glfs/block4: Lock acquisition operation is already in process. 2019-06-06 15:01:06.661 1328 [DEBUG_SCSI_CMD] tcmu_cdb_print_info:353 glfs/block4: 28 0 0 3 1f 80 0 0 8 0 2019-06-06 15:01:06.662 1328 [DEBUG] alua_implicit_transition:561 glfs/block4: Lock acquisition operation is already in process. Expected results: just before the shutdown node is up, the fio could be recovery. --- Additional comment from Xiubo Li on 2019-06-06 14:39:50 MVT --- --- Additional comment from Xiubo Li on 2019-06-06 14:40:16 MVT --- --- Additional comment from Xiubo Li on 2019-06-06 14:42:10 MVT --- The bt output from the gbd: [root@rhel1 ~]# gdb -p 1325 (gdb) bt #0 0x00007fc7761baf47 in pthread_join () from /lib64/libpthread.so.0 #1 0x00007fc7773de468 in event_dispatch_epoll (event_pool=0x559f03d4b560) at event-epoll.c:847 #2 0x0000559f02419658 in main (argc=21, argv=0x7fff9c6722c8) at glusterfsd.c:2871 (gdb) [root@rhel3 ~]# gdb -p 7669 (gdb) bt #0 0x00007fac80bd9f47 in pthread_join () from /usr/lib64/libpthread.so.0 #1 0x00007fac81dfd468 in event_dispatch_epoll (event_pool=0x55de6f845560) at event-epoll.c:847 #2 0x000055de6f143658 in main (argc=21, argv=0x7ffcafc3eff8) at glusterfsd.c:2871 (gdb) The pl_inode->fop_wind_count is: (gdb) thread 2 [Switching to thread 2 (Thread 0x7fc742184700 (LWP 1829))] #0 0x00007fc7761bd965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 (gdb) frame 2 #2 0x00007fc76379c13b in pl_lk (frame=frame@entry=0x7fc750001128, this=this@entry=0x7fc75c0128f0, fd=fd@entry=0x7fc73c0977d8, cmd=cmd@entry=6, flock=flock@entry=0x7fc73c076938, xdata=xdata@entry=0x7fc73c071828) at posix.c:2637 2637 ret = pl_lock_preempt(pl_inode, reqlock); (gdb) p pl_inode->fop_wind_count $1 = -30 (gdb) The pstack logs please see the attachments Thanks. BRs --- Additional comment from Susant Kumar Palai on 2019-06-10 12:10:33 MVT --- Just a small update: There are cases where fop_wind_count can go -ve. A basic fix will be never to bring its value down if it is zero. I will update more on this later as I am busy with a few other issues ATM. Susant --- Additional comment from Xiubo Li on 2019-07-17 13:15:51 MVT --- Hi Susant, Is there any new update about this ? Thanks. --- Additional comment from Susant Kumar Palai on 2019-07-17 13:20:56 MVT --- (In reply to Xiubo Li from comment #5) > Hi Susant, > > Is there any new update about this ? > > Thanks. Hey Xiubo, most likely will be sending a patch by end of day today. --- Additional comment from Xiubo Li on 2019-07-17 13:22:28 MVT --- (In reply to Susant Kumar Palai from comment #6) > (In reply to Xiubo Li from comment #5) > > Hi Susant, > > > > Is there any new update about this ? > > > > Thanks. > > Hey Xiubo, most likely will be sending a patch by end of day today. Sure and take your time Susant please :-) Thanks very much. BRs --- Additional comment from Susant Kumar Palai on 2019-07-17 14:13:59 MVT --- Moved to POST by mistake. Resetting. --- Additional comment from Worker Ant on 2019-07-22 14:59:10 MVT --- REVIEW: https://review.gluster.org/23088 (locks/fencing: Address while lock preemption) posted (#1) for review on master by Susant Palai --- Additional comment from Susant Kumar Palai on 2019-07-22 15:02:07 MVT --- Xiubo, if you could test it out, it would be great. (Make sure you enable fencing before you create any client) --- Additional comment from Xiubo Li on 2019-07-22 15:42:38 MVT --- (In reply to Susant Kumar Palai from comment #10) > Xiubo, if you could test it out, it would be great. (Make sure you enable > fencing before you create any client) @Susant, Yeah, thanks very much for your work on this. And I will test it late today or tomorrow. BRs Xiubo --- Additional comment from Xiubo Li on 2019-07-23 07:44:00 MVT --- @Susant, There 2 issues are found: 1, From my test the glfs_file_lock() sometimes it will takes around 43 seconds, is it normal ? And why ? 90456 2019-07-23 10:08:57.444 31411 [INFO] tcmu_glfs_lock:901 glfs/block0: lxb--------------glfs_file_lock start ... 319959 2019-07-23 10:09:40.183 31411 [INFO] tcmu_glfs_lock:905 glfs/block0: lxb--------------glfs_file_lock end 2, After the lock is broke and all the FIOs callback will return -1, the -EPERM, not the -EBUSY as we discussed before. Is there any change about the return value ? I am only checking the -EBUSY and -ENOTCONN then only after that the lock state in local tcmu node will be changed. Or the local state in tcmu-runner service will always in LOCKED state, but it actually already lost the lock and should be in UNLOCKED state, so all the IOs will fail. Thanks, BRs --- Additional comment from Susant Kumar Palai on 2019-07-23 10:45:01 MVT --- (In reply to Xiubo Li from comment #12) > @Susant, > > There 2 issues are found: > > > 1, From my test the glfs_file_lock() sometimes it will takes around 43 > seconds, is it normal ? And why ? > > 90456 2019-07-23 10:08:57.444 31411 [INFO] tcmu_glfs_lock:901 glfs/block0: > lxb--------------glfs_file_lock start > ... > 319959 2019-07-23 10:09:40.183 31411 [INFO] tcmu_glfs_lock:905 glfs/block0: > lxb--------------glfs_file_lock end I wonder if it is related to draining of fops. Let me do some testing around this. > > 2, After the lock is broke and all the FIOs callback will return -1, the > -EPERM, not the -EBUSY as we discussed before. Is there any change about the > return value ? I am only checking the -EBUSY and -ENOTCONN then only after > that the lock state in local tcmu node will be changed. Or the local state > in tcmu-runner service will always in LOCKED state, but it actually already > lost the lock and should be in UNLOCKED state, so all the IOs will fail. This is interesting. Will get back after some code checking. > > > > Thanks, > BRs --- Additional comment from Xiubo Li on 2019-07-23 10:47:15 MVT --- (In reply to Susant Kumar Palai from comment #13) > (In reply to Xiubo Li from comment #12) > > @Susant, > > > > There 2 issues are found: > > > > > > 1, From my test the glfs_file_lock() sometimes it will takes around 43 > > seconds, is it normal ? And why ? > > > > 90456 2019-07-23 10:08:57.444 31411 [INFO] tcmu_glfs_lock:901 glfs/block0: > > lxb--------------glfs_file_lock start > > ... > > 319959 2019-07-23 10:09:40.183 31411 [INFO] tcmu_glfs_lock:905 glfs/block0: > > lxb--------------glfs_file_lock end > > I wonder if it is related to draining of fops. Let me do some testing around > this. > Sure. > > > > > > 2, After the lock is broke and all the FIOs callback will return -1, the > > -EPERM, not the -EBUSY as we discussed before. Is there any change about the > > return value ? I am only checking the -EBUSY and -ENOTCONN then only after > > that the lock state in local tcmu node will be changed. Or the local state > > in tcmu-runner service will always in LOCKED state, but it actually already > > lost the lock and should be in UNLOCKED state, so all the IOs will fail. > > This is interesting. Will get back after some code checking. > Please take your time @Susant. Thanks, BRs --- Additional comment from Susant Kumar Palai on 2019-07-23 12:44:22 MVT --- (In reply to Xiubo Li from comment #12) > @Susant, > > There 2 issues are found: > > > 1, From my test the glfs_file_lock() sometimes it will takes around 43 > seconds, is it normal ? And why ? > > 90456 2019-07-23 10:08:57.444 31411 [INFO] tcmu_glfs_lock:901 glfs/block0: > lxb--------------glfs_file_lock start > ... > 319959 2019-07-23 10:09:40.183 31411 [INFO] tcmu_glfs_lock:905 glfs/block0: > lxb--------------glfs_file_lock end Checked the time taken for file_lock and it completes immediately for me. ret = glfs_fsetxattr(fd1, GF_ENFORCE_MANDATORY_LOCK, "set", 8, 0); if (ret < 0) { LOG_ERR("glfs_fsetxattr", errno); ret = -1; goto out; } time(&before); /* take a write mandatory lock */ ret = glfs_file_lock(fd1, F_SETLKW, &lock, GLFS_LK_MANDATORY); if (ret) { LOG_ERR("glfs_file_lock", errno); goto out; } time(&after); diff = (unsigned long )after - before; fprintf(fp, "time %lu %lu %lu\n", diff, before, after); time 0 1563867824 1563867824 Can you attach the brick log here when you run the test next time? > > 2, After the lock is broke and all the FIOs callback will return -1, the > -EPERM, not the -EBUSY as we discussed before. Is there any change about the > return value ? I am only checking the -EBUSY and -ENOTCONN then only after > that the lock state in local tcmu node will be changed. Or the local state > in tcmu-runner service will always in LOCKED state, but it actually already > lost the lock and should be in UNLOCKED state, so all the IOs will fail. Please attach the brick log after enabling trace logging. <gluster v set <vol> brick-log-level TRACE> > > > > Thanks, > BRs --- Additional comment from Xiubo Li on 2019-07-23 12:48:58 MVT --- (In reply to Susant Kumar Palai from comment #15) > (In reply to Xiubo Li from comment #12) > > @Susant, > > > > There 2 issues are found: > > > > > > 1, From my test the glfs_file_lock() sometimes it will takes around 43 > > seconds, is it normal ? And why ? > > > > 90456 2019-07-23 10:08:57.444 31411 [INFO] tcmu_glfs_lock:901 glfs/block0: > > lxb--------------glfs_file_lock start > > ... > > 319959 2019-07-23 10:09:40.183 31411 [INFO] tcmu_glfs_lock:905 glfs/block0: > > lxb--------------glfs_file_lock end > > Checked the time taken for file_lock and it completes immediately for me. > > ret = glfs_fsetxattr(fd1, GF_ENFORCE_MANDATORY_LOCK, "set", 8, 0); > if (ret < 0) { > LOG_ERR("glfs_fsetxattr", errno); > ret = -1; > goto out; > } > > time(&before); > /* take a write mandatory lock */ > ret = glfs_file_lock(fd1, F_SETLKW, &lock, GLFS_LK_MANDATORY); > if (ret) { > LOG_ERR("glfs_file_lock", errno); > goto out; > } > time(&after); > diff = (unsigned long )after - before; > fprintf(fp, "time %lu %lu %lu\n", diff, before, after); > > time 0 1563867824 1563867824 > > Can you attach the brick log here when you run the test next time? > > > > > 2, After the lock is broke and all the FIOs callback will return -1, the > > -EPERM, not the -EBUSY as we discussed before. Is there any change about the > > return value ? I am only checking the -EBUSY and -ENOTCONN then only after > > that the lock state in local tcmu node will be changed. Or the local state > > in tcmu-runner service will always in LOCKED state, but it actually already > > lost the lock and should be in UNLOCKED state, so all the IOs will fail. > > Please attach the brick log after enabling trace logging. <gluster v set > <vol> brick-log-level TRACE> > > > > > > Sure, I will do that after my current work handy, possibly late today or tomorrow morning. Thanks BRs --- Additional comment from Susant Kumar Palai on 2019-07-25 15:05:47 MVT --- On the permission denied: I did not see any error related to EPERM but saw EBUSY in the brick logs. [2019-07-24 08:15:22.236283] E [MSGID: 101191] [event-epoll.c:765:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2019-07-24 08:15:46.083306] E [MSGID: 115068] [server-rpc-fops_v2.c:1425:server4_readv_cbk] 0-repvol3-server: 29: READV 0 (7db899f8-bf56-4b89-a4c6-90235e8c720a), client: CTX_ID:024a059c-7be1-4a19-ba27-8624c6e9c c9c-GRAPH_ID:0-PID:9399-HOST:rhel3-PC_NAME:repvol3-client-2-RECON_NO:-0, error-xlator: repvol3-locks [Resource temporarily unavailable] [2019-07-24 08:15:46.088292] E [MSGID: 115068] [server-rpc-fops_v2.c:1425:server4_readv_cbk] 0-repvol3-server: 31: READV 0 (7db899f8-bf56-4b89-a4c6-90235e8c720a), client: CTX_ID:024a059c-7be1-4a19-ba27-8624c6e9c c9c-GRAPH_ID:0-PID:9399-HOST:rhel3-PC_NAME:repvol3-client-2-RECON_NO:-0, error-xlator: repvol3-locks [Resource temporarily unavailable] [2019-07-24 08:15:46.119463] E [MSGID: 115068] [server-rpc-fops_v2.c:1425:server4_readv_cbk] 0-repvol3-server: 33: READV 0 (7db899f8-bf56-4b89-a4c6-90235e8c720a), client: CTX_ID:024a059c-7be1-4a19-ba27-8624c6e9c c9c-GRAPH_ID:0-PID:9399-HOST:rhel3-PC_NAME:repvol3-client-2-RECON_NO:-0, error-xlator: repvol3-locks [Resource temporarily unavailable] [2019-07-24 08:15:46.124067] E [MSGID: 115068] [server-rpc-fops_v2.c:1425:server4_readv_cbk] 0-repvol3-server: 35: READV 0 (7db899f8-bf56-4b89-a4c6-90235e8c720a), client: CTX_ID:024a059c-7be1-4a19-ba27-8624c6e9c c9c-GRAPH_ID:0-PID:9399-HOST:rhel3-PC_NAME:repvol3-client-2-RECON_NO:-0, error-xlator: repvol3-locks [Resource temporarily unavailable] [2019-07-24 08:15:46.294554] E [MSGID: 115068] [server-rpc-fops_v2.c:1425:server4_readv_cbk] 0-repvol3-server: 37: READV 0 (7db899f8-bf56-4b89-a4c6-90235e8c720a), client: CTX_ID:024a059c-7be1-4a19-ba27-8624c6e9c c9c-GRAPH_ID:0-PID:9399-HOST:rhel3-PC_NAME:repvol3-client-2-RECON_NO:-0, error-xlator: repvol3-locks [Resource temporarily unavailable] [2019-07-24 08:15:46.298672] E [MSGID: 115068] [server-rpc-fops_v2.c:1425:server4_readv_cbk] 0-repvol3-server: 39: READV 0 (7db899f8-bf56-4b Is it possible that the lower layer is converting the errnos to EPERM? Can you check gfapi logs and tcmu logs for corresponding error messages and confirm? --- Additional comment from Xiubo Li on 2019-07-25 15:39:23 MVT --- (In reply to Susant Kumar Palai from comment #17) > On the permission denied: > > I did not see any error related to EPERM but saw EBUSY in the brick logs. > > > [2019-07-24 08:15:22.236283] E [MSGID: 101191] > [event-epoll.c:765:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler > [2019-07-24 08:15:46.083306] E [MSGID: 115068] > [server-rpc-fops_v2.c:1425:server4_readv_cbk] 0-repvol3-server: 29: READV 0 > (7db899f8-bf56-4b89-a4c6-90235e8c720a), client: > CTX_ID:024a059c-7be1-4a19-ba27-8624c6e9c > c9c-GRAPH_ID:0-PID:9399-HOST:rhel3-PC_NAME:repvol3-client-2-RECON_NO:-0, > error-xlator: repvol3-locks [Resource temporarily unavailable] > [2019-07-24 08:15:46.088292] E [MSGID: 115068] > [server-rpc-fops_v2.c:1425:server4_readv_cbk] 0-repvol3-server: 31: READV 0 > (7db899f8-bf56-4b89-a4c6-90235e8c720a), client: > CTX_ID:024a059c-7be1-4a19-ba27-8624c6e9c > c9c-GRAPH_ID:0-PID:9399-HOST:rhel3-PC_NAME:repvol3-client-2-RECON_NO:-0, > error-xlator: repvol3-locks [Resource temporarily unavailable] > [2019-07-24 08:15:46.119463] E [MSGID: 115068] > [server-rpc-fops_v2.c:1425:server4_readv_cbk] 0-repvol3-server: 33: READV 0 > (7db899f8-bf56-4b89-a4c6-90235e8c720a), client: > CTX_ID:024a059c-7be1-4a19-ba27-8624c6e9c > c9c-GRAPH_ID:0-PID:9399-HOST:rhel3-PC_NAME:repvol3-client-2-RECON_NO:-0, > error-xlator: repvol3-locks [Resource temporarily unavailable] > [2019-07-24 08:15:46.124067] E [MSGID: 115068] > [server-rpc-fops_v2.c:1425:server4_readv_cbk] 0-repvol3-server: 35: READV 0 > (7db899f8-bf56-4b89-a4c6-90235e8c720a), client: > CTX_ID:024a059c-7be1-4a19-ba27-8624c6e9c > c9c-GRAPH_ID:0-PID:9399-HOST:rhel3-PC_NAME:repvol3-client-2-RECON_NO:-0, > error-xlator: repvol3-locks [Resource temporarily unavailable] > [2019-07-24 08:15:46.294554] E [MSGID: 115068] > [server-rpc-fops_v2.c:1425:server4_readv_cbk] 0-repvol3-server: 37: READV 0 > (7db899f8-bf56-4b89-a4c6-90235e8c720a), client: > CTX_ID:024a059c-7be1-4a19-ba27-8624c6e9c > c9c-GRAPH_ID:0-PID:9399-HOST:rhel3-PC_NAME:repvol3-client-2-RECON_NO:-0, > error-xlator: repvol3-locks [Resource temporarily unavailable] > [2019-07-24 08:15:46.298672] E [MSGID: 115068] > [server-rpc-fops_v2.c:1425:server4_readv_cbk] 0-repvol3-server: 39: READV 0 > (7db899f8-bf56-4b > > > Is it possible that the lower layer is converting the errnos to EPERM? Can > you check gfapi logs and tcmu logs for corresponding error messages and > confirm? If so maybe the gfapi is doing this. I will sent you the gfapi logs, the EPERM value comes from the gfapi directly and tcmu-runner do nothing with it. Checked the gfapi log, it is also full of: [2019-07-24 08:23:41.042339] W [MSGID: 114031] [client-rpc-fops_v2.c:680:client4_0_writev_cbk] 0-repvol3-client-1: remote operation failed [Device or resource busy] [2019-07-24 08:23:41.042381] W [MSGID: 114031] [client-rpc-fops_v2.c:680:client4_0_writev_cbk] 0-repvol3-client-0: remote operation failed [Device or resource busy] [2019-07-24 08:23:41.042556] W [MSGID: 114031] [client-rpc-fops_v2.c:680:client4_0_writev_cbk] 0-repvol3-client-1: remote operation failed [Device or resource busy] [2019-07-24 08:23:41.042574] W [MSGID: 114031] [client-rpc-fops_v2.c:680:client4_0_writev_cbk] 0-repvol3-client-0: remote operation failed [Device or resource busy] [2019-07-24 08:23:41.042655] W [MSGID: 114031] [client-rpc-fops_v2.c:680:client4_0_writev_cbk] 0-repvol3-client-1: remote operation failed [Device or resource busy] [2019-07-24 08:23:41.042671] W [MSGID: 114031] [client-rpc-fops_v2.c:680:client4_0_writev_cbk] 0-repvol3-client-0: remote operation failed [Device or resource busy] [2019-07-24 08:23:41.042709] W [MSGID: 114031] [client-rpc-fops_v2.c:680:client4_0_writev_cbk] 0-repvol3-client-1: remote operation failed [Device or resource busy] [2019-07-24 08:23:41.042722] W [MSGID: 114031] [client-rpc-fops_v2.c:680:client4_0_writev_cbk] 0-repvol3-client-0: remote operation failed [Device or resource busy] [2019-07-24 08:23:41.042784] W [MSGID: 114031] [client-rpc-fops_v2.c:680:client4_0_writev_cbk] 0-repvol3-client-1: remote operation failed [Device or resource busy] Checked the gfapi source code: 677 out: 678 if (rsp.op_ret == -1) { 679 gf_msg(this->name, GF_LOG_WARNING, gf_error_to_errno(rsp.op_errno), 680 PC_MSG_REMOTE_OP_FAILED, "remote operation failed"); 681 } else if (rsp.op_ret >= 0) { 682 if (local->attempt_reopen) 683 client_attempt_reopen(local->fd, this); 684 } 685 CLIENT_STACK_UNWIND(writev, frame, rsp.op_ret, 686 gf_error_to_errno(rsp.op_errno), &prestat, &poststat, 687 xdata); 688 689 if (xdata) 690 dict_unref(xdata); It seems the return valume is coverted. Thanks, BRs --- Additional comment from Xiubo Li on 2019-07-25 15:42:32 MVT --- (In reply to Xiubo Li from comment #18) > (In reply to Susant Kumar Palai from comment #17) [...] > > Checked the gfapi source code: > > 677 out: > 678 if (rsp.op_ret == -1) { It seems returning the rsp.op_ret here to the callback: static void glfs_async_cbk(glfs_fd_t *fd, ssize_t ret, void *data) Not the rsp.op_errno. > 679 gf_msg(this->name, GF_LOG_WARNING, > gf_error_to_errno(rsp.op_errno), > > 680 PC_MSG_REMOTE_OP_FAILED, "remote operation failed"); > 681 } else if (rsp.op_ret >= 0) { > 682 if (local->attempt_reopen) > 683 client_attempt_reopen(local->fd, this); > 684 } > 685 CLIENT_STACK_UNWIND(writev, frame, rsp.op_ret, > 686 gf_error_to_errno(rsp.op_errno), &prestat, > &poststat, > 687 xdata); > 688 > 689 if (xdata) > 690 dict_unref(xdata); > > > It seems the return valume is coverted. > > Thanks, > BRs --- Additional comment from Xiubo Li on 2019-08-01 06:04:52 MVT --- When the ret == -1 and then check the errno directly will works for me now. But I can get both the -EAGAIN and -EBUSY, which only the -EBUSY is expected. Then the problem is why there will always be -EAGAIN every time before acquiring the lock ? Thanks BRs --- Additional comment from Worker Ant on 2019-08-02 19:27:15 MVT --- REVIEW: https://review.gluster.org/23088 (locks/fencing: Address hang while lock preemption) merged (#4) on master by Amar Tumballi --- Additional comment from Xiubo Li on 2019-08-04 18:03:30 MVT --- @Susant, Since the Fencing patch has been into the release 6, so this fixing followed should be backported, right ? Thanks. BRs --- Additional comment from Susant Kumar Palai on 2019-08-12 14:29:58 MVT --- (In reply to Xiubo Li from comment #22) > @Susant, > > Since the Fencing patch has been into the release 6, so this fixing followed > should be backported, right ? > > Thanks. > BRs Will backport to release 6 and 7.
REVIEW: https://review.gluster.org/23217 (locks/fencing: Address hang while lock preemption) posted (#2) for review on release-6 by Susant Palai
This bug is moved to https://github.com/gluster/glusterfs/issues/1045, and will be tracked there from now on. Visit GitHub issues URL for further details
Merging in Release-6 blocked as smoke test is failing as this bug is in closed state.
REVIEW: https://review.gluster.org/23217 (locks/fencing: Address hang while lock preemption) merged (#6) on release-6 by Rinku Kothiya