Description of problem: ===================== I am trying to fetch the status of geo-rep , and didn't mention the master-slave details. While giving such command, it is timing out [root@rhs-gp-srv5 ~]# gluster v geo-rep status Error : Request timed out geo-replication command failed [root@rhs-gp-srv5 ~]# gluster volume geo-replication nftvol rhs-gp-srv13.lab.eng.blr.redhat.com::nftvol-slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- rhs-gp-srv5.lab.eng.blr.redhat.com nftvol /gluster/brick1/nftvol-sv1 root rhs-gp-srv13.lab.eng.blr.redhat.com::nftvol-slave rhs-gp-srv16.lab.eng.blr.redhat.com Active Hybrid Crawl N/A rhs-gp-srv5.lab.eng.blr.redhat.com nftvol /gluster/brick2/nftvol-sv2 root rhs-gp-srv13.lab.eng.blr.redhat.com::nftvol-slave rhs-gp-srv13.lab.eng.blr.redhat.com Active Hybrid Crawl N/A rhs-gp-srv5.lab.eng.blr.redhat.com nftvol /gluster/brick3/nftvol-sv3 root rhs-gp-srv13.lab.eng.blr.redhat.com::nftvol-slave rhs-gp-srv14.lab.eng.blr.redhat.com Active Hybrid Crawl N/A rhs-gp-srv9.lab.eng.blr.redhat.com nftvol /gluster/brick1/nftvol-sv2 root rhs-gp-srv13.lab.eng.blr.redhat.com::nftvol-slave rhs-gp-srv16.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv9.lab.eng.blr.redhat.com nftvol /gluster/brick2/nftvol-sv3 root rhs-gp-srv13.lab.eng.blr.redhat.com::nftvol-slave rhs-gp-srv13.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv9.lab.eng.blr.redhat.com nftvol /gluster/brick3/nftvol-sv4 root rhs-gp-srv13.lab.eng.blr.redhat.com::nftvol-slave rhs-gp-srv14.lab.eng.blr.redhat.com Active Hybrid Crawl N/A rhs-gp-srv7.lab.eng.blr.redhat.com nftvol /gluster/brick1/nftvol-sv1 root rhs-gp-srv13.lab.eng.blr.redhat.com::nftvol-slave rhs-gp-srv13.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv7.lab.eng.blr.redhat.com nftvol /gluster/brick2/nftvol-sv2 root rhs-gp-srv13.lab.eng.blr.redhat.com::nftvol-slave rhs-gp-srv14.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv7.lab.eng.blr.redhat.com nftvol /gluster/brick3/nftvol-sv4 root rhs-gp-srv13.lab.eng.blr.redhat.com::nftvol-slave rhs-gp-srv16.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv8.lab.eng.blr.redhat.com nftvol /gluster/brick1/nftvol-sv1 root rhs-gp-srv13.lab.eng.blr.redhat.com::nftvol-slave rhs-gp-srv14.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv8.lab.eng.blr.redhat.com nftvol /gluster/brick2/nftvol-sv3 root rhs-gp-srv13.lab.eng.blr.redhat.com::nftvol-slave rhs-gp-srv16.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv8.lab.eng.blr.redhat.com nftvol /gluster/brick3/nftvol-sv4 root rhs-gp-srv13.lab.eng.blr.redhat.com::nftvol-slave rhs-gp-srv13.lab.eng.blr.redhat.com Passive N/A N/A Volume Name: gluster_shared_storage Type: Replicate Volume ID: 8070d7a5-e2be-46c6-b607-56d9199371b8 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: rhs-gp-srv7.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick Brick2: rhs-gp-srv8.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick Brick3: rhs-gp-srv5.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick Options Reconfigured: transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on performance.client-io-threads: off cluster.enable-shared-storage: enable Volume Name: nftvol Type: Distributed-Replicate Volume ID: cb9567c5-051c-4cf9-bf8e-1cc7f5bfc129 Status: Started Snapshot Count: 1 Number of Bricks: 4 x 3 = 12 Transport-type: tcp Bricks: Brick1: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick1/nftvol-sv1 Brick2: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick1/nftvol-sv1 Brick3: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick1/nftvol-sv1 Brick4: rhs-gp-srv9.lab.eng.blr.redhat.com:/gluster/brick1/nftvol-sv2 Brick5: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick2/nftvol-sv2 Brick6: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick2/nftvol-sv2 Brick7: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick2/nftvol-sv3 Brick8: rhs-gp-srv9.lab.eng.blr.redhat.com:/gluster/brick2/nftvol-sv3 Brick9: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick3/nftvol-sv3 Brick10: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick3/nftvol-sv4 Brick11: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick3/nftvol-sv4 Brick12: rhs-gp-srv9.lab.eng.blr.redhat.com:/gluster/brick3/nftvol-sv4 Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on features.barrier: disable performance.client-io-threads: off nfs.disable: on storage.fips-mode-rchecksum: on transport.address-family: inet features.uss: enable cluster.shd-max-threads: 24 diagnostics.latency-measurement: on diagnostics.count-fop-hits: on client.event-threads: 8 server.event-threads: 8 cluster.enable-shared-storage: enable Version-Release number of selected component (if applicable): ====================== 6.0.7 How reproducible: ================= consistent Steps to Reproduce: 1. created a georep session b/w 4x3 and 4x(6+2) ec volume 2. started the session 3. status works fine if we give master-slave details as part of command Actual results: =========== but times out if we don't give master-salve details Expected results: ================ shouldn't timeout
glusterd log o/p on node where command was triggered [2019-07-08 08:38:08.229980] I [MSGID: 106061] [glusterd-geo-rep.c:308:__glusterd_handle_gsync_set] 0-management: master not found, while handling geo-replication options [2019-07-08 08:38:08.230020] I [MSGID: 106061] [glusterd-geo-rep.c:315:__glusterd_handle_gsync_set] 0-management: slave not found, while handling geo-replication options [2019-07-08 08:38:08.230079] W [MSGID: 106061] [glusterd-geo-rep.c:2388:glusterd_op_gsync_args_get] 0-management: master not found [2019-07-08 08:38:08.344099] I [MSGID: 106327] [glusterd-geo-rep.c:4653:glusterd_read_status_file] 0-management: Using passed config template(/var/lib/glusterd/geo-replication/nftvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nftvol-slave/gsyncd.conf). The message "I [MSGID: 106488] [glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management: Received get vol req" repeated 2 times between [2019-07-08 08:37:39.774894] and [2019-07-08 08:37:39.782975] o/p of glusterd.log on other nodes [2019-07-08 08:38:08.907384] E [glusterd-op-sm.c:8204:glusterd_op_sm] (-->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x24d5e) [0x7fc36af1cd5e] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x1d12e) [0x7fc36af1512e] -->/usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so(+0x44604) [0x7fc36af3c604] ) 0-management: Unable to get transaction opinfo for transaction ID :276653aa-a2b4-404a-8a69-e93c6576f20a
Hi, Do you have machine available where you are observing this behavior, I tried on local setup its not reproducible. -Sunny
(In reply to Sunny Kumar from comment #3) > Hi, > > Do you have machine available where you are observing this behavior, I tried > on local setup its not reproducible. > > -Sunny yes, I pinged you the details also logs and sosreports @ sosreports and logs @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1727785/
Update: Looks like glusterd bug: adding Sanju to look at it. Thanks Sanju for helping in analysis and please take this forward. [Switching to Thread 0x7fc36cbdc700 (LWP 22324)] Breakpoint 1, 0x00007fc36af32670 in glusterd_set_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so (gdb) bt #0 0x00007fc36af32670 in glusterd_set_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #1 0x00007fc36af14cb1 in __glusterd_handle_stage_op () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #2 0x00007fc36af1cd5e in glusterd_big_locked_handler () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #3 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0 #4 0x00007fc375452180 in ?? () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () (gdb) c Continuing. Breakpoint 2, 0x00007fc36af329f0 in glusterd_clear_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so (gdb) bt #0 0x00007fc36af329f0 in glusterd_clear_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #1 0x00007fc36af34f29 in glusterd_op_ac_stage_op () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #2 0x00007fc36af3c514 in glusterd_op_sm () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #3 0x00007fc36af14d63 in __glusterd_handle_stage_op () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #4 0x00007fc36af1cd5e in glusterd_big_locked_handler () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #5 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0 #6 0x00007fc375452180 in ?? () from /lib64/libc.so.6 #7 0x0000000000000000 in ?? () (gdb) c Continuing. bt ^C Program received signal SIGINT, Interrupt. [Switching to Thread 0x7fc3773054c0 (LWP 22320)] 0x00007fc375c44017 in pthread_join () from /lib64/libpthread.so.0 (gdb) i b Num Type Disp Enb Address What 1 breakpoint keep y 0x00007fc36af32670 <glusterd_set_txn_opinfo> breakpoint already hit 1 time 2 breakpoint keep y 0x00007fc36af329f0 <glusterd_clear_txn_opinfo> breakpoint already hit 1 time (gdb) b glusterd_get_txn_opinfo Breakpoint 3 at 0x7fc36af323c0 (gdb) c Continuing. [Switching to Thread 0x7fc36cbdc700 (LWP 22324)] Breakpoint 3, 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so (gdb) bt #0 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #1 0x00007fc36af14c2a in __glusterd_handle_stage_op () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #2 0x00007fc36af1cd5e in glusterd_big_locked_handler () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #3 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0 #4 0x00007fc375452180 in ?? () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () (gdb) c Continuing. Breakpoint 1, 0x00007fc36af32670 in glusterd_set_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so (gdb) bt #0 0x00007fc36af32670 in glusterd_set_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #1 0x00007fc36af14cb1 in __glusterd_handle_stage_op () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #2 0x00007fc36af1cd5e in glusterd_big_locked_handler () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #3 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0 #4 0x00007fc375452180 in ?? () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () (gdb) c Continuing. Breakpoint 3, 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so (gdb) bt #0 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #1 0x00007fc36af3c434 in glusterd_op_sm () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #2 0x00007fc36af14d63 in __glusterd_handle_stage_op () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #3 0x00007fc36af1cd5e in glusterd_big_locked_handler () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #4 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0 #5 0x00007fc375452180 in ?? () from /lib64/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) c Continuing. Breakpoint 3, 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so (gdb) bt #0 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #1 0x00007fc36af34e02 in glusterd_op_ac_stage_op () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #2 0x00007fc36af3c514 in glusterd_op_sm () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #3 0x00007fc36af14d63 in __glusterd_handle_stage_op () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #4 0x00007fc36af1cd5e in glusterd_big_locked_handler () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #5 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0 #6 0x00007fc375452180 in ?? () from /lib64/libc.so.6 #7 0x0000000000000000 in ?? () (gdb) c Continuing. Breakpoint 2, 0x00007fc36af329f0 in glusterd_clear_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so (gdb) bt #0 0x00007fc36af329f0 in glusterd_clear_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #1 0x00007fc36af34f29 in glusterd_op_ac_stage_op () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #2 0x00007fc36af3c514 in glusterd_op_sm () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #3 0x00007fc36af14d63 in __glusterd_handle_stage_op () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #4 0x00007fc36af1cd5e in glusterd_big_locked_handler () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #5 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0 #6 0x00007fc375452180 in ?? () from /lib64/libc.so.6 #7 0x0000000000000000 in ?? () (gdb) c Continuing. Breakpoint 3, 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so (gdb) bt #0 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #1 0x00007fc36af32a58 in glusterd_clear_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #2 0x00007fc36af34f29 in glusterd_op_ac_stage_op () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #3 0x00007fc36af3c514 in glusterd_op_sm () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #4 0x00007fc36af14d63 in __glusterd_handle_stage_op () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #5 0x00007fc36af1cd5e in glusterd_big_locked_handler () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #6 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0 #7 0x00007fc375452180 in ?? () from /lib64/libc.so.6 #8 0x0000000000000000 in ?? () (gdb) c Continuing. [Switching to Thread 0x7fc36c3db700 (LWP 22325)] Breakpoint 3, 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so (gdb) bt #0 0x00007fc36af323c0 in glusterd_get_txn_opinfo () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #1 0x00007fc36af3c434 in glusterd_op_sm () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #2 0x00007fc36af1512e in __glusterd_handle_commit_op () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #3 0x00007fc36af1cd5e in glusterd_big_locked_handler () from /usr/lib64/glusterfs/6.0/xlator/mgmt/glusterd.so #4 0x00007fc376e42c60 in synctask_wrap () from /lib64/libglusterfs.so.0 #5 0x00007fc375452180 in ?? () from /lib64/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) c Continuing. -Sunny
Update: For no volname operations, no unlock event is triggered where we clean up the txn_opinfo. For such transactions, we set the skip_locking flag as true in staging phase. We clean up the txn_opinfo once staging is completed. The above is working as expected in case of "gluster v status" where the opcode is "GF_CLI_STATUS_ALL" as we are not sending commit req to peers, so in commit phase we don't need the txn_opinfo. In this transaction, we do nothing in the commit phase at the peers and we pass on the volume list to CLI from originator node only. and then CLI sends volume status req for each volume separately to glusterd. But with "gluster v geo-replication status", where we are not specifying the volume name, the txn_opinfo is cleaned up after staging phase as skip_locking is set to true. and glusterd tries to get the cleared txn_opinfo in the commit phase. Here, below is the bt for no volname operations: (gdb) bt #0 glusterd_get_gsync_status_all (rsp_dict=0x7f927c016018, node=0x7f9284225c50 "server2") at glusterd-geo-rep.c:5084 #1 0x00007f928bd4d6a8 in glusterd_get_gsync_status (dict=0x7f927c019df8, op_errstr=0x7f9284226f20, rsp_dict=0x7f927c016018) at glusterd-geo-rep.c:5135 #2 0x00007f928bd4fa5f in glusterd_op_gsync_set (dict=0x7f927c019df8, op_errstr=0x7f9284226f20, rsp_dict=0x7f927c016018) at glusterd-geo-rep.c:5691 #3 0x00007f928bcc9240 in glusterd_op_commit_perform (op=GD_OP_GSYNC_SET, dict=0x7f927c019df8, op_errstr=0x7f9284226f20, rsp_dict=0x7f927c016018) at glusterd-op-sm.c:5924 #4 0x00007f928bcc86fc in glusterd_op_ac_commit_op (event=0x7f927c0159a0, ctx=0x7f927c013210) at glusterd-op-sm.c:5625 #5 0x00007f928bccdc37 in glusterd_op_sm () at glusterd-op-sm.c:7857 #6 0x00007f928bca6b56 in __glusterd_handle_commit_op (req=req@entry=0x7f928401ae18) at glusterd-handler.c:999 #7 0x00007f928bcaa2ed in glusterd_big_locked_handler (req=0x7f928401ae18, actor_fn=0x7f928bca6960 <__glusterd_handle_commit_op>) at glusterd-handler.c:79 #8 0x00007f929833a46a in synctask_wrap () at syncop.c:272 #9 0x00007f92968a3bf0 in ?? () from /lib64/libc.so.6 #10 0x0000000000000000 in ?? () glusterd_get_gsync_status_all(), in turn calls glusterd_get_gsync_status_mst() for all the volumes. a snippet from glusterd_get_gsync_status_all(): cds_list_for_each_entry(volinfo, &priv->volumes, vol_list) { ret = glusterd_get_gsync_status_mst(volinfo, rsp_dict, node); if (ret) goto out; } So, we need this txn_opinfo in commit phase for geo-rep status. The fix would be not to clear the txn_opinfo after staging phase and clean it after commit phase in case of geo-rep operations. In originator, it will be cleared in gd_sync_task_begin(): if (txn_id) { if (global) (void)gd_unlock_op_phase(conf, op, &op_ret, req, op_ctx, op_errstr, global, is_acquired, *txn_id, &txn_opinfo, cluster_lock); else (void)gd_unlock_op_phase(conf, op, &op_ret, req, op_ctx, op_errstr, volname, is_acquired, *txn_id, &txn_opinfo, cluster_lock); /* Clearing the transaction opinfo */ ret = glusterd_clear_txn_opinfo(txn_id); if (ret) gf_msg(this->name, GF_LOG_ERROR, 0, GD_MSG_TRANS_OPINFO_CLEAR_FAIL, "Unable to clear transaction's " "opinfo for transaction ID : %s", uuid_utoa(*txn_id)); } In case of receivers, it will be cleared in glusterd_op_ac_commit_op(): /* for no volname transactions, the txn_opinfo needs to be cleaned up * as there's no unlock event triggered */ if (need_cleanup && txn_id && txn_op_info.skip_locking) ret = glusterd_clear_txn_opinfo(txn_id); As per the above details, I expected the below change to make the things work, but with no luck. I still see this command is hung and I don't see any error messages in glusterd.log [root@server2 glusterfs]# git diff diff --git a/xlators/mgmt/glusterd/src/glusterd-op-sm.c b/xlators/mgmt/glusterd/src/glusterd-op-sm.c index 91533489d..8b0b5901d 100644 --- a/xlators/mgmt/glusterd/src/glusterd-op-sm.c +++ b/xlators/mgmt/glusterd/src/glusterd-op-sm.c @@ -5537,7 +5537,8 @@ out: * txn_opinfo can't be cleared as that'll lead to a race of referring op_ctx * after it's being freed. */ - if (txn_op_info.skip_locking && priv->op_version >= GD_OP_VERSION_6_0) + if (req_ctx->op != GD_OP_GSYNC_SET && txn_op_info.skip_locking && + priv->op_version >= GD_OP_VERSION_6_0) ret = glusterd_clear_txn_opinfo(txn_id); if (rsp_dict) [root@server2 glusterfs]# I will continue to investigate on this and update the BZ. Thanks, Sanju
RCA: Please read comment 15 to understand this comment well. In op-sm framework, every stage goes through glusterd_op_sm() function and sets txn_opinfo based on the txn_id except for the unlock phase. If the state of op-machine is GD_OP_STATE_STAGED and skip_locking is true, then we don't set the txn_opinfo because: 1. skip_locking is true says that it is a no volname operation 2. the state of op-machine is GD_OP_STATE_STAGED says that staging phase is completed. No volname operations such as "gluster v status" does nothing at the peers, so we don't send the commit req to peers and clear the txn_opinfo after staging phase. and we don't need to set txn_opinfo in later phases. a snippet from glusterd_op_sm() which does the above-explained functionality: if ((state[event_type].next_state == GD_OP_STATE_DEFAULT) && (event_type == GD_OP_EVENT_UNLOCK)) { /* Clearing the transaction opinfo */ ret = glusterd_clear_txn_opinfo(&event->txn_id); if (ret) gf_msg(this->name, GF_LOG_ERROR, 0, GD_MSG_TRANS_OPINFO_CLEAR_FAIL, "Unable to clear " "transaction's opinfo"); } else { if ((priv->op_version < GD_OP_VERSION_6_0) || !(event_type == GD_OP_EVENT_STAGE_OP && opinfo.state.state == GD_OP_STATE_STAGED && opinfo.skip_locking)) { ret = glusterd_set_txn_opinfo(&event->txn_id, &opinfo); if (ret) gf_msg(this->name, GF_LOG_ERROR, 0, GD_MSG_TRANS_OPINFO_SET_FAIL, "Unable to set " "transaction's opinfo"); } } But with geo-rep operation, if we don't specify the volname then also we have to send the commit req to peers and also we need to set the txn_opinfo in commit stage. To fix this issue, we can mark skip_locking as true only when it is a no volname operation and it is not a geo-rep operation. So that it can set the txn_opinfo in commit stage also. Thanks, Sanju
upstream patch: https://review.gluster.org/#/c/glusterfs/+/23034
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:3249