Description of problem: ----------------------- Unable to set new volume option on the volume as it throws error - "Another transaction in progress" error. This volume is managed in RHV 4.1 (RC), which also triggers 'volume status', etc periodically and obtains the information from any of the node in the cluster chosen arbitrarily. Version-Release number of selected component (if applicable): ------------------------------------------------------------- RHGS 3.2.0 ( glusterfs-3.8.4-18.el7rhgs ) RHV-H 4.1 RHV 4.1 ( RC ) How reproducible: ----------------- Hit it once, haven't tried to reproduce Steps to Reproduce: ------------------- 0. Turned on SSL/TLS encryption with RHGS data and mgmt path. 1. Created a volume with RHV UI and enable encryption. 2. Create a storage domain with this volume and create VMs Actual results: ---------------- Observed 'Another transaction in progress' when trying to set an option on the volume. Expected results: ----------------- There should not be any locks that was held, which prevents any volume set operation. Additional info: ---------------- From the logs, the lock was still available for last 4 days
This is the exact error message in glusterd logs <snip> [2017-04-15 22:14:52.186079] W [glusterd-locks.c:572:glusterd_mgmt_v3_lock] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xcfb30) [0x7ff4ccf23b30] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xcfa60) [0x7ff4ccf23a60] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd4d6f) [0x7ff4ccf28d6f] ) 0-management: Lock for data held by e45f76c0-89e4-4601-bb41-ba3110a15681 [2017-04-15 22:14:52.186111] E [MSGID: 106119] [glusterd-syncop.c:1851:gd_sync_task_begin] 0-management: Unable to acquire lock for data </snip> The volume name is 'data' and its of type 'replica'
Created attachment 1272257 [details] gluster logs from one of the node
Created attachment 1272258 [details] glusterd statedump from one of the node
glusterd.mgmt_v3_lock= debug.last-success-bt-data-vol:(--> /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd496c)[0x7ff4ccf2896c] (--> /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x2e195)[0x7ff4cce82195] (--> /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x3cd1f)[0x7ff4cce90d1f] (--> /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xf801d)[0x7ff4ccf4c01d] (--> /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x20540)[0x7ff4cce74540] ))))) data_vol:e45f76c0-89e4-4601-bb41-ba3110a15681 stale lock is on volume "data" From the backtrace of the lock: (gdb) info symbol 0x7ff4ccf2896c glusterd_mgmt_v3_lock + 492 in section .text of /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so (gdb) info symbol 0x7ff4cce82195 glusterd_op_ac_lock + 149 in section .text of /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so (gdb) info symbol 0x7ff4cce90d1f glusterd_op_sm + 671 in section .text of /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so (gdb) info symbol 0x7ff4ccf4c01d glusterd_handle_mgmt_v3_lock_fn + 1245 in section .text of /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so (gdb) info symbol 0x7ff4cce74540 glusterd_big_locked_handler + 48 in section .text of /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
so gluster volume profile and gluster volume status consecutive transactions collided on one node resulting into two op-sm transactions running into the same state machine where we can end up into a stale lock. It is explained in detail at https://bugzilla.redhat.com/show_bug.cgi?id=1425681#c4 .
The only way to fix this is to port volume profile command into mgmt_v3. But is it worth of an effort at this stage with GD2 is under active development is what we'd need to assess.
I have tried the workaround suggested by Atin and the stale lock was released. 1. Reset server quorum on all the volumes # gluster volume set <vol> server-quorum-type none 2. Restarted glusterd on all the nodes using gdeploy [hosts] host1 host2 host3 host4 host5 host6 [service] action=restart service=glusterd Note: Restarting glusterd on all the nodes is required. 3. Set server-quorum on all the volumes # gluster volume set <vol> server-quorum-type server
@atin, I have a case, 01874385, which seems to be presenting with very similar errors. glusterd.log: [2017-06-28 20:56:22.549828] E [MSGID: 106119] [glusterd-syncop.c:1851:gd_sync_task_begin] 0-management: Unable to acquire lock for ACL_VEEAM_BCK_VOL1 and associated: cmd_history.log: [2017-06-28 20:56:22.549842] : volume status all tasks : FAILED : Another transaction is in progress for ACL_VEEAM_BCK_VOL1. Please try again after sometime. These occur on almost exactly a 1:1 ratio. Can I get an opinion about this being the same issue? What additional information can I provide to help make that determination?
Get me the cmd_history & glusterd log from all the nodes along with the glusterd statedump taken on the node where the locking has failed.
@Atin, I've requested new logs and the statedump info from the customer. I will attach them when they come in.
Hi, I do have a similar problem with glusterfs 3.4.5 on redhat 6. If I shall provide some logs, please tell me. gluster volume status all Another transaction could be in progress. Please try again after sometime. [2017-08-08 11:35:57.139221] E [glusterd-utils.c:332:glusterd_lock] 0-management: Unable to get lock for uuid: a813ad42-bf64-4b3b-ae24-59883671a8e8, lock held by: a813ad42-bf64-4b3b-ae24-59883671a8e8 [2017-08-08 11:35:57.139272] E [glusterd-op-sm.c:5445:glusterd_op_sm] 0-management: handler returned: -1 [2017-08-08 11:35:57.139920] E [glusterd-syncop.c:715:gd_lock_op_phase] 0-management: Failed to acquire lock [2017-08-08 11:35:57.140762] E [glusterd-utils.c:365:glusterd_unlock] 0-management: Cluster lock not held!
On all the servers in the cluster I had the server itself in the peers-file. this was the problem in my system. simple mistake...took me quite long to figure out.
Tested with RHV 4.2 and glusterfs-3.12. 1. Added RHGS nodes to the cluster. 2. Repeated gluster volume status are queries There are no 'Another transaction in progress' errors
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607