1442983 – Unable to acquire lock for gluster volume leading to 'another transaction in progress' error

Bug 1442983 - Unable to acquire lock for gluster volume leading to 'another transaction in progress' error

Summary: Unable to acquire lock for gluster volume leading to 'another transaction in ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Atin Mukherjee
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1503134 1526372
TreeView+	depends on / blocked

Reported:	2017-04-18 08:44 UTC by SATHEESARAN
Modified:	2018-09-04 06:33 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glusterfs-3.12.2-1
Doc Type:	If docs needed, set a value
Doc Text:	Cause: TBD Consequence: Workaround (if any): Result:
Clone Of:
Clones:	1526372 (view as bug list)
Environment:
Last Closed:	2018-09-04 06:32:03 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
gluster logs from one of the node (296.88 KB, application/x-gzip) 2017-04-18 08:56 UTC, SATHEESARAN	no flags	Details
glusterd statedump from one of the node (30.05 KB, text/plain) 2017-04-18 08:57 UTC, SATHEESARAN	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1425681	0	unspecified	CLOSED	[Glusterd] Volume operations fail on a (tiered) volume because of a stale lock held by one of the nodes	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:33:46 UTC

Internal Links: 1425681

Description SATHEESARAN 2017-04-18 08:44:47 UTC

Description of problem:
-----------------------
Unable to set new volume option on the volume as it throws error - "Another transaction in progress" error.

This volume is managed in RHV 4.1 (RC), which also triggers 'volume status', etc periodically and obtains the information from any of the node in the cluster chosen arbitrarily.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHGS 3.2.0 ( glusterfs-3.8.4-18.el7rhgs )
RHV-H 4.1
RHV 4.1 ( RC )

How reproducible:
-----------------
Hit it once, haven't tried to reproduce

Steps to Reproduce:
-------------------
0. Turned on SSL/TLS encryption with RHGS data and mgmt path.
1. Created a volume with RHV UI and enable encryption.
2. Create a storage domain with this volume and create VMs

Actual results:
----------------
Observed 'Another transaction in progress' when trying to set an option on the volume.

Expected results:
-----------------
There should not be any locks that was held, which prevents any volume set operation.

Additional info:
----------------
From the logs, the lock was still available for last 4 days

Comment 1 SATHEESARAN 2017-04-18 08:45:54 UTC

This is the exact error message in glusterd logs

<snip>
[2017-04-15 22:14:52.186079] W [glusterd-locks.c:572:glusterd_mgmt_v3_lock] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xcfb30) [0x7ff4ccf23b30] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xcfa60) [0x7ff4ccf23a60] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd4d6f) [0x7ff4ccf28d6f] ) 0-management: Lock for data held by e45f76c0-89e4-4601-bb41-ba3110a15681
[2017-04-15 22:14:52.186111] E [MSGID: 106119] [glusterd-syncop.c:1851:gd_sync_task_begin] 0-management: Unable to acquire lock for data
</snip>

The volume name is 'data' and its of type 'replica'

Comment 2 SATHEESARAN 2017-04-18 08:56:24 UTC

Created attachment 1272257 [details]
gluster logs from one of the node

Comment 3 SATHEESARAN 2017-04-18 08:57:45 UTC

Created attachment 1272258 [details]
glusterd statedump from one of the node

Comment 4 Atin Mukherjee 2017-04-18 09:23:14 UTC

glusterd.mgmt_v3_lock=
        debug.last-success-bt-data-vol:(--> /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd496c)[0x7ff4ccf2896c] (--> /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x2e195)[0x7ff4cce82195] (--> /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x3cd1f)[0x7ff4cce90d1f] (--> /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xf801d)[0x7ff4ccf4c01d] (--> /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x20540)[0x7ff4cce74540] )))))
        data_vol:e45f76c0-89e4-4601-bb41-ba3110a15681

stale lock is on volume "data"

From the backtrace of the lock:

(gdb) info symbol 0x7ff4ccf2896c
glusterd_mgmt_v3_lock + 492 in section .text of /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
(gdb) info symbol 0x7ff4cce82195
glusterd_op_ac_lock + 149 in section .text of /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
(gdb) info symbol 0x7ff4cce90d1f
glusterd_op_sm + 671 in section .text of /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
(gdb) info symbol 0x7ff4ccf4c01d
glusterd_handle_mgmt_v3_lock_fn + 1245 in section .text of /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
(gdb) info symbol 0x7ff4cce74540
glusterd_big_locked_handler + 48 in section .text of /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so

Comment 5 Atin Mukherjee 2017-04-18 10:10:26 UTC

so gluster volume profile and gluster volume status consecutive transactions collided on one node resulting into two op-sm transactions running into the same state machine where we can end up into a stale lock. It is explained in detail at https://bugzilla.redhat.com/show_bug.cgi?id=1425681#c4 .

Comment 6 Atin Mukherjee 2017-04-18 10:12:44 UTC

The only way to fix this is to port volume profile command into mgmt_v3. But is it worth of an effort at this stage with GD2 is under active development is what we'd need to assess.

Comment 7 SATHEESARAN 2017-04-19 03:10:35 UTC

I have tried the workaround suggested by Atin and the stale lock was released.

1. Reset server quorum on all the volumes

# gluster volume set <vol> server-quorum-type none

2. Restarted glusterd on all the nodes using gdeploy
[hosts]
host1
host2
host3
host4
host5
host6

[service]
action=restart
service=glusterd

Note: Restarting glusterd on all the nodes is required.

3. Set server-quorum on all the volumes
# gluster volume set <vol> server-quorum-type server

Comment 10 Cal Calhoun 2017-07-04 15:35:31 UTC

@atin,

I have a case, 01874385, which seems to be presenting with very similar errors.

glusterd.log:
[2017-06-28 20:56:22.549828] E [MSGID: 106119] [glusterd-syncop.c:1851:gd_sync_task_begin] 0-management: Unable to acquire lock for ACL_VEEAM_BCK_VOL1

and associated:

cmd_history.log:
[2017-06-28 20:56:22.549842]  : volume status all tasks : FAILED : Another transaction is in progress for ACL_VEEAM_BCK_VOL1. Please try again after sometime.

These occur on almost exactly a 1:1 ratio.  Can I get an opinion about this being the same issue?  What additional information can I provide to help make that determination?

Comment 11 Atin Mukherjee 2017-07-05 04:23:30 UTC

Get me the cmd_history & glusterd log from all the nodes along with the glusterd statedump taken on the node where the locking has failed.

Comment 12 Cal Calhoun 2017-07-05 15:28:17 UTC

@Atin, I've requested new logs and the statedump info from the customer.  I will attach them when they come in.

Comment 21 Timo 2017-08-08 11:37:32 UTC

Hi,

I do have a similar problem with glusterfs 3.4.5 on redhat 6. If I shall provide some logs, please tell me.

gluster volume status all
Another transaction could be in progress. Please try again after sometime.

[2017-08-08 11:35:57.139221] E [glusterd-utils.c:332:glusterd_lock] 0-management: Unable to get lock for uuid: a813ad42-bf64-4b3b-ae24-59883671a8e8, lock held by: a813ad42-bf64-4b3b-ae24-59883671a8e8
[2017-08-08 11:35:57.139272] E [glusterd-op-sm.c:5445:glusterd_op_sm] 0-management: handler returned: -1
[2017-08-08 11:35:57.139920] E [glusterd-syncop.c:715:gd_lock_op_phase] 0-management: Failed to acquire lock
[2017-08-08 11:35:57.140762] E [glusterd-utils.c:365:glusterd_unlock] 0-management: Cluster lock not held!

Comment 22 Timo 2017-08-16 12:36:19 UTC

On all the servers in the cluster I had the server itself in the peers-file. this was the problem in my system. simple mistake...took me quite long to figure out.

Comment 31 SATHEESARAN 2018-05-16 11:30:44 UTC

Tested with RHV 4.2 and glusterfs-3.12.

1. Added RHGS nodes to the cluster.
2. Repeated gluster volume status are queries

There are no 'Another transaction in progress' errors

Comment 32 errata-xmlrpc 2018-09-04 06:32:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.