Bug 1442983

Summary:

Unable to acquire lock for gluster volume leading to 'another transaction in progress' error

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

SATHEESARAN <sasundar>

Component:

glusterd

Assignee:

Atin Mukherjee <amukherj>

Status:

CLOSED ERRATA

QA Contact:

SATHEESARAN <sasundar>

Severity:

high

Docs Contact:

Priority:

medium

Version:

rhgs-3.2

CC:

amukherj, bkunal, ccalhoun, rhinduja, rhs-bugs, sasundar, sheggodu, storage-qa-internal, timo.kramer_ext, vbellur

Target Milestone:

---

Keywords:

ZStream

Target Release:

RHGS 3.4.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

glusterfs-3.12.2-1

Doc Type:

If docs needed, set a value

Doc Text:

Cause: TBD Consequence: Workaround (if any): Result:

Story Points:

---

Clone Of:

Clones:

1526372 (view as bug list)

Environment:

Last Closed:

2018-09-04 06:32:03 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1503134, 1526372

Attachments:

Description	Flags
gluster logs from one of the node	none
glusterd statedump from one of the node	none

Description SATHEESARAN 2017-04-18 08:44:47 UTC

Description of problem:
-----------------------
Unable to set new volume option on the volume as it throws error - "Another transaction in progress" error.

This volume is managed in RHV 4.1 (RC), which also triggers 'volume status', etc periodically and obtains the information from any of the node in the cluster chosen arbitrarily.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHGS 3.2.0 ( glusterfs-3.8.4-18.el7rhgs )
RHV-H 4.1
RHV 4.1 ( RC )

How reproducible:
-----------------
Hit it once, haven't tried to reproduce

Steps to Reproduce:
-------------------
0. Turned on SSL/TLS encryption with RHGS data and mgmt path.
1. Created a volume with RHV UI and enable encryption.
2. Create a storage domain with this volume and create VMs

Actual results:
----------------
Observed 'Another transaction in progress' when trying to set an option on the volume.

Expected results:
-----------------
There should not be any locks that was held, which prevents any volume set operation.

Additional info:
----------------
From the logs, the lock was still available for last 4 days

Comment 1 SATHEESARAN 2017-04-18 08:45:54 UTC

This is the exact error message in glusterd logs

<snip>
[2017-04-15 22:14:52.186079] W [glusterd-locks.c:572:glusterd_mgmt_v3_lock] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xcfb30) [0x7ff4ccf23b30] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xcfa60) [0x7ff4ccf23a60] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd4d6f) [0x7ff4ccf28d6f] ) 0-management: Lock for data held by e45f76c0-89e4-4601-bb41-ba3110a15681
[2017-04-15 22:14:52.186111] E [MSGID: 106119] [glusterd-syncop.c:1851:gd_sync_task_begin] 0-management: Unable to acquire lock for data
</snip>

The volume name is 'data' and its of type 'replica'

Comment 2 SATHEESARAN 2017-04-18 08:56:24 UTC

Created attachment 1272257 [details]
gluster logs from one of the node

Comment 3 SATHEESARAN 2017-04-18 08:57:45 UTC

Created attachment 1272258 [details]
glusterd statedump from one of the node

Comment 4 Atin Mukherjee 2017-04-18 09:23:14 UTC

glusterd.mgmt_v3_lock=
        debug.last-success-bt-data-vol:(--> /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd496c)[0x7ff4ccf2896c] (--> /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x2e195)[0x7ff4cce82195] (--> /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x3cd1f)[0x7ff4cce90d1f] (--> /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xf801d)[0x7ff4ccf4c01d] (--> /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x20540)[0x7ff4cce74540] )))))
        data_vol:e45f76c0-89e4-4601-bb41-ba3110a15681

stale lock is on volume "data"

From the backtrace of the lock:

(gdb) info symbol 0x7ff4ccf2896c
glusterd_mgmt_v3_lock + 492 in section .text of /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
(gdb) info symbol 0x7ff4cce82195
glusterd_op_ac_lock + 149 in section .text of /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
(gdb) info symbol 0x7ff4cce90d1f
glusterd_op_sm + 671 in section .text of /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
(gdb) info symbol 0x7ff4ccf4c01d
glusterd_handle_mgmt_v3_lock_fn + 1245 in section .text of /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
(gdb) info symbol 0x7ff4cce74540
glusterd_big_locked_handler + 48 in section .text of /usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so

Comment 5 Atin Mukherjee 2017-04-18 10:10:26 UTC

so gluster volume profile and gluster volume status consecutive transactions collided on one node resulting into two op-sm transactions running into the same state machine where we can end up into a stale lock. It is explained in detail at https://bugzilla.redhat.com/show_bug.cgi?id=1425681#c4 .

Comment 6 Atin Mukherjee 2017-04-18 10:12:44 UTC

The only way to fix this is to port volume profile command into mgmt_v3. But is it worth of an effort at this stage with GD2 is under active development is what we'd need to assess.

Comment 7 SATHEESARAN 2017-04-19 03:10:35 UTC

I have tried the workaround suggested by Atin and the stale lock was released.

1. Reset server quorum on all the volumes

# gluster volume set <vol> server-quorum-type none

2. Restarted glusterd on all the nodes using gdeploy
[hosts]
host1
host2
host3
host4
host5
host6

[service]
action=restart
service=glusterd

Note: Restarting glusterd on all the nodes is required.

3. Set server-quorum on all the volumes
# gluster volume set <vol> server-quorum-type server

Comment 10 Cal Calhoun 2017-07-04 15:35:31 UTC

@atin,

I have a case, 01874385, which seems to be presenting with very similar errors.

glusterd.log:
[2017-06-28 20:56:22.549828] E [MSGID: 106119] [glusterd-syncop.c:1851:gd_sync_task_begin] 0-management: Unable to acquire lock for ACL_VEEAM_BCK_VOL1

and associated:

cmd_history.log:
[2017-06-28 20:56:22.549842]  : volume status all tasks : FAILED : Another transaction is in progress for ACL_VEEAM_BCK_VOL1. Please try again after sometime.

These occur on almost exactly a 1:1 ratio.  Can I get an opinion about this being the same issue?  What additional information can I provide to help make that determination?

Comment 11 Atin Mukherjee 2017-07-05 04:23:30 UTC

Get me the cmd_history & glusterd log from all the nodes along with the glusterd statedump taken on the node where the locking has failed.

Comment 12 Cal Calhoun 2017-07-05 15:28:17 UTC

@Atin, I've requested new logs and the statedump info from the customer.  I will attach them when they come in.

Comment 21 Timo 2017-08-08 11:37:32 UTC

Hi,

I do have a similar problem with glusterfs 3.4.5 on redhat 6. If I shall provide some logs, please tell me.

gluster volume status all
Another transaction could be in progress. Please try again after sometime.

[2017-08-08 11:35:57.139221] E [glusterd-utils.c:332:glusterd_lock] 0-management: Unable to get lock for uuid: a813ad42-bf64-4b3b-ae24-59883671a8e8, lock held by: a813ad42-bf64-4b3b-ae24-59883671a8e8
[2017-08-08 11:35:57.139272] E [glusterd-op-sm.c:5445:glusterd_op_sm] 0-management: handler returned: -1
[2017-08-08 11:35:57.139920] E [glusterd-syncop.c:715:gd_lock_op_phase] 0-management: Failed to acquire lock
[2017-08-08 11:35:57.140762] E [glusterd-utils.c:365:glusterd_unlock] 0-management: Cluster lock not held!

Comment 22 Timo 2017-08-16 12:36:19 UTC

On all the servers in the cluster I had the server itself in the peers-file. this was the problem in my system. simple mistake...took me quite long to figure out.

Comment 31 SATHEESARAN 2018-05-16 11:30:44 UTC

Tested with RHV 4.2 and glusterfs-3.12.

1. Added RHGS nodes to the cluster.
2. Repeated gluster volume status are queries

There are no 'Another transaction in progress' errors

Comment 32 errata-xmlrpc 2018-09-04 06:32:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607