Bug 1425681 - [Glusterd] Volume operations fail on a (tiered) volume because of a stale lock held by one of the nodes
Summary: [Glusterd] Volume operations fail on a (tiered) volume because of a stale loc...
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterd
Version: rhgs-3.2
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: RHGS 3.4.0
Assignee: Atin Mukherjee
QA Contact: Bala Konda Reddy M
: 1412907 (view as bug list)
Depends On: 1499004 1503239 1557304
Blocks: 1351530 RHGS-3.4-GSS-proposed-tracker 1503135
TreeView+ depends on / blocked
Reported: 2017-02-22 05:40 UTC by Sweta Anandpara
Modified: 2018-09-04 06:33 UTC (History)
13 users (show)

Fixed In Version: glusterfs-3.12.2-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1499004 (view as bug list)
Last Closed: 2018-09-04 06:32:03 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1442983 None CLOSED Unable to acquire lock for gluster volume leading to 'another transaction in progress' error 2019-10-22 14:18:41 UTC
Red Hat Knowledge Base (Solution) 3208541 None None None 2017-11-27 15:37:59 UTC
Red Hat Product Errata RHSA-2018:2607 None None None 2018-09-04 06:33:46 UTC

Internal Links: 1442983

Description Sweta Anandpara 2017-02-22 05:40:44 UTC
Description of problem:
Had a tiered volume with 2*(4+2) as cold tier and plain distribute 1*4 as hot tier in 6 node cluster. Had I/O taking place from fuse as well as nfs mounts, (ofcourse) in different directories. During the watermarks testing wrt the volume, reduced the values of low and high watermark, which resulted in the data percentage of hot tier exceeding the high-watermark - which should result in demotions (only).

Was monitoring the demotions taking place via the command 'gluster volume tier <volname> status'. After a while, the said command started failing with 'Another transaction is in progress for <volname>. Please try again after sometime'. And it has got stuck in that state since a day now. Glusterd logs complain of 'another lock being held by <uuid>'.

(Do not think it is related, but fyi)
While monitoring the demotions 'gluster volume tier <volname> status' and waiting for them to get completed, I did create a new dist-rep volume 2*2 and set 'nfs.disable' to 'off'. 
Soon after that when I repeated the 'tier status' command, it started failing with '...another transaction is in progress...'

'glusterd restart' (as advised by Atin) on the node (which had held the lock) seems to have got the volume back to normal.

Sosreports at : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/

Version-Release number of selected component (if applicable):

How reproducible:

Additional info:

[root@dhcp46-221 ~]# 
[root@dhcp46-221 ~]# gluster peer stauts
unrecognized word: stauts (position 1)
[root@dhcp46-221 ~]# gluster peer status
Number of Peers: 5

Hostname: dhcp46-242.lab.eng.blr.redhat.com
Uuid: 838465bf-1fd8-4f85-8599-dbc8367539aa
State: Peer in Cluster (Connected)
Other names:

Uuid: b9af0965-ffe7-4827-b610-2380a8fa810b
State: Peer in Cluster (Connected)

Uuid: 5bff39d7-cd9c-4dbb-86eb-2a7ba6dfea3d
State: Peer in Cluster (Connected)

Uuid: c2fbc432-b7a9-4db1-9b9d-a8d82e998923
State: Peer in Cluster (Connected)

Uuid: 81184471-cbf7-47aa-ba41-21f32bb644b0
State: Peer in Cluster (Connected)
[root@dhcp46-221 ~]# vim /var/log/glusterfs/glusterd.log
[root@dhcp46-221 ~]# gluster v status
Another transaction is in progress for ozone. Please try again after sometime.
Status of volume: vola
Gluster process                             TCP Port  RDMA Port  Online  Pid
Brick    49152     0          Y       5259 
Brick    49152     0          Y       20012
Brick    49153     0          Y       21512
Brick    49155     0          Y       28705
NFS Server on localhost                     2049      0          Y       31911
Self-heal Daemon on localhost               N/A       N/A        Y       31743
NFS Server on dhcp46-242.lab.eng.blr.redhat
.com                                        2049      0          Y       21788
Self-heal Daemon on dhcp46-242.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       21563
NFS Server on                  2049      0          Y       5699 
Self-heal Daemon on            N/A       N/A        Y       5291 
NFS Server on                  2049      0          Y       28899
Self-heal Daemon on            N/A       N/A        Y       28759
NFS Server on                  2049      0          Y       20201
Self-heal Daemon on            N/A       N/A        Y       20061
NFS Server on                  2049      0          Y       1784 
Self-heal Daemon on            N/A       N/A        Y       1588 
Task Status of Volume vola
There are no active volume tasks
[root@dhcp46-221 ~]# 
[root@dhcp46-221 ~]# rpm -qa | grep gluster

[root@dhcp46-221 ~]# 
[root@dhcp46-221 ~]# 

########## after glusterd restart   ################

[root@dhcp46-221 ~]# 
[root@dhcp46-221 ~]# gluster v status ozone
Status of volume: ozone
Gluster process                             TCP Port  RDMA Port  Online  Pid
Hot Bricks:
r3                                          49152     0          Y       18409
r2                                          49152     0          Y       16208
r1                                          49152     0          Y       1655 
r0                                          49152     0          Y       23869
Cold Bricks:
Brick   49156     0          Y       18567
Brick   49156     0          Y       21626
Brick   49156     0          Y       10841
Brick   49153     0          Y       27354
Brick   49154     0          Y       2139 
Brick   49154     0          Y       4378 
Brick   49157     0          Y       18587
Brick   49157     0          Y       21646
Brick   49157     0          Y       10861
Brick   49154     0          Y       27353
Brick  49155     0          Y       2159 
Brick  49155     0          Y       4398 
NFS Server on localhost                     2049      0          Y       5622 
Self-heal Daemon on localhost               N/A       N/A        Y       5630 
Quota Daemon on localhost                   N/A       N/A        Y       5639 
NFS Server on                  2049      0          Y       15129
Self-heal Daemon on            N/A       N/A        Y       15152
Quota Daemon on                N/A       N/A        Y       15189
NFS Server on                  2049      0          Y       25626
Self-heal Daemon on            N/A       N/A        Y       25647
Quota Daemon on                N/A       N/A        Y       25657
NFS Server on dhcp46-242.lab.eng.blr.redhat
.com                                        2049      0          Y       20513
Self-heal Daemon on dhcp46-242.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       20540
Quota Daemon on dhcp46-242.lab.eng.blr.redh
at.com                                      N/A       N/A        Y       20565
NFS Server on                  2049      0          Y       6509 
Self-heal Daemon on            N/A       N/A        Y       6532 
Quota Daemon on                N/A       N/A        Y       6549 
NFS Server on                  2049      0          Y       11094
Self-heal Daemon on            N/A       N/A        Y       11120
Quota Daemon on                N/A       N/A        Y       11143
Task Status of Volume ozone
Task                 : Tier migration      
ID                   : 19fb4787-d9de-4436-8f15-86ff39fbc7bb
Status               : in progress         
[root@dhcp46-221 ~]# gluster v tier ozone status
Node                 Promoted files       Demoted files        Status              
---------            ---------            ---------            ---------           
localhost            0                    2033                 in progress         
dhcp46-242.lab.eng.blr.redhat.com 0                    2025                 in progress         14                   0                    in progress         0                    0                    in progress         0                    2238                 in progress         0                    2167                 in progress         
Tiering Migration Functionality: ozone: success
[root@dhcp46-221 ~]#

Comment 2 Sweta Anandpara 2017-02-22 05:46:42 UTC
[qe@rhsqe-repo 1425681]$ 
[qe@rhsqe-repo 1425681]$ pwd
[qe@rhsqe-repo 1425681]$ 
[qe@rhsqe-repo 1425681]$ 
[qe@rhsqe-repo 1425681]$ hostname
[qe@rhsqe-repo 1425681]$ 
[qe@rhsqe-repo 1425681]$ ll
total 335724
-rwxr-xr-x. 1 qe qe 57683896 Feb 22 11:07 sosreport-dhcp46-218.lab.eng.blr.redhat.com-20170222102430.tar.xz
-rwxr-xr-x. 1 qe qe 49431280 Feb 22 11:07 sosreport-dhcp46-221.lab.eng.blr.redhat.com-20170222102436.tar.xz
-rwxr-xr-x. 1 qe qe 46801048 Feb 22 11:07 sosreport-dhcp46-222.lab.eng.blr.redhat.com-20170222102441.tar.xz
-rwxr-xr-x. 1 qe qe 59767652 Feb 22 11:07 sosreport-dhcp46-239.lab.eng.blr.redhat.com-20170222102415.tar.xz
-rwxr-xr-x. 1 qe qe 60357960 Feb 22 11:07 sosreport-dhcp46-240.lab.eng.blr.redhat.com-20170222102420.tar.xz
-rwxr-xr-x. 1 qe qe 69726508 Feb 22 11:07 sosreport-dhcp46-242.lab.eng.blr.redhat.com-20170222102425.tar.xz
[qe@rhsqe-repo 1425681]$ 
[qe@rhsqe-repo 1425681]$

Comment 3 Sweta Anandpara 2017-02-23 05:40:55 UTC
Hit this again on another volume on the same cluster. CLI logs pasted below:

[root@dhcp46-239 ~]# gluster v info fds
Volume Name: fds
Type: Distribute
Volume ID: 52b53d2f-02a9-4a54-858b-168fcce82b4b
Status: Started
Snapshot Count: 0
Number of Bricks: 2
Transport-type: tcp
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: off
[root@dhcp46-239 ~]# gluster v tier 
[root@dhcp46-239 ~]# gluster v tier attach
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 40, in <module>
    from libcxattr import Xattr
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 16, in <module>
    class Xattr(object):
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 28, in Xattr
    libc = CDLL(find_library("libc"), use_errno=True)
  File "/usr/lib64/python2.7/ctypes/util.py", line 244, in find_library
    return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
  File "/usr/lib64/python2.7/ctypes/util.py", line 93, in _findLib_gcc
    fdout, ccout = tempfile.mkstemp()
  File "/usr/lib64/python2.7/tempfile.py", line 297, in mkstemp
Usage: volume tier <VOLNAME> status
volume tier <VOLNAME> start [force]
volume tier <VOLNAME> attach [<replica COUNT>] <NEW-BRICK>... [force]
volume tier <VOLNAME> detach <start|stop|status|commit|force>

Tier command failed
[root@dhcp46-239 ~]#     dir = gettempdir()
  File "/usr/lib64/python2.7/tempfile.py", line 265, in gettempdir
    tempdir = _get_default_tempdir()
  File "/usr/lib64/python2.7/tempfile.py", line 197, in _get_default_tempdir
    fd = _os.open(filename, flags, 0o600)

[root@dhcp46-239 ~]# gluster v tier fds attach
volume attach-tier: failed: Another transaction is in progress Please try again after sometime.
Tier command failed
[root@dhcp46-239 ~]#

Comment 4 Atin Mukherjee 2017-02-24 06:38:50 UTC
I have an RCA for this issue now which will be a bit longer than usual, so read it with patience ;)

GlusterD maintains three different types of transaction framework

(a) op-sm framework where in both originator glusterd (the place where CLI command has been initiated) and the receiver glusterd instances follow the same framework which is event driven state transition.

(b) synctask framework where originator glusterd follows sync task and receiver glusterd instances follow op-sm

(c) And the most effective framework as it stands today is the enhanced version of sync task where the phases of the transaction can be customed and this too has the same framework followed on originator and receiver nodes.

So ideally "Another transaction is in progress" or "locking failed" sort of messages turn up when multiple transaction contends for the same volume lock, while the commands can fail usually with this collision but the theory is the lock should not be held for ever which was not the case in this setup. GlusterD statedump did show up that UUID of the same glusterd node has acquired a lock on Volume <XXX>. This made me to first jump into the cmd_history log on all the nodes to figure out the pattern and I did see a stream of consecutive commands run across all the nodes at the same time. With that information I went ahead and tried to simulate the same test case and here is what I did:

1. Formed a 4 node cluster having a 2X2 volume
2. Wrote a script which executes a set of commands in a loop:

while true                                                                      
gluster volume status                                                           
gluster volume geo-replication test-vol status                                  
gluster snapshot list test-vol                                                  
gluster volume quota test-vol list                                              
gluster snapshot info test-vol                                                  
gluster volume rebalance test-vol status                                                                              
gluster snapshot status test-vol                                                
gluster v status                                                                

3. Ran this script on all the nodes, and then after a while I started seeing the script started to hang after snapshot info command.

This made me to believe that there is something wrong in rebalance status command flow.

[Now the next part of the update would be a bit more technical, so read it carefully]

rebalance status command follows op-sm framework where as other commands are mix of synctask and mgmt_v3. So given there were 4 concurrent script running on 4 nodes the commands will contend for the same volume lock very frequently and there could be a case that on Node 1 (refer N1 from now) when rebalance status is initiated (which follows the op-sm framework) there is high possibility that the same node might be processing another command as a receiver and if its any one of them which follows synctask framework at the originator the receiver will follow the op-sm. So in short, in N1 we could end up into a state where multiple commands are processed through the op-sm at a given point of time.

Now coming to how op-sm framework works, as I mentioned earlier its primary a event based state transition model, primarily it maintains a global opinfo object to store the current transaction and its subsequent event, current & next states. So although multiple commands can be processed concurrently through the op-sm there is a design flaw where two commands can not operate on the same global opinfo and this is exactly what has happened there. For eg when command 1 was deciding its next state to transit command 2 updated the opinfo which made command 1 loose its flow and it completely went into different state transition. Particularly in this case, N1 took a lock and never went into the unlock phase to take the lock out which eventually got us into the stale lock issue. And this actually becomes error prone especially when originator and receiver transaction s collide with each other.

IMO, solving this problem is not at all easy in op-sm as it'd require a sort of design change. Instead we should retire op-sm framework (atleast on the originator side) and try to follow as much of mgmt_v3.

I recollect that Avra had attempted to port rebalance from op-sm to mgmt_v3 through the patch [1] but it got abandoned. We need to revive it and solve this problem in that way instead of making any changes in the existing op-sm framework.

[1] https://review.gluster.org/#/c/9043/

Comment 5 Atin Mukherjee 2017-02-24 06:39:51 UTC
*** Bug 1412907 has been marked as a duplicate of this bug. ***

Comment 7 Bhavana 2017-03-13 16:25:53 UTC
Updated the doc text slightly for the release notes

Comment 24 Oonkwee Lim_ 2017-10-02 15:13:23 UTC

Would someone be able to answer the following customer's questions:

* what about implementing 'cluster-wide' locking for monitoring scripts. Using gluster shared storage? Can this help with avoiding these problems?

* can you give me a list of gluster commands that can be run safely in parallel from multiple nodes in the same time (and which not).

* locking (node / cluster wide), should be handled by gluster itself, not in 'upper layers'. It really doesn't make sense to write additional logic to check if gluster xyz command is not running on other node in cluster.

Comment 32 Bala Konda Reddy M 2018-04-25 08:47:11 UTC
Build: 3.12.2-8

All volume operations are succeeding even after performing below commands in the whole cluster in loop. No stale lock is held.

1. Created 30 volumes on a 3 node setup.
2. On all three nodes started set of volume commands.
while true                                                                      
gluster volume status                                                           
gluster snapshot list test_10                                                  
gluster volume quota test_10 list                                              
gluster snapshot info test_10                                                  
gluster volume rebalance test_10 status                                
gluster snapshot status test_10                                                
gluster v status                                                                

3. One node acquired lock and on the other nodes it says another transaction in progress.
4. The lock is held for 3 mins.
5. After 3 mins the lock is released. Able to see the output of volume commands.

Hence marking it as verified

Comment 33 errata-xmlrpc 2018-09-04 06:32:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.