Bug 1425681

Summary:	[Glusterd] Volume operations fail on a (tiered) volume because of a stale lock held by one of the nodes
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Sweta Anandpara <sanandpa>
Component:	glusterd	Assignee:	Atin Mukherjee <amukherj>
Status:	CLOSED ERRATA	QA Contact:	Bala Konda Reddy M <bmekala>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.2	CC:	amukherj, asriram, bkunal, bmohanra, bturner, nchilaka, olim, rcyriac, rhinduja, rhs-bugs, sheggodu, storage-qa-internal, vbellur
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.12.2-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1499004 (view as bug list)		Environment:
Last Closed:	2018-09-04 06:32:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1499004, 1503239, 1557304
Bug Blocks:	1351530, 1472361, 1503135

Description Sweta Anandpara 2017-02-22 05:40:44 UTC

Description of problem:
=======================
Had a tiered volume with 2*(4+2) as cold tier and plain distribute 1*4 as hot tier in 6 node cluster. Had I/O taking place from fuse as well as nfs mounts, (ofcourse) in different directories. During the watermarks testing wrt the volume, reduced the values of low and high watermark, which resulted in the data percentage of hot tier exceeding the high-watermark - which should result in demotions (only).

Was monitoring the demotions taking place via the command 'gluster volume tier <volname> status'. After a while, the said command started failing with 'Another transaction is in progress for <volname>. Please try again after sometime'. And it has got stuck in that state since a day now. Glusterd logs complain of 'another lock being held by <uuid>'.

(Do not think it is related, but fyi)
While monitoring the demotions 'gluster volume tier <volname> status' and waiting for them to get completed, I did create a new dist-rep volume 2*2 and set 'nfs.disable' to 'off'. 
Soon after that when I repeated the 'tier status' command, it started failing with '...another transaction is in progress...'

'glusterd restart' (as advised by Atin) on the node (which had held the lock) seems to have got the volume back to normal.

Sosreports at : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/


Version-Release number of selected component (if applicable):
===========================================================
3.8.4-14


How reproducible:
=================
1:1


Additional info:
==================

[root@dhcp46-221 ~]# 
[root@dhcp46-221 ~]# gluster peer stauts
unrecognized word: stauts (position 1)
[root@dhcp46-221 ~]# gluster peer status
Number of Peers: 5

Hostname: dhcp46-242.lab.eng.blr.redhat.com
Uuid: 838465bf-1fd8-4f85-8599-dbc8367539aa
State: Peer in Cluster (Connected)
Other names:
10.70.46.242

Hostname: 10.70.46.239
Uuid: b9af0965-ffe7-4827-b610-2380a8fa810b
State: Peer in Cluster (Connected)

Hostname: 10.70.46.240
Uuid: 5bff39d7-cd9c-4dbb-86eb-2a7ba6dfea3d
State: Peer in Cluster (Connected)

Hostname: 10.70.46.218
Uuid: c2fbc432-b7a9-4db1-9b9d-a8d82e998923
State: Peer in Cluster (Connected)

Hostname: 10.70.46.222
Uuid: 81184471-cbf7-47aa-ba41-21f32bb644b0
State: Peer in Cluster (Connected)
[root@dhcp46-221 ~]# vim /var/log/glusterfs/glusterd.log
[root@dhcp46-221 ~]# gluster v status
Another transaction is in progress for ozone. Please try again after sometime.
 
Status of volume: vola
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.239:/bricks/brick3/vola_0    49152     0          Y       5259 
Brick 10.70.46.240:/bricks/brick3/vola_1    49152     0          Y       20012
Brick 10.70.46.242:/bricks/brick3/vola_2    49153     0          Y       21512
Brick 10.70.46.218:/bricks/brick3/vola_3    49155     0          Y       28705
NFS Server on localhost                     2049      0          Y       31911
Self-heal Daemon on localhost               N/A       N/A        Y       31743
NFS Server on dhcp46-242.lab.eng.blr.redhat
.com                                        2049      0          Y       21788
Self-heal Daemon on dhcp46-242.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       21563
NFS Server on 10.70.46.239                  2049      0          Y       5699 
Self-heal Daemon on 10.70.46.239            N/A       N/A        Y       5291 
NFS Server on 10.70.46.218                  2049      0          Y       28899
Self-heal Daemon on 10.70.46.218            N/A       N/A        Y       28759
NFS Server on 10.70.46.240                  2049      0          Y       20201
Self-heal Daemon on 10.70.46.240            N/A       N/A        Y       20061
NFS Server on 10.70.46.222                  2049      0          Y       1784 
Self-heal Daemon on 10.70.46.222            N/A       N/A        Y       1588 
 
Task Status of Volume vola
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp46-221 ~]# 
[root@dhcp46-221 ~]# rpm -qa | grep gluster
glusterfs-libs-3.8.4-14.el7rhgs.x86_64
glusterfs-fuse-3.8.4-14.el7rhgs.x86_64
glusterfs-rdma-3.8.4-14.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-client-xlators-3.8.4-14.el7rhgs.x86_64
glusterfs-cli-3.8.4-14.el7rhgs.x86_64
glusterfs-events-3.8.4-14.el7rhgs.x86_64
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
glusterfs-server-3.8.4-14.el7rhgs.x86_64
python-gluster-3.8.4-14.el7rhgs.noarch
glusterfs-geo-replication-3.8.4-14.el7rhgs.x86_64
glusterfs-3.8.4-14.el7rhgs.x86_64
glusterfs-api-3.8.4-14.el7rhgs.x86_64

[root@dhcp46-221 ~]# 
[root@dhcp46-221 ~]# 


########## after glusterd restart   ################

[root@dhcp46-221 ~]# 
[root@dhcp46-221 ~]# gluster v status ozone
Status of volume: ozone
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Hot Bricks:
Brick 10.70.46.222:/bricks/brick2/ozone_tie
r3                                          49152     0          Y       18409
Brick 10.70.46.221:/bricks/brick2/ozone_tie
r2                                          49152     0          Y       16208
Brick 10.70.46.218:/bricks/brick2/ozone_tie
r1                                          49152     0          Y       1655 
Brick 10.70.46.242:/bricks/brick2/ozone_tie
r0                                          49152     0          Y       23869
Cold Bricks:
Brick 10.70.46.239:/bricks/brick0/ozone_0   49156     0          Y       18567
Brick 10.70.46.240:/bricks/brick0/ozone_1   49156     0          Y       21626
Brick 10.70.46.242:/bricks/brick0/ozone_2   49156     0          Y       10841
Brick 10.70.46.218:/bricks/brick0/ozone_3   49153     0          Y       27354
Brick 10.70.46.221:/bricks/brick0/ozone_4   49154     0          Y       2139 
Brick 10.70.46.222:/bricks/brick0/ozone_5   49154     0          Y       4378 
Brick 10.70.46.239:/bricks/brick1/ozone_6   49157     0          Y       18587
Brick 10.70.46.240:/bricks/brick1/ozone_7   49157     0          Y       21646
Brick 10.70.46.242:/bricks/brick1/ozone_8   49157     0          Y       10861
Brick 10.70.46.218:/bricks/brick1/ozone_9   49154     0          Y       27353
Brick 10.70.46.221:/bricks/brick1/ozone_10  49155     0          Y       2159 
Brick 10.70.46.222:/bricks/brick1/ozone_11  49155     0          Y       4398 
NFS Server on localhost                     2049      0          Y       5622 
Self-heal Daemon on localhost               N/A       N/A        Y       5630 
Quota Daemon on localhost                   N/A       N/A        Y       5639 
NFS Server on 10.70.46.239                  2049      0          Y       15129
Self-heal Daemon on 10.70.46.239            N/A       N/A        Y       15152
Quota Daemon on 10.70.46.239                N/A       N/A        Y       15189
NFS Server on 10.70.46.240                  2049      0          Y       25626
Self-heal Daemon on 10.70.46.240            N/A       N/A        Y       25647
Quota Daemon on 10.70.46.240                N/A       N/A        Y       25657
NFS Server on dhcp46-242.lab.eng.blr.redhat
.com                                        2049      0          Y       20513
Self-heal Daemon on dhcp46-242.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       20540
Quota Daemon on dhcp46-242.lab.eng.blr.redh
at.com                                      N/A       N/A        Y       20565
NFS Server on 10.70.46.222                  2049      0          Y       6509 
Self-heal Daemon on 10.70.46.222            N/A       N/A        Y       6532 
Quota Daemon on 10.70.46.222                N/A       N/A        Y       6549 
NFS Server on 10.70.46.218                  2049      0          Y       11094
Self-heal Daemon on 10.70.46.218            N/A       N/A        Y       11120
Quota Daemon on 10.70.46.218                N/A       N/A        Y       11143
 
Task Status of Volume ozone
------------------------------------------------------------------------------
Task                 : Tier migration      
ID                   : 19fb4787-d9de-4436-8f15-86ff39fbc7bb
Status               : in progress         
 
[root@dhcp46-221 ~]# gluster v tier ozone status
Node                 Promoted files       Demoted files        Status              
---------            ---------            ---------            ---------           
localhost            0                    2033                 in progress         
dhcp46-242.lab.eng.blr.redhat.com 0                    2025                 in progress         
10.70.46.239         14                   0                    in progress         
10.70.46.240         0                    0                    in progress         
10.70.46.218         0                    2238                 in progress         
10.70.46.222         0                    2167                 in progress         
Tiering Migration Functionality: ozone: success
[root@dhcp46-221 ~]#

Comment 2 Sweta Anandpara 2017-02-22 05:46:42 UTC

[qe@rhsqe-repo 1425681]$ 
[qe@rhsqe-repo 1425681]$ pwd
/home/repo/sosreports/1425681
[qe@rhsqe-repo 1425681]$ 
[qe@rhsqe-repo 1425681]$ 
[qe@rhsqe-repo 1425681]$ hostname
rhsqe-repo.lab.eng.blr.redhat.com
[qe@rhsqe-repo 1425681]$ 
[qe@rhsqe-repo 1425681]$ ll
total 335724
-rwxr-xr-x. 1 qe qe 57683896 Feb 22 11:07 sosreport-dhcp46-218.lab.eng.blr.redhat.com-20170222102430.tar.xz
-rwxr-xr-x. 1 qe qe 49431280 Feb 22 11:07 sosreport-dhcp46-221.lab.eng.blr.redhat.com-20170222102436.tar.xz
-rwxr-xr-x. 1 qe qe 46801048 Feb 22 11:07 sosreport-dhcp46-222.lab.eng.blr.redhat.com-20170222102441.tar.xz
-rwxr-xr-x. 1 qe qe 59767652 Feb 22 11:07 sosreport-dhcp46-239.lab.eng.blr.redhat.com-20170222102415.tar.xz
-rwxr-xr-x. 1 qe qe 60357960 Feb 22 11:07 sosreport-dhcp46-240.lab.eng.blr.redhat.com-20170222102420.tar.xz
-rwxr-xr-x. 1 qe qe 69726508 Feb 22 11:07 sosreport-dhcp46-242.lab.eng.blr.redhat.com-20170222102425.tar.xz
[qe@rhsqe-repo 1425681]$ 
[qe@rhsqe-repo 1425681]$

Comment 3 Sweta Anandpara 2017-02-23 05:40:55 UTC

Hit this again on another volume on the same cluster. CLI logs pasted below:

[root@dhcp46-239 ~]# gluster v info fds
 
Volume Name: fds
Type: Distribute
Volume ID: 52b53d2f-02a9-4a54-858b-168fcce82b4b
Status: Started
Snapshot Count: 0
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 10.70.46.239:/bricks/brick0/fds_0
Brick2: 10.70.46.240:/bricks/brick0/fds_1
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: off
[root@dhcp46-239 ~]# gluster v tier 
^C
[root@dhcp46-239 ~]# gluster v tier attach
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 40, in <module>
    from libcxattr import Xattr
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 16, in <module>
    class Xattr(object):
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 28, in Xattr
    libc = CDLL(find_library("libc"), use_errno=True)
  File "/usr/lib64/python2.7/ctypes/util.py", line 244, in find_library
    return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
  File "/usr/lib64/python2.7/ctypes/util.py", line 93, in _findLib_gcc
    fdout, ccout = tempfile.mkstemp()
  File "/usr/lib64/python2.7/tempfile.py", line 297, in mkstemp
Usage: volume tier <VOLNAME> status
volume tier <VOLNAME> start [force]
volume tier <VOLNAME> attach [<replica COUNT>] <NEW-BRICK>... [force]
volume tier <VOLNAME> detach <start|stop|status|commit|force>

Tier command failed
[root@dhcp46-239 ~]#     dir = gettempdir()
  File "/usr/lib64/python2.7/tempfile.py", line 265, in gettempdir
    tempdir = _get_default_tempdir()
  File "/usr/lib64/python2.7/tempfile.py", line 197, in _get_default_tempdir
    fd = _os.open(filename, flags, 0o600)
KeyboardInterrupt

[root@dhcp46-239 ~]# gluster v tier fds attach 10.70.46.221:/bricks/brick3/fds_tier0 10.70.46.222:/bricks/brick3/fds_tier1
volume attach-tier: failed: Another transaction is in progress Please try again after sometime.
Tier command failed
[root@dhcp46-239 ~]#

Comment 4 Atin Mukherjee 2017-02-24 06:38:50 UTC

I have an RCA for this issue now which will be a bit longer than usual, so read it with patience ;)

GlusterD maintains three different types of transaction framework

(a) op-sm framework where in both originator glusterd (the place where CLI command has been initiated) and the receiver glusterd instances follow the same framework which is event driven state transition.

(b) synctask framework where originator glusterd follows sync task and receiver glusterd instances follow op-sm

(c) And the most effective framework as it stands today is the enhanced version of sync task where the phases of the transaction can be customed and this too has the same framework followed on originator and receiver nodes.

So ideally "Another transaction is in progress" or "locking failed" sort of messages turn up when multiple transaction contends for the same volume lock, while the commands can fail usually with this collision but the theory is the lock should not be held for ever which was not the case in this setup. GlusterD statedump did show up that UUID of the same glusterd node has acquired a lock on Volume <XXX>. This made me to first jump into the cmd_history log on all the nodes to figure out the pattern and I did see a stream of consecutive commands run across all the nodes at the same time. With that information I went ahead and tried to simulate the same test case and here is what I did:

1. Formed a 4 node cluster having a 2X2 volume
2. Wrote a script which executes a set of commands in a loop:

while true                                                                      
do                                                                              
gluster volume status                                                           
gluster volume geo-replication test-vol status                                  
gluster snapshot list test-vol                                                  
gluster volume quota test-vol list                                              
gluster snapshot info test-vol                                                  
gluster volume rebalance test-vol status                                                                              
gluster snapshot status test-vol                                                
gluster v status                                                                
done

3. Ran this script on all the nodes, and then after a while I started seeing the script started to hang after snapshot info command.

This made me to believe that there is something wrong in rebalance status command flow.

[Now the next part of the update would be a bit more technical, so read it carefully]

rebalance status command follows op-sm framework where as other commands are mix of synctask and mgmt_v3. So given there were 4 concurrent script running on 4 nodes the commands will contend for the same volume lock very frequently and there could be a case that on Node 1 (refer N1 from now) when rebalance status is initiated (which follows the op-sm framework) there is high possibility that the same node might be processing another command as a receiver and if its any one of them which follows synctask framework at the originator the receiver will follow the op-sm. So in short, in N1 we could end up into a state where multiple commands are processed through the op-sm at a given point of time.

Now coming to how op-sm framework works, as I mentioned earlier its primary a event based state transition model, primarily it maintains a global opinfo object to store the current transaction and its subsequent event, current & next states. So although multiple commands can be processed concurrently through the op-sm there is a design flaw where two commands can not operate on the same global opinfo and this is exactly what has happened there. For eg when command 1 was deciding its next state to transit command 2 updated the opinfo which made command 1 loose its flow and it completely went into different state transition. Particularly in this case, N1 took a lock and never went into the unlock phase to take the lock out which eventually got us into the stale lock issue. And this actually becomes error prone especially when originator and receiver transaction s collide with each other.

IMO, solving this problem is not at all easy in op-sm as it'd require a sort of design change. Instead we should retire op-sm framework (atleast on the originator side) and try to follow as much of mgmt_v3.

I recollect that Avra had attempted to port rebalance from op-sm to mgmt_v3 through the patch [1] but it got abandoned. We need to revive it and solve this problem in that way instead of making any changes in the existing op-sm framework.


[1] https://review.gluster.org/#/c/9043/

Comment 5 Atin Mukherjee 2017-02-24 06:39:51 UTC

*** Bug 1412907 has been marked as a duplicate of this bug. ***

Comment 7 Bhavana 2017-03-13 16:25:53 UTC

Updated the doc text slightly for the release notes

Comment 24 Oonkwee Lim 2017-10-02 15:13:23 UTC

Hello,

Would someone be able to answer the following customer's questions:

* what about implementing 'cluster-wide' locking for monitoring scripts. Using gluster shared storage? Can this help with avoiding these problems?

* can you give me a list of gluster commands that can be run safely in parallel from multiple nodes in the same time (and which not).

* locking (node / cluster wide), should be handled by gluster itself, not in 'upper layers'. It really doesn't make sense to write additional logic to check if gluster xyz command is not running on other node in cluster.

Comment 32 Bala Konda Reddy M 2018-04-25 08:47:11 UTC

Build: 3.12.2-8

All volume operations are succeeding even after performing below commands in the whole cluster in loop. No stale lock is held.

1. Created 30 volumes on a 3 node setup.
2. On all three nodes started set of volume commands.
while true                                                                      
do                                                                              
gluster volume status                                                           
gluster snapshot list test_10                                                  
gluster volume quota test_10 list                                              
gluster snapshot info test_10                                                  
gluster volume rebalance test_10 status                                
gluster snapshot status test_10                                                
gluster v status                                                                
done

3. One node acquired lock and on the other nodes it says another transaction in progress.
4. The lock is held for 3 mins.
5. After 3 mins the lock is released. Able to see the output of volume commands.

Hence marking it as verified

Comment 33 errata-xmlrpc 2018-09-04 06:32:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607