Bug 1425681
Summary: | [Glusterd] Volume operations fail on a (tiered) volume because of a stale lock held by one of the nodes | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Sweta Anandpara <sanandpa> | |
Component: | glusterd | Assignee: | Atin Mukherjee <amukherj> | |
Status: | CLOSED ERRATA | QA Contact: | Bala Konda Reddy M <bmekala> | |
Severity: | urgent | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.2 | CC: | amukherj, asriram, bkunal, bmohanra, bturner, nchilaka, olim, rcyriac, rhinduja, rhs-bugs, sheggodu, storage-qa-internal, vbellur | |
Target Milestone: | --- | Keywords: | ZStream | |
Target Release: | RHGS 3.4.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.12.2-1 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1499004 (view as bug list) | Environment: | ||
Last Closed: | 2018-09-04 06:32:03 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1499004, 1503239, 1557304 | |||
Bug Blocks: | 1351530, 1472361, 1503135 |
Description
Sweta Anandpara
2017-02-22 05:40:44 UTC
[qe@rhsqe-repo 1425681]$ [qe@rhsqe-repo 1425681]$ pwd /home/repo/sosreports/1425681 [qe@rhsqe-repo 1425681]$ [qe@rhsqe-repo 1425681]$ [qe@rhsqe-repo 1425681]$ hostname rhsqe-repo.lab.eng.blr.redhat.com [qe@rhsqe-repo 1425681]$ [qe@rhsqe-repo 1425681]$ ll total 335724 -rwxr-xr-x. 1 qe qe 57683896 Feb 22 11:07 sosreport-dhcp46-218.lab.eng.blr.redhat.com-20170222102430.tar.xz -rwxr-xr-x. 1 qe qe 49431280 Feb 22 11:07 sosreport-dhcp46-221.lab.eng.blr.redhat.com-20170222102436.tar.xz -rwxr-xr-x. 1 qe qe 46801048 Feb 22 11:07 sosreport-dhcp46-222.lab.eng.blr.redhat.com-20170222102441.tar.xz -rwxr-xr-x. 1 qe qe 59767652 Feb 22 11:07 sosreport-dhcp46-239.lab.eng.blr.redhat.com-20170222102415.tar.xz -rwxr-xr-x. 1 qe qe 60357960 Feb 22 11:07 sosreport-dhcp46-240.lab.eng.blr.redhat.com-20170222102420.tar.xz -rwxr-xr-x. 1 qe qe 69726508 Feb 22 11:07 sosreport-dhcp46-242.lab.eng.blr.redhat.com-20170222102425.tar.xz [qe@rhsqe-repo 1425681]$ [qe@rhsqe-repo 1425681]$ Hit this again on another volume on the same cluster. CLI logs pasted below: [root@dhcp46-239 ~]# gluster v info fds Volume Name: fds Type: Distribute Volume ID: 52b53d2f-02a9-4a54-858b-168fcce82b4b Status: Started Snapshot Count: 0 Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: 10.70.46.239:/bricks/brick0/fds_0 Brick2: 10.70.46.240:/bricks/brick0/fds_1 Options Reconfigured: transport.address-family: inet performance.readdir-ahead: on nfs.disable: off [root@dhcp46-239 ~]# gluster v tier ^C [root@dhcp46-239 ~]# gluster v tier attach Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 40, in <module> from libcxattr import Xattr File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 16, in <module> class Xattr(object): File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 28, in Xattr libc = CDLL(find_library("libc"), use_errno=True) File "/usr/lib64/python2.7/ctypes/util.py", line 244, in find_library return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name)) File "/usr/lib64/python2.7/ctypes/util.py", line 93, in _findLib_gcc fdout, ccout = tempfile.mkstemp() File "/usr/lib64/python2.7/tempfile.py", line 297, in mkstemp Usage: volume tier <VOLNAME> status volume tier <VOLNAME> start [force] volume tier <VOLNAME> attach [<replica COUNT>] <NEW-BRICK>... [force] volume tier <VOLNAME> detach <start|stop|status|commit|force> Tier command failed [root@dhcp46-239 ~]# dir = gettempdir() File "/usr/lib64/python2.7/tempfile.py", line 265, in gettempdir tempdir = _get_default_tempdir() File "/usr/lib64/python2.7/tempfile.py", line 197, in _get_default_tempdir fd = _os.open(filename, flags, 0o600) KeyboardInterrupt [root@dhcp46-239 ~]# gluster v tier fds attach 10.70.46.221:/bricks/brick3/fds_tier0 10.70.46.222:/bricks/brick3/fds_tier1 volume attach-tier: failed: Another transaction is in progress Please try again after sometime. Tier command failed [root@dhcp46-239 ~]# I have an RCA for this issue now which will be a bit longer than usual, so read it with patience ;) GlusterD maintains three different types of transaction framework (a) op-sm framework where in both originator glusterd (the place where CLI command has been initiated) and the receiver glusterd instances follow the same framework which is event driven state transition. (b) synctask framework where originator glusterd follows sync task and receiver glusterd instances follow op-sm (c) And the most effective framework as it stands today is the enhanced version of sync task where the phases of the transaction can be customed and this too has the same framework followed on originator and receiver nodes. So ideally "Another transaction is in progress" or "locking failed" sort of messages turn up when multiple transaction contends for the same volume lock, while the commands can fail usually with this collision but the theory is the lock should not be held for ever which was not the case in this setup. GlusterD statedump did show up that UUID of the same glusterd node has acquired a lock on Volume <XXX>. This made me to first jump into the cmd_history log on all the nodes to figure out the pattern and I did see a stream of consecutive commands run across all the nodes at the same time. With that information I went ahead and tried to simulate the same test case and here is what I did: 1. Formed a 4 node cluster having a 2X2 volume 2. Wrote a script which executes a set of commands in a loop: while true do gluster volume status gluster volume geo-replication test-vol status gluster snapshot list test-vol gluster volume quota test-vol list gluster snapshot info test-vol gluster volume rebalance test-vol status gluster snapshot status test-vol gluster v status done 3. Ran this script on all the nodes, and then after a while I started seeing the script started to hang after snapshot info command. This made me to believe that there is something wrong in rebalance status command flow. [Now the next part of the update would be a bit more technical, so read it carefully] rebalance status command follows op-sm framework where as other commands are mix of synctask and mgmt_v3. So given there were 4 concurrent script running on 4 nodes the commands will contend for the same volume lock very frequently and there could be a case that on Node 1 (refer N1 from now) when rebalance status is initiated (which follows the op-sm framework) there is high possibility that the same node might be processing another command as a receiver and if its any one of them which follows synctask framework at the originator the receiver will follow the op-sm. So in short, in N1 we could end up into a state where multiple commands are processed through the op-sm at a given point of time. Now coming to how op-sm framework works, as I mentioned earlier its primary a event based state transition model, primarily it maintains a global opinfo object to store the current transaction and its subsequent event, current & next states. So although multiple commands can be processed concurrently through the op-sm there is a design flaw where two commands can not operate on the same global opinfo and this is exactly what has happened there. For eg when command 1 was deciding its next state to transit command 2 updated the opinfo which made command 1 loose its flow and it completely went into different state transition. Particularly in this case, N1 took a lock and never went into the unlock phase to take the lock out which eventually got us into the stale lock issue. And this actually becomes error prone especially when originator and receiver transaction s collide with each other. IMO, solving this problem is not at all easy in op-sm as it'd require a sort of design change. Instead we should retire op-sm framework (atleast on the originator side) and try to follow as much of mgmt_v3. I recollect that Avra had attempted to port rebalance from op-sm to mgmt_v3 through the patch [1] but it got abandoned. We need to revive it and solve this problem in that way instead of making any changes in the existing op-sm framework. [1] https://review.gluster.org/#/c/9043/ *** Bug 1412907 has been marked as a duplicate of this bug. *** Updated the doc text slightly for the release notes Hello, Would someone be able to answer the following customer's questions: * what about implementing 'cluster-wide' locking for monitoring scripts. Using gluster shared storage? Can this help with avoiding these problems? * can you give me a list of gluster commands that can be run safely in parallel from multiple nodes in the same time (and which not). * locking (node / cluster wide), should be handled by gluster itself, not in 'upper layers'. It really doesn't make sense to write additional logic to check if gluster xyz command is not running on other node in cluster. Build: 3.12.2-8 All volume operations are succeeding even after performing below commands in the whole cluster in loop. No stale lock is held. 1. Created 30 volumes on a 3 node setup. 2. On all three nodes started set of volume commands. while true do gluster volume status gluster snapshot list test_10 gluster volume quota test_10 list gluster snapshot info test_10 gluster volume rebalance test_10 status gluster snapshot status test_10 gluster v status done 3. One node acquired lock and on the other nodes it says another transaction in progress. 4. The lock is held for 3 mins. 5. After 3 mins the lock is released. Able to see the output of volume commands. Hence marking it as verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607 |