Description of problem: after a small number of flock rounds the lock remains behind indefinitely until cleared with volume clear-locks, whereafter which normal operation resumes again. I suspect this happens when there is contention on the lock. I've got a setup where these locks are used syncronization mechanism. So a process on host a will take the lock, and release it on shutdown, at which point another host is likely already trying to obtain the lock, and never manages to do so (clearing granted allows the lock to proceed, but randomly clearing locks is a high-risk operation). Version-Release number of selected component (if applicable): glusterfs 6.1 (confirmed working correctly on 3.12.3 and 4.0.2, suspected correct on 4.1.5 but no longer have a setup with 4.1.5 around). How reproducible: Trivial. In the mentioned application it's on almost every single lock attempt as far as I can determine. Steps to Reproduce: morpheus ~ # gluster volume info shared Volume Name: shared Type: Replicate Volume ID: a4410662-b6e0-4ed0-b1e0-a1cbf168029c Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: morpheus:/mnt/gluster/shared Brick2: r2d2:/mnt/gluster/shared Options Reconfigured: transport.address-family: inet nfs.disable: on morpheus ~ # mkdir /mnt/t morpheus ~ # mount -t glusterfs localhost:shared /mnt/t morpheus ~ # r2d2 ~ # mkdir /mnt/t r2d2 ~ # mount -t glusterfs localhost:shared /mnt/t r2d2 ~ # morpheus ~ # cd /mnt/t/ morpheus ~ # ls -l total 0 morpheus /mnt/t # exec 3>lockfile; c=0; while flock -w 10 -x 3; do (( c++ )); echo "Iteration $c passed"; exec 3<&-; exec 3>lockfile; done; echo "Failed after $c iterations"; exec 3<&- Iteration 1 passed Iteration 2 passed Iteration 3 passed ... r2d2 /mnt/t # exec 3>lockfile; c=0; while flock -w 10 -x 3; do (( c++ )); echo "Iteration $c passed"; exec 3<&-; exec 3>lockfile; done; echo "Failed after $c iterations"; exec 3<&- Iteration 1 passed Iteration 2 passed Failed after 2 iterations r2d2 /mnt/t # Iteration 100 passed Iteration 101 passed Iteration 102 passed Failed after 102 iterations morpheus /mnt/t # The two mounts failed at the same time, morpheus just passed more iterations due to being started first. Only iterating on one host I've had to stop it with ^C around 10k iterations, which to me is sufficient indication that it's contention related. After the above failure, I need to either rm the file and then it works again, or I need to issue "gluster volume clear-locks shared /lockfile kind granted posix" On /tmp on my local machine I can run as much invocations of the loop above as I want without issues (ext4 filesystem). On glusterfs 3.12.3 and 4.0.2 I tried the above too, and stopped them after 10k iterations. I have not observed the behaviour on glusterfs 4.1.5 which we used for a very long time. I either need a fix for this, or a way (prefereably with little no downtime, total around 1.8TB of data) to downgrade glusterfs back to 4.1.X. Or a way to get around this reliably from within my application code (mostly control scripts written in bash).
Hi Jaco, thanks for the report. Will update on this soon.
Looks like a simple test worth adding to our CI. We can do with 1000 iterations or so.
The issue is reproducible always. Will update once I find the RCA. Susant
I've managed to implement a workaround for this in php/bash (C/C++ will be similar). This "work around" is perhaps how locking should have been implemented in the first place on our end (lock files gets removed post use). The code uses a small(ish 1s) timeout per flock() call due to the bug, a more global timeout would be better but given the bug here doesn't work as well as can be done. Recursion can (and should) be eliminated but I haven't spent a lot of time on this (getting it out the door was more urgent than making it optimal). This code does have the single advantage that lock files gets removed post use again (it's based on discussions with other parties). The other option for folks running into this is to look at dotlockfile(1) which doesn't rely on flock() but has other major timing gap issues (retries are atomic, but waiting is a simple sleep + retry, so if other processes grabs locks at the wrong time the invoking process could starve/fail without the need to do so). Bash: #! /bin/sh function getlock() { local fd="$1" local lockfile="$2" local waittime="$3" eval "exec $fd>\"\${lockfile}\"" || return $? local inum=$(stat -c%i - <&3) local lwait="-w1" [ "${waittime}" -le 0 ] && lwait=-n while ! flock -w1 -x 3; do if [ "$(stat -c%i "${lockfile}" 2>/dev/null)" != "${inum}" ]; then eval "exec $fd>\"\${lockfile}\"" || return $? local inum=$(stat -c%i - <&3) continue fi (( waittime-- )) if [ $waittime -le 0 ]; then eval "exec $fd<&-" return 1 fi done if [ "$(stat -c%i "${lockfile}" 2>/dev/null)" != "${inum}" ]; then eval "exec $fd<&-" getlock "$fd" "$lockfile" "${waittime}" return $? fi return 0 } function releaselock() { local fd="$1" local lockfile="$2" rm "${lockfile}" eval "exec $fd<&-" } PHP: <?php function getlock($filename, $lockwait = -1 /* seconds */) { $lock = new stdClass; $lock->filename = $filename; $lock->fp = fopen($filename, "w"); if (!$lock->fp) return NULL; $lstat = fstat($lock->fp); if (!$lstat) { fclose($lock->fp); return NULL; } pcntl_signal(SIGALRM, function() {}, false); pcntl_alarm(1); while (!flock($lock->fp, LOCK_EX)) { pcntl_alarm(0); clearstatcache(true, $filename); $nstat = stat($filename); if (!$nstat || $nstat['ino'] != $lstat['ino']) { fclose($lock->fp); $lock->fp = fopen($filename, "w"); if (!$lock->fp) return NULL; $lstat = fstat($lock->fp); if (!$lstat) { fclose($lock->fp); return NULL; } } if (--$lockwait < 0) { fclose($lock->fp); return NULL; } pcntl_alarm(1); } pcntl_alarm(0); clearstatcache(true, $filename); $nstat = stat($filename); if (!$nstat || $nstat['ino'] != $lstat['ino']) { fclose($lock->fp); return getlock($filename, $lockwait); } return $lock; } function releaselock($lock) { unlink($lock->filename); fclose($lock->fp); } ?>
Any progress here? Whilst the workaround is working it ends up leaving a lot of garbage around since the original file really isn't getting unlocked, end up consuming an inode on at least one of the bricks, eventually (on smaller) systems resulting in out of inode situations on the filesystem.
@Jaco Kroon, I will assign this to someone for resolving this issue. @Susant Can you look into this issue. Its been stale for a long time now.
Moving this to lock translator maintainer.
Kruthika, could you help with this issue?
This bug is moved to https://github.com/gluster/glusterfs/issues/982, and will be tracked there from now on. Visit GitHub issues URL for further details
GIT tracker is not receiving any attention either, and I can't re-open the issue on that side. Very unhappy about this issue. GlusterFS 7.8 is still affected.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days