Bug 1718562

Summary:	flock failure (regression)
Product:	[Community] GlusterFS	Reporter:	Jaco Kroon <jaco>
Component:	locks	Assignee:	bugs <bugs>
Status:	CLOSED UPSTREAM	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	high
Version:	6	CC:	bugs, jthottan, kdhananj, pasik, pumice_unproven968, sheggodu, spalai
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-17 05:53:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jaco Kroon 2019-06-08 16:34:10 UTC

Description of problem:

after a small number of flock rounds the lock remains behind indefinitely until cleared with volume clear-locks, whereafter which normal operation resumes again.

I suspect this happens when there is contention on the lock.

I've got a setup where these locks are used syncronization mechanism.  So a process on host a will take the lock, and release it on shutdown, at which point another host is likely already trying to obtain the lock, and never manages to do so (clearing granted allows the lock to proceed, but randomly clearing locks is a high-risk operation).

Version-Release number of selected component (if applicable):  glusterfs 6.1 (confirmed working correctly on 3.12.3 and 4.0.2, suspected correct on 4.1.5 but no longer have a setup with 4.1.5 around).

How reproducible:  Trivial.  In the mentioned application it's on almost every single lock attempt as far as I can determine.


Steps to Reproduce:

morpheus ~ # gluster volume info shared
 
Volume Name: shared
Type: Replicate
Volume ID: a4410662-b6e0-4ed0-b1e0-a1cbf168029c
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: morpheus:/mnt/gluster/shared
Brick2: r2d2:/mnt/gluster/shared
Options Reconfigured:
transport.address-family: inet
nfs.disable: on

morpheus ~ # mkdir /mnt/t
morpheus ~ # mount -t glusterfs localhost:shared /mnt/t
morpheus ~ # 

r2d2 ~ # mkdir /mnt/t
r2d2 ~ # mount -t glusterfs localhost:shared /mnt/t
r2d2 ~ # 

morpheus ~ # cd /mnt/t/
morpheus ~ # ls -l
total 0
morpheus /mnt/t # exec 3>lockfile; c=0; while flock -w 10 -x 3; do (( c++ )); echo "Iteration $c passed"; exec 3<&-; exec 3>lockfile; done; echo "Failed after $c iterations"; exec 3<&-
Iteration 1 passed
Iteration 2 passed
Iteration 3 passed
...

r2d2 /mnt/t # exec 3>lockfile; c=0; while flock -w 10 -x 3; do (( c++ )); echo "Iteration $c passed"; exec 3<&-; exec 3>lockfile; done; echo "Failed after $c iterations"; exec 3<&-
Iteration 1 passed
Iteration 2 passed
Failed after 2 iterations
r2d2 /mnt/t #

Iteration 100 passed
Iteration 101 passed
Iteration 102 passed
Failed after 102 iterations
morpheus /mnt/t # 

The two mounts failed at the same time, morpheus just passed more iterations due to being started first.

Only iterating on one host I've had to stop it with ^C around 10k iterations, which to me is sufficient indication that it's contention related.

After the above failure, I need to either rm the file and then it works again, or I need to issue "gluster volume clear-locks shared /lockfile kind granted posix"

On /tmp on my local machine I can run as much invocations of the loop above as I want without issues (ext4 filesystem).

On glusterfs 3.12.3 and 4.0.2 I tried the above too, and stopped them after 10k iterations.

I have not observed the behaviour on glusterfs 4.1.5 which we used for a very long time.

I either need a fix for this, or a way (prefereably with little no downtime, total around 1.8TB of data) to downgrade glusterfs back to 4.1.X.  Or a way to get around this reliably from within my application code (mostly control scripts written in bash).

Comment 1 Amar Tumballi 2019-06-10 09:00:30 UTC

Hi Jaco, thanks for the report. Will update on this soon.

Comment 2 Yaniv Kaul 2019-06-10 09:18:12 UTC

Looks like a simple test worth adding to our CI. We can do with 1000 iterations or so.

Comment 3 Susant Kumar Palai 2019-06-19 07:11:48 UTC

The issue is reproducible always. Will update once I find the RCA.

Susant

Comment 4 Jaco Kroon 2019-06-19 09:09:56 UTC

I've managed to implement a workaround for this in php/bash (C/C++ will be similar).

This "work around" is perhaps how locking should have been implemented in the first place on our end (lock files gets removed post use).

The code uses a small(ish 1s) timeout per flock() call due to the bug, a more global timeout would be better but given the bug here doesn't work as well as can be done.  Recursion can (and should) be eliminated but I haven't spent a lot of time on this (getting it out the door was more urgent than making it optimal).  This code does have the single advantage that lock files gets removed post use again (it's based on discussions with other parties).

The other option for folks running into this is to look at dotlockfile(1) which doesn't rely on flock() but has other major timing gap issues (retries are atomic, but waiting is a simple sleep + retry, so if other processes grabs locks at the wrong time the invoking process could starve/fail without the need to do so).

Bash:

#! /bin/sh

function getlock()
{
        local fd="$1"
        local lockfile="$2"
        local waittime="$3"

        eval "exec $fd>\"\${lockfile}\"" || return $?
        local inum=$(stat -c%i - <&3)
        local lwait="-w1"
        [ "${waittime}" -le 0 ] && lwait=-n

        while ! flock -w1 -x 3; do
                if [ "$(stat -c%i "${lockfile}" 2>/dev/null)" != "${inum}" ]; then
                        eval "exec $fd>\"\${lockfile}\"" || return $?
                        local inum=$(stat -c%i - <&3)
                        continue
                fi
                (( waittime-- ))
                if [ $waittime -le 0 ]; then
                        eval "exec $fd<&-"
                        return 1
                fi
        done

        if [ "$(stat -c%i "${lockfile}" 2>/dev/null)" != "${inum}" ]; then
                eval "exec $fd<&-"
                getlock "$fd" "$lockfile" "${waittime}"
                return $?
        fi

        return 0
}

function releaselock()
{
        local fd="$1"
        local lockfile="$2"

        rm "${lockfile}"
        eval "exec $fd<&-"
}

PHP:

<?php
function getlock($filename, $lockwait = -1 /* seconds */)
{
        $lock = new stdClass;
        $lock->filename = $filename;
        $lock->fp = fopen($filename, "w");

        if (!$lock->fp)
                return NULL;

        $lstat = fstat($lock->fp);
        if (!$lstat) {
                fclose($lock->fp);
                return NULL;
        }

        pcntl_signal(SIGALRM, function() {}, false);

        pcntl_alarm(1);
        while (!flock($lock->fp, LOCK_EX)) {
                pcntl_alarm(0);
                clearstatcache(true, $filename);
                $nstat = stat($filename);
                if (!$nstat || $nstat['ino'] != $lstat['ino']) {
                        fclose($lock->fp);
                        $lock->fp = fopen($filename, "w");
                        if (!$lock->fp)
                                return NULL;

                        $lstat = fstat($lock->fp);
                        if (!$lstat) {
                                fclose($lock->fp);
                                return NULL;
                        }
                }

                if (--$lockwait < 0) {
                        fclose($lock->fp);
                        return NULL;
                }

                pcntl_alarm(1);
        }
        pcntl_alarm(0);

        clearstatcache(true, $filename);
        $nstat = stat($filename);
        if (!$nstat || $nstat['ino'] != $lstat['ino']) {
                fclose($lock->fp);
                return getlock($filename, $lockwait);
        }

        return $lock;
}

function releaselock($lock)
{
        unlink($lock->filename);
        fclose($lock->fp);
}
?>

Comment 5 Jaco Kroon 2019-09-17 21:14:18 UTC

Any progress here?  Whilst the workaround is working it ends up leaving a lot of garbage around since the original file really isn't getting unlocked, end up consuming an inode on at least one of the bricks, eventually (on smaller) systems resulting in out of inode situations on the filesystem.

Comment 6 Vishal Pandey 2019-09-18 06:04:43 UTC

@Jaco Kroon, I will assign this to someone for resolving this issue.

@Susant Can you look into this issue. Its been stale for a long time now.

Comment 7 Susant Kumar Palai 2019-09-18 06:25:25 UTC

Moving this to lock translator maintainer.

Comment 8 Susant Kumar Palai 2019-09-18 06:26:20 UTC

Kruthika, could you help with this issue?

Comment 9 Worker Ant 2020-03-12 13:00:21 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/982, and will be tracked there from now on. Visit GitHub issues URL for further details

Comment 10 Jaco Kroon 2020-11-16 15:26:26 UTC

GIT tracker is not receiving any attention either, and I can't re-open the issue on that side.  Very unhappy about this issue.

GlusterFS 7.8 is still affected.

Comment 12 Red Hat Bugzilla 2023-09-14 05:30:00 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days