Bug 1718562
Summary: | flock failure (regression) | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Jaco Kroon <jaco> |
Component: | locks | Assignee: | bugs <bugs> |
Status: | CLOSED UPSTREAM | QA Contact: | |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 6 | CC: | bugs, jthottan, kdhananj, pasik, pumice_unproven968, sheggodu, spalai |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-11-17 05:53:25 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jaco Kroon
2019-06-08 16:34:10 UTC
Hi Jaco, thanks for the report. Will update on this soon. Looks like a simple test worth adding to our CI. We can do with 1000 iterations or so. The issue is reproducible always. Will update once I find the RCA. Susant I've managed to implement a workaround for this in php/bash (C/C++ will be similar). This "work around" is perhaps how locking should have been implemented in the first place on our end (lock files gets removed post use). The code uses a small(ish 1s) timeout per flock() call due to the bug, a more global timeout would be better but given the bug here doesn't work as well as can be done. Recursion can (and should) be eliminated but I haven't spent a lot of time on this (getting it out the door was more urgent than making it optimal). This code does have the single advantage that lock files gets removed post use again (it's based on discussions with other parties). The other option for folks running into this is to look at dotlockfile(1) which doesn't rely on flock() but has other major timing gap issues (retries are atomic, but waiting is a simple sleep + retry, so if other processes grabs locks at the wrong time the invoking process could starve/fail without the need to do so). Bash: #! /bin/sh function getlock() { local fd="$1" local lockfile="$2" local waittime="$3" eval "exec $fd>\"\${lockfile}\"" || return $? local inum=$(stat -c%i - <&3) local lwait="-w1" [ "${waittime}" -le 0 ] && lwait=-n while ! flock -w1 -x 3; do if [ "$(stat -c%i "${lockfile}" 2>/dev/null)" != "${inum}" ]; then eval "exec $fd>\"\${lockfile}\"" || return $? local inum=$(stat -c%i - <&3) continue fi (( waittime-- )) if [ $waittime -le 0 ]; then eval "exec $fd<&-" return 1 fi done if [ "$(stat -c%i "${lockfile}" 2>/dev/null)" != "${inum}" ]; then eval "exec $fd<&-" getlock "$fd" "$lockfile" "${waittime}" return $? fi return 0 } function releaselock() { local fd="$1" local lockfile="$2" rm "${lockfile}" eval "exec $fd<&-" } PHP: <?php function getlock($filename, $lockwait = -1 /* seconds */) { $lock = new stdClass; $lock->filename = $filename; $lock->fp = fopen($filename, "w"); if (!$lock->fp) return NULL; $lstat = fstat($lock->fp); if (!$lstat) { fclose($lock->fp); return NULL; } pcntl_signal(SIGALRM, function() {}, false); pcntl_alarm(1); while (!flock($lock->fp, LOCK_EX)) { pcntl_alarm(0); clearstatcache(true, $filename); $nstat = stat($filename); if (!$nstat || $nstat['ino'] != $lstat['ino']) { fclose($lock->fp); $lock->fp = fopen($filename, "w"); if (!$lock->fp) return NULL; $lstat = fstat($lock->fp); if (!$lstat) { fclose($lock->fp); return NULL; } } if (--$lockwait < 0) { fclose($lock->fp); return NULL; } pcntl_alarm(1); } pcntl_alarm(0); clearstatcache(true, $filename); $nstat = stat($filename); if (!$nstat || $nstat['ino'] != $lstat['ino']) { fclose($lock->fp); return getlock($filename, $lockwait); } return $lock; } function releaselock($lock) { unlink($lock->filename); fclose($lock->fp); } ?> Any progress here? Whilst the workaround is working it ends up leaving a lot of garbage around since the original file really isn't getting unlocked, end up consuming an inode on at least one of the bricks, eventually (on smaller) systems resulting in out of inode situations on the filesystem. @Jaco Kroon, I will assign this to someone for resolving this issue. @Susant Can you look into this issue. Its been stale for a long time now. Moving this to lock translator maintainer. Kruthika, could you help with this issue? This bug is moved to https://github.com/gluster/glusterfs/issues/982, and will be tracked there from now on. Visit GitHub issues URL for further details GIT tracker is not receiving any attention either, and I can't re-open the issue on that side. Very unhappy about this issue. GlusterFS 7.8 is still affected. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |