Bug 762749 (GLUSTER-1017) - Locking deadlock when upgrading lock
Summary: Locking deadlock when upgrading lock
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-1017
Product: GlusterFS
Classification: Community
Component: locks
Version: 3.2.5
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact: Raghavendra Bhat
URL:
Whiteboard:
Depends On:
Blocks: 817967
TreeView+ depends on / blocked
 
Reported: 2010-06-22 13:55 UTC by Alex/AT
Modified: 2013-07-24 17:26 UTC (History)
5 users (show)

Fixed In Version: glusterfs-3.4.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-07-24 17:26:00 UTC
Regression: ---
Mount Type: fuse
Documentation: ---
CRM:
Verified Versions: glusterfs-3.3.0
Embargoed:


Attachments (Terms of Use)
ctdb setup configuration files (892 bytes, application/x-gzip)
2010-08-24 09:31 UTC, Balamurugan Arumugam
no flags Details

Description Alex/AT 2010-06-22 10:56:59 UTC
The script in the description has 

----------------------------------------------------------------------
#echo('Shared lock attempt.'."\n");
#flock($fh, LOCK_SH);
#echo('Locked as shared.'."\n");
#sleep(10);
----------------------------------------------------------------------

lines commented with #. Uncomment them before running, or use the following:

----------------------------------------------------------------------
#/usr/bin/php
<?

$fh = fopen('gluster.test', 'ab+');
echo('Opened.'."\n");
sleep(2);
#echo('Shared lock attempt.'."\n");
#flock($fh, LOCK_SH);
#echo('Locked as shared.'."\n");
#sleep(10);
echo('Exclusive lock attempt.'."\n");
flock($fh, LOCK_EX);
echo('Locked exclusively.'."\n");
sleep(10);
flock($fh, LOCK_UN);
echo('Unlocked.'."\n");
sleep(2);
fclose($fh);
echo('Closed.'."\n");
sleep(1);

?>
----------------------------------------------------------------------

Comment 1 Alex/AT 2010-06-22 10:58:06 UTC
Oops, I posted the same again. It is:

----------------------------------------------------------------------
#/usr/bin/php
<?

$fh = fopen('gluster.test', 'ab+');
echo('Opened.'."\n");
sleep(2);
echo('Shared lock attempt.'."\n");
flock($fh, LOCK_SH);
echo('Locked as shared.'."\n");
sleep(10);
echo('Exclusive lock attempt.'."\n");
flock($fh, LOCK_EX);
echo('Locked exclusively.'."\n");
sleep(10);
flock($fh, LOCK_UN);
echo('Unlocked.'."\n");
sleep(2);
fclose($fh);
echo('Closed.'."\n");
sleep(1);

?>
----------------------------------------------------------------------

Now for sure. Sorry for the crap above, it's just the end of the working day.

Comment 2 Alex/AT 2010-06-22 13:55:16 UTC
I am trying to put 3.0.4 in production, but have reached a showstopper bug.

My config is:
* CentOS 5.5 x86_64 (vanilla up-to-date 2.6.18 kernel), custom FUSE module and userspace built from fuse-2.7.4glfs11-1.tar.gz, vanilla GlusterFS from glusterfs-*-3.0.4-1.x86_64.rpm's.

The bug is: when the file lock is upgraded from LOCK_SH to LOCK_EX, and other node keeps LOCK_SH, we get a deadlock on client that causes the file in operation to never be locked with LOCK_EX again (any attempt to LOCK_EX it deadlocks).

This also happens sometimes if we just upgrade the lock on one node. The script deadlocks and never exits.

Side effect: after deadlock, only umount --force and fusermount -u can unmount the file system. Regular umount says file system is busy even if I manage to kill the test scripts.

How to repeat: use this script

----------------------------------------------------------------------
#/usr/bin/php
<?

$fh = fopen('gluster.test', 'ab+');
echo('Opened.'."\n");
sleep(2);
#echo('Shared lock attempt.'."\n");
#flock($fh, LOCK_SH);
#echo('Locked as shared.'."\n");
#sleep(10);
echo('Exclusive lock attempt.'."\n");
flock($fh, LOCK_EX);
echo('Locked exclusively.'."\n");
sleep(10);
flock($fh, LOCK_UN);
echo('Unlocked.'."\n");
sleep(2);
fclose($fh);
echo('Closed.'."\n");
sleep(1);

?>
----------------------------------------------------------------------

Run it on two nodes with a slight delay (1-2 sec), and you'll get first deadlock after both nodes go to "Exclusive lock attempt" state. CTRL-C can break the scripts, but it is already fatal: run script on any one node (or on both) again, and after next "Exclusive lock attempt" you'll get complete deadlock with the script going defunct. umount --force and then fusermount -u helps though.

----------------------------------------------------------------------

That's the config I use (passwords removed, IPs changed, volume names changed):

First server:
----------------------------------------------------------------------
volume a1_posix
  type storage/posix
  option directory /glusterfs/a0
  option background-unlink yes
end-volume

volume a1
    type features/locks
    subvolumes a1_posix
end-volume

volume a1_server
  type protocol/server
  option transport-type tcp
  option transport.socket.listen-port 6996
  option auth.addr.bigweb_ttknw.allow 10.1.1.*
  subvolumes a1
end-volume
----------------------------------------------------------------------

Second server config is identical, just the names are "a2", not "a1".

Client:
----------------------------------------------------------------------
volume a1
  type protocol/client
  option transport-type tcp
  option remote-host 10.1.1.2
  option remote-port 6996
  option remote-subvolume a1
end-volume

volume a2
  type protocol/client
  option transport-type tcp
  option remote-host 10.1.1.4
  option remote-port 6996
  option remote-subvolume a2
end-volume

volume a0
  type cluster/replicate
  subvolumes a1 a2
end-volume
----------------------------------------------------------------------

Comment 3 Alex/AT 2010-07-21 08:28:09 UTC
Reporting in: GlusterFS 3.0.5 still has that problem.

Comment 4 Balamurugan Arumugam 2010-08-24 09:31:32 UTC
Created attachment 295


Tarball contains sample ctdb setup configuration files.  These files needs to be modified as per test setup environment.

Comment 5 Amar Tumballi 2010-10-05 06:01:19 UTC
Most of the self-heal (replicate related) bugs are now fixed with 3.1.0 branch. As we are just week behind the GA release time.. we would like you to test the particular bug in 3.1.0RC releases, and let us know if its fixed.

Comment 6 Vijay Bellur 2010-10-05 09:34:59 UTC
PATCH: http://patches.gluster.com/patch/5285 in master (features/locks: Handle lock upgrade and downgrade properly in locks.)

Comment 7 Vijay Bellur 2010-10-18 07:25:57 UTC
PATCH: http://patches.gluster.com/patch/5507 in release-3.0 (features/locks: Handle upgrade/downgrade of locks properly.)

Comment 8 Alex/AT 2010-10-20 16:20:38 UTC
GlusterFS 3.1 release still fails on that even patched.

How to repeat:

1. Create one replicated volume (2 replicas) according to manual
2. Put test script on volume
3. Run script simultaneously on both nodes (with 1-2 second interval)
4. See the script hanging (even SIGKILL fails)

Comment 9 Alex/AT 2010-10-20 16:25:19 UTC
It's better than previously: the FS can be unmounted with -f after some tries, and the scripts terminate when one of the nodes affected unmounts FS. But it is still deadlocking on shared->excl upgrade, if the second node tries to do the same at the same time.

Comment 10 Vijay Bellur 2010-10-21 14:00:43 UTC
PATCH: http://patches.gluster.com/patch/5552 in release-3.0 (cluster/afr: Do a broadcast unlock in replicate to eliminate deadlock during upgrade/downgrade.)

Comment 11 Alex/AT 2010-10-21 16:22:16 UTC
Applied 5552 to 3.1 with a small filename change and whitespace ignoring.

The test passes now, but... it really fails. Look:

1. First node locks file as SHARED
2. Second node locks file as SHARED, this is correct
3. First node attempts to lock file as EXCLUSIVE and it waits for the shared lock to be removed.
4. Second node attempts to lock file as EXCLUSIVE, and then everything goes wrong...

Second node is allowed to take EXCLUSIVE lock then. It must not be, because first node still holds the SHARED lock.

Why this is wrong. First node expects file to be unmodified between SHARED and EXCLUSIVE locks.

Yes. There will be some deadlock, but not in the filesystem core. File system must not hang and allow processes to be terminated at least with SIGKILL.

In real life, processes opt to fallback if the lock upgrade is ungrantable. The filesystem must not hung, just report that such a lock is ungrantable at the time of request.

---

Okay. Now we have semi-working configuration, that is good enough, but it can lead to corrupted files in case of some lock upgrades confirmed as granted. I will also post a second test in a while.

Comment 12 Alex/AT 2010-10-21 16:29:20 UTC
Hmm. Here is another test suite:

--------------------------------------------------------------------------

test2-1.php

--------------------------------------------------------------------------

#!php
<?

@unlink('gluster.test');
$fh = fopen('gluster.test', 'ab+');
echo('Opened.'."\n");
sleep(2);
echo('Shared lock attempt.'."\n");
flock($fh, LOCK_SH);
echo('Locked as shared.'."\n");
sleep(10);
echo('Exclusive lock attempt.'."\n");
flock($fh, LOCK_EX);
echo('Locked exclusively.'."\n");
sleep(10);
flock($fh, LOCK_UN);
echo('Unlocked.'."\n");
sleep(2);
fclose($fh);
echo('Closed.'."\n");
sleep(1);
echo('Result: '.file_get_contents('gluster.test')."\n");

?>

--------------------------------------------------------------------------

test2-2.php

--------------------------------------------------------------------------

#!php
<?

$fh = fopen('gluster.test', 'ab+');
echo('Opened.'."\n");
sleep(2);
echo('Exclusive lock attempt.'."\n");
flock($fh, LOCK_EX);
echo('Locked exclusively.'."\n");
sleep(10);
ftruncate($fh, 0);
fwrite($fh, 'WRONG!');
flock($fh, LOCK_UN);
echo('Unlocked.'."\n");
sleep(2);
fclose($fh);
echo('Closed.'."\n");
sleep(1);

?>

Comment 13 Alex/AT 2010-10-21 16:36:20 UTC
One mistake.

Please place
echo('Result: '.file_get_contents('gluster.test')."\n");

just after
sleep(10);

in the code.

----------------------------------------------------------------------

How to use:

1. Place test2-1.php and test2-2.php onto GlusterFS volume.
2. Run test2-1.php on one node.
3. Wait about 1-2 seconds.
4. Run test2-2.php on another node.
5. Wait for the result.

----

What must be:

1. Node 1 locks file as SHARED
2. Node 2 attempts to lock file as EXCLUSIVE. It must wait, because there is shared lock on file.
3. Node 1 upgrades lock to EXCLUSIVE. It is permitted to do so, because there are no other locks on the file.
4. Node 1 prints file contents.
5. Node 1 unlocks the file.
6. Node 2 is allowed to do its job.

What happens currently:

1. Node 1 locks file as SHARED
2. Node 2 attempts to lock file as EXCLUSIVE. It must wait, because there is shared lock on file.
3. Node 1 tries to upgrade lock to EXCLUSIVE. 
4. Oops, Node 2 is allowed to get the lock and do as it wish... this is bad, it corrupts the file with "WRONG!"
5. Node 2 unlocks the file.
6. Node 1 is allowed to get the lock.
7. Node 1 prints WRONG file contents.
8. Node 1 unlocks the file.

Comment 14 Alex/AT 2010-10-21 16:40:52 UTC
IRL, such lock upgrades are rare, because they pose deadlock danger. I still do think this patch is a must for start, while we can have our time to debug locking. Alas, I'm not too versed in a matters of GlusterFS internals, but will try to understand it in the meanwhile so I can be of a help (maybe).

Comment 15 Anand Avati 2010-11-18 10:56:13 UTC
PATCH: http://patches.gluster.com/patch/5742 in master (features/locks: Send prelock unlock only if it is not grantable and is a blocking lock call.)

Comment 16 Alex/AT 2010-11-30 07:14:54 UTC
Reporting in:

The test from Comment #2 still fails in gluster 3.1.1. It hangs on exclusive locking.

Comment 17 Amar Tumballi 2011-04-25 09:33:05 UTC
Please update the status of this bug as its been more than 6months since its filed (bug id < 2000)

Please resolve it with proper resolution if its not valid anymore. If its still valid and not critical, move it to 'enhancement' severity.

Comment 18 Dave Garnett 2011-09-26 12:01:42 UTC
A Pivotal Tracker story has been created for this Bug: http://www.pivotaltracker.com/story/show/18852795

Comment 19 Dave Garnett 2011-09-26 12:53:22 UTC
Dave Garnett deleted the linked story in Pivotal Tracker

Comment 20 Alex/AT 2011-10-10 08:04:13 UTC
GlusterFS 3.2.4 (FUSE) fails script in Comment 2.

Comment 21 Alex/AT 2011-10-10 08:09:51 UTC
3.2.4 also hangs on that script, defunct'ing the script process. 
Killing glusterfs processes on one of the nodes resolves hang on another node.
Something is still wrong dead in the locking code.

Comment 22 Pranith Kumar K 2012-05-28 11:24:43 UTC
hi Alex,
     I tested for deadlock in comment-2 on release-3.3 branch and the deadlock did not happen.
I also want to test the use-case in comment-12,13. Could you tell me if your suggestion in comment-13 is for test2-1.php or test2-2.php?.

Could you attach the files instead of copy/paste.

If you want to play with the new changes in locks xlator yourself, use:

http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa43.tar.gz

Pranith.

Comment 23 Alex/AT 2012-05-28 12:52:02 UTC
Yes, I would like to check it again. 

I will check in a short while (may take one to two days to build test case), then I will post test results along with the test scripts used.

Thanks for your assistance.

Comment 24 Pranith Kumar K 2012-05-28 13:31:15 UTC
Thanks Alex.
Could you give me the scripts you used in comment-12,13 so that I can test that case while you build other test cases?.

Pranith.

Comment 25 Pranith Kumar K 2012-06-11 10:52:48 UTC
Alex,
      I am moving this bug to MODIFIED state as the fix for the first test-case that resulted in dead-lock is not happening on 3.3. anymore. Please feel free to open new bugs if you can come up with new cases that result in problems.

Comment 26 Raghavendra Bhat 2012-06-12 10:10:56 UTC
Checked with glusterfs-3.3.0. Executed the php script mentioned from 2 clients and tests completed both the sides.

Comment 27 Alex/AT 2012-09-13 09:16:31 UTC
Sorry for the long absence, I had had some health issues to cope with.

Tested on 3.3.0 the script #1 at the end of this message (running two script copies on different hosts in the same mount dir within a 1-3 second interval).

Result: locking success, workflow is impaired. 

When both scripts race for the exclusive lock, glusterfs FUSE mount blocks completely until one of the processes are SIGKILL'ed. 

Normal behavior involves passing SIGHUP/SIGTERM to process as well during lockwaits.

Otherwise, locking works as desired. Nonblocking locks work. Shared/exclusive mechanics work.

Verdict: usable, but needs some love to the signal handling while in lockwait.

--------------------------------------------------------------------------

Script #1

#!/usr/bin/php
<?

$fh = fopen('gluster.test', 'ab+');
echo('Opened.'."\n");
sleep(2);
echo('Shared lock attempt.'."\n");
flock($fh, LOCK_SH);
echo('Locked as shared.'."\n");
sleep(10);
echo('Exclusive lock attempt.'."\n");
flock($fh, LOCK_EX);
echo('Locked exclusively.'."\n");
sleep(10);
flock($fh, LOCK_UN);
echo('Unlocked.'."\n");
sleep(2);
fclose($fh);
echo('Closed.'."\n");
sleep(1);

?>

--------------------------------------------------------------------------

Script #2

Comment 28 Pranith Kumar K 2012-09-13 09:32:03 UTC
Welcome back. Thanks for the verification.
Could you explain a bit more about the signal handling part.

Comment 29 Alex/AT 2012-09-14 04:01:29 UTC
Yes.

When waiting for lock (or racing for lock) on regular filesystems, SIGINT/SIGHUP/SIGTERM and other signals reach the process waiting for lock.

When waiting for lock on GlusterFS, these signals does not reach the process waiting for lock, and process can never be signalled using external means. I found that SIGKILL now works, but SIGKILL is not quite a regular signal.

It's not fatal for everyday ops, but it prevents process termination i.e. from console (SIGINT) when debugging.


Note You need to log in before you can comment on or make changes to this bug.