1694920 – Inconsistent locking in presence of disconnects

Bug 1694920 - Inconsistent locking in presence of disconnects

Summary: Inconsistent locking in presence of disconnects

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	protocol
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1689375
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-02 05:09 UTC by Raghavendra G
Modified:	2020-02-10 18:49 UTC (History)
CC List:	12 users (show)
Fixed In Version:	glusterfs-7.0
Clone Of:	1689375
Environment:
Last Closed:	2020-02-10 18:49:07 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	22712	0	None	Open	protocol/client: don't reopen fds on which POSIX locks are held after a reconnect	2019-09-14 11:17:23 UTC

Comment 1 Raghavendra G 2019-04-02 05:13:28 UTC

On 3/14 at 19:09 the customer disconnected the active node from the network/storage . You can see the secondary node attempt to take the files for writing but receive the "Try again" messages until the files become available. After a few tries the secondary is able to acquire a lock and begin writing to the files.

Fast forward 5 minutes to 19:14 and you can see the following message when they brought the original primary server back on the network.

[2019-03-14 19:14:26.377957] I [addr.c:55:compare_addr_and_update] 0-/bricks/brick1/brick: allowed = "*", received addr = "xyz"
[2019-03-14 19:14:26.378010] I [MSGID: 115029] [server-handshake.c:564:server_setvolume] 0-dis-rep-server: accepted client from xyz (version: 3.12.2) with subvol /bricks/brick1/brick

After the above message confirming the clients reconnection, you will see in the logs that both clients are now able to write to the files. The expected result is that the Secondary now owns the locks, and the original Primary would now receive the "Try again" messages until its either taken out of an active state or the secondary goes down or back into standby giving up the locks.

They do see the desired behavior when only network between the two clients is interrupted, but the primary is not disconnected from storage. In that case, the secondary will go into an active state, but forever just scrolls "Try again" in the log file and is never able to acquire the lock while the other is still active.

My understanding of the sequence of events (confirmed by the customer)

1. Client 1 takes a lock on a file on volume dis-rep which is locally fuse mounted to the client.

2. Client 1 goes offline.

3. Client 2 takes the lock from now defunct Client 1.

4. Client 1 comes back online and is able to acquire a lock while the file is already locked by Client 2.

Comment 2 Raghavendra G 2019-04-02 05:15:19 UTC

--- Additional comment from Krutika Dhananjay on 2019-03-20 07:07:24 UTC ---

Adding protocol/client maintainer to thread.

Raghavendra,

First, some context on the case -

<context>
The customer runs an application which has two glusterfs fuse clients, only one of which is supposed to be writing to the files at a given time. This is ensured in the form of both attempting to acquire fcntl() locks on the files and whichever of the 2 clients wins the locks, gets to writes to it.
Client-1 initially acquires the locks. Continues to do IO. During this time, any attempt by client-2 to acquire fcntl locks on these same files is met with EAGAIN.
Client-1 disconnects at some point. (Process is still alive, just that the network is disconnected) Its locks are cleaned up by the locks translator on the server.
Client-2 succeeds at acquiring fcntl locks now and starts writing to the files.
At some point, network on client-1's end is restored and it reconnects with the bricks. But it still thinks that it owns the fcntl locks and starts writing to the files.
Now we have a case where both client-1 and client-2 are writing to the same files in alternation. This is undesirable as the expectation is that client-1 reacquires fcntl locks before writing.
</context> 


Here, client-1 that was initially holding the FCNTL locks, upon disconnect marks the fds bad. So far so good.
But upon reconnect, these fds are reopened post handshake, and marked "good" again, as per the logs here and the code that is executed (client_reopen_done)-

client-0:
=========
[2019-03-14 19:14:26.379756] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 2-dis-rep-client-0: Connected to dis-rep-client-0, attached to remote volume '/bricks/brick1/brick'.
[2019-03-14 19:14:26.379786] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 2-dis-rep-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2019-03-14 19:14:26.379810] I [MSGID: 114042] [client-handshake.c:1047:client_post_handshake] 2-dis-rep-client-0: 3 fds open - Delaying child_up until they are re-opened
[2019-03-14 19:14:26.382260] I [MSGID: 114041] [client-handshake.c:678:client_child_up_reopen_done] 2-dis-rep-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP

client-1:
=========
[2019-03-14 19:14:26.403372] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 2-dis-rep-client-1: Connected to dis-rep-client-1, attached to remote volume '/bricks/brick2/brick'.
[2019-03-14 19:14:26.403416] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 2-dis-rep-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2019-03-14 19:14:26.403436] I [MSGID: 114042] [client-handshake.c:1047:client_post_handshake] 2-dis-rep-client-1: 3 fds open - Delaying child_up until they are re-opened
[2019-03-14 19:14:26.405933] I [MSGID: 114041] [client-handshake.c:678:client_child_up_reopen_done] 2-dis-rep-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP

client-2:
=========
[2019-03-14 19:14:28.419255] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 2-dis-rep-client-2: Connected to dis-rep-client-2, attached to remote volume '/bricks/brick5/brick'.
[2019-03-14 19:14:28.419284] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 2-dis-rep-client-2: Server and Client lk-version numbers are not same, reopening the fds                                                    
[2019-03-14 19:14:28.419303] I [MSGID: 114042] [client-handshake.c:1047:client_post_handshake] 2-dis-rep-client-2: 3 fds open - Delaying child_up until they are re-opened
[2019-03-14 19:14:28.421753] I [MSGID: 114041] [client-handshake.c:678:client_child_up_reopen_done] 2-dis-rep-client-2: last fd open'd/lock-self-heal'd - notifying CHILD-UP

Knowing there were posix locks associated with these fds, shouldn't protocol/client be leaving the fds marked as bad, and perhaps force reacquisition of the posix locks, once back online?

-Krutika

Comment 3 Raghavendra G 2019-04-02 05:16:18 UTC

--- Additional comment from Raghavendra G on 2019-03-21 11:30:29 UTC ---

(In reply to Krutika Dhananjay from comment #6)
> Adding protocol/client maintainer to thread.
> 
> Raghavendra,

....

> Knowing there were posix locks associated with these fds, shouldn't
> protocol/client be leaving the fds marked as bad, and perhaps force
> reacquisition of the posix locks, once back online?

You are right. re-opening the fds on bricks (especially the ones which have POSIX locks) acquired on them) would cause application to be not aware of the fact that it has lost the lock. This is a bug.

On discussion with Anoop and Pranith we agreed that protocol/client, AFR and EC shouldn't reopen the fds if the locks are not guaranteed to be present on bricks. For the case of afr and EC this means if Quorum number of bricks were down at any single point in time, the opened fds with locks on them at that point in time shouldn't be re-opened even after bricks come back up. I'll be sending a fix to protocol/client to do this and would request afr and EC team to send fixes to these components as well. 

Note that this fix will make sure applications receive error (EBADFD) after disconnects have cleaned up locks on bricks. fds have to be re-opened by application and locks to be re-acquired after such an event. This I believe will be cumbersome for applications (as disconnect events are asynchronous and are supposed to be transparent to the application as long as Quorum number of bricks are online at any point in time. To provide such transparent behaviour we might require lock healing feature in AFR/EC. I see some discussion to revive that topic on Gluster-devel and I guess efforts to that will be tracked in a different bz or github issue or email. AFR/EC team can post relevant links to this thread for the benefit of others interested in this use-case.

regards,
Raghavendra

Comment 4 Worker Ant 2019-09-14 11:17:24 UTC

REVIEW: https://review.gluster.org/22712 (protocol/client: don't reopen fds on which POSIX locks are held after a reconnect) merged (#5) on master by Raghavendra G

Comment 6 Sunny Kumar 2020-02-10 18:49:07 UTC

Patch is merged, closing this bug now.

Note You need to log in before you can comment on or make changes to this bug.