Bug 1341429 - successful mkdir from "bad" subvolume should be ignored while propagating result to higher layer
Summary: successful mkdir from "bad" subvolume should be ignored while propagating res...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Raghavendra G
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1341435
TreeView+ depends on / blocked
 
Reported: 2016-06-01 04:31 UTC by Raghavendra G
Modified: 2019-05-09 10:16 UTC (History)
2 users (show)

Fixed In Version: glusterfs-6.x
Clone Of:
: 1341435 (view as bug list)
Environment:
Last Closed: 2019-05-09 10:16:43 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Raghavendra G 2016-06-01 04:31:28 UTC
Description of problem:

For Preop checks like [1], dentry operations like mkdir etc would rely on xattrs on parent (like dht layout). However, a "bad" subvolume of afr cannot make correct decision during preop check as the xattrs are not guaranteed to be correct. Imagine the following scenario:
a. mkdir succeeds on non-readable subvolume
b. fails on readable subvolume (may be because of layout xattrs didn't match).

afr here would report mkdir as success to parent xlators (here dht). However, since non-readable subvolume is not guaranteed to have correct xattrs, mkdir shouldn't have succeeded. Worse still, if mkdir on readable subvolume had failed because preop check failed (client in memory layout xattrs and layout xattrs persisted on disk didn't match), we would be ignoring a genuine failure and instead transforming it as a success.

dht is planning to introduce pre-operation checks on all dentry and lock (posix-locks, inodelk, entrylk) operations. So, without some sort of support from lower layers on xattr consistency (as these preops primarily rely on xattrs) it is impossible for a correct implementation.

[1] http://review.gluster.org/13885


Version-Release number of selected component (if applicable):
mainline, 3.7.11, 3.8

How reproducible:
bug was filed as a result of design discussion.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Raghavendra G 2016-06-06 10:43:31 UTC
Following is the transcript of discussion I had with Xavi discussing whether this problem is present in EC. The conclusion was, EC doesn't suffer from this problem.

15:07 < raghug> xavih: my concern with patch #13885 is that, pre-op checks (xattr comparision) is done only at storage/posix
15:07 < raghug> but in clustered scenario like afr/ec, the brick (by itself) doesn't know the "correctness" of xattrs present on it
15:08 < raghug> So, whatever the decision the brick makes, can't be trusted
15:08 < xavih> raghug: yes, it only knows its own state, and it should act depending on it
15:09 < raghug> xavih: yes. 
15:09 < raghug> xavih: consider this scenario
15:09 < xavih> raghug: in this case afr/ec will combine the answers as appropriate and give a "good" answer to the upper layer. Bricks not matching the "good" answer will be marked as bad
15:12 < raghug> xavih: just for clarification, here is an example I gave out in the bug report of bz 1341429
15:12 < raghug> [#gluster] xavih: what about rollback? If the posix has already created the directory, but ec/afr considers the answer as bad, we need to delete the directory
15:12 < raghug> raghug: sorry ignore last statement
15:12 < raghug> For Preop checks like [1], dentry operations like mkdir etc would rely on xattrs on parent (like dht layout). However, a "bad" subvolume of afr cannot make correct decision during preop check as the xattrs are not guaranteed to be correct. Imagine the following scenario:
15:12 < raghug> a. mkdir succeeds on non-readable subvolume
15:12 < raghug> b. fails on readable subvolume (may be because of layout xattrs didn't match).
15:12 < raghug> afr here would report mkdir as success to parent xlators (here dht). However, since non-readable subvolume is not guaranteed to have correct xattrs, mkdir shouldn't have succeeded. Worse still, if mkdir on readable subvolume had failed because preop check failed (client in memory layout xattrs and layout xattrs persisted on disk didn't match), we would be ignoring a genuine failure and instead transforming it as a success.
15:13 < raghug> you can use readable/good interchanged
15:13 < raghug> same with non-readable/bad subvols
15:13 < xavih> raghug: as I understand the changes in storage/posix, mkdir *cannot* succeed on a bad brick
15:15 < xavih> raghug: if the xattr check fails, it returns an error without creating the directory
15:15 < raghug> xavih: what if a "bad" brick has the "stale" xattrs coincedentally cached by client also?
15:16 < raghug> it can happen with frequent changes of xattrs I suppose. thinking for a concrete example..
15:17 -!- atinm [atinm@nat/redhat/x-iofeywgqrejhtcuh] has joined #gluster
15:17 < xavih> raghug: do you mean that directories with the same xattr can be good and bad ? shouldn't that xattr value be unique in some way ?
15:17 < xavih> raghug: otherwise this preop check seems weak...
15:19 < raghug> For simplicity, Lets consider the case of 3 way replica b1, b2 and b3
15:19 < xavih> raghug: ok
15:19 < raghug> Lets say a client c1 has cached xattr value x1
15:20 < xavih> raghug: when you say "cached" do you mean that it has stored that value on the three bricks ?
15:20 < raghug> xavih: no. The client has stored the value in its memory
15:20 < raghug> in the inode-ctx of directory/file lets say
15:21 < xavih> raghug: without sending the request to the bricks and check the answer ?
15:21 < xavih> raghug: I think this is not possible
15:22 < raghug> xavih: it would have sent the request (say a getxattr/lookup). But lets say there is a future fop (say mkdir) that relies on this xattr. In this time window, there is a possibility that this value can change on brick
15:22 < raghug> that is the "staleness" 13855 is addressing at
15:23 < xavih> raghug: yes, this is possible, but I still don't see the problem...
15:23 < raghug> yes.. here is the scenario
15:24 < raghug> 15:19 < raghug> For simplicity, Lets consider the case of 3 way replica b1, b2 and b3
15:24 < raghug> 15:19 < xavih> raghug: ok
15:24 < raghug> 15:19 < raghug> Lets say a client c1 has cached xattr value x1
15:24 < raghug> are you ok till now?
15:25 < raghug> so, we've three bricks b1, b2 and b3
15:25 < raghug> all three part of a replica
15:25 < raghug> Now imagine a client c1 has read an xattr with value x1 and it has cached that value in its memory
15:25 < xavih> raghug: I don't understand the xattr being cached. AFAIK netither afr nor ec cache regular xattrs beyong those needed by its own internal use...
15:26 < xavih> raghug: anyway, let's assume it's cached...
15:26 < raghug> xavih: dht is caching them
15:26 < xavih> raghug: because dht needs it...
15:26 < raghug> the xattr being cached is layout of the directory
15:26 < raghug> xavih: yes
15:26 < raghug> xavih: any doubts till now?
15:26 < xavih> raghug: afr and ec don't need it, so they don't cache it
15:27 < xavih> raghug: no, it's ok
15:27 < raghug> xavih: ok. So, now another client c2 changes the same xattr value to x2. But it succeeds only on b2 and b3
15:27 < xavih> raghug: ok
15:27 < raghug> so, b1 has x1. b2 and b3 have x2 
15:28 < raghug> also dht on client c1 has cached value x1 
15:28 < xavih> raghug: ok
15:29 < raghug> so now when dht (from client c1) tries to do a mkdir using the same xattr, mkdir succeeds on "bad" brick b1 but fails on b2 and b3
15:29 < xavih> raghug: in that case (I will speak for ec, not so sure for afr), when dht on c1 tries to create a directory it will receive an EIO error with the xattr indicating that preop failed
15:30 < xavih> raghug: from the ec side, it will see b2 and b3 failing, and b1 succeeding
15:30 < xavih> raghug: b1 will be marked as bad, and self-heal will take care of it
15:31 < raghug> xavih: yes. With Quorum this problem is mitigated
15:31 < raghug> But with two way replica, this would be a problem
15:31 < raghug> I was also worried about "rollback"
15:31 < raghug> (here deleting directory from b1)
15:32 < xavih> raghug: I can't talk for afr. Pranith should know better
15:32 < xavih> raghug: the removal of directory will happen, but it could be delayed
15:32 < raghug> xavih: thanks. Let me think over what we discussed. Will get back to you if I find any issues. Thanks :)
15:33 < xavih> raghug: anyway, since the parent directory has been marked as bad, and future readdir won't read contents from that brick
15:34 < xavih> raghug: so the "bad" newly created directory won't be visible to the user (nor dht)
15:34 < raghug> xavih: ok. got it
15:35 < raghug> xavih: what if there is a quorum of "bad" bricks?
15:35 < xavih> raghug: this self-heal will also heal the "outdated" xattr that caused the incorrect creation of the directory in the first place
15:35 < xavih> raghug: there cannot be a quorum of "bad" bricks
15:35 < raghug> in this case lets say for some reason b2 also had the same "stale" xattr and mkdir succeeded on it
15:36 < raghug> xavih: hmm..
15:36 < xavih> raghug: if there were a quorum of bad bricks, it means that another client has failed on more than a half of the bricks
15:36 < raghug> then, I suppose that op itself is a failure
15:36 < raghug> (from another client)
15:36 < xavih> raghug: bad in that case, the "bad" bricks will be the ones that have succeeded the mkdir on the other client
15:37 < xavih> raghug: following your example, to try to have a majority of "bad" bricks, c2 should have failed on b1 and b2, and succeded only on b3, right ?
15:37 < raghug> xavih: yes
15:38 < xavih> raghug: in that case, c2 will receive an error, indicating that the mkdir has failed
15:38 < raghug> ok
15:38 < xavih> raghug: this means that the good bricks are really b1 and b2
15:38 < raghug> xavih: yes. Got it
15:39 < xavih> raghug: so c1 won't see the as bad, and it will succeed on them without problems
15:39 < xavih> raghug: everything is consistent on both clients
15:39 < raghug> xavih: on a slightly unrelated issue, I am planning to move the locking to bricks as future work. So, EC wouldn't be witnessing lock that synchronizes modification of the xattr (this lock is again taken by dht). Do you think there would be any problem with that?
15:39 < raghug> (ref: patch 13885)
15:41 < raghug> with this lock, no dht will be able to modify xattr for the duration of lock
15:41 < raghug> (no dht from other clients)
15:41 < xavih> raghug: that might be problematic
15:41 < raghug> xavih: ...
15:44 < xavih> raghug: ec takes the necessary inodelks to guarantee integrity, but if you move the locking from dht to the bricks, it means that two clients could try to change the same xattr. EC guarantees that all bricks will be updated in the same order, but it cannot guarantee that a lock taken after it (in the brick) will be honored
15:44 < xavih> raghug: how do you plan to move locking to the bricks ? maybe this could be used be ec as well.
15:44 < xavih> raghug: I doubt this could be transparent to ec
15:45 < raghug> xavih: dht's directory self-heal takes lock on _all_ its subvols
15:46 < raghug> so, a lock on one subvol would prevent self-heal
15:46 < raghug> but the catch is whether an EC subvol would choose the same brick as lock server
15:47 < raghug> In the context of #13885..
15:47 < raghug> mkdir would've wound to b1, b2 and b3
15:48 < xavih> raghug: from the point of view of DHT, an EC subvolume should be seen as a single brick, however if a feature crosses EC "hidden" in an xdata, special care must be taken...
15:48 < raghug> and all three bricks would 1. acquire inodelk in the same brick 2. do pre-op check (xattr comparision). 3. release inodelk
15:49 < raghug> note that each brick is aquiring an inodelk only on itself
15:49 < xavih> raghug: in this special case for mkdir, it shouldn't be a problem, but the locking issue should be analyzed again
15:50 < xavih> raghug: oh, I see
15:50 < raghug> xavih: its not just mkdir. We've plans to expand this logic to all dentry operations that depend on layout (create, unlink, rename, rmdir etc)
15:50 < xavih> raghug: I should analyze it deeper. At first glance, I think there could be some deadlocks
15:50 < raghug> xavih: no problem. We can meet later to discuss
15:51 < raghug> Its sufficient for now that you understand the scope of the problem/solution we are trying to implement in dht
15:51 < xavih> raghug: when you have something more detailed, I can look at it.
15:52 < raghug> xavih: roughly you expect patches similar to #13885 for other dentry operations
15:52 < raghug> create, link, unlink, rmdir, symlink, mknod etc
15:53 < xavih> raghug: good. I don't foresee any problem with them, but you can add me as a reviewer so that I'll be able to check each of them
15:53 < raghug> there is also special case of lk, which I think we can take up once you are clear about dentry ops
15:54 < raghug>  when lk is issued on a directory, dht has to choose a "lock-server", which all the clients agree upon
15:54 < raghug> (as directory is present on all subvols)
15:54 < raghug> we are planning to have an "hashed-subvol" as the lock-server
15:54 < raghug> and this hashed-subvol is dependent on layout xattr
15:55 < xavih> raghug: wouldn't this cause problems if that subvol dies ?
15:55 < raghug> xavih: yes. its a problem
15:55 < raghug> xavih: ideally we should acquire lock on _all_ subvols
15:55 < xavih> raghug: yes
15:55 < raghug> but its complicated and poses scalability issues
15:55 < raghug> so, I assume that's a tradeoff we've to live with
15:56 < xavih> raghug: and using a subset of the subvolumes ? (2 or more)
15:56 < raghug> xavih: yes, that's also a variant we are thinking of
15:56 < raghug> but bare minimum would be to agree on at-least one subvol as "lock-server"
15:57 < xavih> ok
15:57 < raghug> the current code is broken there too
15:57 < raghug> and can sometimes lead to different "lock-server(s)" even when all subvols are up
15:58 < raghug> note that these are corner cases

Comment 2 Amar Tumballi 2019-05-09 10:16:43 UTC
DHT now has entrylk and hence this can be treated as done. (glusterfs-6.x)


Note You need to log in before you can comment on or make changes to this bug.