Description of problem: For Preop checks like [1], dentry operations like mkdir etc would rely on xattrs on parent (like dht layout). However, a "bad" subvolume of afr cannot make correct decision during preop check as the xattrs are not guaranteed to be correct. Imagine the following scenario: a. mkdir succeeds on non-readable subvolume b. fails on readable subvolume (may be because of layout xattrs didn't match). afr here would report mkdir as success to parent xlators (here dht). However, since non-readable subvolume is not guaranteed to have correct xattrs, mkdir shouldn't have succeeded. Worse still, if mkdir on readable subvolume had failed because preop check failed (client in memory layout xattrs and layout xattrs persisted on disk didn't match), we would be ignoring a genuine failure and instead transforming it as a success. dht is planning to introduce pre-operation checks on all dentry and lock (posix-locks, inodelk, entrylk) operations. So, without some sort of support from lower layers on xattr consistency (as these preops primarily rely on xattrs) it is impossible for a correct implementation. [1] http://review.gluster.org/13885 Version-Release number of selected component (if applicable): mainline, 3.7.11, 3.8 How reproducible: bug was filed as a result of design discussion. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Following is the transcript of discussion I had with Xavi discussing whether this problem is present in EC. The conclusion was, EC doesn't suffer from this problem. 15:07 < raghug> xavih: my concern with patch #13885 is that, pre-op checks (xattr comparision) is done only at storage/posix 15:07 < raghug> but in clustered scenario like afr/ec, the brick (by itself) doesn't know the "correctness" of xattrs present on it 15:08 < raghug> So, whatever the decision the brick makes, can't be trusted 15:08 < xavih> raghug: yes, it only knows its own state, and it should act depending on it 15:09 < raghug> xavih: yes. 15:09 < raghug> xavih: consider this scenario 15:09 < xavih> raghug: in this case afr/ec will combine the answers as appropriate and give a "good" answer to the upper layer. Bricks not matching the "good" answer will be marked as bad 15:12 < raghug> xavih: just for clarification, here is an example I gave out in the bug report of bz 1341429 15:12 < raghug> [#gluster] xavih: what about rollback? If the posix has already created the directory, but ec/afr considers the answer as bad, we need to delete the directory 15:12 < raghug> raghug: sorry ignore last statement 15:12 < raghug> For Preop checks like [1], dentry operations like mkdir etc would rely on xattrs on parent (like dht layout). However, a "bad" subvolume of afr cannot make correct decision during preop check as the xattrs are not guaranteed to be correct. Imagine the following scenario: 15:12 < raghug> a. mkdir succeeds on non-readable subvolume 15:12 < raghug> b. fails on readable subvolume (may be because of layout xattrs didn't match). 15:12 < raghug> afr here would report mkdir as success to parent xlators (here dht). However, since non-readable subvolume is not guaranteed to have correct xattrs, mkdir shouldn't have succeeded. Worse still, if mkdir on readable subvolume had failed because preop check failed (client in memory layout xattrs and layout xattrs persisted on disk didn't match), we would be ignoring a genuine failure and instead transforming it as a success. 15:13 < raghug> you can use readable/good interchanged 15:13 < raghug> same with non-readable/bad subvols 15:13 < xavih> raghug: as I understand the changes in storage/posix, mkdir *cannot* succeed on a bad brick 15:15 < xavih> raghug: if the xattr check fails, it returns an error without creating the directory 15:15 < raghug> xavih: what if a "bad" brick has the "stale" xattrs coincedentally cached by client also? 15:16 < raghug> it can happen with frequent changes of xattrs I suppose. thinking for a concrete example.. 15:17 -!- atinm [atinm@nat/redhat/x-iofeywgqrejhtcuh] has joined #gluster 15:17 < xavih> raghug: do you mean that directories with the same xattr can be good and bad ? shouldn't that xattr value be unique in some way ? 15:17 < xavih> raghug: otherwise this preop check seems weak... 15:19 < raghug> For simplicity, Lets consider the case of 3 way replica b1, b2 and b3 15:19 < xavih> raghug: ok 15:19 < raghug> Lets say a client c1 has cached xattr value x1 15:20 < xavih> raghug: when you say "cached" do you mean that it has stored that value on the three bricks ? 15:20 < raghug> xavih: no. The client has stored the value in its memory 15:20 < raghug> in the inode-ctx of directory/file lets say 15:21 < xavih> raghug: without sending the request to the bricks and check the answer ? 15:21 < xavih> raghug: I think this is not possible 15:22 < raghug> xavih: it would have sent the request (say a getxattr/lookup). But lets say there is a future fop (say mkdir) that relies on this xattr. In this time window, there is a possibility that this value can change on brick 15:22 < raghug> that is the "staleness" 13855 is addressing at 15:23 < xavih> raghug: yes, this is possible, but I still don't see the problem... 15:23 < raghug> yes.. here is the scenario 15:24 < raghug> 15:19 < raghug> For simplicity, Lets consider the case of 3 way replica b1, b2 and b3 15:24 < raghug> 15:19 < xavih> raghug: ok 15:24 < raghug> 15:19 < raghug> Lets say a client c1 has cached xattr value x1 15:24 < raghug> are you ok till now? 15:25 < raghug> so, we've three bricks b1, b2 and b3 15:25 < raghug> all three part of a replica 15:25 < raghug> Now imagine a client c1 has read an xattr with value x1 and it has cached that value in its memory 15:25 < xavih> raghug: I don't understand the xattr being cached. AFAIK netither afr nor ec cache regular xattrs beyong those needed by its own internal use... 15:26 < xavih> raghug: anyway, let's assume it's cached... 15:26 < raghug> xavih: dht is caching them 15:26 < xavih> raghug: because dht needs it... 15:26 < raghug> the xattr being cached is layout of the directory 15:26 < raghug> xavih: yes 15:26 < raghug> xavih: any doubts till now? 15:26 < xavih> raghug: afr and ec don't need it, so they don't cache it 15:27 < xavih> raghug: no, it's ok 15:27 < raghug> xavih: ok. So, now another client c2 changes the same xattr value to x2. But it succeeds only on b2 and b3 15:27 < xavih> raghug: ok 15:27 < raghug> so, b1 has x1. b2 and b3 have x2 15:28 < raghug> also dht on client c1 has cached value x1 15:28 < xavih> raghug: ok 15:29 < raghug> so now when dht (from client c1) tries to do a mkdir using the same xattr, mkdir succeeds on "bad" brick b1 but fails on b2 and b3 15:29 < xavih> raghug: in that case (I will speak for ec, not so sure for afr), when dht on c1 tries to create a directory it will receive an EIO error with the xattr indicating that preop failed 15:30 < xavih> raghug: from the ec side, it will see b2 and b3 failing, and b1 succeeding 15:30 < xavih> raghug: b1 will be marked as bad, and self-heal will take care of it 15:31 < raghug> xavih: yes. With Quorum this problem is mitigated 15:31 < raghug> But with two way replica, this would be a problem 15:31 < raghug> I was also worried about "rollback" 15:31 < raghug> (here deleting directory from b1) 15:32 < xavih> raghug: I can't talk for afr. Pranith should know better 15:32 < xavih> raghug: the removal of directory will happen, but it could be delayed 15:32 < raghug> xavih: thanks. Let me think over what we discussed. Will get back to you if I find any issues. Thanks :) 15:33 < xavih> raghug: anyway, since the parent directory has been marked as bad, and future readdir won't read contents from that brick 15:34 < xavih> raghug: so the "bad" newly created directory won't be visible to the user (nor dht) 15:34 < raghug> xavih: ok. got it 15:35 < raghug> xavih: what if there is a quorum of "bad" bricks? 15:35 < xavih> raghug: this self-heal will also heal the "outdated" xattr that caused the incorrect creation of the directory in the first place 15:35 < xavih> raghug: there cannot be a quorum of "bad" bricks 15:35 < raghug> in this case lets say for some reason b2 also had the same "stale" xattr and mkdir succeeded on it 15:36 < raghug> xavih: hmm.. 15:36 < xavih> raghug: if there were a quorum of bad bricks, it means that another client has failed on more than a half of the bricks 15:36 < raghug> then, I suppose that op itself is a failure 15:36 < raghug> (from another client) 15:36 < xavih> raghug: bad in that case, the "bad" bricks will be the ones that have succeeded the mkdir on the other client 15:37 < xavih> raghug: following your example, to try to have a majority of "bad" bricks, c2 should have failed on b1 and b2, and succeded only on b3, right ? 15:37 < raghug> xavih: yes 15:38 < xavih> raghug: in that case, c2 will receive an error, indicating that the mkdir has failed 15:38 < raghug> ok 15:38 < xavih> raghug: this means that the good bricks are really b1 and b2 15:38 < raghug> xavih: yes. Got it 15:39 < xavih> raghug: so c1 won't see the as bad, and it will succeed on them without problems 15:39 < xavih> raghug: everything is consistent on both clients 15:39 < raghug> xavih: on a slightly unrelated issue, I am planning to move the locking to bricks as future work. So, EC wouldn't be witnessing lock that synchronizes modification of the xattr (this lock is again taken by dht). Do you think there would be any problem with that? 15:39 < raghug> (ref: patch 13885) 15:41 < raghug> with this lock, no dht will be able to modify xattr for the duration of lock 15:41 < raghug> (no dht from other clients) 15:41 < xavih> raghug: that might be problematic 15:41 < raghug> xavih: ... 15:44 < xavih> raghug: ec takes the necessary inodelks to guarantee integrity, but if you move the locking from dht to the bricks, it means that two clients could try to change the same xattr. EC guarantees that all bricks will be updated in the same order, but it cannot guarantee that a lock taken after it (in the brick) will be honored 15:44 < xavih> raghug: how do you plan to move locking to the bricks ? maybe this could be used be ec as well. 15:44 < xavih> raghug: I doubt this could be transparent to ec 15:45 < raghug> xavih: dht's directory self-heal takes lock on _all_ its subvols 15:46 < raghug> so, a lock on one subvol would prevent self-heal 15:46 < raghug> but the catch is whether an EC subvol would choose the same brick as lock server 15:47 < raghug> In the context of #13885.. 15:47 < raghug> mkdir would've wound to b1, b2 and b3 15:48 < xavih> raghug: from the point of view of DHT, an EC subvolume should be seen as a single brick, however if a feature crosses EC "hidden" in an xdata, special care must be taken... 15:48 < raghug> and all three bricks would 1. acquire inodelk in the same brick 2. do pre-op check (xattr comparision). 3. release inodelk 15:49 < raghug> note that each brick is aquiring an inodelk only on itself 15:49 < xavih> raghug: in this special case for mkdir, it shouldn't be a problem, but the locking issue should be analyzed again 15:50 < xavih> raghug: oh, I see 15:50 < raghug> xavih: its not just mkdir. We've plans to expand this logic to all dentry operations that depend on layout (create, unlink, rename, rmdir etc) 15:50 < xavih> raghug: I should analyze it deeper. At first glance, I think there could be some deadlocks 15:50 < raghug> xavih: no problem. We can meet later to discuss 15:51 < raghug> Its sufficient for now that you understand the scope of the problem/solution we are trying to implement in dht 15:51 < xavih> raghug: when you have something more detailed, I can look at it. 15:52 < raghug> xavih: roughly you expect patches similar to #13885 for other dentry operations 15:52 < raghug> create, link, unlink, rmdir, symlink, mknod etc 15:53 < xavih> raghug: good. I don't foresee any problem with them, but you can add me as a reviewer so that I'll be able to check each of them 15:53 < raghug> there is also special case of lk, which I think we can take up once you are clear about dentry ops 15:54 < raghug> when lk is issued on a directory, dht has to choose a "lock-server", which all the clients agree upon 15:54 < raghug> (as directory is present on all subvols) 15:54 < raghug> we are planning to have an "hashed-subvol" as the lock-server 15:54 < raghug> and this hashed-subvol is dependent on layout xattr 15:55 < xavih> raghug: wouldn't this cause problems if that subvol dies ? 15:55 < raghug> xavih: yes. its a problem 15:55 < raghug> xavih: ideally we should acquire lock on _all_ subvols 15:55 < xavih> raghug: yes 15:55 < raghug> but its complicated and poses scalability issues 15:55 < raghug> so, I assume that's a tradeoff we've to live with 15:56 < xavih> raghug: and using a subset of the subvolumes ? (2 or more) 15:56 < raghug> xavih: yes, that's also a variant we are thinking of 15:56 < raghug> but bare minimum would be to agree on at-least one subvol as "lock-server" 15:57 < xavih> ok 15:57 < raghug> the current code is broken there too 15:57 < raghug> and can sometimes lead to different "lock-server(s)" even when all subvols are up 15:58 < raghug> note that these are corner cases
DHT now has entrylk and hence this can be treated as done. (glusterfs-6.x)