Description of problem: AFR uses changelogs in extended attributes (via xattrop() FOP) to keep track of the state of replicas. The algorithm assumes the changelog xattrs and their ordering w.r.t to the data operations to be durable. But in reality if there is a system crash/hard reset, the backend filesystem (XFS) will almost certainly re-order the operations and possibly lose the data operations (as it does not journal data operations). This can put the system in an inconsistent and irrecoverable state at best, and corrupt data in the worst case. Version-Release number of selected component (if applicable): All releases How reproducible: Almost always Steps to Reproduce: 1. create a file, write data and close it 2. abruptly power down one server and reboot it (within 30 secs) 3. file size of one of the copies will be 0 bytes, AFR xattr changelog and index are clean. Actual results: All files written in the last 30 seconds before the abrupt power down will lose their data (which is expected), but AFR's changelogging is vulnerable to this, by not flushing the xattrs and the operation to persistent storage. This will result in files staying silently unhealed in the "best case", and in the "worst" case the file will revert back to an old changelog value resulting in healing in the wrong direction resulting in data corruption. Expected results: AFR should be resilient to system reset/crash. After a system reset/crash, AFR should reliably heal the files. Additional info:
REVIEW: http://review.gluster.org/4719 (storage/posix: make xattrop barrier operations on the inode) posted (#1) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4721 (cluster/afr: ensure DATA operations are made durable before POST-OP) posted (#1) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4721 (cluster/afr: ensure DATA operations are made durable before POST-OP) posted (#2) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4721 (cluster/afr: ensure DATA operations are made durable before POST-OP) posted (#3) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4721 (cluster/afr: ensure DATA operations are made durable before POST-OP) posted (#4) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4721 (cluster/afr: ensure DATA operations are made durable before POST-OP) posted (#5) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4721 (cluster/afr: ensure DATA operations are made durable before POST-OP) posted (#6) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4721 (cluster/afr: ensure DATA operations are made durable before POST-OP) posted (#7) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4721 (cluster/afr: ensure DATA operations are made durable before POST-OP) posted (#8) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4737 (cluster/afr: fsync() guarantees POST-OP completion) posted (#2) for review on master by Anand Avati (avati)
COMMIT: http://review.gluster.org/4721 committed in master by Anand Avati (avati) ------ commit ca10fdc81a72a71ac67ac9fc8c5ad5b92febd875 Author: Anand Avati <avati> Date: Sun Mar 24 12:19:56 2013 -0700 cluster/afr: ensure DATA operations are made durable before POST-OP The changelogging scheme of AFR stores information about the state of all replicas in all replicas (in the extended attribute of the respective files on each server) in the form of 'pending counts' of operations (effectively "dirty flags"). These xattrs are blindly trusted while performing self-heal, and therefore utmost care has to be taken while updating and maintaing them. The most critical updation is the clearing of the pending counts corresponding to the *other* server in the changelog of a given server. Before clearing the pending count, we need durability guarantee of the write which was performed on the other server. To obtain such a guarantee, it may be necessary to explicitly introduce an fsync() phase (if the file itself wasn't already opened with O_SYNC). This patch introduces the detection of unstable stable writes on a file and issues explicit fsync() on the servers before performing the POST-OP clearing of pending flags. Change-Id: I2171b86a74ec91e40e5877eef0a4e7379578ecf7 BUG: 927146 Signed-off-by: Anand Avati <avati> Reviewed-on: http://review.gluster.org/4721 Reviewed-by: Pranith Kumar Karampuri <pkarampu> Reviewed-by: Krishnan Parthasarathi <kparthas> Tested-by: Gluster Build System <jenkins.com>
COMMIT: http://review.gluster.org/4737 committed in master by Anand Avati (avati) ------ commit 8909c28c1173e10fd2f10706bd8a0f2ca5b5d685 Author: Anand Avati <avati> Date: Wed Mar 27 19:55:58 2013 -0700 cluster/afr: fsync() guarantees POST-OP completion AFR now provides a stronger guarantee that fsync() returns only after completely finishing all the deferred/delayed POST-OP on that open file. To acheive this we make a stub out of the returning fsync and register it with the "delayed" frame in afr_changelog_wake_resume(). The delayed frame, after getting woken up and finishing the POST-OP will call_resume() the registered stub (which UNWINDs the fsync) at the time of frame destruction. This provides a guarantee that an application's (or FUSE) fsync() returns only after finishing up all the previous transactions, including delayed POST-OPs and UNLOCK. Change-Id: Iaa955457e2f25088a144fde37ad0444277b5cf49 BUG: 927146 Signed-off-by: Anand Avati <avati> Reviewed-on: http://review.gluster.org/4737 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu>
REVIEW: http://review.gluster.org/4741 (cluster/afr: piggyback and fsync resume changes) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)
COMMIT: http://review.gluster.org/4741 committed in master by Anand Avati (avati) ------ commit ca6a3d1e396a65d25e54d331bef966178cd55375 Author: Pranith Kumar K <pkarampu> Date: Thu Mar 28 11:29:41 2013 +0530 cluster/afr: piggyback and fsync resume changes 1) pre_op_piggyback should always be decremented. 2) Move fsync resume to just after post_op. 3) fsync stub should be created from afr's local not from the final response. Change-Id: I220bb532eb03bea584292f4dd2e816ad0c3e0cf7 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/4741 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Anand Avati <avati>
REVIEW: http://review.gluster.org/4744 (cluster/afr: fsync before erase xattrs in data self-heal) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)
COMMIT: http://review.gluster.org/4744 committed in master by Anand Avati (avati) ------ commit 6ae6f3db02ec374448e9286b03651849ae30dff0 Author: Pranith Kumar K <pkarampu> Date: Thu Mar 28 22:26:24 2013 +0530 cluster/afr: fsync before erase xattrs in data self-heal Added extra fsync to data self-heal code to make sure the data reached disk before erasing the changelogs Change-Id: I9e7e6e55cdc49de2b991705d1638946464a9d4f9 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/4744 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Anand Avati <avati>
REVIEW: http://review.gluster.org/4745 (cluster/afr: fix fd leak with unsafe call_resume()) posted (#1) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4746 (storage/posix: implement batched fsync in a single thread) posted (#1) for review on master by Anand Avati (avati)
COMMIT: http://review.gluster.org/4745 committed in master by Anand Avati (avati) ------ commit 0b81f2801b7a72130d86c88da938f288430cd3e5 Author: Anand Avati <avati> Date: Mon Mar 25 20:34:43 2013 -0700 cluster/afr: fix fd leak with unsafe call_resume() Introduce AFR_CALL_RESUME macro which cleans up frame->local, like how AFR_STACK_UNWIND etc. do. Therefore fix leak in afr_fsync() path. Change-Id: I3855d8e7e84dbc44e05f507563b7f722bf9621b8 BUG: 927146 Signed-off-by: Anand Avati <avati> Reviewed-on: http://review.gluster.org/4745 Reviewed-by: Pranith Kumar Karampuri <pkarampu> Tested-by: Gluster Build System <jenkins.com>
REVIEW: http://review.gluster.org/4752 (cluster/afr: prevent piggyback on stale pre_op) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)
REVIEW: http://review.gluster.org/4752 (cluster/afr: prevent piggyback on stale pre_op) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)
COMMIT: http://review.gluster.org/4752 committed in master by Anand Avati (avati) ------ commit 864ac6b7b3d69b5f2cc0fafe4b12d861da3a633c Author: Pranith Kumar K <pkarampu> Date: Tue Apr 2 00:24:45 2013 +0530 cluster/afr: prevent piggyback on stale pre_op Here are the logs of a file on which we saw EIO because of size mismatch: [root@lizzie ~]# grep 38f18204 /var/log/glusterfs/mnt-x-.log Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 0, len: 7680 Cleared unstable write flag for 38f18204-2840-408e-ae65-c01f4106b8c4: offset 0 length 7680 Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 7680, len: 71680 Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for offset: 79360, len: 15716 fsync completed on 38f18204-2840-408e-ae65-c01f4106b8c4 for offset 0 length 7680 with changelog status: -1 -1 According to these logs fsync did not happen after writev with offset: 79360, len: 15716. Which is the reason for this problem. In total 3 writes came. lets call them w1, w2, w3 w1 does pre_op so pre_op_done[0], pre_op_done[1] counts become 1 and 1 then is_piggyback_post_op() is called for w1 and it returns *false* w1's fsync is fired Now w2 and w3 come and see that pre_op_done[0], pre_op_done[1] are both 1, so pre_op_piggyback[0] and pre_op_piggyback[1] are both incremented twice, once by w2, one more time by w3 and become 2, 2 ------- Step-A Now fsync of w1 is complete and it goes ahead with post op and decrements pre_op_done[0], pre_op_done[1] to 0, 0 Now w2, w3 writevs complete and is_piggyback_post_op will return *true* for both w2, w3. So fsync is not fired for both w2, w3 this patch prevents Step-A from happening. Change-Id: I8b6af1f1875b2cf5f718caa3c16ee7ff3dc96b5c BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/4752 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Jeff Darcy <jdarcy>
REVIEW: http://review.gluster.org/4746 (storage/posix: implement batched fsync in a single thread) posted (#2) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4746 (storage/posix: implement batched fsync in a single thread) posted (#3) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4746 (storage/posix: implement batched fsync in a single thread) posted (#4) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4746 (storage/posix: implement batched fsync in a single thread) posted (#5) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4746 (storage/posix: implement batched fsync in a single thread) posted (#6) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4746 (storage/posix: implement batched fsync in a single thread) posted (#7) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4746 (storage/posix: implement batched fsync in a single thread) posted (#8) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4746 (storage/posix: implement batched fsync in a single thread) posted (#9) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4746 (storage/posix: implement batched fsync in a single thread) posted (#10) for review on master by Anand Avati (avati)
COMMIT: http://review.gluster.org/4746 committed in master by Vijay Bellur (vbellur) ------ commit 37ac6bdca826046cbcb0d50727af29baf9407950 Author: Anand Avati <avati> Date: Fri Jul 19 08:31:41 2013 -0700 storage/posix: implement batched fsync in a single thread Because of the extra fsync()s issued by AFR transaction, they could potentially "clog" all the io-threads denying unrelated operations from making progress. This patch assigns a dedicated thread to issues fsyncs, as an experimental feature to understand performance characteristics with the approach. As a basis, incoming individual fsync requests are grouped into batches, falling in the same @batch-fsync-delay-usec window of time. These windows can extend in practice, as processing of the previous batch can take longer than @batch-fsync-delay-usec while new requests are getting batched. The feature support three modes (similar to the -S modes of fs_mark) - syncfs: In this mode one syncfs() is issued per batch, instead of N fsync()s (one per file.) - syncfs-single-fsync: In this mode one syncfs() is issued per batch (which, on Linux, guarantees the completion of write-out of dirty pages in the filesystem up to that point) and one single fsync() to synchronize or flush the controller/drive cache. This corresponds to -S 2 of fsmark. - syncfs-reverse-fsync: In this mode, one syncfs() is issued per batch, and all the open files in that batch are fsync()'ed in the reverse order of the queue. This corresponds to -S 4 of fsmark. - reverse-fsync: In this mode, no syncfs() is issued and all the files in the batch are fsync()'ed in the reverse order. This corresponds to -S 3 of fsmark. Change-Id: Ia1e170a810c780c8d80e02cf910accc4170c4cd4 BUG: 927146 Signed-off-by: Anand Avati <avati> Reviewed-on: http://review.gluster.org/4746 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>
REVIEW: http://review.gluster.org/5381 (storage/posix: Fix conditional compiling for syncfs) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)
COMMIT: http://review.gluster.org/5381 committed in master by Vijay Bellur (vbellur) ------ commit a496f0fd94276822169ff8ea9f961ac2dba7984a Author: Pranith Kumar K <pkarampu> Date: Wed Jul 24 11:25:07 2013 +0530 storage/posix: Fix conditional compiling for syncfs Change-Id: Ief22e1c0f2b5074060752d70da41ae93f1028d62 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/5381 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>
REVIEW: http://review.gluster.org/5406 (tests: fix test script to turn on write-behind) posted (#1) for review on master by Anand Avati (avati)
COMMIT: http://review.gluster.org/5406 committed in master by Vijay Bellur (vbellur) ------ commit 412940c56e203b16ebf027fe5b9cbf58cd3a144e Author: Anand Avati <avati> Date: Sat Jul 20 15:04:48 2013 -0700 tests: fix test script to turn on write-behind Change-Id: I8a3ddc8183355236ff7725229441e27bbf8188e3 BUG: 927146 Signed-off-by: Anand Avati <avati> Reviewed-on: http://review.gluster.org/5406 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>
REVIEW: http://review.gluster.org/5408 (cluster/afr: Print self-heal log when self-heal succeeds) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)
REVIEW: http://review.gluster.org/5408 (cluster/afr: Print self-heal log when self-heal succeeds) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)
REVIEW: http://review.gluster.org/5408 (cluster/afr: Print self-heal log when self-heal succeeds) posted (#3) for review on master by Pranith Kumar Karampuri (pkarampu)
REVIEW: http://review.gluster.org/5408 (cluster/afr: Print self-heal log when self-heal succeeds) posted (#4) for review on master by Pranith Kumar Karampuri (pkarampu)
COMMIT: http://review.gluster.org/5408 committed in master by Vijay Bellur (vbellur) ------ commit 177f32e5b0d73336b2d5bde08bafff186b65e211 Author: Pranith Kumar K <pkarampu> Date: Mon Jul 29 14:44:40 2013 +0530 cluster/afr: Print self-heal log when self-heal succeeds Change-Id: I95e47e589419dc6a032cbd8ba01964b6c176c2d5 BUG: 927146 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/5408 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>
REVIEW: http://review.gluster.org/5501 (afr: treat appending writes as stable writes.) posted (#1) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/5501 (afr: treat appending writes as stable writes.) posted (#2) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/5620 (posix: Default value for `batch-fsync-delay-usec` should be '0') posted (#1) for review on master by Harshavardhana (harsha)
COMMIT: http://review.gluster.org/5620 committed in master by Anand Avati (avati) ------ commit 0d756dc618c1a4b659a3531aec449506ce577f50 Author: Harshavardhana <harsha> Date: Tue Aug 13 17:35:20 2013 -0700 posix: Default value for `batch-fsync-delay-usec` should be '0' Also fixes for failing testcase `./tests/bugs/bug-888174.t`, which has been failing sporadically for many patches. Change-Id: Ic7d2c95da5d3126623cec403207afadd449bf950 BUG: 927146 Signed-off-by: Harshavardhana <harsha> Reviewed-on: http://review.gluster.org/5620 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Anand Avati <avati>
COMMIT: http://review.gluster.org/5501 committed in master by Anand Avati (avati) ------ commit 8360037701788d49471cc0228fa873aa18382023 Author: Anand Avati <avati> Date: Wed Jul 24 03:53:16 2013 -0700 afr: treat appending writes as stable writes. Durability of appending writes is implicit in the file size. Therefore performing an explicit fsync() is unnecessary in such cases as self-heal can check for the size of file when pending changelog is not unambiguous. Change-Id: I05446180a91d20e0dbee5de5a7085b87d57f178a BUG: 927146 Signed-off-by: Anand Avati <avati> Reviewed-on: http://review.gluster.org/5501 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu>
REVIEW: http://review.gluster.org/5694 (bug-979365.t: fix wrong expectation of encountering fsync) posted (#1) for review on master by Anand Avati (avati)
COMMIT: http://review.gluster.org/5694 committed in master by Anand Avati (avati) ------ commit 094b06c520498477804ef9ff8166ec0541d98c21 Author: Anand Avati <avati> Date: Thu Aug 22 12:34:26 2013 -0700 bug-979365.t: fix wrong expectation of encountering fsync After the append-write detection patch, FSYNCs may or may not be issued depeneding on the order in which writes reach the server (in the presence of write-behind). Fix the test case to understand this non-deterministic behavior. Change-Id: I1dc3453a6dd4a12a66551948eb8311d789ac2ecf BUG: 927146 Signed-off-by: Anand Avati <avati> Reviewed-on: http://review.gluster.org/5694 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu>