Description of problem: Currently afr_fsync() is not a transaction and it can lead to some problems. If data is not yet synced to the bricks and the app issues fsync and some bricks crash, then we need to heal them to other bricks. Since fsync is not a transaction it won't set any pending markers to indicate data needs to be copied to the other bricks which failed to do fsync. By making it a transaction it will take care of setting pending markers and when the crashed bricks come up, it will do the sync guaranteeing the data persists on the disk. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/131942/
Upstream patch: https://review.gluster.org/#/c/19621/
Steps to validate this bug: - Create a replica volume, start & mount it - Create a file and write some data to it - Kill one of the brick and do fsync on the file - Check the xattrs on the file to see whether data pending set on it (It will be set only if it is a transaction). - The entry should be present in the heal info - Bring the brick up and wait for the heal to complete - Now the pending mark should be reset, and heal info should be 0
Update: ======== 1) create 2 * 3 distributed-replicate volume and start 2) create files with data from mount point 3) kill brick ( b0 ) from replica set 4) do fsync on a file 5) check the heal info - it should show pending heal for the file where did fsync in step 4 6) check xattrs on the file -- data pending bit should be set # gluster vol heal testvol_distributed-replicated info Brick 10.70.47.45:/bricks/brick3/testvol_distributed-replicated_brick0 Status: Transport endpoint is not connected Number of entries: - Brick 10.70.47.144:/bricks/brick3/testvol_distributed-replicated_brick1 /file_1 Status: Connected Number of entries: 1 Brick 10.70.46.135:/bricks/brick1/testvol_distributed-replicated_brick2 /file_1 Status: Connected Number of entries: 1 Brick 10.70.46.35:/bricks/brick0/testvol_distributed-replicated_brick3 Status: Connected Number of entries: 0 Brick 10.70.46.166:/bricks/brick0/testvol_distributed-replicated_brick4 Status: Connected Number of entries: 0 Brick 10.70.47.122:/bricks/brick0/testvol_distributed-replicated_brick5 Status: Connected Number of entries: 0 # # getfattr -d -m . -e hex /bricks/brick3/testvol_distributed-replicated_brick1/file_1 getfattr: Removing leading '/' from absolute path names # file: bricks/brick3/testvol_distributed-replicated_brick1/file_1 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.testvol_distributed-replicated-client-0=0x000000010000000000000000 trusted.gfid=0xcc8c2dfde4c84160b450e5ddd2e9966f trusted.gfid2path.e3c7c5862232fa1b=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f66696c655f31 # Changing status to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607