Description of problem:
Have a 4node cluster with a 1 x (4+2) volume ozone. Enable bitrot and set the scrubber frequency to hourly. Create files/directories via fuse/nfs and create a couple of hardlinks as well. Corrupt one of the hardlinks from the backend brick path and wait for the scrubber to mark it as corrupted. Now follow the standard procedure of recovering a corrupted file, by deleting the same on the backend and accessing it from the mountpoint. After recovery, we see that the recovered file has the same contents as what it had when it was corrupted.
Version-Release number of selected component (if applicable):
Hit it in my setup. Recreated the issue on another setup which was shared by development team. This setup is still in the same state, in case it has to be looked at.
Steps to Reproduce:
1. Have a 4node cluster. Create a 4+2 disperse volume on node2, node3 and node4 by using 2 bricks each from every node.
2. Enable bitrot and mount it via fuse. Create 5 files and 2 hardlinks.
3. Go to the brick backend path of node2, and append a line to one of the hardlinks.
4. Verify using 'cat' that the hardlink as well as the parent file get corrupted at the backend.
5. Wait for the scrubber to finish its run, and verify that /var/log/glusterfs/scrub.log detects the corruption.
6. Delete the hardlink (and the parent file) from the backend brick path of node2 and access the file from the moutnpoint, hoping that afr will recover the file on node2.
After step6, file and the hardlink do get recovered, but it continues to have the corrupted data.
Good copy of file should get recovered
Proposing it as a blocker as the data once corrupted, continues to remain corrupted.
This reduces the redundancy that comes along with a disperse volume, without the user's knowledge.
Few updates about what happened during the day while trying to debug this issue.
1. Tried the same steps without bitrot, with a plain disperse volume. If there is no scrubber involved which marks the file as bad, then the recovery of the file works as expected at the outset. (However further testing would be required to confidently claim the same)
2. In the setup that was shared by Kotresh, this behaviour was consistently reproduced not just for hardlinks/softlinks but even for regular files.
3. Had missed deleting the file entry from .glusterfs folder. Re did the steps mentioned in the description. THIS time again, the file gets recovered not with the corrupted data, but with NO data. It is an empty file, which continues to remain empty. Multiple attempts to manually heal the file using 'gluster volume heal <volname>' has no effect.
To sum it up, recovery of (corrupted) file is not working as expected in a disperse volume. Data corruption (and no way to recover) silently leaves the system in a -1 redundancy state.
The lookup on deleted file should have cleaned the inode context where bitrot had marked the file bad in memory. For some reason this is not happening with EC volume. It needs further investigation.
Well we have a workaround here. If the brick is restared, healing successfully happens.
Recreated the issue on the build 3.7.9-7, and tested the workaround of restarting the brick process. The file does get healed successfully.
Will execute a few cases in and around this workaround, to ensure there's no unexpected impact to the rest of the functionality.
Follow up of comment10 of validating the workaround:
Killing the brick process by 'kill -15' and restarting it by 'gluster volume start <volname> force' does help in recovery of the file. We no longer see the recovered-file empty.
Impact of the workaround:
The only known/recommended way of restarting brick process is to start the volume by force, which in turn restarts the scrubber as well. All the status of the volume wrt #files scrubbed, #files skipped, #duration, #last completed scrub time, is reset.
Had the information related to (other) corruptions also lost, it would have been a concern, as the user would have had to wait for another scrub run. But that informations remains, and shows up correctly in the scrub status output.
To sum it up, (1) kill -15 <brick pid> (2) gluster volume <volname> start force can be accepted as a workaround to recover a (corrupted) file in disperse volume.
Atin/Kotresh, please do write if you see any other concern wrt 'volume start force'
Fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4
Tested and verified this on the build glusterfs-3.8.4-3.el7rhgs.x86_64
Followed the steps mentioned in the description multiple times, with hardlinks created at various directory levels, and was able to successfully recover everytime.
Did see an issue with scrubbed/skipped files #count, but that is not related to the issue this BZ was raised.
Moving this BZ to verified in 3.2. The console logs are attached.
Created attachment 1217863 [details]
Server and client logs
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.