+++ This bug was initially created as a clone of Bug #1374565 +++
+++ This bug was initially created as a clone of Bug #1374564 +++
+++ This bug was initially created as a clone of Bug #1373520 +++
+++ This bug was initially created as a clone of Bug #1341934 +++
Description of problem:
Have a 4node cluster with a 1 x (4+2) volume ozone. Enable bitrot and set the scrubber frequency to hourly. Create files/directories via fuse/nfs and create a couple of hardlinks as well. Corrupt one of the hardlinks from the backend brick path and wait for the scrubber to mark it as corrupted. Now follow the standard procedure of recovering a corrupted file, by deleting the same on the backend and accessing it from the mountpoint. After recovery, we see that the recovered file has the same contents as what it had when it was corrupted.
Version-Release number of selected component (if applicable):
Hit multiple times
Steps to Reproduce:
1. Have a 4node cluster. Create a 4+2 disperse volume on node2, node3 and node4 by using 2 bricks each from every node.
2. Enable bitrot and mount it via fuse. Create 5 files and 2 hardlinks.
3. Go to the brick backend path of node2, and append a line to one of the hardlinks.
4. Verify using 'cat' that the hardlink as well as the parent file get corrupted at the backend.
5. Wait for the scrubber to finish its run, and verify that /var/log/glusterfs/scrub.log detects the corruption.
6. Delete the hardlink (and the parent file) from the backend brick path of node2 and access the file from the moutnpoint, hoping that afr will recover the file on node2.
After step6, file and the hardlink do get recovered, but it continues to have the corrupted data.
Good copy of file should get recovered
Few updates about what happened during the day while trying to debug this issue.
1. Tried the same steps without bitrot, with a plain disperse volume. If there is no scrubber involved which marks the file as bad, then the recovery of the file works as expected at the outset. (However further testing would be required to confidently claim the same)
2. In the setup that was shared by Kotresh, this behaviour was consistently reproduced not just for hardlinks/softlinks but even for regular files.
3. Had missed deleting the file entry from .glusterfs folder. Re did the steps mentioned in the description. THIS time again, the file gets recovered not with the corrupted data, but with NO data. It is an empty file, which continues to remain empty. Multiple attempts to manually heal the file using 'gluster volume heal <volname>' has no effect.
To sum it up, recovery of (corrupted) file is not working as expected in a disperse volume. Data corruption (and no way to recover) silently leaves the system in a -1 redundancy state.
EC Team Update:
I was able to reproduce the issue
1 - Without Bit-rot - Corrupting file from backend and deleting it from path and also from .glusterfs. Accessing file from mount point successfully heals the file. No data loss and No data corruption
2 - With Bit- rot - Corrupting file from backend and deleting it from path and also from .glusterfs. Accessing file from mount point DOES NOT heal the file.
I tried to debug the  and it looks like bit-rot is maintaining the trusted.bit-rot.bad-file=0x3100 xattr in memory.
Entry heal and metadata heal has been happening successfully. However data heal is not happening.
When data heal start, shd tries to open this file, from bad as well as good copy, but this open on bad copy fails. I checked the brick logs and found following error messages -
[2016-06-02 13:23:13.678342] E [MSGID: 116020] [bit-rot-stub.c:566:br_stub_check_bad_object] 0-nash-bitrot-stub: b6cbec17-d66f-42b3-b088-b9c917139bc6 is a bad object. Returning
[2016-06-02 13:23:13.678472] E [MSGID: 115070] [server-rpc-fops.c:1472:server_open_cbk] 0-nash-server: 2411: OPEN /file-3 (b6cbec17-d66f-42b3-b088-b9c917139bc6) ==> (Input/output error) [Input/output error]
[2016-06-02 13:23:14.565096] E [MSGID: 116020] [bit-rot-stub.c:566:br_stub_check_bad_object] 0-nash-bitrot-stub: 24b01cf8-eb2a-4896-ac1d-1bf085bd2623 is a bad object. Returning
[2016-06-02 13:23:14.565308] E [MSGID: 115070] [server-rpc-fops.c:1472:server_open_cbk] 0-nash-server: 2486: OPEN /file-6 (24b01cf8-eb2a-4896-ac1d-1bf085bd2623) ==> (Input/output error) [Input/output error]
[2016-06-02 13:23:14.893098] E [MSGID: 116020] [bit-rot-stub.c:566:br_stub_check_bad_object] 0-nash-bitrot-stub: 65faad93-5bf6-47c5-9b7c-7db281c88882 is a bad object. Returning
[2016-06-02 13:23:14.893202] E [MSGID: 115070] [server-rpc-fops.c:1472:server_open_cbk] 0-nash-server: 2515: OPEN /file-7 (65faad93-5bf6-47c5-9b7c-7db281c88882) ==> (Input/output error) [Input/output error]
[2016-06-02 13:23:15.619885] E [MSGID: 116020] [bit-rot-stub.c:566:br_stub_check_bad_object] 0-nash-bitrot-stub: b6cbec17-d66f-42b3-b088-b9c917139bc6 is a bad object. Returning
As per the comment on br_stub_check_bad_object function -
* The possible return values from br_stub_is_bad_object () are:
* 1) 0 => as per the inode context object is not bad
* 2) -1 => Failed to get the inode context itself
* 3) -2 => As per the inode context object is bad
* Both -ve values means the fop which called this function is failed
* and error is returned upwards.
In our case it is returning -2 => As per the inode context object is bad
It seems that even after deletion of files from back end, inode context still exist in memory which contain trusted.bit-rot.bad-file=0x3100 and returns error.
I tried to kill the brick process on which file was deleted and restarted the brick process. Immediately heal happened successfully.
Without restart -
[root@kotresh-3 nash]# getfattr -d -m . -e hex file-7
# file: file-7
[root@kotresh-4 nash]# getfattr -d -m . -e hex file-7
# file: file-7
REVIEW: http://review.gluster.org/15434 (feature/bitrot: Fix recovery of corrupted hardlink) posted (#1) for review on release-3.9 by Kotresh HR (email@example.com)
COMMIT: http://review.gluster.org/15434 committed in release-3.9 by Aravinda VK (firstname.lastname@example.org)
Author: Kotresh HR <email@example.com>
Date: Tue Sep 6 18:28:42 2016 +0530
feature/bitrot: Fix recovery of corrupted hardlink
When a file with hardlink is corrupted in ec volume,
the recovery steps mentioned was not working.
Only name and metadata was healing but not the data.
The bad file marker in the inode context is not removed.
Hence when self heal tries to open the file for data
healing, it fails with EIO.
The bitrot deletes inode context during forget.
Briefly, the recovery steps involves following steps.
1. Delete the entry marked with bad file xattr
from backend. Delete all the hardlinks including
.glusters hardlink as well.
2. Access the each hardlink of the file including
original from the mount.
The step 2 will send lookup to the brick where the files
are deleted from backend and returns with ENOENT. On
ENOENT, server xlator forgets the inode if there are
no dentries associated with it. But in case hardlinks,
the forget won't be called as dentries (other hardlink
files) are associated with the inode. Hence bitrot stube
won't delete it's context failing the data self heal.
Bitrot-stub should delete the inode context on getting
ENOENT during lookup.
>Signed-off-by: Kotresh HR <firstname.lastname@example.org>
>Smoke: Gluster Build System <email@example.com>
>NetBSD-regression: NetBSD Build System <firstname.lastname@example.org>
>CentOS-regression: Gluster Build System <email@example.com>
>Reviewed-by: Raghavendra Bhat <firstname.lastname@example.org>
(cherry picked from commit b86a7de9b5ea9dcd0a630dbe09fce6d9ad0d8944)
Signed-off-by: Kotresh HR <email@example.com>
Smoke: Gluster Build System <firstname.lastname@example.org>
CentOS-regression: Gluster Build System <email@example.com>
NetBSD-regression: NetBSD Build System <firstname.lastname@example.org>
Reviewed-by: Atin Mukherjee <email@example.com>
Reviewed-by: Aravinda VK <firstname.lastname@example.org>