Bug 1374565 - [Bitrot]: Recovery fails of a corrupted hardlink (and the corresponding parent file) in a disperse volume
Summary: [Bitrot]: Recovery fails of a corrupted hardlink (and the corresponding paren...
Alias: None
Product: GlusterFS
Classification: Community
Component: bitrot
Version: 3.8
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
Assignee: Kotresh HR
QA Contact:
Depends On: 1341934 1373520 1374564
Blocks: 1374567
TreeView+ depends on / blocked
Reported: 2016-09-09 04:57 UTC by Kotresh HR
Modified: 2016-09-20 05:05 UTC (History)
11 users (show)

Fixed In Version: glusterfs-3.8.4
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1374564
: 1374567 (view as bug list)
Last Closed: 2016-09-16 18:28:44 UTC
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:

Attachments (Terms of Use)

Description Kotresh HR 2016-09-09 04:57:38 UTC
+++ This bug was initially created as a clone of Bug #1374564 +++

+++ This bug was initially created as a clone of Bug #1373520 +++

+++ This bug was initially created as a clone of Bug #1341934 +++

Description of problem:
Have a 4node cluster with a 1 x (4+2) volume ozone. Enable bitrot and set the scrubber frequency to hourly. Create files/directories via fuse/nfs and create a couple of hardlinks as well. Corrupt one of the hardlinks from the backend brick path and wait for the scrubber to mark it as corrupted. Now follow the standard procedure of recovering a corrupted file, by deleting the same on the backend and accessing it from the mountpoint. After recovery, we see that the recovered file has the same contents as what it had when it was corrupted. 

Version-Release number of selected component (if applicable):

How reproducible:
Hit multiple times

Steps to Reproduce:

1. Have a 4node cluster. Create a 4+2 disperse volume on node2, node3 and node4 by using 2 bricks each from every node.
2. Enable bitrot and mount it via fuse. Create 5 files and 2 hardlinks.
3. Go to the brick backend path of node2, and append a line to one of the hardlinks.
4. Verify using 'cat' that the hardlink as well as the parent file get corrupted at the backend. 
5. Wait for the scrubber to finish its run, and verify that /var/log/glusterfs/scrub.log detects the corruption.
6. Delete the hardlink (and the parent file) from the backend brick path of node2 and access the file from the moutnpoint, hoping that afr will recover the file on node2.

Actual results:
After step6, file and the hardlink do get recovered, but it continues to have the corrupted data.

Expected results:
Good copy of file should get recovered

Few updates about what happened during the day while trying to debug this issue.

1. Tried the same steps without bitrot, with a plain disperse volume. If there is no scrubber involved which marks the file as bad, then the recovery of the file works as expected at the outset. (However further testing would be required to confidently claim the same)

2. In the setup that was shared by Kotresh, this behaviour was consistently reproduced not just for hardlinks/softlinks but even for regular files.

3. Had missed deleting the file entry from .glusterfs folder. Re did the steps mentioned in the description. THIS time again, the file gets recovered not with the corrupted data, but with NO data. It is an empty file, which continues to remain empty. Multiple attempts to manually heal the file using 'gluster volume heal <volname>' has no effect.

To sum it up, recovery of (corrupted) file is not working as expected in a disperse volume. Data corruption (and no way to recover) silently leaves the system in a -1 redundancy state.

EC Team Update:

I was able to reproduce the issue 

1 - Without Bit-rot - Corrupting file from backend and deleting it from path and also from .glusterfs. Accessing file from mount point successfully heals the file. No data loss and No data corruption
2 - With Bit- rot - Corrupting file from backend and deleting it from path and also from .glusterfs. Accessing file from mount point DOES NOT  heal the file.

I tried to debug the [2] and it looks like bit-rot is maintaining the trusted.bit-rot.bad-file=0x3100 xattr in memory.

Entry heal and metadata heal has been happening successfully. However data heal is not happening.
When data heal start, shd tries to open this file, from bad as well as good copy, but this open on bad copy fails. I checked the brick logs and found following error messages -

[2016-06-02 13:23:13.678342] E [MSGID: 116020] [bit-rot-stub.c:566:br_stub_check_bad_object] 0-nash-bitrot-stub: b6cbec17-d66f-42b3-b088-b9c917139bc6 is a bad object. Returning
[2016-06-02 13:23:13.678472] E [MSGID: 115070] [server-rpc-fops.c:1472:server_open_cbk] 0-nash-server: 2411: OPEN /file-3 (b6cbec17-d66f-42b3-b088-b9c917139bc6) ==> (Input/output error) [Input/output error]
[2016-06-02 13:23:14.565096] E [MSGID: 116020] [bit-rot-stub.c:566:br_stub_check_bad_object] 0-nash-bitrot-stub: 24b01cf8-eb2a-4896-ac1d-1bf085bd2623 is a bad object. Returning
[2016-06-02 13:23:14.565308] E [MSGID: 115070] [server-rpc-fops.c:1472:server_open_cbk] 0-nash-server: 2486: OPEN /file-6 (24b01cf8-eb2a-4896-ac1d-1bf085bd2623) ==> (Input/output error) [Input/output error]
[2016-06-02 13:23:14.893098] E [MSGID: 116020] [bit-rot-stub.c:566:br_stub_check_bad_object] 0-nash-bitrot-stub: 65faad93-5bf6-47c5-9b7c-7db281c88882 is a bad object. Returning
[2016-06-02 13:23:14.893202] E [MSGID: 115070] [server-rpc-fops.c:1472:server_open_cbk] 0-nash-server: 2515: OPEN /file-7 (65faad93-5bf6-47c5-9b7c-7db281c88882) ==> (Input/output error) [Input/output error]
[2016-06-02 13:23:15.619885] E [MSGID: 116020] [bit-rot-stub.c:566:br_stub_check_bad_object] 0-nash-bitrot-stub: b6cbec17-d66f-42b3-b088-b9c917139bc6 is a bad object. Returning

As per the comment on br_stub_check_bad_object function -

 * The possible return values from br_stub_is_bad_object () are:
 * 1) 0  => as per the inode context object is not bad
 * 2) -1 => Failed to get the inode context itself
 * 3) -2 => As per the inode context object is bad
 * Both -ve values means the fop which called this function is failed
 * and error is returned upwards.
In our case it is returning  -2 => As per the inode context object is bad
It seems that even after deletion of files from back end, inode context still exist in memory which contain trusted.bit-rot.bad-file=0x3100 and returns error.

I tried to kill the brick process on which file was deleted and restarted the brick process. Immediately heal happened successfully.

Without restart - 
[root@kotresh-3 nash]# getfattr -d -m . -e hex file-7
# file: file-7

[root@kotresh-4 nash]# getfattr -d -m . -e hex file-7
# file: file-7

Comment 1 Worker Ant 2016-09-09 04:59:54 UTC
REVIEW: http://review.gluster.org/15433 (feature/bitrot: Fix recovery of corrupted hardlink) posted (#1) for review on release-3.8 by Kotresh HR (khiremat@redhat.com)

Comment 2 Worker Ant 2016-09-09 14:05:10 UTC
COMMIT: http://review.gluster.org/15433 committed in release-3.8 by Raghavendra Bhat (raghavendra@redhat.com) 
commit 22ea98a31f147bcd1e4643c2b77f503c63b03a4e
Author: Kotresh HR <khiremat@redhat.com>
Date:   Tue Sep 6 18:28:42 2016 +0530

    feature/bitrot: Fix recovery of corrupted hardlink
    When a file with hardlink is corrupted in ec volume,
    the recovery steps mentioned was not working.
    Only name and metadata was healing but not the data.
    The bad file marker in the inode context is not removed.
    Hence when self heal tries to open the file for data
    healing, it fails with EIO.
    The bitrot deletes inode context during forget.
    Briefly, the recovery steps involves following steps.
       1. Delete the entry marked with bad file xattr
          from backend. Delete all the hardlinks including
          .glusters hardlink as well.
       2. Access the each hardlink of the file including
          original from the mount.
    The step 2 will send lookup to the brick where the files
    are deleted from backend and returns with ENOENT. On
    ENOENT, server xlator forgets the inode if there are
    no dentries associated with it. But in case hardlinks,
    the forget won't be called as dentries (other hardlink
    files) are associated with the inode. Hence bitrot stube
    won't delete it's context failing the data self heal.
    Bitrot-stub should delete the inode context on getting
    ENOENT during lookup.
    >Change-Id: Ice6adc18625799e7afd842ab33b3517c2be264c1
    >BUG: 1373520
    >Signed-off-by: Kotresh HR <khiremat@redhat.com>
    >Reviewed-on: http://review.gluster.org/15408
    >Smoke: Gluster Build System <jenkins@build.gluster.org>
    >NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    >CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    >Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com>
    (cherry picked from commit b86a7de9b5ea9dcd0a630dbe09fce6d9ad0d8944)
    Change-Id: Ice6adc18625799e7afd842ab33b3517c2be264c1
    BUG: 1374565
    Signed-off-by: Kotresh HR <khiremat@redhat.com>
    Reviewed-on: http://review.gluster.org/15433
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Raghavendra Bhat <raghavendra@redhat.com>

Comment 3 Niels de Vos 2016-09-12 05:37:21 UTC
All 3.8.x bugs are now reported against version 3.8 (without .x). For more information, see http://www.gluster.org/pipermail/gluster-devel/2016-September/050859.html

Comment 4 Niels de Vos 2016-09-16 18:28:44 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.4, please open a new bug report.

glusterfs-3.8.4 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/announce/2016-September/000060.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.