Bug 1374567

Summary:	[Bitrot]: Recovery fails of a corrupted hardlink (and the corresponding parent file) in a disperse volume
Product:	[Community] GlusterFS	Reporter:	Kotresh HR <khiremat>
Component:	bitrot	Assignee:	Kotresh HR <khiremat>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	medium	Docs Contact:	bugs <bugs>
Priority:	unspecified
Version:	3.9	CC:	amukherj, aspandey, bmohanra, bugs, khiremat, pkarampu, rcyriac, rhinduja, rhs-bugs, rmekala, sanandpa
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	v3.9.1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1374565	Environment:
Last Closed:	2017-03-08 09:33:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1341934, 1373520, 1374564, 1374565
Bug Blocks:

Description Kotresh HR 2016-09-09 05:09:39 UTC

+++ This bug was initially created as a clone of Bug #1374565 +++

+++ This bug was initially created as a clone of Bug #1374564 +++

+++ This bug was initially created as a clone of Bug #1373520 +++

+++ This bug was initially created as a clone of Bug #1341934 +++

Description of problem:
=======================
Have a 4node cluster with a 1 x (4+2) volume ozone. Enable bitrot and set the scrubber frequency to hourly. Create files/directories via fuse/nfs and create a couple of hardlinks as well. Corrupt one of the hardlinks from the backend brick path and wait for the scrubber to mark it as corrupted. Now follow the standard procedure of recovering a corrupted file, by deleting the same on the backend and accessing it from the mountpoint. After recovery, we see that the recovered file has the same contents as what it had when it was corrupted. 


Version-Release number of selected component (if applicable):
=============================================================


How reproducible:
================
Hit multiple times


Steps to Reproduce:
==================

1. Have a 4node cluster. Create a 4+2 disperse volume on node2, node3 and node4 by using 2 bricks each from every node.
2. Enable bitrot and mount it via fuse. Create 5 files and 2 hardlinks.
3. Go to the brick backend path of node2, and append a line to one of the hardlinks.
4. Verify using 'cat' that the hardlink as well as the parent file get corrupted at the backend. 
5. Wait for the scrubber to finish its run, and verify that /var/log/glusterfs/scrub.log detects the corruption.
6. Delete the hardlink (and the parent file) from the backend brick path of node2 and access the file from the moutnpoint, hoping that afr will recover the file on node2.

Actual results:
================
After step6, file and the hardlink do get recovered, but it continues to have the corrupted data.


Expected results:
=================
Good copy of file should get recovered


Few updates about what happened during the day while trying to debug this issue.

1. Tried the same steps without bitrot, with a plain disperse volume. If there is no scrubber involved which marks the file as bad, then the recovery of the file works as expected at the outset. (However further testing would be required to confidently claim the same)

2. In the setup that was shared by Kotresh, this behaviour was consistently reproduced not just for hardlinks/softlinks but even for regular files.

3. Had missed deleting the file entry from .glusterfs folder. Re did the steps mentioned in the description. THIS time again, the file gets recovered not with the corrupted data, but with NO data. It is an empty file, which continues to remain empty. Multiple attempts to manually heal the file using 'gluster volume heal <volname>' has no effect.

To sum it up, recovery of (corrupted) file is not working as expected in a disperse volume. Data corruption (and no way to recover) silently leaves the system in a -1 redundancy state.


EC Team Update:

I was able to reproduce the issue 

1 - Without Bit-rot - Corrupting file from backend and deleting it from path and also from .glusterfs. Accessing file from mount point successfully heals the file. No data loss and No data corruption
2 - With Bit- rot - Corrupting file from backend and deleting it from path and also from .glusterfs. Accessing file from mount point DOES NOT  heal the file.

I tried to debug the [2] and it looks like bit-rot is maintaining the trusted.bit-rot.bad-file=0x3100 xattr in memory.

Entry heal and metadata heal has been happening successfully. However data heal is not happening.
When data heal start, shd tries to open this file, from bad as well as good copy, but this open on bad copy fails. I checked the brick logs and found following error messages -

[2016-06-02 13:23:13.678342] E [MSGID: 116020] [bit-rot-stub.c:566:br_stub_check_bad_object] 0-nash-bitrot-stub: b6cbec17-d66f-42b3-b088-b9c917139bc6 is a bad object. Returning
[2016-06-02 13:23:13.678472] E [MSGID: 115070] [server-rpc-fops.c:1472:server_open_cbk] 0-nash-server: 2411: OPEN /file-3 (b6cbec17-d66f-42b3-b088-b9c917139bc6) ==> (Input/output error) [Input/output error]
[2016-06-02 13:23:14.565096] E [MSGID: 116020] [bit-rot-stub.c:566:br_stub_check_bad_object] 0-nash-bitrot-stub: 24b01cf8-eb2a-4896-ac1d-1bf085bd2623 is a bad object. Returning
[2016-06-02 13:23:14.565308] E [MSGID: 115070] [server-rpc-fops.c:1472:server_open_cbk] 0-nash-server: 2486: OPEN /file-6 (24b01cf8-eb2a-4896-ac1d-1bf085bd2623) ==> (Input/output error) [Input/output error]
[2016-06-02 13:23:14.893098] E [MSGID: 116020] [bit-rot-stub.c:566:br_stub_check_bad_object] 0-nash-bitrot-stub: 65faad93-5bf6-47c5-9b7c-7db281c88882 is a bad object. Returning
[2016-06-02 13:23:14.893202] E [MSGID: 115070] [server-rpc-fops.c:1472:server_open_cbk] 0-nash-server: 2515: OPEN /file-7 (65faad93-5bf6-47c5-9b7c-7db281c88882) ==> (Input/output error) [Input/output error]
[2016-06-02 13:23:15.619885] E [MSGID: 116020] [bit-rot-stub.c:566:br_stub_check_bad_object] 0-nash-bitrot-stub: b6cbec17-d66f-42b3-b088-b9c917139bc6 is a bad object. Returning

As per the comment on br_stub_check_bad_object function -

/**
 * The possible return values from br_stub_is_bad_object () are:
 * 1) 0  => as per the inode context object is not bad
 * 2) -1 => Failed to get the inode context itself
 * 3) -2 => As per the inode context object is bad
 * Both -ve values means the fop which called this function is failed
 * and error is returned upwards.
 */
In our case it is returning  -2 => As per the inode context object is bad
It seems that even after deletion of files from back end, inode context still exist in memory which contain trusted.bit-rot.bad-file=0x3100 and returns error.

I tried to kill the brick process on which file was deleted and restarted the brick process. Immediately heal happened successfully.


Without restart - 
[root@kotresh-3 nash]# getfattr -d -m . -e hex file-7
# file: file-7
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.bit-rot.signature=0x010300000000000000096c09038638126e90a32e0c4f7322ebf2db4fc213a09407240994596102f832
trusted.bit-rot.version=0x03000000000000005750188300019e6f
trusted.ec.config=0x0000080602000200
trusted.ec.dirty=0x00000000000000020000000000000000
trusted.ec.size=0x000000000000232c
trusted.ec.version=0x00000000000003e900000000000003e9
trusted.gfid=0x65faad935bf647c59b7c7db281c88882


[root@kotresh-4 nash]# getfattr -d -m . -e hex file-7
# file: file-7
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.ec.config=0x0000080602000200
trusted.ec.version=0x000000000000000000000000000003e9
trusted.gfid=0x65faad935bf647c59b7c7db281c88882

Comment 1 Worker Ant 2016-09-09 05:12:13 UTC

REVIEW: http://review.gluster.org/15434 (feature/bitrot: Fix recovery of corrupted hardlink) posted (#1) for review on release-3.9 by Kotresh HR (khiremat)

Comment 2 Worker Ant 2016-09-19 05:37:58 UTC

COMMIT: http://review.gluster.org/15434 committed in release-3.9 by Aravinda VK (avishwan) 
------
commit 8781ea7ddacc602b2f23689fa5f3415aec7253f1
Author: Kotresh HR <khiremat>
Date:   Tue Sep 6 18:28:42 2016 +0530

    feature/bitrot: Fix recovery of corrupted hardlink
    
    Problem:
    When a file with hardlink is corrupted in ec volume,
    the recovery steps mentioned was not working.
    Only name and metadata was healing but not the data.
    
    Cause:
    The bad file marker in the inode context is not removed.
    Hence when self heal tries to open the file for data
    healing, it fails with EIO.
    
    Background:
    The bitrot deletes inode context during forget.
    
    Briefly, the recovery steps involves following steps.
       1. Delete the entry marked with bad file xattr
          from backend. Delete all the hardlinks including
          .glusters hardlink as well.
       2. Access the each hardlink of the file including
          original from the mount.
    
    The step 2 will send lookup to the brick where the files
    are deleted from backend and returns with ENOENT. On
    ENOENT, server xlator forgets the inode if there are
    no dentries associated with it. But in case hardlinks,
    the forget won't be called as dentries (other hardlink
    files) are associated with the inode. Hence bitrot stube
    won't delete it's context failing the data self heal.
    
    Fix:
    Bitrot-stub should delete the inode context on getting
    ENOENT during lookup.
    
    >Change-Id: Ice6adc18625799e7afd842ab33b3517c2be264c1
    >BUG: 1373520
    >Signed-off-by: Kotresh HR <khiremat>
    >Reviewed-on: http://review.gluster.org/15408
    >Smoke: Gluster Build System <jenkins.org>
    >NetBSD-regression: NetBSD Build System <jenkins.org>
    >CentOS-regression: Gluster Build System <jenkins.org>
    >Reviewed-by: Raghavendra Bhat <raghavendra>
    (cherry picked from commit b86a7de9b5ea9dcd0a630dbe09fce6d9ad0d8944)
    
    Change-Id: Ice6adc18625799e7afd842ab33b3517c2be264c1
    BUG: 1374567
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/15434
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>
    Reviewed-by: Aravinda VK <avishwan>