Bug 1362129

Summary: rename of a file can cause data loss in an replica/arbiter volume configuration
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ravishankar N <ravishankar>
Component: replicateAssignee: Ravishankar N <ravishankar>
Status: CLOSED ERRATA QA Contact: Anees Patel <anepatel>
Severity: urgent Docs Contact:
Priority: high    
Version: rhgs-3.1CC: amukherj, anepatel, asriram, bugs, nchilaka, pkarampu, ravishankar, rcyriac, rhinduja, rhs-bugs, sheggodu, srmukher, storage-qa-internal
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: RHGS 3.4.z Batch Update 3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.12.2-33 Doc Type: Bug Fix
Doc Text:
In replica 3 volume, renaming a file while the brick with the 'good copy' of the file is down would result in removal of the file during self-heal, leading to data loss. With this release, lookup of a file will fail if there is no good copy (as determined by AFR xattrs) found, dismissing the rename option and limiting the data loss.
Story Points: ---
Clone Of: 1357000 Environment:
Last Closed: 2019-02-04 07:41:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1357000, 1366818    
Bug Blocks: 1645480    

Description Ravishankar N 2016-08-01 10:47:59 UTC
+++ This bug was initially created as a clone of Bug #1357000 +++

Description of problem:
=========================
there is a case where rename of a file leads to data loss.



Steps to Reproduce:
===================
1.create a 1x(2+1) volume with bricks as say db1,db2 and ab1
2.now mount the vol by fuse
3.create a directory say dir1
4. Now bring down the first data brick(db1) 
5. create a file say f1 under dir1 with some contents 
6. note down the getfattr details from both db2 and ab1 
7. now bring down db2 and bring up db1
8. trigger a heal 
9. now rename f1 to f2
10. now bring up db2 and trigger a heal
11. from mount do a cat of f2

We get EIO
[root@dhcp42-93 db1_Down]# cat renamdatafile 
cat: renamdatafile: Input/output error

client logs:
[2016-07-15 12:25:40.299090] W [MSGID: 108008] [afr-read-txn.c:244:afr_read_txn] 0-arbit-replicate-0: Unreadable subvolume -1 found with event generation 7 for gfid 091d29dd-f4e1-49da-8353-1686e59818de. (Possible split-brain)
[2016-07-15 12:25:40.301196] E [MSGID: 108008] [afr-read-txn.c:89:afr_read_txn_refresh_done] 0-arbit-replicate-0: Failing FGETXATTR on gfid 091d29dd-f4e1-49da-8353-1686e59818de: split-brain observed. [Input/output error]
[2016-07-15 12:25:40.302017] W [MSGID: 108027] [afr-common.c:2245:afr_discover_done] 0-arbit-replicate-0: no read subvols for (null)
[2016-07-15 12:25:40.305693] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 796: READ => -1 gfid=091d29dd-f4e1-49da-8353-1686e59818de fd=0x7fcbf801579c (Input/output error)
[2016-07-15 12:25:40.303768] W [MSGID: 108008] [afr-read-txn.c:244:afr_read_txn] 0-arbit-replicate-0: Unreadable subvolume -1 found with event generation 7 for gfid 091d29dd-f4e1-49da-8353-1686e59818de. (Possible split-brain)
[2016-07-15 12:25:40.305666] E [MSGID: 108008] [afr-read-txn.c:89:afr_read_txn_refresh_done] 0-arbit-replicate-0: Failing READ on gfid 091d29dd-f4e1-49da-8353-1686e59818de: split-brain observed. [Input/output error]



db1 getfattr:
root@dhcp43-157 ~]#  getfattr -d -m . -e hex /bricks/brick2/arbit/db1_Down/renamdatafile 
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick2/arbit/db1_Down/renamdatafile
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.arbit-client-0=0x000000030000000000000000
trusted.afr.arbit-client-1=0x000000010000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x02000000000000005788cb3000085f14
trusted.gfid=0x091d29ddf4e149da83531686e59818de


db2:[root@dhcp43-153 ~]# getfattr -d -m . -e hex /bricks/brick1/arbit/db1_Down/renamdatafile 
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/arbit/db1_Down/renamdatafile
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0x091d29ddf4e149da83531686e59818de

ab1:

[root@dhcp43-157 ~]#  getfattr -d -m . -e hex /bricks/brick0/arbit/db1_Down/renamdatafile 
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick0/arbit/db1_Down/renamdatafile
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0x091d29ddf4e149da83531686e59818de


Volume Name: arbit
Type: Replicate
Volume ID: 0069b5a7-bfdf-4f59-86ec-851f500ed902
Status: Started
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.43.129:/bricks/brick0/arbit
Brick2: 10.70.43.153:/bricks/brick1/arbit
Brick3: 10.70.43.129:/bricks/brick2/arbit (arbiter)
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
[root@dhcp43-157 ~]# 





Expected results:


Additional info:

--- Additional comment from Vijay Bellur on 2016-07-27 00:20:31 EDT ---

REVIEW: http://review.gluster.org/15017 (afr: some coverity fixes) posted (#1) for review on release-3.7 by Ravishankar N (ravishankar)

--- Additional comment from Ravishankar N on 2016-07-27 00:22:42 EDT ---

Ignore comment #1, that patch is for a different bug.

Comment 3 Atin Mukherjee 2016-08-30 10:55:23 UTC
http://review.gluster.org/15226 posted upstream for review.

Comment 23 Anees Patel 2019-01-08 08:15:09 UTC
Verified the fix per the above test-plan for arbiter and replica 3, on the latest BU3 build
# rpm -qa | grep gluster
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-api-3.12.2-36.el7rhgs.x86_64
glusterfs-server-3.12.2-36.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-36.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-36.el7rhgs.x86_64


-Setting this to verified

Comment 24 Srijita Mukherjee 2019-01-20 19:27:09 UTC
The doc text has been updated. Kindly review the technical accuracy.

Comment 25 Ravishankar N 2019-01-21 04:13:39 UTC
Made a minor change. Looks good to me otherwise.

Comment 27 errata-xmlrpc 2019-02-04 07:41:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0263