Bug 1413525

Summary: resolving split-brain using "bigger-file" option fails
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nag Pavan Chilakam <nchilaka>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED WONTFIX QA Contact: Nag Pavan Chilakam <nchilaka>
Severity: high Docs Contact:
Priority: low    
Version: rhgs-3.2CC: amukherj, nchilaka, pkarampu, ravishankar, rhs-bugs, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-20 10:08:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nag Pavan Chilakam 2017-01-16 09:38:17 UTC
Description of problem:
========================
As part of bug verification 1403840 - [GSS]xattr 'replica.split-brain-status' shows the file is in data-splitbrain but "heal split-brain latest-mtime" fails , I came up with a testcase to sanity check resolution of splitbrain based on bigger file option.
When I tried to do that it failed
While the data healing happened successfully, the CLI throws the resolution of splitbrain as failed. Post that the file shows up in pending heal

Ravi(Dev.) debugged and found that it is because of pending metadata heals


Following is the information:
[root@dhcp35-37 rep2]# gluster v heal rep2  split-brain bigger-file /testbig
Healing /testbig failed: File not in split-brain.
Volume heal failed.


root@dhcp35-37 rep2]# gluster v heal rep2 info
Brick 10.70.35.116:/rhs/brick1/rep2
/testbig - Is in split-brain

[root@dhcp35-37 rep2]# gluster v heal rep2  split-brain bigger-file /testbig
Healing /testbig failed: File not in split-brain.
Volume heal failed.

===>because of this the file is seen as heal pending . I have tested for two files and both end up with pending heal


[root@dhcp35-37 rep2]# gluster v heal rep2 info
Brick 10.70.35.116:/rhs/brick1/rep2
/bigfile 
/testbig 
Status: Connected
Number of entries: 2

Brick 10.70.35.239:/rhs/brick1/rep2
Status: Connected
Number of entries: 0

[root@dhcp35-37 rep2]# 





/bigfile 
Status: Connected
Number of entries: 2

Brick 10.70.35.239:/rhs/brick1/rep2
/testbig - Is in split-brain

Status: Connected
Number of entries: 1


backend bricks==>heal is successful
[root@dhcp35-116 ~]# md5sum /rhs/brick1/rep2/testbig
031bf15433a0c324c3c36b03b4ea384c  /rhs/brick1/rep2/testbig

[root@dhcp35-239 ~]# md5sum /rhs/brick1/rep2/testbig
031bf15433a0c324c3c36b03b4ea384c  /rhs/brick1/rep2/testbig




 
Volume Name: rep2
Type: Replicate
Volume ID: 778d60b1-981b-4a33-9ed7-a7c09a389fa4
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.70.35.116:/rhs/brick1/rep2
Brick2: 10.70.35.239:/rhs/brick1/rep2
Options Reconfigured:
cluster.self-heal-daemon: disable
cluster.entry-self-heal: off
cluster.data-self-heal: off
cluster.metadata-self-heal: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@dhcp35-37 rep2]# gluster v status rep2
Status of volume: rep2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.116:/rhs/brick1/rep2         49160     0          Y       18035
Brick 10.70.35.239:/rhs/brick1/rep2         49160     0          Y       15909
 
Task Status of Volume rep2
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp35-37 rep2]#





Version-Release number of selected component (if applicable):
=================================
3.8.4-11

How reproducible:
====================
mostly

Steps to Reproduce:
1.have a 1x2 volume
2.disble client and server side healing
3.write a file f1 with say 1 line
4. bring down b1  and now append f1 with 10 lines
5. now bring back b1, and bring down b2
6. now delete data of f1 using " >f1"
7. now bring back b2 online
8. file can be seen in splitbrain
9. try to resolve it using bigger file option

the resolution fails

Comment 2 Ravishankar N 2017-01-16 10:01:02 UTC
The RCA is given in https://bugzilla.redhat.com/show_bug.cgi?id=1403840#c13

Basically, when a file is in data split-brain *and* has pending metadata heals (but not split-brain), and the bigger-file option of the CLI is used, after healing the data split-brain, it also tries to heal the metadata but fails because it is not in split-brain. Hence the CLI that file is not in split-brain.

Comment 6 Atin Mukherjee 2018-11-12 02:31:27 UTC
Based on comment 2, my understanding is that this BZ is already addressed and there's no work required on this bug? Any specific reason why this BZ is hanging around for so many months now?

Comment 7 Pranith Kumar K 2018-11-12 11:29:29 UTC
It looks like Ravi has context about this bug. Moving it to him.

Comment 8 Ravishankar N 2018-11-13 02:31:29 UTC
(In reply to Atin Mukherjee from comment #6)
> Based on comment 2, my understanding is that this BZ is already addressed
> and there's no work required on this bug? Any specific reason why this BZ is
> hanging around for so many months now?

BZ 1403840 is for a different issue and does not fix this bug.