Bug 1716360

Summary: Arbiter becoming source of heal when bricks are brought down continuously
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Anees Patel <anepatel>
Component: arbiterAssignee: Karthik U S <ksubrahm>
Status: CLOSED DEFERRED QA Contact: Prasanth <pprakash>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: amukherj, ksubrahm, nchilaka, pkarampu, rhs-bugs, storage-qa-internal, tmuthami, vdas
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-31 05:14:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Script to continously append to a file none

Description Anees Patel 2019-06-03 10:22:36 UTC
Created attachment 1576564 [details]
Script to continously append to a file

Description of problem:

When 2 bricks in an arbiter volume are brought down continuously,  fattr's of both data bricks blame each other and arbiter is becoming source of heal, A similar issue was discovered earlier in BZ#1401969 which was fixed in 3.4.0

Version-Release number of selected component (if applicable):

Was discovered while testing Hotfix
# rpm -qa | grep gluster
python2-gluster-3.12.2-40.el7rhgs.1.HOTFIX.sfdc02320997.bz1708121.x86_64
glusterfs-3.12.2-40.el7rhgs.1.HOTFIX.sfdc02320997.bz1708121.x86_64

How reproducible:
Once

Steps to Reproduce:
1. Run a script that collects all the bricks in the volume, kill 2 bricks (b0, b1) with milli-second difference, bring back bricks using glusterd restart
2. Now kill b1 and b2 and repeat the cycle in loop
3. At the same time run the perl script on fuse client as IO, the script is attached along with this bug, this script opens a file and does infinite writes in loop 

Actual results:

File is pending heal and is unable to access from mount point. Arbiter becoming source of heal
# ls 1
ls: cannot access 1: Transport endpoint is not connected
# stat 1
stat: cannot stat ‘1’: Transport endpoint is not connected


# gluster v heal master2vol-2 info
Brick 10.70.36.49:/bricks/brick1/master1vol-2
<gfid:90959a41-63dc-4fe0-b6d9-f1223b1ab40f>
Status: Connected
Number of entries: 1

Brick 10.70.36.62:/bricks/brick3/master1vol-2repl
<gfid:90959a41-63dc-4fe0-b6d9-f1223b1ab40f>
Status: Connected
Number of entries: 1

Brick 10.70.36.56:/bricks/brick1/master1vol-2
<gfid:90959a41-63dc-4fe0-b6d9-f1223b1ab40f>
Status: Connected
Number of entries: 1

Expected results:

Arbiter brick should not become source of heal, and all files should heal

Additional info:
================
# gluster v info master2vol-2
 
Volume Name: master2vol-2
Type: Replicate
Volume ID: 0f62e637-15ae-4c64-828b-f7d83e08baf4
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.36.49:/bricks/brick1/master1vol-2
Brick2: 10.70.36.62:/bricks/brick3/master1vol-2repl
Brick3: 10.70.36.56:/bricks/brick1/master1vol-2 (arbiter)
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
changelog.changelog: on
cluster.shd-max-threads: 30
cluster.enable-shared-storage: enable

=============================================================================
Extended Attributes for the file blame each other (client 0 and client 1) and dirty attribute is set.
Data-brick 1
# getfattr -m . -d -e hex /bricks/brick1/master1vol-2/replace-brick/1
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/master1vol-2/replace-brick/1
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x0000002d0000000000000000
trusted.afr.master2vol-2-client-1=0x000000020000000000000000
trusted.gfid=0x90959a4163dc4fe0b6d9f1223b1ab40f
trusted.gfid2path.d6e66232a352f62e=0x31363365353336322d393862312d343836652d393061392d3437313437633165306662302f31
trusted.glusterfs.0f62e637-15ae-4c64-828b-f7d83e08baf4.xtime=0x5cf343a9000264bd
==
Data-brick 2
# getfattr -m . -d -e hex /bricks/brick3/master1vol-2repl/replace-brick/1
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick3/master1vol-2repl/replace-brick/1
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x0000002c0000000000000000
trusted.afr.master2vol-2-client-0=0x000000010000000000000000
trusted.gfid=0x90959a4163dc4fe0b6d9f1223b1ab40f
trusted.gfid2path.d6e66232a352f62e=0x31363365353336322d393862312d343836652d393061392d3437313437633165306662302f31
trusted.glusterfs.0f62e637-15ae-4c64-828b-f7d83e08baf4.xtime=0x5cf343ae0005e2f7
==
Arbiter brick
# getfattr -m . -d -e hex /bricks/brick1/master1vol-2/replace-brick/1
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/master1vol-2/replace-brick/1
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x0000002c0000000000000000
trusted.afr.master2vol-2-client-0=0x000000010000000000000000
trusted.afr.master2vol-2-client-1=0x000000020000000000000000
trusted.gfid=0x90959a4163dc4fe0b6d9f1223b1ab40f
trusted.gfid2path.d6e66232a352f62e=0x31363365353336322d393862312d343836652d393061392d3437313437633165306662302f31
trusted.glusterfs.0f62e637-15ae-4c64-828b-f7d83e08baf4.xtime=0x5cf343ac000e9301
=============================================================================
System details and sos-report to be provide in following comment