1712225 – Healing is not completed on Arbiter bricks

Bug 1712225 - Healing is not completed on Arbiter bricks

Summary: Healing is not completed on Arbiter bricks

Keywords:
Status:	CLOSED DUPLICATE of bug 1640148
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Karthik U S
QA Contact:	Anees Patel
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-21 05:28 UTC by Anees Patel
Modified:	2019-05-23 09:35 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-23 09:35:31 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Anees Patel 2019-05-21 05:28:54 UTC

Description of problem:

Healing is unable to complete on arbiter bricks, testcase test_entry_self_heal_heal_command fails with files pending heal

Version-Release number of selected component (if applicable):

Discovered in 3.4 BU4 Async
# rpm -qa | grep gluster
glusterfs-libs-3.12.2-47.2.el7rhgs.x86_64
python2-gluster-3.12.2-47.2.el7rhgs.x86_64
glusterfs-devel-3.12.2-47.2.el7rhgs.x86_64
glusterfs-events-3.12.2-47.2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-3.12.2-47.2.el7rhgs.x86_64

Also reproduced in 3.5
# rpm -qa | grep gluster
glusterfs-libs-6.0-2.el7rhgs.x86_64
glusterfs-geo-replication-6.0-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-cli-6.0-2.el7rhgs.x86_64
glusterfs-client-xlators-6.0-2.el7rhgs. 


How reproducible:
2/5.


Steps to Reproduce:
TC is automated, with the following steps
1. Create a dist-arbiter (2X2+1) volume
2. Set client-side heal options to "off"
        "metadata-self-heal": "off"
        "entry-self-heal": "off"
        "data-self-heal": "off"
3. Write IO's
4. Set volume option 
         self-heal-daemon": "off"
5. Bring down arbiter brick from each subvol
6. Modify data written in step 3
7. List files from mount pt
8. Bring bricks online
9. Set option
         "self-heal-daemon": "on"
10. Trigger heal once all volume process and shd's are online
         gluster volume heal <vol-name>
11. Wait for heal completion 

Actual results:

At step 11, heal is unable to complete, and files are pending heal

Expected results:

Heal should be completed with no files pending heal
Additional info:

# gluster v info testvol_distributed-replicated
 
Volume Name: testvol_distributed-replicated
Type: Distributed-Replicate
Volume ID: 9f335738-25fc-4e2e-ad69-3d3b25212491
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.35.50:/bricks/brick2/testvol_distributed-replicated_brick0
Brick2: 10.70.46.132:/bricks/brick1/testvol_distributed-replicated_brick1
Brick3: 10.70.46.216:/bricks/brick2/testvol_distributed-replicated_brick2 (arbiter)
Brick4: 10.70.46.42:/bricks/brick1/testvol_distributed-replicated_brick3
Brick5: 10.70.47.41:/bricks/brick1/testvol_distributed-replicated_brick4
Brick6: 10.70.46.231:/bricks/brick2/testvol_distributed-replicated_brick5 (arbiter)
Options Reconfigured:
cluster.self-heal-daemon: on
cluster.data-self-heal: off
cluster.metadata-self-heal: off
cluster.entry-self-heal: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off



Files pending heal

# gluster v heal testvol_distributed-replicated info
Brick 10.70.35.50:/bricks/brick2/testvol_distributed-replicated_brick0
Status: Connected
Number of entries: 0

Brick 10.70.46.132:/bricks/brick1/testvol_distributed-replicated_brick1
Status: Connected
Number of entries: 0

Brick 10.70.46.216:/bricks/brick2/testvol_distributed-replicated_brick2
Status: Connected
Number of entries: 0

Brick 10.70.46.42:/bricks/brick1/testvol_distributed-replicated_brick3
/files/user2_a/dir0_a/dir0_a 
/files/user2_a/dir0_a 
Status: Connected
Number of entries: 2

Brick 10.70.47.41:/bricks/brick1/testvol_distributed-replicated_brick4
/files/user2_a/dir0_a/dir0_a 
/files/user2_a/dir0_a 
Status: Connected
Number of entries: 2

Brick 10.70.46.231:/bricks/brick2/testvol_distributed-replicated_brick5
<gfid:05893f69-b962-48c3-8838-523857703ce3>/user2_a/dir0_a/dir0_a 
<gfid:05893f69-b962-48c3-8838-523857703ce3>/user2_a/dir0_a 
Status: Connected
Number of entries: 2


It is noticed that every time the issue is hit, the same set of files (also always only 2 entries pending heal) are pending heal viz., <>//user2_a/dir0_a/dir0_a  , <>//user2_a/dir0_a 

It is always the same files, hence it can be tough reproducing this issue manually,

Data-brick for subvol2
[root@dhcp47-41 dir0_a]#  getfattr -m. -d -e hex /bricks/brick1/testvol_distributed-replicated_brick4/files/user2_a/dir0_a/dir0_a
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testvol_distributed-replicated_brick4/files/user2_a/dir0_a/dir0_a
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.testvol_distributed-replicated-client-5=0x000000000000000e00000001
trusted.gfid=0x71d4d0b120944636ad95d802e15be1f2
trusted.glusterfs.dht=0x0000000000000000000000007ffffffe

Another Data brick for subvol-2
[root@dhcp46-42 x86_64]# getfattr -m. -d -e hex /bricks/brick1/testvol_distributed-replicated_brick3/files/user2_a/dir0_a/dir0_a/
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testvol_distributed-replicated_brick3/files/user2_a/dir0_a/dir0_a/
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.testvol_distributed-replicated-client-5=0x000000000000000e00000001
trusted.gfid=0x71d4d0b120944636ad95d802e15be1f2
trusted.glusterfs.dht=0x0000000000000000000000007ffffffe

No entries are present is arbiter brick

[root@dhcp46-231 dir0_a]# getfattr -m. -d -e hex /bricks/brick2/testvol_distributed-replicated_brick5/files/user2_a/dir0_a/dir0_a/
getfattr: /bricks/brick2/testvol_distributed-replicated_brick5/files/user2_a/dir0_a/dir0_a/: No such file or directory

[root@dhcp46-231 dir0_a]# ls /bricks/brick2/testvol_distributed-replicated_brick5/files/user2_a/dir0_a/

<empty dir>

whereas entries are present in other data-bricks
t@dhcp46-42 ~]# ls /bricks/brick1/testvol_distributed-replicated_brick3/files/user2_a/dir0_a/dir0_a/
testfile0_a.txt   testfile11_a.txt  testfile13_a.txt  testfile15_a.txt  testfile17_a.txt  testfile2_a.txt  testfile4_a.txt  testfile6_a.txt  testfile9_a.txt
testfile10_a.txt  testfile12_a.txt  testfile14_a.txt  testfile16_a.txt  testfile19_a.txt  testfile3_a.txt  testfile5_a.txt  testfile7_a.txt

Comment 5 Anees Patel 2019-05-23 09:35:31 UTC


*** This bug has been marked as a duplicate of bug 1640148 ***

Note You need to log in before you can comment on or make changes to this bug.