Bug 1234884

Summary: Selfheal on a volume stops at a particular point and does not resume for a long time
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Apeksha <akhakhar>
Component: replicateAssignee: Ravishankar N <ravishankar>
Status: CLOSED ERRATA QA Contact: Vijay Avuthu <vavuthu>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.1CC: nchilaka, ravishankar, rhinduja, rhs-bugs, sheggodu, ssampat
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: rebase
Fixed In Version: glusterfs-3.12.2-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 06:26:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1503134    
Attachments:
Description Flags
sosreports 1
none
sosreports 2 none

Description Apeksha 2015-06-23 12:51:08 UTC
Description of problem:
Selfheal on a volume stops at a particular point and does not resume for a long time

Version-Release number of selected component (if applicable):
glusterfs-3.7.1-3.el6rhs.x86_64
nfs-ganesha-2.2.0-3.el6rhs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. create a 1X2 dist-rep volume and mount using nfs-ganesha vers=3
2. create directories and files
3. bring down 1 brick of the replica pair
4. rename all the files and directories
5. force start the volume
6. Self-heal process starts and then seems to hang

Actual results: No. of enteries come to a specific number and then stops there

Expected results: No. of enteries in self-heal info must become 0


Additional info:
[root@nfs2 ~]# gluster v heal testvol info
Brick nfs1:/rhs/brick1/brick1/testvol_brick0/
/x1/b1 
/x1/b2 
/x1/b3 
/x1/b4 
'
'
'
/x15/b19 
/x15/b20 
Number of entries: 300

Brick nfs2:/rhs/brick1/brick1/testvol_brick1/
Number of entries: 0
 
Self-heal eventually completes after a few hours

Comment 2 Shruti Sampat 2015-06-23 12:56:30 UTC
Created attachment 1042308 [details]
sosreports 1

Comment 3 Shruti Sampat 2015-06-23 12:57:50 UTC
Created attachment 1042309 [details]
sosreports 2

Comment 5 Apeksha 2015-06-23 13:10:46 UTC
on the client used following script 

1. to create files/directories:

for i in {1..15}; do mkdir /mnt/testvol/a$i; mkdir /mnt/testvol/x$i; for j in {1..20}; do mkdir /mnt/testvol/a$i/b$j; mkdir /mnt/testvol/x$i/y$j; for k in {1..30}; do touch /mnt/testvol/a$i/b$j/c$k; done done done

2. to rename files/directories:

for i in {1..15}; do for j in {1..20}; do mv /mnt/testvol/a$i/b$j /mnt/testvol/x$i/b$j; for k in {1..30}; do mv /mnt/testvol/x$i/b$j/c$k /mnt/testvol/x$i/y$j/c$k; done done done

Comment 11 Vijay Avuthu 2018-03-28 14:37:53 UTC
Update:
==============

Build used : glusterfs-3.12.2-6.el7rhgs.x86_64

Verified below scenarios for both 1 * 2 and 2 * 3

1. create a volume and mount 
2. create directories and files using below

for i in {1..15}; do mkdir /mnt/testvol/a$i; mkdir /mnt/testvol/x$i; for j in {1..20}; do mkdir /mnt/testvol/a$i/b$j; mkdir /mnt/testvol/x$i/y$j; for k in {1..30}; do touch /mnt/testvol/a$i/b$j/c$k; done done done


3. bring down 1 brick of the replica pair ( for 2 * 3 , bring down 1 brick for each replica set )
4. rename all the files and directories using below

for i in {1..15}; do for j in {1..20}; do mv /mnt/testvol/a$i/b$j /mnt/testvol/x$i/b$j; for k in {1..30}; do mv /mnt/testvol/x$i/b$j/c$k /mnt/testvol/x$i/y$j/c$k; done done done

5. force start the volume

Healing is completed without any issues

[root@dhcp35-163 ~]# gluster vol heal 23 info 
Brick 10.70.35.61:/bricks/brick0/testvol_distributed-replicated_brick0
Status: Connected
Number of entries: 0

Brick 10.70.35.174:/bricks/brick0/testvol_distributed-replicated_brick1
Status: Connected
Number of entries: 0

Brick 10.70.35.17:/bricks/brick0/testvol_distributed-replicated_brick2
Status: Connected
Number of entries: 0

Brick 10.70.35.163:/bricks/brick0/testvol_distributed-replicated_brick3
Status: Connected
Number of entries: 0

Brick 10.70.35.136:/bricks/brick0/testvol_distributed-replicated_brick4
Status: Connected
Number of entries: 0

Brick 10.70.35.214:/bricks/brick0/testvol_distributed-replicated_brick5
Status: Connected
Number of entries: 0

[root@dhcp35-163 ~]# 

> Also verified with the steps provided in comment 7

Changing status to Verified.

Comment 13 errata-xmlrpc 2018-09-04 06:26:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607