Bug 1234884 - Selfheal on a volume stops at a particular point and does not resume for a long time
Summary: Selfheal on a volume stops at a particular point and does not resume for a lo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: replicate
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: RHGS 3.4.0
Assignee: Ravishankar N
QA Contact: Vijay Avuthu
URL:
Whiteboard: rebase
Depends On:
Blocks: 1503134
TreeView+ depends on / blocked
 
Reported: 2015-06-23 12:51 UTC by Apeksha
Modified: 2018-09-17 14:09 UTC (History)
6 users (show)

Fixed In Version: glusterfs-3.12.2-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-04 06:26:56 UTC
Embargoed:


Attachments (Terms of Use)
sosreports 1 (11.86 MB, application/x-xz)
2015-06-23 12:56 UTC, Shruti Sampat
no flags Details
sosreports 2 (10.10 MB, application/x-xz)
2015-06-23 12:57 UTC, Shruti Sampat
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 0 None None None 2018-09-04 06:28:39 UTC

Description Apeksha 2015-06-23 12:51:08 UTC
Description of problem:
Selfheal on a volume stops at a particular point and does not resume for a long time

Version-Release number of selected component (if applicable):
glusterfs-3.7.1-3.el6rhs.x86_64
nfs-ganesha-2.2.0-3.el6rhs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. create a 1X2 dist-rep volume and mount using nfs-ganesha vers=3
2. create directories and files
3. bring down 1 brick of the replica pair
4. rename all the files and directories
5. force start the volume
6. Self-heal process starts and then seems to hang

Actual results: No. of enteries come to a specific number and then stops there

Expected results: No. of enteries in self-heal info must become 0


Additional info:
[root@nfs2 ~]# gluster v heal testvol info
Brick nfs1:/rhs/brick1/brick1/testvol_brick0/
/x1/b1 
/x1/b2 
/x1/b3 
/x1/b4 
'
'
'
/x15/b19 
/x15/b20 
Number of entries: 300

Brick nfs2:/rhs/brick1/brick1/testvol_brick1/
Number of entries: 0
 
Self-heal eventually completes after a few hours

Comment 2 Shruti Sampat 2015-06-23 12:56:30 UTC
Created attachment 1042308 [details]
sosreports 1

Comment 3 Shruti Sampat 2015-06-23 12:57:50 UTC
Created attachment 1042309 [details]
sosreports 2

Comment 5 Apeksha 2015-06-23 13:10:46 UTC
on the client used following script 

1. to create files/directories:

for i in {1..15}; do mkdir /mnt/testvol/a$i; mkdir /mnt/testvol/x$i; for j in {1..20}; do mkdir /mnt/testvol/a$i/b$j; mkdir /mnt/testvol/x$i/y$j; for k in {1..30}; do touch /mnt/testvol/a$i/b$j/c$k; done done done

2. to rename files/directories:

for i in {1..15}; do for j in {1..20}; do mv /mnt/testvol/a$i/b$j /mnt/testvol/x$i/b$j; for k in {1..30}; do mv /mnt/testvol/x$i/b$j/c$k /mnt/testvol/x$i/y$j/c$k; done done done

Comment 11 Vijay Avuthu 2018-03-28 14:37:53 UTC
Update:
==============

Build used : glusterfs-3.12.2-6.el7rhgs.x86_64

Verified below scenarios for both 1 * 2 and 2 * 3

1. create a volume and mount 
2. create directories and files using below

for i in {1..15}; do mkdir /mnt/testvol/a$i; mkdir /mnt/testvol/x$i; for j in {1..20}; do mkdir /mnt/testvol/a$i/b$j; mkdir /mnt/testvol/x$i/y$j; for k in {1..30}; do touch /mnt/testvol/a$i/b$j/c$k; done done done


3. bring down 1 brick of the replica pair ( for 2 * 3 , bring down 1 brick for each replica set )
4. rename all the files and directories using below

for i in {1..15}; do for j in {1..20}; do mv /mnt/testvol/a$i/b$j /mnt/testvol/x$i/b$j; for k in {1..30}; do mv /mnt/testvol/x$i/b$j/c$k /mnt/testvol/x$i/y$j/c$k; done done done

5. force start the volume

Healing is completed without any issues

[root@dhcp35-163 ~]# gluster vol heal 23 info 
Brick 10.70.35.61:/bricks/brick0/testvol_distributed-replicated_brick0
Status: Connected
Number of entries: 0

Brick 10.70.35.174:/bricks/brick0/testvol_distributed-replicated_brick1
Status: Connected
Number of entries: 0

Brick 10.70.35.17:/bricks/brick0/testvol_distributed-replicated_brick2
Status: Connected
Number of entries: 0

Brick 10.70.35.163:/bricks/brick0/testvol_distributed-replicated_brick3
Status: Connected
Number of entries: 0

Brick 10.70.35.136:/bricks/brick0/testvol_distributed-replicated_brick4
Status: Connected
Number of entries: 0

Brick 10.70.35.214:/bricks/brick0/testvol_distributed-replicated_brick5
Status: Connected
Number of entries: 0

[root@dhcp35-163 ~]# 

> Also verified with the steps provided in comment 7

Changing status to Verified.

Comment 13 errata-xmlrpc 2018-09-04 06:26:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607


Note You need to log in before you can comment on or make changes to this bug.