1391411 – While healing is in progress and one of source bricks goes down, then the heal takes significantly long time

Bug 1391411 - While healing is in progress and one of source bricks goes down, then the heal takes significantly long time

Summary: While healing is in progress and one of source bricks goes down, then the hea...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	disperse
Sub Component:
Version:	rhgs-3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Sunil Kumar Acharya
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-03 09:16 UTC by Nag Pavan Chilakam
Modified:	2017-02-20 04:23 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-02-20 04:23:28 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nag Pavan Chilakam 2016-11-03 09:16:06 UTC

Description of problem:
======================
The heal time is taking terribly long time when we bring up a  brick which was down and while heal is in progress, bring down one of the source brick.

I ran the case in two sceanrios with same workload:
1)created a dir dir1 on root of volume and created files under this dir
2) No IO was going on while heal was happening

And following is the volume settings
[root@dhcp35-180 ~]# gluster v info
 
Volume Name: dist-ec
Type: Distributed-Disperse
Volume ID: 3bcd582c-f0cd-446c-afce-bfa3b0b8e316
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (4 + 2) = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.35.179:/rhs/brick1/dist-ec
Brick2: 10.70.35.180:/rhs/brick1/dist-ec
Brick3: 10.70.35.86:/rhs/brick1/dist-ec
Brick4: 10.70.35.9:/rhs/brick1/dist-ec
Brick5: 10.70.35.153:/rhs/brick1/dist-ec
Brick6: 10.70.35.79:/rhs/brick1/dist-ec
Brick7: 10.70.35.179:/rhs/brick2/dist-ec
Brick8: 10.70.35.180:/rhs/brick2/dist-ec
Brick9: 10.70.35.86:/rhs/brick2/dist-ec
Brick10: 10.70.35.9:/rhs/brick2/dist-ec
Brick11: 10.70.35.153:/rhs/brick2/dist-ec
Brick12: 10.70.35.79:/rhs/brick2/dist-ec
Options Reconfigured:
disperse.shd-max-threads: 3
disperse.heal-wait-qlength: 3
cluster.shd-max-threads: 30
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

Scenario 1:
===========
step1: started to create 1GB files from fuse mount in a loop for 11 times. so total 11GB data with 11 files(using dd "for i in {1..11};do date;echo "loop $i" ;dd if=/dev/urandom of=file.$i bs=1024 count=1000000;done")
Thu Nov  3 13:10:51 IST 2016

step 2: I then killed brick1 of dht-subvol-1 while the first file create was in progress with about 500MB size

step 3: I waited for all 12 files to get created and then brought up the brick using vol start force.

I saw that the healing of all files completed in 1.5 minutes once healing started

Sceanrio 2:
===========

step1: started to create 1GB files from fuse mount in a loop for 11 times. so total 11GB data with 11 files(using dd "for i in {1..11};do date;echo "loop $i" ;dd if=/dev/urandom of=file.$i bs=1024 count=1000000;done")
Thu Nov  3 13:10:51 IST 2016

step 2: I then killed brick1 of dht-subvol-1 while the first file create was in progress with about 500MB size

step 3: I waited for all 12 files to get created and then brought up the brick using vol start force.

Step 4: Once healing started, I killed brick 3 of same dht-subvol-1.

I then noticed that the heal from here on takes lot of time to complete for the same set of data

The files were healing at fast pace till this new brick was brought down after that the healing info shows as below
[root@dhcp35-179 ~]# for i in {1..100};do date;gluster v heal dist-ec info|grep ntries;sleep 30;done
Thu Nov  3 14:29:38 IST 2016
Number of entries: 0
Number of entries: 2
Number of entries: -
Number of entries: 2
Number of entries: 2
Number of entries: 2
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0

Thu Nov  3 14:45:44 IST 2016
Number of entries: 0
Number of entries: 2
Number of entries: -
Number of entries: 2
Number of entries: 2
Number of entries: 2
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0

Even after 15 min still heal is not complete


Version-Release number of selected component (if applicable):
==============================================================
[root@dhcp35-86 ~]# rpm -qa|grep gluster
glusterfs-libs-3.8.4-3.el7rhgs.x86_64
glusterfs-cli-3.8.4-3.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-3.el7rhgs.x86_64
glusterfs-server-3.8.4-3.el7rhgs.x86_64
glusterfs-fuse-3.8.4-3.el7rhgs.x86_64
glusterfs-api-3.8.4-3.el7rhgs.x86_64
glusterfs-3.8.4-3.el7rhgs.x86_64

Comment 2 Ashish Pandey 2016-11-03 20:15:43 UTC

I think this could be expected behavior.

I have few questions to make sure - 
1 - Did you check if on back end files were created  on brick1 (first killed brick) after heal say 5-10 min?

2 - Did you try to heal files using index or FULL heal?
3 - Why have you set disperse.shd-max-threads: 3 and cluster.shd-max-threads: 30 ??

What I think is that while healing started and some files were getting healed you killed brick3 and at this time even after healing the current files the index entries were not getting removed from all the bricks. These entries were getting listed in heal info. This is what happening on my laptop.

Even if you wait for long these heal entries are getting listed in heal info.

I tried it with plain ec and observing this.

Comment 3 Nag Pavan Chilakam 2016-11-17 09:49:00 UTC

that shouldn't be happening. and that is the problem

Comment 5 Sunil Kumar Acharya 2017-02-17 12:04:46 UTC

Recreated the issue as mentioned in description.

We observed that the files were getting healed on the node which has come up,
but the index entries were not cleared. This is an expected behavior as we are
not sure about the sink bricks due to which index entries were created. Once the
brick(killed when heal was on) comes up heal clears the index entries as there 
is no confusion about the sink bricks.

Nag,

Can we close this issue? Please confirm.

Comment 6 Nag Pavan Chilakam 2017-02-17 12:46:17 UTC

(In reply to Sunil Kumar Acharya from comment #5)
> Recreated the issue as mentioned in description.
> 
> We observed that the files were getting healed on the node which has come up,
> but the index entries were not cleared. This is an expected behavior as we
> are
> not sure about the sink bricks due to which index entries were created. Once
> the
> brick(killed when heal was on) comes up heal clears the index entries as
> there 
> is no confusion about the sink bricks.
> 
> Nag,
> 
> Can we close this issue? Please confirm.

The explanation is good, can go ahead and close the bz

Note You need to log in before you can comment on or make changes to this bug.