Bug 1367779

Summary:	with performance.client-io-threads enabled healing is seemingly taking significantly more time
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Nag Pavan Chilakam <nchilaka>
Component:	disperse	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED DUPLICATE	QA Contact:	Nag Pavan Chilakam <nchilaka>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	amukherj, asoman, nchilaka, rcyriac, rhs-bugs, storage-qa-internal
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-08-31 21:36:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nag Pavan Chilakam 2016-08-17 12:54:00 UTC

If I enabled performance.client-io-threads, then with the same work load compared to not-enabling, I am seeing that  heal completion takes about double the time.
What I did:

    Settings: Default; no specific options set
 

    Settings: Default; no specific options set

    Volume type:14+2)

    Nodes: 6 VMs each hosting one brick each of 8GB (3.1.3 3.7.9-10 7.2rhel)

    Other Cluster Info: there are about another 5 volumes, but all were offlined

    Test setup: mounted the volume only on one client -fuse (7.2rhel 3.7.9-10)

    Opened two terminals of the same client and started below IOs:

        Term1: created a directory dir1 under root and started below IO

            for i in {1..50};do dd if=/dev/urandom of=bgfile.$i count=997 bs=1000000;done

        Term2:created a directory dir2 under root and started below IO

            for i in {1..50};do dd if=/dev/urandom of=bgfile.$i count=997 bs=1000000;done

    After about 4 files(2 from each terminal) ie about 4GB total size created, brought down brick1 and left it dead till 50GB of data was created in total(so ~46GB to be healed)

    brought back brick1 using vol start force, while the IOs were still going on

    Result: Took about 7 Mins to heal the data

Re-ran same scenario with performance.client-io-threads enabled, it took about 15 min time.

Comment 2 Pranith Kumar K 2016-08-18 13:44:45 UTC

Nag,
     There are two things I would like to know before looking closely at this bug.
1) In ec there is an issue where directory that needs to create the files is listed last compared to the files in that directory. So what I would like to make sure is if this is the same issue or not. Could you modify the test to do it this way?

1) create the empty files before you bring the bricks down.
2) Do dd on the files while the bricks are down.
3) Measure the time it takes to heal now.

IMO, we see 2 problems which need to be fixed in ec when it comes to healing.
1) Making sure parent directory of the files is healed before the files in the directory
2) Making sure two heals on same file don't lead to waiting for heal to complete. Both of these should be addressed.

Pranith

Comment 3 Nag Pavan Chilakam 2016-08-22 14:32:06 UTC

Pranith,
Retried the steps mentioned by you, I don't see any(minimal) performance difference between toggling performance.client-io-threads b/w on/off

However, one question; then why am I seeing in my previous case, with the same kind of work load?

Comment 4 Pranith Kumar K 2016-08-22 14:53:21 UTC

(In reply to nchilaka from comment #3)
> Pranith,
> Retried the steps mentioned by you, I don't see any(minimal) performance
> difference between toggling performance.client-io-threads b/w on/off
> 
> However, one question; then why am I seeing in my previous case, with the
> same kind of work load?

Nag,
    There is no consistency in heal times at the moment because of entry heal marking code. So what happens directory is marked for healing well after the files that need heal are marked. Every 10 minutes a heal is triggered. So the brick could have been brought up just 1-2 minutes before the timer is expired(where it took 7 minutes) in first case and 7-8 minutes before the timer is expired(Where it took 15 minutes) in the second case of your bug description.

If you run the same test with client-io-threads ON multiple times, you will see different times because of this timing issue. Until this marking problem is addressed We will see this problem IMO.

Pranith

Comment 5 Atin Mukherjee 2016-08-22 16:16:55 UTC

Guys, any specific reason why the comments need to be private here? 

Nag - With this hypothesis I believe you'd be able to close this BZ and file a new one with existing issues highlighted by Pranith.

Comment 6 Pranith Kumar K 2016-08-22 16:20:37 UTC

(In reply to Atin Mukherjee from comment #5)
> Guys, any specific reason why the comments need to be private here? 
> 
> Nag - With this hypothesis I believe you'd be able to close this BZ and file
> a new one with existing issues highlighted by Pranith.

I have the habit of adding downstream comments as private. Removed them.

Comment 7 Pranith Kumar K 2016-08-31 21:36:37 UTC

This is the last comment from Nag in mail, which confirms that this is observed due to same bug we are following with the bz 1361513

Mail from Nag:
I have retried the case by creating 400GB of data on two different vols(one vol has client-io enabled and another it is not enabled).
This time as soon as I brought the brick up, I triggered the heal manually.
I don't see any difference in the time heal completion takes b/w the two volume.

Closing this as Dup to 1361513

*** This bug has been marked as a duplicate of bug 1361513 ***