Bug 1367779 - with performance.client-io-threads enabled healing is seemingly taking significantly more time
Summary: with performance.client-io-threads enabled healing is seemingly taking signif...
Keywords:
Status: CLOSED DUPLICATE of bug 1361513
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: disperse
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Pranith Kumar K
QA Contact: Nag Pavan Chilakam
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-17 12:54 UTC by Nag Pavan Chilakam
Modified: 2016-08-31 21:36 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-31 21:36:37 UTC
Embargoed:


Attachments (Terms of Use)

Description Nag Pavan Chilakam 2016-08-17 12:54:00 UTC
If I enabled performance.client-io-threads, then with the same work load compared to not-enabling, I am seeing that  heal completion takes about double the time.
What I did:

    Settings: Default; no specific options set
 

    Settings: Default; no specific options set

    Volume type:14+2)

    Nodes: 6 VMs each hosting one brick each of 8GB (3.1.3 3.7.9-10 7.2rhel)

    Other Cluster Info: there are about another 5 volumes, but all were offlined

    Test setup: mounted the volume only on one client -fuse (7.2rhel 3.7.9-10)

    Opened two terminals of the same client and started below IOs:

        Term1: created a directory dir1 under root and started below IO

            for i in {1..50};do dd if=/dev/urandom of=bgfile.$i count=997 bs=1000000;done

        Term2:created a directory dir2 under root and started below IO

            for i in {1..50};do dd if=/dev/urandom of=bgfile.$i count=997 bs=1000000;done

    After about 4 files(2 from each terminal) ie about 4GB total size created, brought down brick1 and left it dead till 50GB of data was created in total(so ~46GB to be healed)

    brought back brick1 using vol start force, while the IOs were still going on

    Result: Took about 7 Mins to heal the data

Re-ran same scenario with performance.client-io-threads enabled, it took about 15 min time.

Comment 2 Pranith Kumar K 2016-08-18 13:44:45 UTC
Nag,
     There are two things I would like to know before looking closely at this bug.
1) In ec there is an issue where directory that needs to create the files is listed last compared to the files in that directory. So what I would like to make sure is if this is the same issue or not. Could you modify the test to do it this way?

1) create the empty files before you bring the bricks down.
2) Do dd on the files while the bricks are down.
3) Measure the time it takes to heal now.

IMO, we see 2 problems which need to be fixed in ec when it comes to healing.
1) Making sure parent directory of the files is healed before the files in the directory
2) Making sure two heals on same file don't lead to waiting for heal to complete. Both of these should be addressed.

Pranith

Comment 3 Nag Pavan Chilakam 2016-08-22 14:32:06 UTC
Pranith,
Retried the steps mentioned by you, I don't see any(minimal) performance difference between toggling performance.client-io-threads b/w on/off

However, one question; then why am I seeing in my previous case, with the same kind of work load?

Comment 4 Pranith Kumar K 2016-08-22 14:53:21 UTC
(In reply to nchilaka from comment #3)
> Pranith,
> Retried the steps mentioned by you, I don't see any(minimal) performance
> difference between toggling performance.client-io-threads b/w on/off
> 
> However, one question; then why am I seeing in my previous case, with the
> same kind of work load?

Nag,
    There is no consistency in heal times at the moment because of entry heal marking code. So what happens directory is marked for healing well after the files that need heal are marked. Every 10 minutes a heal is triggered. So the brick could have been brought up just 1-2 minutes before the timer is expired(where it took 7 minutes) in first case and 7-8 minutes before the timer is expired(Where it took 15 minutes) in the second case of your bug description.

If you run the same test with client-io-threads ON multiple times, you will see different times because of this timing issue. Until this marking problem is addressed We will see this problem IMO.

Pranith

Comment 5 Atin Mukherjee 2016-08-22 16:16:55 UTC
Guys, any specific reason why the comments need to be private here? 

Nag - With this hypothesis I believe you'd be able to close this BZ and file a new one with existing issues highlighted by Pranith.

Comment 6 Pranith Kumar K 2016-08-22 16:20:37 UTC
(In reply to Atin Mukherjee from comment #5)
> Guys, any specific reason why the comments need to be private here? 
> 
> Nag - With this hypothesis I believe you'd be able to close this BZ and file
> a new one with existing issues highlighted by Pranith.

I have the habit of adding downstream comments as private. Removed them.

Comment 7 Pranith Kumar K 2016-08-31 21:36:37 UTC
This is the last comment from Nag in mail, which confirms that this is observed due to same bug we are following with the bz 1361513

Mail from Nag:
I have retried the case by creating 400GB of data on two different vols(one vol has client-io enabled and another it is not enabled).
This time as soon as I brought the brick up, I triggered the heal manually.
I don't see any difference in the time heal completion takes b/w the two volume.

Closing this as Dup to 1361513

*** This bug has been marked as a duplicate of bug 1361513 ***


Note You need to log in before you can comment on or make changes to this bug.