If I enabled performance.client-io-threads, then with the same work load compared to not-enabling, I am seeing that heal completion takes about double the time. What I did: Settings: Default; no specific options set Settings: Default; no specific options set Volume type:14+2) Nodes: 6 VMs each hosting one brick each of 8GB (3.1.3 3.7.9-10 7.2rhel) Other Cluster Info: there are about another 5 volumes, but all were offlined Test setup: mounted the volume only on one client -fuse (7.2rhel 3.7.9-10) Opened two terminals of the same client and started below IOs: Term1: created a directory dir1 under root and started below IO for i in {1..50};do dd if=/dev/urandom of=bgfile.$i count=997 bs=1000000;done Term2:created a directory dir2 under root and started below IO for i in {1..50};do dd if=/dev/urandom of=bgfile.$i count=997 bs=1000000;done After about 4 files(2 from each terminal) ie about 4GB total size created, brought down brick1 and left it dead till 50GB of data was created in total(so ~46GB to be healed) brought back brick1 using vol start force, while the IOs were still going on Result: Took about 7 Mins to heal the data Re-ran same scenario with performance.client-io-threads enabled, it took about 15 min time.
Nag, There are two things I would like to know before looking closely at this bug. 1) In ec there is an issue where directory that needs to create the files is listed last compared to the files in that directory. So what I would like to make sure is if this is the same issue or not. Could you modify the test to do it this way? 1) create the empty files before you bring the bricks down. 2) Do dd on the files while the bricks are down. 3) Measure the time it takes to heal now. IMO, we see 2 problems which need to be fixed in ec when it comes to healing. 1) Making sure parent directory of the files is healed before the files in the directory 2) Making sure two heals on same file don't lead to waiting for heal to complete. Both of these should be addressed. Pranith
Pranith, Retried the steps mentioned by you, I don't see any(minimal) performance difference between toggling performance.client-io-threads b/w on/off However, one question; then why am I seeing in my previous case, with the same kind of work load?
(In reply to nchilaka from comment #3) > Pranith, > Retried the steps mentioned by you, I don't see any(minimal) performance > difference between toggling performance.client-io-threads b/w on/off > > However, one question; then why am I seeing in my previous case, with the > same kind of work load? Nag, There is no consistency in heal times at the moment because of entry heal marking code. So what happens directory is marked for healing well after the files that need heal are marked. Every 10 minutes a heal is triggered. So the brick could have been brought up just 1-2 minutes before the timer is expired(where it took 7 minutes) in first case and 7-8 minutes before the timer is expired(Where it took 15 minutes) in the second case of your bug description. If you run the same test with client-io-threads ON multiple times, you will see different times because of this timing issue. Until this marking problem is addressed We will see this problem IMO. Pranith
Guys, any specific reason why the comments need to be private here? Nag - With this hypothesis I believe you'd be able to close this BZ and file a new one with existing issues highlighted by Pranith.
(In reply to Atin Mukherjee from comment #5) > Guys, any specific reason why the comments need to be private here? > > Nag - With this hypothesis I believe you'd be able to close this BZ and file > a new one with existing issues highlighted by Pranith. I have the habit of adding downstream comments as private. Removed them.
This is the last comment from Nag in mail, which confirms that this is observed due to same bug we are following with the bz 1361513 Mail from Nag: I have retried the case by creating 400GB of data on two different vols(one vol has client-io enabled and another it is not enabled). This time as soon as I brought the brick up, I triggered the heal manually. I don't see any difference in the time heal completion takes b/w the two volume. Closing this as Dup to 1361513 *** This bug has been marked as a duplicate of bug 1361513 ***