Bug 1200457

Summary: gluster performance issue as data is added to volume. tar extraction of files goes from 1-minute on empty volume to 20-minutes on volume with 40TB.
Product: [Community] GlusterFS Reporter: drobinson
Component: coreAssignee: Poornima G <pgurusid>
Status: CLOSED UPSTREAM QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: mainlineCC: atumball, bfoster, bturner, bugs, drobinson, glush, jahernan, kaushal, kdhananj, mpillai, ndevos, pgurusid, pkarampu, spalai, srangana, waldon-redhat
Target Milestone: ---Keywords: Performance, Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-19 04:54:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Testing without gluster in the stack
none
Further system information
none
small-file performance over multiple iterations none

Description drobinson 2015-03-10 15:30:24 UTC
Description of problem:
The storage system slows down dramatically as data is added to the volume.  This problem was discovered by downloading and trying to un-tar the boost libraries source code.  On an empty volume the extraction takes 1-minute.  On a volume with 40-TB, the extraction takes 20-minutes.

Version-Release number of selected component (if applicable):
gluster 3.6.2

How reproducible:
This problem is reproducible on my systems and it has been reproduced by Ben Turner on the RedHat systems (email traffic from 2/17/15).  Ben was working with Shaym to resolve this issue.  Debugging info was sent to Shaym on/around 2/10/2015 (strace, io-stats, etc).


Steps to Reproduce:
1.  Create a volume and add 20-TB of data.  
2.  time tar -xf boost.tar
3.  Create a new volume with no data
4.  time tar -xf boost.tar

Actual results:
volume with 20-TB: 20 minutes
empty volume: 1 minute

Expected results:


Additional info:

Comment 1 Niels de Vos 2015-03-17 12:39:17 UTC
Shyam, if you passed this on to someone else to work on it, please adjust the status/assignee. Thanks!

Comment 2 Shyamsundar 2015-03-17 16:56:07 UTC
Assigned to self.

Comment 3 Shyamsundar 2015-03-18 18:44:52 UTC
Updating some analysis done till date:
See mail thread: http://www.gluster.org/pipermail/gluster-devel/2015-February/043792.html
--------------------------------------

1) Volumes classified as slow (i.e with a lot of pre-existing data) and fast (new volumes carved from the same backend file system that slow bricks are on, with little or no data)

2) We ran an strace of tar and also collected io-stats outputs from these volumes, both show that create and mkdir is slower on slow as compared to the fast volume. This seems to be the overall reason for slowness

3) The tarball extraction is to a new directory on the gluster mount, so all lookups etc. happen within this new name space on the volume

4) Checked memory footprints of the slow bricks and fast bricks etc. nothing untoward noticed there

5) Restarted the slow volume, just as a test case to do things from scratch, no improvement in performance (this was on David's setup).

---------------

My base filesystem has 40-TB and the tar takes 19 minutes. I copied over 10-TB and it took the tar extraction from 1-minute to 7-minutes. 

My suspicion is that it is related to number of files and not necessarily file size. Shyam is looking into reproducing this behavior on a redhat system. 

-------------

On 02/12/2015 11:18 AM, David F. Robinson wrote:
> Shyam,
>
> You asked me to stop/start the slow volume to see if it fixed the timing
> issue.  I stopped/started homegfs_backup (the production volume with 40+
> TB) and it didn't make it faster.  I didn't stop/start the fast volume
> to see if it made it slower.  I just did  that and sent out an email.  I
> saw a similar result as Pranith. 

-----------

Comment 4 Shyamsundar 2015-03-18 18:46:31 UTC
Current dev update:
-------------------

Patch (yet to be posted upstream) to make pruning of inode lists asynchronous to the FOP call is in progress.

Pending testing the same on a setup with defined characteristics that can demonstrate the problem and hence the fix.

Comment 5 Anand Avati 2015-03-18 20:19:29 UTC
REVIEW: http://review.gluster.org/9936 (core: prune inode table asynchronously) posted (#1) for review on master by Shyamsundar Ranganathan (srangana)

Comment 6 Anand Avati 2015-03-23 14:10:56 UTC
REVIEW: http://review.gluster.org/9936 (core: prune inode table asynchronously) posted (#2) for review on master by Shyamsundar Ranganathan (srangana)

Comment 7 Shyamsundar 2015-03-25 13:51:13 UTC
Update (dev):
-------------

With the change in comment #5 ran a series of tests to see if it improves the performance. Here are the current observations from the same.

1) Tried reproducing the issue in order to test the fix in a couple of setups, but issue of old volume always being slow is not observed always, so unsure of current ability to reproduce the issue.

2) In cases where the issue got reproduced, the behavior was as follows,
   - Have a large amount of data on the volume
   - Fill up various internal caches, using a listing of all files in the volume and also cat'ing the contents
   - First untar run post this is slow, the next run onward things catch up and are fine (these were on a pure distribute volume)
   - When the untar is slow, the mkdir and create are the FOPs that show the slowness, as in the investigation on the bug report
   - The slowness is almost completely due to the brick processes and the client stack does not seem to be contributing anything (io-stats output based).
   - With the prune fix the situation does not improve

3) Currently continuing to debug on the setup possible causes for this first run to slow down, as the slowness reason seems to be in alignment with what is initially reported and analyzed

Comment 8 Shyamsundar 2015-03-26 19:19:03 UTC
Test that reproduces the issue:

1) Create ~5TB of 320KB files spread across several directories
2) Run "find . -exec cat {} \; > /dev/null" so that meta data and data is read from the gluster volume from a FUSE mount
3) After about 1-2 hours of (2) running, stop the command
4) Now extract the said tarball on the FUSE mount, this takes approximately 2x the time that it would take if this was done after step (1) and after restarting the volume and remounting the same on the clients.

Analysis:
1) The slowness demonstrated in Step (4) is only for the first 1-2 tar extractions, after which it comes back to the good number
2) In these slow runs the mkdir and create are the FOPs that slow down the entire process
3) Placing io-stat on top of POSIX on the bricks, and checking time deltas between the FOP time on POSIX->Server/slator->FUSE, it is seen that the time is consumed in POSIX and that layer needs further investigation.

@David, would it be possible to run a similar io-stat capture at your end on your volumes so that I can corroborate that this is a POSIX/xlator layer issue only?

Comment 9 Shyamsundar 2015-03-30 17:09:27 UTC
Further observations and info
-----------------------------

A) Replicated the test on David's (reporter) setup with io-stats xlator on top of the posix xlator and below server/xlator (as usual), and observed almost all of the latency from the posix xlator mkdir and create FOPs.

This points to the slowness in POSIX xlator or below it.

David had also run a tar extraction directly on the brick, and that was fast (10 seconds). So there is no real slowness on XFS directly.

Current suspicion is on the link creation in the flat .glusterfs namespace, as that would have accumulated a lot of files (the cu setup holds about 50TB of data).

B) Attempting an pure XFS test for a workflow that performs like gluster, i.e create file or directory and hard/soft link it within a flat .glusterfs name space to see, if a relatively filled up directory causes slowness

C) Also requested David for an strace from the brick processes to see latency deviations between the same, in the fast and slow case

Comment 10 Shyamsundar 2015-03-31 13:35:27 UTC
C) Strace from David points to the following,

linkat and symlink take a high amount of average latency on the bricks that belong to the slow volume and are fast on the bricks that belong to the fast volume.

The order of magnitude is very high as follows,

Fast brick:
Avg. linkat latency from strace: 0.000012 seconds
#calls: 22025
Time taken for linkat: 0.000012 * #calls = 0.26914 seconds

Slow brick:
Avg. linkat latency from strace: 0.0103351812 seconds
#calls: 22025
Time taken for linkat: 227 seconds

Similar data is seen for symlink into the .glusterfs directories.

Overall the latency here is for a single brick, extrapolating this (i.e multiplying this by 2) shows that we would save ~450 seconds on linkat if this was fast, which points to the problem being linkat and symlink into the .glusterfs directories.

Looking at option B (in comment #9) and possibly getting xfs_metadump from David and also some entry counts in the 5 best and worst linkat directories to see how to proceed with looking at the same from an XFS/VFS perspective.

Comment 11 Shyamsundar 2015-03-31 16:44:18 UTC
Data from David's setup on the 5 fast linkat directories and 5 slowest linkat directories, in terms of entry counts:

Fast:
/data/brick01bkp/homegfs_bkp/.glusterfs/82/93/
/data/brick01bkp/homegfs_bkp/.glusterfs/f1/34/
/data/brick01bkp/homegfs_bkp/.glusterfs/df/b0/
/data/brick01bkp/homegfs_bkp/.glusterfs/67/f7/
/data/brick01bkp/homegfs_bkp/.glusterfs/96/e0/

Slow:
/data/brick01bkp/homegfs_bkp/.glusterfs/e0/24/
/data/brick01bkp/homegfs_bkp/.glusterfs/80/e1/
/data/brick01bkp/homegfs_bkp/.glusterfs/e3/f4/
/data/brick01bkp/homegfs_bkp/.glusterfs/da/80/
/data/brick01bkp/homegfs_bkp/.glusterfs/a9/ca/

Fast:
[root@gfs01bkp ~]# ls -l /data/brick01bkp/homegfs_bkp/.glusterfs/82/93/ |
wc -l
81
[root@gfs01bkp ~]# ls -l /data/brick01bkp/homegfs_bkp/.glusterfs/f1/34/ |
wc -l
72
[root@gfs01bkp ~]# ls -l /data/brick01bkp/homegfs_bkp/.glusterfs/df/b0/ |
wc -l
77
[root@gfs01bkp ~]# ls -l /data/brick01bkp/homegfs_bkp/.glusterfs/67/f7/ |
wc -l
58
[root@gfs01bkp ~]# ls -l /data/brick01bkp/homegfs_bkp/.glusterfs/96/e0/ |
wc -l
57

Slow:
[root@gfs01bkp ~]# ls -l /data/brick01bkp/homegfs_bkp/.glusterfs/e0/24/ |
wc -l
69
[root@gfs01bkp ~]# ls -l /data/brick01bkp/homegfs_bkp/.glusterfs/80/e1/ |
wc -l
56
[root@gfs01bkp ~]# ls -l /data/brick01bkp/homegfs_bkp/.glusterfs/e3/f4/ |
wc -l
61
[root@gfs01bkp ~]# ls -l /data/brick01bkp/homegfs_bkp/.glusterfs/da/80/ |
wc -l
64
[root@gfs01bkp ~]# ls -l /data/brick01bkp/homegfs_bkp/.glusterfs/a9/ca/ |
wc -l
79

Comment 12 Shyamsundar 2015-04-14 13:11:15 UTC
Information requested has been provided and uploaded to this bug in the comments above.

Comment 13 Shyamsundar 2015-06-04 18:14:41 UTC
Created attachment 1034827 [details]
Testing without gluster in the stack

update for point (B) in comment #9

Attempted to create a simple program that runs on a local XFS mount to see if the problem can be replicated without gluster in the picture.

The attached program was run for a few hours in both a local setup as well as in David's setup but did not show up any linkat/symlink creation slowness as observed with Gluster.

This means of detecting or narrowing the problem does not seem to be bearing fruit.

Comment 14 Shyamsundar 2015-06-05 19:49:42 UTC
Created attachment 1035401 [details]
Further system information

Some further information from David's setup:

1) [root@gfs01bkp homegfs]# cat /etc/redhat-release
Scientific Linux release 6.6 (Carbon)

2) ext4 also possibly shows the same increased untar times (in a more controlled test)

3) Post reboot an rerun of the tests still shows the same slowness:

    [root@gfs01bkp homegfs]# date; time tar -xPf boost_1_57_0.tar ; date; time rm -rf boost_1_57_0; date
    Fri Jun  5 12:43:39 EDT 2015
    real    26m47.391s
    user    0m0.642s
    sys     0m3.873s
    Fri Jun  5 13:10:26 EDT 2015
    real    0m33.857s
    user    0m0.130s
    sys     0m1.048s
    Fri Jun  5 13:11:00 EDT 2015

attached file contains data from the following commands:
     - kernel version (uname -a)
     - xfsprogs version (xfs_repair -V)
     - number of CPUs
     - contents of /proc/mounts
     - contents of /proc/partitions
     - xfs_info output on the filesystem in question

Comment 15 Shyamsundar 2015-11-06 14:50:44 UTC
This bug is still ASSIGNED, as the inode table pruning code does not help the situation.

Comment 16 Manoj Pillai 2016-08-11 17:13:27 UTC
I see a few items related to small-file performance planned for the near future: compound fops, tiering improvements for small file. IMO, we need to fix this in order for all those enhancements to really make sense. And in any case, we really need to understand what is going on here. I'm planning to spend some time on this bug to see if I can help figure out what is happening here.

Comment 17 Manoj Pillai 2016-08-11 17:36:31 UTC
Created attachment 1190117 [details]
small-file performance over multiple iterations

Each iteration here is creating and writing the same no. of files, and reporting the aggregate throughput for that iteration. So iteration 1 writes 128K files. Iteration 2 adds another 128K files on top of that and so on. sync is performed between iterations. Ideally, the throughput reported in each iteration should be the same.

1. Gluster-HDD is for a single-brick gluster volume on a hard disk, fuse mounted on clients.
2. kNFS-HDD is the same brick filesystem in 1., but this time kernel NFS exported and nfs mounted on the clients.
3. Gluster-NVMeSSD is a single-brick gluster volume, like in 1., but this time created on an NVMe SSD (note that the throughput is much higher in this case, and is shown by the secondary Y axis).

The Gluster-HDD line shows the severe performance degradation that has been reported in this bz -- throughput in iteration 8 is less than 1/4th the throughput in iteration 1.

The kNFS-HDD and Gluster-NVMeSSD do not show this degradation, though there is a  slight drop in the case of kNFS.

To me these results suggest that:
(1) the performance degradation we are seeing in gluster is due to increased seeks, likely due to fragmentation or sub-optimal allocation. The seek hypothesis would explain why the degradation is not seen in the SSD case.

(2) The fragmentation or sub-optimal allocation is primarily due to the extra metadata that gluster is maintaining. This would explain why it is not seen in the kNFS case.

Comment 18 Manoj Pillai 2016-08-11 18:00:27 UTC
(In reply to Manoj Pillai from comment #17)

More details on the test:

Versions:
glusterfs*-3.8.2-1.el7.x86_64
RHEL 7.2 (server): kernel-3.10.0-327.el7.x86_64
RHEL 7.1 (clients): kernel-3.10.0-229.el7.x86_64

Benchmarking Tool: smallfile version 3.0 [https://github.com/bengland2/smallfile]


smallfile_cli.py --top ${top_dir} --host-set ${hosts_str} --threads 4 --files 8192 --file-size 64 --record-size 64 --fsync Y --response-times N --operation create

There are 4 clients, so no. of files in each iteration is: (4 clients) * (4 threads per client) * 8192 = 128K files. file size is 64KB.

Each iteration creates a new "top directory" under the mountpoint and all files created in that iteration are created inside that sub-directory in a hierarchy with 100 files per directory (default).

Gluster volume info:
Volume Name: perfvol
Type: Distribute
<...>
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
<...>
Options Reconfigured:
cluster.lookup-optimize: on
server.event-threads: 4
client.event-threads: 8
performance.client-io-threads: on
transport.address-family: inet
performance.readdir-ahead: on

Comment 19 Pranith Kumar K 2016-08-12 03:19:10 UTC
Manoj,
      This is interesting data and I wonder if you got a chance to gather strace output from the brick using "strace -ff -T -p <pid-of-brick> -o <path-to-the-file-where-you-want-the-output-saved>`" This actually gives the time it spends in each of the syscalls. It would be interesting to know what kind of syscalls are leading to this problem. What do you think?

Pranith

Comment 20 Manoj Pillai 2016-08-12 05:11:26 UTC
(In reply to Pranith Kumar K from comment #19)

There is strace analysis in Shyam's comment #10. Please take a look. We can collect again if needed.

In the meantime, I'd very much like to get a primer on the metadata gluster maintains on the brick: what, why, and how it figures in the I/O path. If you can point to a link where this is documented, or post here, or discuss offline, that would be very useful.

Comment 21 Kaushal 2016-08-23 12:39:20 UTC
GlusterFS-3.6 is nearing its End-Of-Life, only important security bugs still make a chance on getting fixed. Moving this to the mainline 'version'. If this needs to get fixed in 3.7 or 3.8 this bug should get cloned

Comment 22 Pranith Kumar K 2016-08-25 16:23:34 UTC
Some of the findings based on the tests I did based on Brian's inputs:
We removed all the xlators between fuse and posix to make the volfile similar to what we have here:
volume test-posix-0
    type storage/posix
    option directory /home/brick/test20
end-volume

Mounted this volfile using:
glusterfs --volfile=/root/volfile.vol /d

Small file CLI we used is: 
for i in {1..80}; do mkdir /mnt/smf/$i ; python smallfile_cli.py --top /mnt/smf/$i --threads 4 --files 8192 --file-size 64 --record-size 64 --fsync Y --response-times N --operation create >> /root/smf.out; done

1) In this test I ran the small file workload using the following CLI 25 times. We observed that the elapsed time kept increasing.
These are the times:
[root@gprfs039 ~]# grep 'sec elapsed' just-posix-smf.out 
45.026515 sec elapsed time
45.630780 sec elapsed time
47.106841 sec elapsed time
48.994986 sec elapsed time
49.056766 sec elapsed time
47.511098 sec elapsed time
57.042012 sec elapsed time
57.128427 sec elapsed time
63.952732 sec elapsed time
77.634293 sec elapsed time
69.540255 sec elapsed time
84.495499 sec elapsed time
83.755443 sec elapsed time
76.225897 sec elapsed time
86.382487 sec elapsed time
113.626848 sec elapsed time
109.282297 sec elapsed time
95.230611 sec elapsed time
110.965802 sec elapsed time
90.876091 sec elapsed time
98.172049 sec elapsed time
106.055803 sec elapsed time
133.958628 sec elapsed time
104.454683 sec elapsed time
119.641330 sec elapsed time

2) In this test we did a remount and ran the test where we stopped from. The first run took 97 seconds which was in the same range as before.

3) In this test we did a drop-caches. What we observed is that the elapsed time started at around 440 seconds and came down to 95 seconds over time but hovered again at around 95-120 seconds
442.857238 sec elapsed time
314.401744 sec elapsed time
260.829069 sec elapsed time
219.472399 sec elapsed time
201.450154 sec elapsed time
178.373704 sec elapsed time
158.367833 sec elapsed time
151.658923 sec elapsed time
144.989891 sec elapsed time
126.476212 sec elapsed time
133.021995 sec elapsed time
106.850061 sec elapsed time
105.401238 sec elapsed time
118.126849 sec elapsed time
115.904676 sec elapsed time
94.786972 sec elapsed time
119.076134 sec elapsed time

4) In this test we kept creating new bricks on the same XFS partitions by generating new volfiles and running small file create using the following CLI:
for i in {2..26}; do echo "volume test-posix-0" > /root/volfile.vol ; echo "    type storage/posix" >> /root/volfile.vol; echo "    option directory /home/brick/test$i"  >>  /root/volfile.vol; echo "end-volume" >>  /root/volfile.vol; mkdir /home/brick/test$i; glusterfs --volfile=/root/volfile.vol /d; mkdir -p /d/smf/$i ; python smallfile_cli.py --top /d/smf/$i --threads 4 --files 8192 --file-size 64 --record-size 64 --fsync Y --response-times N --operation create >> /root/separate-bricks.out; umount /d; done

Run time didn't degrade in this case!!
Here are the times:
[root@gprfs039 ~]# grep 'sec elapsed' /root/separate-bricks.out 
48.763393 sec elapsed time
48.581379 sec elapsed time
48.777564 sec elapsed time
49.469651 sec elapsed time
48.915736 sec elapsed time
50.647200 sec elapsed time
48.117454 sec elapsed time
49.835503 sec elapsed time
48.549795 sec elapsed time
50.297137 sec elapsed time
51.053001 sec elapsed time
49.238791 sec elapsed time
49.562103 sec elapsed time
46.240161 sec elapsed time
50.360794 sec elapsed time
48.729912 sec elapsed time
51.763603 sec elapsed time
49.370971 sec elapsed time
48.930266 sec elapsed time
48.070540 sec elapsed time
48.472625 sec elapsed time
48.350298 sec elapsed time
47.316538 sec elapsed time
46.672402 sec elapsed time
47.304169 sec elapsed time

Will update with more info once we find more about the reason why this is the case.

Pranith

Comment 23 Pranith Kumar K 2016-08-25 21:47:33 UTC
Ran strace on the bricks for the fast and slow case, these are the 10 maximum syscall latencies for all of the syscalls. I am seeing high latencies for fsync and stats in the slow case:

Normal case:

chown
0.000067
0.000070
0.000071
0.000079
0.000086
0.000099
0.000141
0.000147
0.000172
0.000249
close
0.000202
0.000202
0.000203
0.000204
0.000204
0.000204
0.000206
0.000207
0.000220
0.000505
fgetxattr
0.000143
0.000147
0.000151
0.000152
0.000155
0.000158
0.000160
0.000162
0.000162
0.000193
fstat
0.000130
0.000131
0.000132
0.000133
0.000133
0.000134
0.000134
0.000151
0.000156
0.000158
fsync
0.018422
0.018479
0.018490
0.018572
0.018582
0.033627
0.036283
0.047154
0.078602
0.695221
getdents
0.000108
0.000109
0.000109
0.000110
0.000112
0.000114
0.000114
0.000116
0.000118
0.000164
lgetxattr
0.000081
0.000081
0.000082
0.000082
0.000083
0.000084
0.000085
0.000089
0.000091
0.000094
linkat
0.000097
0.000099
0.000099
0.000100
0.000105
0.000113
0.000138
0.000152
0.000154
0.000182
lseek
0.000039
0.000074
0.000087
0.000103
0.000106
0.000113
0.000113
0.000114
0.000115
0.000134
lsetxattr
0.000095
0.000097
0.000098
0.000099
0.000100
0.000103
0.000106
0.000139
0.000167
0.000171
lstat
0.002063
0.002094
0.003154
0.003503
0.003998
0.004202
0.004760
0.005640
0.008268
0.015741
mkdir
0.000258
0.000259
0.000265
0.000268
0.000269
0.000269
0.000276
0.000279
0.000293
0.001713
openat
0.000118
0.000120
0.000123
0.000123
0.000123
0.000124
0.000124
0.000126
0.000126
0.000130
open
0.002263
0.003762
0.003774
0.005012
0.005164
0.005207
0.005640
0.005902
0.006392
0.193450
pwrite
0.000224
0.000224
0.000228
0.000248
0.000249
0.000250
0.000259
0.000266
0.000398
0.000636
read
0.000085
0.000086
0.000087
0.000088
0.000098
0.000124
0.000130
0.000145
0.000162
0.000238
readlink
0.000020
0.000021
0.000021
0.000023
0.000024
0.000028
readv
0.000956
0.006536
0.047662
0.061173
0.224371
0.332077
2.278480
48.556456
85.082750
127.490240
stat
0.000085
0.000089
0.000099
0.000116
0.000120
0.000128
0.000131
0.000132
0.000135
0.000145
symlink
0.000180
0.000183
0.000183
0.000183
0.000190
0.000195
0.000196
0.000202
0.000208
0.000222
write
0.000089
0.000127
0.000130
0.000135
0.000136
0.000136
0.000137
0.000140
0.000140
0.000258
writev
0.000148
0.000157
0.000157
0.000159
0.000160
0.000163
0.000170
0.000904
0.004772
0.088366

Slow case:
chown
0.000169
0.000169
0.000170
0.000174
0.000175
0.000177
0.000179
0.000179
0.000193
0.000265
close
0.000191
0.000192
0.000205
0.000210
0.000218
0.000224
0.000315
0.000321
0.000416
0.000454
fgetxattr
0.000209
0.000232
0.000250
0.000279
0.000286
0.000361
0.000443
0.000519
0.000824
0.000877
fstat
0.000231
0.000264
0.000264
0.000275
0.000314
0.000363
0.000440
0.000640
0.001134
0.002430
fsync
1.500296
1.503076
1.513502
1.516628
1.522290
1.523135
1.535945
1.546670
1.548597
1.590475
getdents
0.000117
0.000118
0.000118
0.000119
0.000120
0.000120
0.000125
0.000125
0.000125
0.000126
lgetxattr
0.000155
0.000157
0.000157
0.000158
0.000164
0.000164
0.000171
0.000184
0.000205
0.000351
linkat
0.000214
0.000224
0.000228
0.000228
0.000231
0.000231
0.000234
0.000252
0.010239
0.070905
lremovexattr
0.000028
0.000028
0.000028
0.000029
0.000029
0.000029
0.000030
0.000031
0.000031
0.000032
lseek
0.000148
0.000151
0.000152
0.000158
0.000158
0.000162
0.000164
0.000178
0.000178
0.000188
lsetxattr
0.000161
0.000162
0.000164
0.000165
0.000165
0.000166
0.000167
0.000177
0.000292
0.000488
lstat
1.529056
1.529124
1.530052
1.531486
1.531578
1.531743
1.533177
1.722707
1.931263
2.287300
mkdir
0.000249
0.000259
0.000260
0.000268
0.000272
0.000306
0.000358
0.000415
0.000477
0.002294
openat
0.000121
0.000124
0.000126
0.000126
0.000127
0.000129
0.000131
0.000131
0.000135
0.000190
open
0.022604
0.024123
0.025090
0.030248
0.033538
0.034225
0.036417
0.037107
0.043395
0.047846
pwrite
0.000279
0.000285
0.000285
0.000290
0.000292
0.000294
0.000302
0.000336
0.000374
0.000374
read
0.000165
0.000165
0.000168
0.000169
0.000187
0.000194
0.000207
0.000212
0.000550
0.000552
readv
0.201163
0.242594
0.247809
0.256123
0.263686
0.271234
0.273045
0.274275
0.280773
0.286999
stat
0.000169
0.000170
0.000181
0.000230
0.000233
0.000245
0.000279
0.000321
0.000323
0.000775
symlink
0.000248
0.000251
0.000252
0.000253
0.000259
0.000263
0.000267
0.000274
0.000286
0.000298
truncate
0.000028
0.000028
0.000029
0.000031
0.000032
0.000032
0.000032
0.000033
0.000033
0.000034
write
0.000184
0.000184
0.000187
0.000193
0.000194
0.000204
0.000218
0.000221
0.000227
0.000229
writev
0.000497
0.000580
0.000603
0.000726
0.000804
0.001008
0.014676
0.021616
0.027423
0.030991

Comment 24 Pranith Kumar K 2016-08-25 21:48:59 UTC
Oops, big comment :-). Just check for fsync/lstat headings in both cases.

Comment 25 Brian Foster 2016-08-26 12:16:04 UTC
(In reply to Pranith Kumar K from comment #23)
> Ran strace on the bricks for the fast and slow case, these are the 10
> maximum syscall latencies for all of the syscalls. I am seeing high
> latencies for fsync and stats in the slow case:
> 
> Normal case:
> 
...
> fsync
> 0.018422
> 0.018479
> 0.018490
> 0.018572
> 0.018582
> 0.033627
> 0.036283
> 0.047154
> 0.078602
> 0.695221
...
> lstat
> 0.002063
> 0.002094
> 0.003154
> 0.003503
> 0.003998
> 0.004202
> 0.004760
> 0.005640
> 0.008268
> 0.015741
...
> 
> Slow case:
...
> fsync
> 1.500296
> 1.503076
> 1.513502
> 1.516628
> 1.522290
> 1.523135
> 1.535945
> 1.546670
> 1.548597
> 1.590475
...
> lstat
> 1.529056
> 1.529124
> 1.530052
> 1.531486
> 1.531578
> 1.531743
> 1.533177
> 1.722707
> 1.931263
> 2.287300
...

Ok, did we learn anything about whether the number of calls is roughly similar between the fast and slow runs? If so, the next best thing is probably to try and dig into where the latency originates via the function graph tracer.

Comment 26 Pranith Kumar K 2016-08-26 14:58:50 UTC
I will need to rerun this test to get this information because I couldn't get the granularity of the tests right in the last run. What I will do this time is to run the test 25 times by creating new bricks and 25 times on the same brick. And capture the outputs.

Comment 27 Amar Tumballi 2018-09-18 08:35:23 UTC
Been exactly 2yrs since the last update! Not sure if the behavior is same on latest versions. Need to see and evaluate.

Comment 28 Poornima G 2018-11-19 04:54:46 UTC
Lot of things have changed wrt performance improvements in gluster. Closing this bug for now, please retest with the latest and reopen if its still the case.