Bug 1419807

Summary: [Perf]: 25% regression on sequential reads on EC over SMB3
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ambarish <asoman>
Component: open-behindAssignee: Milind Changire <mchangir>
Status: CLOSED ERRATA QA Contact: Karan Sandha <ksandha>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, aspandey, bturner, ksandha, mchangir, pgurusid, pkarampu, rcyriac, rgowdapp, rhinduja, rhs-bugs, sheggodu, smali, storage-qa-internal
Target Milestone: ---Keywords: Regression, Reopened, ZStream
Target Release: RHGS 3.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1455872 (view as bug list) Environment:
Last Closed: 2018-09-04 06:32:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1455872    
Bug Blocks: 1503134    

Description Ambarish 2017-02-07 06:00:03 UTC
Description of problem:
-----------------------

Testbed : 12*(4+2),6 servers,6 workload generating clients.

Benchmark : 3.1.3 with io-threads enabled.

3.2 testing was done with io-threads enabled and mdcache parameters set.

It looks like we have regressed with 3.2 on large file writes and reads over SMB3 :


******************
Sequential Reads
******************

3.1.3 : 1499580.48 kB/sec
3.2 :   1106730.49 kB/sec

Regression : 26%

AIO was disabled at all times.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------


glusterfs-3.8.4-10.el7rhgs.x86_64

How reproducible:
-----------------

every which way I try.

Actual results:
----------------

Sequential workloads on SMB 3 mounts have regressed by 25%.


Expected results:
-----------------

Regression Threshold is within 10%

Additional info:
------------------


> Client and Server OS : RHEL 7.3

> *Vol config* :

Volume Name: butcher
Type: Distributed-Disperse
Volume ID: 4a377ad9-0c87-4553-b45f-95ab0590c055
Status: Started
Snapshot Count: 0
Number of Bricks: 12 x (4 + 2) = 72
Transport-type: tcp
Bricks:
Brick1: gqas008.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick2: gqas009.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick3: gqas010.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick4: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick5: gqas001.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick6: gqas016.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick7: gqas008.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick8: gqas009.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick9: gqas010.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick10: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick11: gqas001.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick12: gqas016.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick13: gqas008.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick14: gqas009.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick15: gqas010.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick16: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick17: gqas001.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick18: gqas016.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick19: gqas008.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick20: gqas009.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick21: gqas010.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick22: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick23: gqas001.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick24: gqas016.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick25: gqas008.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick26: gqas009.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick27: gqas010.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick28: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick29: gqas001.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick30: gqas016.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick31: gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick32: gqas009.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick33: gqas010.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick34: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick35: gqas001.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick36: gqas016.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick37: gqas008.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick38: gqas009.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick39: gqas010.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick40: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick41: gqas001.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick42: gqas016.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick43: gqas008.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick44: gqas009.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick45: gqas010.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick46: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick47: gqas001.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick48: gqas016.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick49: gqas008.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick50: gqas009.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick51: gqas010.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick52: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick53: gqas001.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick54: gqas016.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick55: gqas008.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick56: gqas009.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick57: gqas010.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick58: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick59: gqas001.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick60: gqas016.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick61: gqas008.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick62: gqas009.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick63: gqas010.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick64: gqas014.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick65: gqas001.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick66: gqas016.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick67: gqas008.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick68: gqas009.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick69: gqas010.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick70: gqas014.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick71: gqas001.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick72: gqas016.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
cluster.lookup-optimize: on
performance.client-io-threads: on
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
server.event-threads: 4
client.event-threads: 4
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-samba-metadata: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
[root@gqas008 ~]#

Comment 4 Ambarish 2017-02-07 06:04:30 UTC
Adding Regression keyword since it's a regression from the last released version of RHGS.

Comment 6 Ambarish 2017-02-08 06:30:52 UTC
It has come to my attention that one of the servers,gqas010 has gone bad,I'll recheck this on another setup and share the results.

Comment 7 Ambarish 2017-02-11 14:39:15 UTC
This has been reproduced on another setup(that Ben and Karan were using).

3.1.3 : 1499580.48 kB/sec

3.8.4-14 : 1112863.16 kB/sec

Comment 9 Ambarish 2017-02-13 10:33:19 UTC
Seeing extremely high latencies for READDIR(3500 us+) on all bricks on 3.2 bits,compared to 300-600 us in 3.1.3.

Comment 10 Ashish Pandey 2017-02-16 06:03:59 UTC
Ambarish,

I talked to Poornima and she suggested to come up with few more information. 
We need to see the profile info after disabling md-cache and io-thread.
So, please provide these two.

Comment 11 Ambarish 2017-02-17 03:41:40 UTC
Ashish/Poornima,

I have the data with  io-threads disabled.I see no impact with or without io-threads.

3.1.3 : 1499580.48 kB/sec

3.2. with io-threads : 1106730.49 kB/sec

3.2 without io-threads : 1111601.68 kB/sec

The regression is still there,even without io-threads enabled.

ICYMI,3.1.3 benchmarking was also done with io-threads on,as we encouraged our EC customers on 3.1.3 to enable io-threads for better writes.


Tests without mdcache is something I can fetch by EOD or moro mostly,once the beaker downtime is over.

Comment 12 Ambarish 2017-02-18 13:37:19 UTC
I could reproduce the perf regresssion,even after switching off mdcache.

So,in a nutshell,

3.1.3 : 1499580.48 kB/sec

3.2. with io-threads : 1106730.49 kB/sec

3.2 without io-threads : 1111601.68 kB/sec

3.2 without mdcache tunables : 1146567 kB/sec

Comment 14 Pranith Kumar K 2017-02-21 12:00:43 UTC
17:19:30 :) ⚡ grep -w READ 3.2_SMB3-read.txt | awk '{sum+=$8}END{print sum/2}'
6061757

17:19:31 :) ⚡ grep -w READ 3.1.3_EC_SMB3-read.txt | awk '{sum+=$8}END{print sum/2}'
5663296

%-increase:
>>> (6061757.0 - 5663296.0)/5663296
0.0703584979488976

so 7% of increase is there in READ calls 

17:24:21 :) ⚡ grep -w READ 3.2_SMB3-read.txt | awk '{sum+=($2*$8);calls+=$8}END{print sum/calls}'
278.358

17:20:23 :) ⚡  grep -w READ 3.1.3_EC_SMB3-read.txt | awk '{sum+=($2*$8);calls+=$8}END{print sum/calls}'
261.703

>>> (278.358-261.703)/261.703
0.06364084477441997

so 6% of increase is there in READ fop avg latency.

We need to find two things here.
1) Considering this issue is observed only on SMB, can we have someone from Samba take a look and update us with any information about why the number of calls could have increased.

2) We need to know which exact fops are leading to increase in the latency. So on disk latency should be helpful as well.

Will leave needinfo on Poornima for 1) above and ambarish for 2) above.

Comment 15 Poornima G 2017-02-22 10:04:18 UTC
For 1), is it possible to check with 3.1.3 Samba packages and latest Gluster and check if the 6% increase is because of Samba update?

Also is it possible to get the number of reads from 3.2 Samba on dis-rep volume and compare it against EC?

Comment 21 Pranith Kumar K 2017-02-28 05:23:42 UTC
Ambarish,
   Is this not the command you used for strace capture?
strace -ff -T -p <pid-of-brick> -o <path-to-the-file-where-you-want-the-output-saved>`

I don't see any time taken for syscalls in the strace files:
I see strace like:
fgetxattr(31, "trusted.gfid", "\x9a\xfb\x8aZ$\xaaC\x9e\x87\x88\x8c\x07\x17_\x1f\xcd", 16) = 16

There is no <time-taken-for-syscall> at the end in '<>'

This will require a re-run?

Comment 22 Pranith Kumar K 2017-02-28 05:30:07 UTC
Ambarish,
   I found, 11.6% increase in number of preads, so that can cause regression. Let me look at why that is the case while you provide the latency numbers.

Pranith

Comment 23 Pranith Kumar K 2017-02-28 06:14:21 UTC
In 3.1.3 each brick has the file opened:
      0.00     489.25 us     406.00 us     610.00 us             12        OPEN

in 3.2.0 the file is not opened.

Here is the difference between open calls in 3.1.3 vs 3.2.0
in 3.1.3:
   7245 open calls and respective files

[qe@rhsqe-repo gqas005]$ cut -f2 -d'(' open-latency.txt | cut -f1 -d',' | sort | uniq -c
      1 "/bricks1/brick/.glusterfs/10/59/105999e1-0085-4066-886a-5c5b4cb1d301"
      1 "/bricks1/brick/.glusterfs/1f/1b/1f1bba3f-d316-476c-89af-4660a608986c"
      1 "/bricks1/brick/.glusterfs/22/16/2216f451-6cfb-4aff-80f8-05224e51266b"
   7194 "/bricks1/brick/.glusterfs/28/ae/28ae65bb-5df5-4371-bcc9-41b0db4b89bb"
      1 "/bricks1/brick/.glusterfs/29/2c/292ca564-b669-4b19-aa0d-7162ade9db80"
      1 "/bricks1/brick/.glusterfs/2b/75/2b75dd52-7f63-42fc-bb80-a33cfd965b53"
      1 "/bricks1/brick/.glusterfs/50/f4/50f4072e-1f4a-4d3e-9f53-1535c293278f"
      1 "/bricks1/brick/.glusterfs/52/4c/524c472f-8f95-403a-a6cb-5815666ee1f6"
      1 "/bricks1/brick/.glusterfs/54/4c/544cd41f-962d-4da5-96ee-daa4490f5fda"
      1 "/bricks1/brick/.glusterfs/76/5b/765b3ca6-caca-4654-8813-336cd0761fee"
      1 "/bricks1/brick/.glusterfs/c5/11/c5115144-f7d6-4689-a62a-fe4e6ed69cad"
      1 "/bricks1/brick/.glusterfs/f2/e2/f2e2c259-0194-4846-afc1-e27004c1ac67"
      1 "/bricks1/brick/.glusterfs/f4/8b/f48bee5d-6b58-41db-8d5f-ccad35c2ffc5"
     11 "/bricks1/brick/.glusterfs/health_check"
      1 "/bricks2/brick/.glusterfs/18/03/18030c9e-2430-4031-a655-910d04b7c5bb"
      1 "/bricks2/brick/.glusterfs/36/4b/364b32c0-a46e-4274-b365-d4b7017a2d37"
      1 "/bricks2/brick/.glusterfs/3d/d1/3dd112c7-12a7-4ae1-ac92-fa9ddbcd745d"
      1 "/bricks2/brick/.glusterfs/6e/d3/6ed3d53a-d1a5-4de4-a0d3-b6615836f635"
      1 "/bricks2/brick/.glusterfs/7f/39/7f39122b-0f37-43fe-84f7-828e126d11ec"
      1 "/bricks2/brick/.glusterfs/b8/9e/b89e0324-47a3-44c2-ab49-703f4bcbfbcc"
      1 "/bricks2/brick/.glusterfs/c4/74/c47429d7-7652-48bc-bf37-fe0652f8f2fc"
      1 "/bricks2/brick/.glusterfs/df/0b/df0bea9d-683d-4098-ad90-342203b8c436"
      1 "/bricks2/brick/.glusterfs/df/24/df24870a-fd2a-4e54-9fe3-969656e09614"
      1 "/bricks2/brick/.glusterfs/ee/37/ee371c2e-a212-439f-b674-b4bad3bc7029"
      1 "/bricks2/brick/.glusterfs/fc/35/fc3563dc-c462-45ef-afeb-37ea99fdd86a"
     13 "/bricks2/brick/.glusterfs/health_check"

in 3.2.0:
 188785 open calls and respective files

[qe@rhsqe-repo gqas005]$ cut -f2 -d'(' open-latency.txt | cut -f1 -d',' | sort | uniq -c
   5030 "/bricks1/brick/.glusterfs/0e/e1/0ee1dfc8-df1e-4c24-a3fc-56062cc8d112"
   6626 "/bricks1/brick/.glusterfs/60/1c/601c1553-25d4-4bb3-aa9b-52eca1968a2e"
   9378 "/bricks1/brick/.glusterfs/69/8a/698a81a2-a32f-47ab-9934-f4e9f378fbd5"
   6827 "/bricks1/brick/.glusterfs/6d/ea/6dea381b-5446-42e9-94b4-cc6517450ad9"
   4987 "/bricks1/brick/.glusterfs/79/67/7967182f-23e5-480c-ade2-93d460ee7542"
   9688 "/bricks1/brick/.glusterfs/82/55/82555aa2-74f7-4f86-a89d-e556cd3f834a"
   6904 "/bricks1/brick/.glusterfs/85/a5/85a54dcd-0a22-41d9-ac1f-b03266ef343a"
   9415 "/bricks1/brick/.glusterfs/9a/fb/9afb8a5a-24aa-439e-8788-8c07175f1fcd"
   9529 "/bricks1/brick/.glusterfs/a0/b1/a0b18584-b602-4713-8a3a-10f1730493b6"
   6541 "/bricks1/brick/.glusterfs/a4/bf/a4bfd317-4ba4-4e8d-bef4-7d032a688746"
   5008 "/bricks1/brick/.glusterfs/d1/79/d179ba5b-ba0d-4403-a377-74b82cdffd91"
   5019 "/bricks1/brick/.glusterfs/ea/1a/ea1a76eb-7dac-4841-b960-cb0a4e317dd4"
   9452 "/bricks1/brick/.glusterfs/f1/21/f12127d8-9f0c-4ca0-a466-0c661c4ac3d9"
     15 "/bricks1/brick/.glusterfs/health_check"
   6643 "/bricks2/brick/.glusterfs/15/d5/15d5a555-4ae9-4fd6-a4b9-b446e8ae2423"
   6673 "/bricks2/brick/.glusterfs/2b/91/2b914713-f5ed-4af2-a218-538803605ca6"
   9733 "/bricks2/brick/.glusterfs/53/63/53632d26-c480-4e36-9108-084aa1926c19"
   9444 "/bricks2/brick/.glusterfs/55/ed/55edde38-2ad6-428d-8088-4afb1e7569fc"
   9675 "/bricks2/brick/.glusterfs/87/a0/87a08afc-1c79-49fb-a5d2-424530a840ce"
   9573 "/bricks2/brick/.glusterfs/8d/47/8d470f86-b059-42b5-9a76-48119d7fe766"
   9291 "/bricks2/brick/.glusterfs/9d/9d/9d9da0ca-0989-4bf3-8f67-aed9c4afa950"
   6871 "/bricks2/brick/.glusterfs/b6/df/b6dfb606-90e4-46c8-8fa3-2168ed844ec6"
   6975 "/bricks2/brick/.glusterfs/c8/ef/c8ef678d-0555-42cf-a453-3ca3bcfc633f"
   9682 "/bricks2/brick/.glusterfs/dc/cb/dccb9a5d-0bfa-4b42-8a1c-008ad386e287"
   9786 "/bricks2/brick/.glusterfs/e2/1c/e21cb5e1-e832-41bb-965f-a45bef67c72c"
     16 "/bricks2/brick/.glusterfs/health_check"

Total number of preadv calls are 853641 in 3.2.0

>>> 188785.0/853641 # (extra opens/pread)
0.2211526859651774

22% read calls I think are served through extra anon-fd which means to serve the read it has to open then read which will lead to delay.

May be we need this run on 3.2. with open-behind disabled to test the theory to see if performance matches

Comment 24 Ambarish 2017-03-01 13:54:24 UTC
Pranith,

With open behind disabled,I am getting faster reads than 3.1.3,under the sam workload and the same systems :) 


Benchmark : 3.1.3 : 723391.85 kB/sec
3.2 Defaults : with open behind on : 592149 kB/sec
3.2 : with open behind off :  788382 kB/sec

Comment 27 Pranith Kumar K 2017-03-01 16:48:20 UTC
(In reply to Ambarish from comment #24)
> Pranith,
> 
> With open behind disabled,I am getting faster reads than 3.1.3,under the sam
> workload and the same systems :) 
> 
> 
> Benchmark : 3.1.3 : 723391.85 kB/sec
> 3.2 Defaults : with open behind on : 592149 kB/sec
> 3.2 : with open behind off :  788382 kB/sec

This confirms my theory.

If we do a "cat" of a file, it actually sends OPEN then READ and then CLOSE roughly (ignoring lookups for now). cp -r of a directory with a lot of files will require OPEN, READ, CLOSE. With open-behind on, OPENs won't happen over the wire, instead the READ is sent on anonymous-fd which leads to OPEN+READ directly on the disk. So we save extra network operation i.e. OPEN. This is very important for small file workload. But what Ambarish found in the test above is that when the same idea is used for Large file reading, each read still goes over the wire and if an anonymous fd which used for the previous READ is not available on the server it will lead to OPEN+READ on the disk. In this bz, there are around 22% of READ fops hit this case where it didn't have any other anonymous-fd available by the time a READ fop came. so it had to serve it by doing OPEN+READ on the disk.

I think a proper fix for this issue is an enhancement in open-behind to do READS after OPEN for big files and not do OPENs for small files so that it works well for both cases. So we need to implement a heuristic where after witnessing X number of reads on an fd it should switch to READs on normal fd rather than anonymous fd. We need to come up with a default for this 'X' based on the boundary where doing anon-fd reads is costlier than reads on an already opened fd.

I think for this release we should defer it by documenting that for large file workloads it makes sense to disable open-behind until this heuristic is implemented.

Ambarish could you also add the data you have where you were mentioning that Fuse performance was also better?

Comment 28 Ambarish 2017-03-01 17:23:18 UTC
Reads were slightly better on EC over FUSE :


* 3.2,Seq Reads with open behind on -  2636985 kB/sec

* 3.2,Seq Reads with open behind off - 2812877 kB/sec



The story on DR (2*2) was different though :

* 3.2,Seq Reads with open behind on - 2637843.08 kB/sec

* 3.2,Seq Reads with open behind off - 1671888.96 kB/sec

This just got confusing.

Comment 29 Pranith Kumar K 2017-03-01 18:04:12 UTC
(In reply to Ambarish from comment #28)
> Reads were slightly better on EC over FUSE :
> 
> 
> * 3.2,Seq Reads with open behind on -  2636985 kB/sec
> 
> * 3.2,Seq Reads with open behind off - 2812877 kB/sec
> 
> 
> 
> The story on DR (2*2) was different though :
> 
> * 3.2,Seq Reads with open behind on - 2637843.08 kB/sec
> 
> * 3.2,Seq Reads with open behind off - 1671888.96 kB/sec
> 
> This just got confusing.

Strace and profile info will clear up the confusions :-). Please provide brick straces for the volume so we can debug further.

Comment 38 Raghavendra G 2017-04-19 05:55:34 UTC
(In reply to Pranith Kumar K from comment #37)
> (In reply to Milind Changire from comment #36)
> > Can we use stat info (size) of a file to decide upon whether a file is large
> > or small i.e. if file can be read in one READ fop of say 128KB (or whatever
> > the current setting for default read buffer size is) then it is a small file
> > if not then it is a large file
> > 
> > would this heuristic hold good ?
> > 
> > I'm presuming that a LOOKUP fop would go through before an OPEN or READ is
> > attempted, and that the stat info would be returned as a result of a
> > successful LOOKUP. Please correct me otherwise.
> 
> That would work. Doing this based on number of READs on the file is another
> way. You guys figure out what is best to implement. I think the best option
> would be to compound open with the next fop so that we don't have any
> heuristics to play with.

That's a good suggestion. However, there is flush operation which need to be wound to backend if we are using non-anonymous fds. So, in the extreme case of just 1 read, we would still suffer penalty of 1 fop - flush - round trip (though whether its substantial enough, is an open question). So, we might still have to do some heuristic based on size.

Also, Is existing compound fop framework complete enough to do that (say compound open and read?

regards,
Raghavendra

Comment 39 Pranith Kumar K 2017-05-17 05:51:51 UTC
(In reply to Raghavendra G from comment #38)
> (In reply to Pranith Kumar K from comment #37)
> > (In reply to Milind Changire from comment #36)
> > > Can we use stat info (size) of a file to decide upon whether a file is large
> > > or small i.e. if file can be read in one READ fop of say 128KB (or whatever
> > > the current setting for default read buffer size is) then it is a small file
> > > if not then it is a large file
> > > 
> > > would this heuristic hold good ?
> > > 
> > > I'm presuming that a LOOKUP fop would go through before an OPEN or READ is
> > > attempted, and that the stat info would be returned as a result of a
> > > successful LOOKUP. Please correct me otherwise.
> > 
> > That would work. Doing this based on number of READs on the file is another
> > way. You guys figure out what is best to implement. I think the best option
> > would be to compound open with the next fop so that we don't have any
> > heuristics to play with.
> 
> That's a good suggestion. However, there is flush operation which need to be
> wound to backend if we are using non-anonymous fds. So, in the extreme case
> of just 1 read, we would still suffer penalty of 1 fop - flush - round trip
> (though whether its substantial enough, is an open question). So, we might
> still have to do some heuristic based on size.

Okay. In that case let us go with the heuristic based on size for now.

> 
> Also, Is existing compound fop framework complete enough to do that (say
> compound open and read?

I think we need more work for allowing a compound FOP which has opening of fd.

> 
> regards,
> Raghavendra

Comment 45 Milind Changire 2017-10-12 06:49:17 UTC
upstream patch: https://review.gluster.org/17377

Comment 62 Raghavendra G 2018-03-01 18:05:09 UTC
Ambarish/Milind,

It looks like the bug is no longer present in 3.4.0. Reasons are:

* On default configuration (with open-behind on)
  Avg read on 3.4.0 - approx 8500 KB/sec (comment #59)
  Avg read on 3.3.1 - approx 7950 KB/sec (comment #54)
  Avg read on 3.1.3 - approx 7400 KB/sec (comment #61)

  So, performance in 3.4.0 is better than 3.3.1 is better than 3.1.3. Also note that in all these tests open-behind on has better performance than open-behind off.

* Note that the initial RCA of the bug was open-behind is causing problems. IOW, open-behind on 3.2.0 when on gives less performance when compared with open-behind off. However, if we look at the numbers in 3.4.0,
  open-behind on - approx 8500 KB/sec (comment #59)
  open-behind off - approx 8100 KB/sec (comment #51)

  So, open-behind is actually adding marginal performance benefits and what we had observed in 3.2.0 is no longer true.

This makes me wonder, whether these performance numbers were taken on EC/disperse. I've a feeling that these numbers correspond to distributed-replicate where open-behind definitely had helped as given in earlier comments (Only EC had issues).

@Milind, can you clarify whether these numbers were taken on a disperse/EC setup?

Comment 69 errata-xmlrpc 2018-09-04 06:32:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607