Bug 1419807
Summary: | [Perf]: 25% regression on sequential reads on EC over SMB3 | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Ambarish <asoman> | |
Component: | open-behind | Assignee: | Milind Changire <mchangir> | |
Status: | CLOSED ERRATA | QA Contact: | Karan Sandha <ksandha> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.2 | CC: | amukherj, aspandey, bturner, ksandha, mchangir, pgurusid, pkarampu, rcyriac, rgowdapp, rhinduja, rhs-bugs, sheggodu, smali, storage-qa-internal | |
Target Milestone: | --- | Keywords: | Regression, Reopened, ZStream | |
Target Release: | RHGS 3.4.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1455872 (view as bug list) | Environment: | ||
Last Closed: | 2018-09-04 06:32:03 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1455872 | |||
Bug Blocks: | 1503134 |
Description
Ambarish
2017-02-07 06:00:03 UTC
Adding Regression keyword since it's a regression from the last released version of RHGS. It has come to my attention that one of the servers,gqas010 has gone bad,I'll recheck this on another setup and share the results. This has been reproduced on another setup(that Ben and Karan were using). 3.1.3 : 1499580.48 kB/sec 3.8.4-14 : 1112863.16 kB/sec Seeing extremely high latencies for READDIR(3500 us+) on all bricks on 3.2 bits,compared to 300-600 us in 3.1.3. Ambarish, I talked to Poornima and she suggested to come up with few more information. We need to see the profile info after disabling md-cache and io-thread. So, please provide these two. Ashish/Poornima, I have the data with io-threads disabled.I see no impact with or without io-threads. 3.1.3 : 1499580.48 kB/sec 3.2. with io-threads : 1106730.49 kB/sec 3.2 without io-threads : 1111601.68 kB/sec The regression is still there,even without io-threads enabled. ICYMI,3.1.3 benchmarking was also done with io-threads on,as we encouraged our EC customers on 3.1.3 to enable io-threads for better writes. Tests without mdcache is something I can fetch by EOD or moro mostly,once the beaker downtime is over. I could reproduce the perf regresssion,even after switching off mdcache. So,in a nutshell, 3.1.3 : 1499580.48 kB/sec 3.2. with io-threads : 1106730.49 kB/sec 3.2 without io-threads : 1111601.68 kB/sec 3.2 without mdcache tunables : 1146567 kB/sec 17:19:30 :) ⚡ grep -w READ 3.2_SMB3-read.txt | awk '{sum+=$8}END{print sum/2}' 6061757 17:19:31 :) ⚡ grep -w READ 3.1.3_EC_SMB3-read.txt | awk '{sum+=$8}END{print sum/2}' 5663296 %-increase: >>> (6061757.0 - 5663296.0)/5663296 0.0703584979488976 so 7% of increase is there in READ calls 17:24:21 :) ⚡ grep -w READ 3.2_SMB3-read.txt | awk '{sum+=($2*$8);calls+=$8}END{print sum/calls}' 278.358 17:20:23 :) ⚡ grep -w READ 3.1.3_EC_SMB3-read.txt | awk '{sum+=($2*$8);calls+=$8}END{print sum/calls}' 261.703 >>> (278.358-261.703)/261.703 0.06364084477441997 so 6% of increase is there in READ fop avg latency. We need to find two things here. 1) Considering this issue is observed only on SMB, can we have someone from Samba take a look and update us with any information about why the number of calls could have increased. 2) We need to know which exact fops are leading to increase in the latency. So on disk latency should be helpful as well. Will leave needinfo on Poornima for 1) above and ambarish for 2) above. For 1), is it possible to check with 3.1.3 Samba packages and latest Gluster and check if the 6% increase is because of Samba update? Also is it possible to get the number of reads from 3.2 Samba on dis-rep volume and compare it against EC? Ambarish, Is this not the command you used for strace capture? strace -ff -T -p <pid-of-brick> -o <path-to-the-file-where-you-want-the-output-saved>` I don't see any time taken for syscalls in the strace files: I see strace like: fgetxattr(31, "trusted.gfid", "\x9a\xfb\x8aZ$\xaaC\x9e\x87\x88\x8c\x07\x17_\x1f\xcd", 16) = 16 There is no <time-taken-for-syscall> at the end in '<>' This will require a re-run? Ambarish, I found, 11.6% increase in number of preads, so that can cause regression. Let me look at why that is the case while you provide the latency numbers. Pranith In 3.1.3 each brick has the file opened:
0.00 489.25 us 406.00 us 610.00 us 12 OPEN
in 3.2.0 the file is not opened.
Here is the difference between open calls in 3.1.3 vs 3.2.0
in 3.1.3:
7245 open calls and respective files
[qe@rhsqe-repo gqas005]$ cut -f2 -d'(' open-latency.txt | cut -f1 -d',' | sort | uniq -c
1 "/bricks1/brick/.glusterfs/10/59/105999e1-0085-4066-886a-5c5b4cb1d301"
1 "/bricks1/brick/.glusterfs/1f/1b/1f1bba3f-d316-476c-89af-4660a608986c"
1 "/bricks1/brick/.glusterfs/22/16/2216f451-6cfb-4aff-80f8-05224e51266b"
7194 "/bricks1/brick/.glusterfs/28/ae/28ae65bb-5df5-4371-bcc9-41b0db4b89bb"
1 "/bricks1/brick/.glusterfs/29/2c/292ca564-b669-4b19-aa0d-7162ade9db80"
1 "/bricks1/brick/.glusterfs/2b/75/2b75dd52-7f63-42fc-bb80-a33cfd965b53"
1 "/bricks1/brick/.glusterfs/50/f4/50f4072e-1f4a-4d3e-9f53-1535c293278f"
1 "/bricks1/brick/.glusterfs/52/4c/524c472f-8f95-403a-a6cb-5815666ee1f6"
1 "/bricks1/brick/.glusterfs/54/4c/544cd41f-962d-4da5-96ee-daa4490f5fda"
1 "/bricks1/brick/.glusterfs/76/5b/765b3ca6-caca-4654-8813-336cd0761fee"
1 "/bricks1/brick/.glusterfs/c5/11/c5115144-f7d6-4689-a62a-fe4e6ed69cad"
1 "/bricks1/brick/.glusterfs/f2/e2/f2e2c259-0194-4846-afc1-e27004c1ac67"
1 "/bricks1/brick/.glusterfs/f4/8b/f48bee5d-6b58-41db-8d5f-ccad35c2ffc5"
11 "/bricks1/brick/.glusterfs/health_check"
1 "/bricks2/brick/.glusterfs/18/03/18030c9e-2430-4031-a655-910d04b7c5bb"
1 "/bricks2/brick/.glusterfs/36/4b/364b32c0-a46e-4274-b365-d4b7017a2d37"
1 "/bricks2/brick/.glusterfs/3d/d1/3dd112c7-12a7-4ae1-ac92-fa9ddbcd745d"
1 "/bricks2/brick/.glusterfs/6e/d3/6ed3d53a-d1a5-4de4-a0d3-b6615836f635"
1 "/bricks2/brick/.glusterfs/7f/39/7f39122b-0f37-43fe-84f7-828e126d11ec"
1 "/bricks2/brick/.glusterfs/b8/9e/b89e0324-47a3-44c2-ab49-703f4bcbfbcc"
1 "/bricks2/brick/.glusterfs/c4/74/c47429d7-7652-48bc-bf37-fe0652f8f2fc"
1 "/bricks2/brick/.glusterfs/df/0b/df0bea9d-683d-4098-ad90-342203b8c436"
1 "/bricks2/brick/.glusterfs/df/24/df24870a-fd2a-4e54-9fe3-969656e09614"
1 "/bricks2/brick/.glusterfs/ee/37/ee371c2e-a212-439f-b674-b4bad3bc7029"
1 "/bricks2/brick/.glusterfs/fc/35/fc3563dc-c462-45ef-afeb-37ea99fdd86a"
13 "/bricks2/brick/.glusterfs/health_check"
in 3.2.0:
188785 open calls and respective files
[qe@rhsqe-repo gqas005]$ cut -f2 -d'(' open-latency.txt | cut -f1 -d',' | sort | uniq -c
5030 "/bricks1/brick/.glusterfs/0e/e1/0ee1dfc8-df1e-4c24-a3fc-56062cc8d112"
6626 "/bricks1/brick/.glusterfs/60/1c/601c1553-25d4-4bb3-aa9b-52eca1968a2e"
9378 "/bricks1/brick/.glusterfs/69/8a/698a81a2-a32f-47ab-9934-f4e9f378fbd5"
6827 "/bricks1/brick/.glusterfs/6d/ea/6dea381b-5446-42e9-94b4-cc6517450ad9"
4987 "/bricks1/brick/.glusterfs/79/67/7967182f-23e5-480c-ade2-93d460ee7542"
9688 "/bricks1/brick/.glusterfs/82/55/82555aa2-74f7-4f86-a89d-e556cd3f834a"
6904 "/bricks1/brick/.glusterfs/85/a5/85a54dcd-0a22-41d9-ac1f-b03266ef343a"
9415 "/bricks1/brick/.glusterfs/9a/fb/9afb8a5a-24aa-439e-8788-8c07175f1fcd"
9529 "/bricks1/brick/.glusterfs/a0/b1/a0b18584-b602-4713-8a3a-10f1730493b6"
6541 "/bricks1/brick/.glusterfs/a4/bf/a4bfd317-4ba4-4e8d-bef4-7d032a688746"
5008 "/bricks1/brick/.glusterfs/d1/79/d179ba5b-ba0d-4403-a377-74b82cdffd91"
5019 "/bricks1/brick/.glusterfs/ea/1a/ea1a76eb-7dac-4841-b960-cb0a4e317dd4"
9452 "/bricks1/brick/.glusterfs/f1/21/f12127d8-9f0c-4ca0-a466-0c661c4ac3d9"
15 "/bricks1/brick/.glusterfs/health_check"
6643 "/bricks2/brick/.glusterfs/15/d5/15d5a555-4ae9-4fd6-a4b9-b446e8ae2423"
6673 "/bricks2/brick/.glusterfs/2b/91/2b914713-f5ed-4af2-a218-538803605ca6"
9733 "/bricks2/brick/.glusterfs/53/63/53632d26-c480-4e36-9108-084aa1926c19"
9444 "/bricks2/brick/.glusterfs/55/ed/55edde38-2ad6-428d-8088-4afb1e7569fc"
9675 "/bricks2/brick/.glusterfs/87/a0/87a08afc-1c79-49fb-a5d2-424530a840ce"
9573 "/bricks2/brick/.glusterfs/8d/47/8d470f86-b059-42b5-9a76-48119d7fe766"
9291 "/bricks2/brick/.glusterfs/9d/9d/9d9da0ca-0989-4bf3-8f67-aed9c4afa950"
6871 "/bricks2/brick/.glusterfs/b6/df/b6dfb606-90e4-46c8-8fa3-2168ed844ec6"
6975 "/bricks2/brick/.glusterfs/c8/ef/c8ef678d-0555-42cf-a453-3ca3bcfc633f"
9682 "/bricks2/brick/.glusterfs/dc/cb/dccb9a5d-0bfa-4b42-8a1c-008ad386e287"
9786 "/bricks2/brick/.glusterfs/e2/1c/e21cb5e1-e832-41bb-965f-a45bef67c72c"
16 "/bricks2/brick/.glusterfs/health_check"
Total number of preadv calls are 853641 in 3.2.0
>>> 188785.0/853641 # (extra opens/pread)
0.2211526859651774
22% read calls I think are served through extra anon-fd which means to serve the read it has to open then read which will lead to delay.
May be we need this run on 3.2. with open-behind disabled to test the theory to see if performance matches
Pranith, With open behind disabled,I am getting faster reads than 3.1.3,under the sam workload and the same systems :) Benchmark : 3.1.3 : 723391.85 kB/sec 3.2 Defaults : with open behind on : 592149 kB/sec 3.2 : with open behind off : 788382 kB/sec (In reply to Ambarish from comment #24) > Pranith, > > With open behind disabled,I am getting faster reads than 3.1.3,under the sam > workload and the same systems :) > > > Benchmark : 3.1.3 : 723391.85 kB/sec > 3.2 Defaults : with open behind on : 592149 kB/sec > 3.2 : with open behind off : 788382 kB/sec This confirms my theory. If we do a "cat" of a file, it actually sends OPEN then READ and then CLOSE roughly (ignoring lookups for now). cp -r of a directory with a lot of files will require OPEN, READ, CLOSE. With open-behind on, OPENs won't happen over the wire, instead the READ is sent on anonymous-fd which leads to OPEN+READ directly on the disk. So we save extra network operation i.e. OPEN. This is very important for small file workload. But what Ambarish found in the test above is that when the same idea is used for Large file reading, each read still goes over the wire and if an anonymous fd which used for the previous READ is not available on the server it will lead to OPEN+READ on the disk. In this bz, there are around 22% of READ fops hit this case where it didn't have any other anonymous-fd available by the time a READ fop came. so it had to serve it by doing OPEN+READ on the disk. I think a proper fix for this issue is an enhancement in open-behind to do READS after OPEN for big files and not do OPENs for small files so that it works well for both cases. So we need to implement a heuristic where after witnessing X number of reads on an fd it should switch to READs on normal fd rather than anonymous fd. We need to come up with a default for this 'X' based on the boundary where doing anon-fd reads is costlier than reads on an already opened fd. I think for this release we should defer it by documenting that for large file workloads it makes sense to disable open-behind until this heuristic is implemented. Ambarish could you also add the data you have where you were mentioning that Fuse performance was also better? Reads were slightly better on EC over FUSE : * 3.2,Seq Reads with open behind on - 2636985 kB/sec * 3.2,Seq Reads with open behind off - 2812877 kB/sec The story on DR (2*2) was different though : * 3.2,Seq Reads with open behind on - 2637843.08 kB/sec * 3.2,Seq Reads with open behind off - 1671888.96 kB/sec This just got confusing. (In reply to Ambarish from comment #28) > Reads were slightly better on EC over FUSE : > > > * 3.2,Seq Reads with open behind on - 2636985 kB/sec > > * 3.2,Seq Reads with open behind off - 2812877 kB/sec > > > > The story on DR (2*2) was different though : > > * 3.2,Seq Reads with open behind on - 2637843.08 kB/sec > > * 3.2,Seq Reads with open behind off - 1671888.96 kB/sec > > This just got confusing. Strace and profile info will clear up the confusions :-). Please provide brick straces for the volume so we can debug further. (In reply to Pranith Kumar K from comment #37) > (In reply to Milind Changire from comment #36) > > Can we use stat info (size) of a file to decide upon whether a file is large > > or small i.e. if file can be read in one READ fop of say 128KB (or whatever > > the current setting for default read buffer size is) then it is a small file > > if not then it is a large file > > > > would this heuristic hold good ? > > > > I'm presuming that a LOOKUP fop would go through before an OPEN or READ is > > attempted, and that the stat info would be returned as a result of a > > successful LOOKUP. Please correct me otherwise. > > That would work. Doing this based on number of READs on the file is another > way. You guys figure out what is best to implement. I think the best option > would be to compound open with the next fop so that we don't have any > heuristics to play with. That's a good suggestion. However, there is flush operation which need to be wound to backend if we are using non-anonymous fds. So, in the extreme case of just 1 read, we would still suffer penalty of 1 fop - flush - round trip (though whether its substantial enough, is an open question). So, we might still have to do some heuristic based on size. Also, Is existing compound fop framework complete enough to do that (say compound open and read? regards, Raghavendra (In reply to Raghavendra G from comment #38) > (In reply to Pranith Kumar K from comment #37) > > (In reply to Milind Changire from comment #36) > > > Can we use stat info (size) of a file to decide upon whether a file is large > > > or small i.e. if file can be read in one READ fop of say 128KB (or whatever > > > the current setting for default read buffer size is) then it is a small file > > > if not then it is a large file > > > > > > would this heuristic hold good ? > > > > > > I'm presuming that a LOOKUP fop would go through before an OPEN or READ is > > > attempted, and that the stat info would be returned as a result of a > > > successful LOOKUP. Please correct me otherwise. > > > > That would work. Doing this based on number of READs on the file is another > > way. You guys figure out what is best to implement. I think the best option > > would be to compound open with the next fop so that we don't have any > > heuristics to play with. > > That's a good suggestion. However, there is flush operation which need to be > wound to backend if we are using non-anonymous fds. So, in the extreme case > of just 1 read, we would still suffer penalty of 1 fop - flush - round trip > (though whether its substantial enough, is an open question). So, we might > still have to do some heuristic based on size. Okay. In that case let us go with the heuristic based on size for now. > > Also, Is existing compound fop framework complete enough to do that (say > compound open and read? I think we need more work for allowing a compound FOP which has opening of fd. > > regards, > Raghavendra upstream patch: https://review.gluster.org/17377 Ambarish/Milind, It looks like the bug is no longer present in 3.4.0. Reasons are: * On default configuration (with open-behind on) Avg read on 3.4.0 - approx 8500 KB/sec (comment #59) Avg read on 3.3.1 - approx 7950 KB/sec (comment #54) Avg read on 3.1.3 - approx 7400 KB/sec (comment #61) So, performance in 3.4.0 is better than 3.3.1 is better than 3.1.3. Also note that in all these tests open-behind on has better performance than open-behind off. * Note that the initial RCA of the bug was open-behind is causing problems. IOW, open-behind on 3.2.0 when on gives less performance when compared with open-behind off. However, if we look at the numbers in 3.4.0, open-behind on - approx 8500 KB/sec (comment #59) open-behind off - approx 8100 KB/sec (comment #51) So, open-behind is actually adding marginal performance benefits and what we had observed in 3.2.0 is no longer true. This makes me wonder, whether these performance numbers were taken on EC/disperse. I've a feeling that these numbers correspond to distributed-replicate where open-behind definitely had helped as given in earlier comments (Only EC had issues). @Milind, can you clarify whether these numbers were taken on a disperse/EC setup? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607 |