Description of problem: To get the baseline Random write IOPS of SSD gluster volume, that is added as a storage domain in RHHI so as to provide block storage to VM provisioned, is mounted on of the hosts and a fio random write run is executed over it. Later, the same volume was added as a storage domain and added as a block device to the Virtual Machine(VM) running on top of RHHI and the same fio random write test is executed within the VM. Following are the results: 1. Baseline Random Write IOPS: 6456 2. Virtual Machine Random Write IOPS: 602 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Ravi, The results are indicating that RHHI-libgfapi is performing well, whereas RHHI-fuse is not; also the gluster volume profiles are showing that in the RHHI-fuse case there is a high proportion of FINODELK and FXATTROP calls (not seen in RHHI-libgfapi). Is this indicating that the recent eager-lock enhancements are less effective over fuse, and more work is needed there? (Both RHHI-libgfapi and the libgfapi-based gluster-block have shown good performance with the eager-lock fixes). Requesting your comments on the reported results and collected data, and what might be the way forward.
Hi Manoj, As far as eager lock modification https://code.engineering.redhat.com/gerrit/#/c/140577/ is concerned, I'm not able to find any potential issues. The changes are independent of which xlators sits on the top. i.e. if the order and sequence of FOPS reaching AFR are same irrespective of whether it is gfapi or fuse, the no. of FINODELK + WRITE + FXATTROP generated by AFR will be the same. The fact that it isn't seems to indicate there is a difference in I/O pattern reaching AFR using gfapi vs fuse. We would need to figure out why that is.
Nikhil, just wanted to check if 1) you got a chance to run the tests with eager-lock disabled and compare the results of fio on gfapi vs fuse and you are still seeing the performance drop. 2)Whether having a fully preallocated vm-disk (comment #9) makes any difference to the results.
(In reply to Ravishankar N from comment #13) > Nikhil, just wanted to check if > > 1) you got a chance to run the tests with eager-lock disabled and compare > the results of fio on gfapi vs fuse and you are still seeing the > performance drop. Yes, ran the test with eager-lock off on RHHI-FUSE and RHHI-Libgfapi. The difference is still persisting. RHHI-libgfapi - 1216 RHHI-Fuse - 392 > 2)Whether having a fully preallocated vm-disk (comment #9) makes any > difference to the results. Since, before executing the Random write test, a sequential write test is executed first, which allocates the shards and on top of the same file, Fio random IO is executed. Hence, this is already a preallocated test.
Just some points that I observed during discussion with Sahina and Nikhil : 1. Comparing RHHI-libgfapi vs RHHI-Fuse: Fuse is expected to be slower because of the context switches between kernel and fuse-bridge. 2. Running fio inside a VM backed by gluster is not identical to running fio inside a gluster fuse mount. Fio inside VM can operate on multiple files (that it creates) using multiple fds, whereas the fio job running on a VM will always come on a single FD on gluster (irrespective of gfapi or fuse based access)
(In reply to Ravishankar N from comment #17) > Fio inside VM can operate on multiple files > (that it creates) using multiple fds, typo. I meant " Fio inside fuse mount can operate on multiple files (that it creates) using multiple fds" Too many context switches :(
(In reply to Ravishankar N from comment #21) > Manoj, What is the expectation from this BZ? > > It is to see why the profile info is different for a VM on RHHI using fuse > vs libgfapi (and not focus on *why* performance numbers are better on > libgfapi?) This is a performance bug on RHHI on a random write workload. RHHI, in the supported configuration, is providing ~600 IOPS on a random write test, with 4k block size. Why so low? What is holding it back? * The underlying devices are NVMe, probably capable of 200+K IOPS. So that's not the bottleneck. * Running the test on the volume mounting it directly on the host gives about 3K IOPS; with some tuning we are able to bump that up to 6+K IOPS. [Nikhil has posted data for single-file (to simulate the RHHI case) and multiple files; in each case gluster is able to hit ~6K IOPS]. So, the 600K IOPS that RHHI is giving is not because of inherent gluster limits. * Running the test with RHHI-libgfapi path, the same test run on the VM again gives approx. 3K IOPS, and with tuning it gives 6K IOPS. So it is not that running in the VM is the cause of the low IOPS. So what is holding back RHHI performance on this workload in the supported RHHI-fuse configuration? The task here is to find the root cause and a fix. The gluster volume profiles posted in comment #3 and others, are showing that the RHHI-fuse case is showing much higher proportion of FINODELK and FXATTROP calls. According to my conversation with you, that suggests that the eager locking is not effective. That's the biggest clue that we have so far, IMO. But feel free to propose hypotheses for the reason for poor performance, and validate them. Setting a needinfo on the assignee to take this forward.
(In reply to Manoj Pillai from comment #22) > multiple files; in each case gluster is able to hit ~6K IOPS]. So, the 600K > IOPS that RHHI is giving is not because of inherent gluster limits. That should read 600 IOPS, obviously. Otherwise, we would have a bug of a very different nature :-).
(In reply to Manoj Pillai from comment #22) > > The gluster volume profiles posted in comment #3 and others, are showing > that the RHHI-fuse case is showing much higher proportion of FINODELK and > FXATTROP calls. According to my conversation with you, that suggests that > the eager locking is not effective. That's the biggest clue that we have so > far, IMO. But feel free to propose hypotheses for the reason for poor > performance, and validate them. > I will explore if it is possible to load 2 io-stats xlators manually before and after AFR in the client volfile, and run the VM on gfapi (6000 IOPS) vs VM on fuse (600 IOPS) fio test and capture the profile info for each run. I think it is doable for fuse mount, but let me check for gfapi and confirm. If yes, I will sit with Nikhil and do this on the same setup.
I ran the same random write fio workload on 2 3-way replicated volumes - one with shard and another without it to keep the test simple. I did this mostly to eliminate the possibility of any bugs in shard that could be causing it to send FXATTROPS even with already created files. I also had io-stats loaded at multiple points in the client stack of the replicated-sharded volume. (This was done entirely on FUSE mount - no gfapi). Eager-lock was enabled on both volumes. Results/Observations: ==================== 1. With the unsharded volume: From brick profile: 0.00 30.28 us 18.25 us 54.72 us 8 FINODELK 0.00 82.79 us 46.91 us 118.46 us 4 OPEN 0.00 307.96 us 222.14 us 418.21 us 4 FSYNC 0.00 214.29 us 64.62 us 599.05 us 12 FXATTROP 100.00 168.84 us 73.69 us 434220.60 us 1071435 WRITE This makes perfect sense. The fio job operates on 4 files. So 4 * 2 (lock + unlock) = 8 finodelks. And there are 4 fsyncs - perhaps one per file. And I see that afr's fsync is also converted into a data transaction, so it would have piggybacked on the existing eager-locks on the files, leading to no extra finodelks. All good so far. 2. With sharded volume: Shard's parent received the following fops: WRITE 721941 875.17 us 237.37 us 3574429.83 us FLUSH 4 183756.46 us 3950.93 us 700065.74 us FSYNC 4 2639002.96 us 2502096.61 us 3022192.69 us LOOKUP 7 4612.40 us 385.51 us 24819.05 us Not very different from (1) above. Layers above shard see only 4 files. So 4 fsyncs, 4 flushes etc. Shard's child received the following fops: WRITE 721941 858.33 us 226.74 us 3574409.67 us FLUSH 4 183750.80 us 3946.39 us 700060.42 us FSYNC 1024 1761200.73 us 248645.32 us 3019774.01 us LOOKUP 1049 814.11 us 305.65 us 24783.29 us The numbers still make sense. 16GB*4 = 64GB. And with shard-block-size of 64MB, there are 1024 shards. So 1024 fsyncs, approximately 1024 lookups as well (wi. Also note that there are no fxattrops originating from shard itself. So at least there are no bugs in shard's write fop. So far so good. Each child of replicate-0 received the following fops: WRITE 721941 544.15 us 152.05 us 3260626.44 us FSYNC 1024 1222615.59 us 247642.73 us 1957657.64 us LOOKUP 1053 560.92 us 198.14 us 10163.24 us FINODELK 11538 153684.12 us 65.11 us 3260720.01 us FXATTROP 11538 698127.39 us 177.48 us 3469537.85 us Since there are 1024 shards, I'd imagine that AFR would be taking eager-locks 1024 * 2 (lock + unlock) = 2048 times. But why is afr then sending 11538 (>> 2048) inodelks? *Even* if I assume fsync uses its own eager-lock, that would still account for 2048 * 2 = 4096 inodelks. What are the rest of the finodelks and fxattrops being wound for? Couple of reasons I can think of: 1. I don't know entirely how the new eager-lock implementation works. Does it still release the lock after 1 sec of inactivity on the fd like it used to in the older scheme? ... Ravi would you know? ^^ ... and the degree of randomness of the fio write was possibly such that a given shard saw no writes for > 1 sec causing the eager-lock to be released and re-acquired multiple times over the course of the run? 2. Another question that crops up in my mind - does eager-lock optimize the locking well for writes on anonymous fds too, given that shard uses anon fds to perform its IO? For 2, let me write a quick and dirty patch in shard to make it OPEN the shards explicitly before performing writes and check whether that makes a difference. -Krutika
(In reply to Krutika Dhananjay from comment #26) > Couple of reasons I can think of: > > 1. I don't know entirely how the new eager-lock implementation works. Does > it still release the lock after 1 sec of inactivity on the fd like it used > to in the older scheme? ... > > Ravi would you know? ^^ The new eager lock changes as I understand it (not going to lie, I'm not yet fully familiar with the no. of fop queues it maintains and how it moves fops from one queue to another) is no longer fd based but is tied to the inode. Fops on the same inode are put into different queues based on various conditions and processed.During post-op if there is only one fd, there is a wait for post_op_delay_secs (1 sec by default) before doing on-wire post-op and unlock. So yes, the old behaviour does seem to be in place. > > ... and the degree of randomness of the fio write was possibly such that a > given shard saw no writes for > 1 sec causing the eager-lock to be released > and re-acquired multiple times over the course of the run? > > > 2. Another question that crops up in my mind - does eager-lock optimize the > locking well for writes on anonymous fds too, given that shard uses anon fds > to perform its IO? > AFR checks for GLUSTERFS_ACTIVE_FD_COUNT in every write cbk and disables eager lock if multiple fds are opened. If anon-fds contribute towards increasing this count, then I'm guessing eager lock benefits would be lost. But this is something that has been there even before the new eager lock changes.
One more thing that could be contributing to the wide gap in performance between fuse and libgfapi is the presence of FUSE event-history, which as the name suggests is FUSE-specific and causes mutex lock contention between FUSE AND epoll threads in both req and rsp path of fops over a global circular buf. This was fixed in RHGS-3.4.0, but is not part of the RHGS-3.3.1 async branch. Here's some numbers with and without event-history from a run I'd done long time back - https://bugzilla.redhat.com/show_bug.cgi?id=1501013#c0 (This is also reflected in the cpu usage of the fuse reader thread where with event-history disabledit hits ~99% util). -Krutika
(In reply to Ravishankar N from comment #21) > Manoj, What is the expectation from this BZ? > (In reply to Manoj Pillai from comment #22) <...> > RHHI, in the supported configuration, is providing ~600 IOPS on a random > write test, with 4k block size. Why so low? What is holding it back? > <...> > * Running the test with RHHI-libgfapi path, the same test run on the VM > again gives approx. 3K IOPS, and with tuning it gives 6K IOPS. So it is not > that running in the VM is the cause of the low IOPS. > > So what is holding back RHHI performance on this workload in the supported > RHHI-fuse configuration? The task here is to find the root cause and a fix. > Expressing the expectation as numbers, IMO the RHHI-fuse IOPS results should be close to the RHHI-libgfapi results for this test i.e. should be able to hit close to 6K IOPS with tuning. If not, we should be able to give an explanation for why that is not happening. Note that the test itself is not a latency-sensitive test -- it has a large no. of concurrent requests outstanding (4 fio jobs with iodepth=16 each) and we are focusing on IOPS. Even though it is expected that the fuse path will have a somewhat higher latency, with so many concurrent requests in flight, I'd expect it to be able to match the libgfapi path on IOPS, if latency is the only difference in the two paths.
What does the fio option "end_fsync" do? man fio(1) says this - "Sync file contents when job exits." But does it have any bearing on the IOPs fio would report? Or can all FSYNC (gluster) transactions be safely ignored for the purpose of this debugging exercise?
On a related note, I was going through eager-lock implementation and one thing that caught my eye was this: https://github.com/gluster/glusterfs/blob/d6d729b0609957c0382749c30da507dda77561b7/xlators/cluster/afr/src/afr-transaction.c#L2023 IIUC, delayed post-op is ONLY applicable to GF_FOP_WRITE and GF_FOP_FXATTROP. Given that afr's FSYNC is also a transaction now, can we NOT change this piece of code to also make FSYNC delay its post-op and unlock, like this: <snippet> diff --git a/xlators/cluster/afr/src/afr-transaction.c b/xlators/cluster/afr/src/afr-transaction.c index 9c587db05..6b0bfcb6c 100644 --- a/xlators/cluster/afr/src/afr-transaction.c +++ b/xlators/cluster/afr/src/afr-transaction.c @@ -2020,7 +2020,8 @@ afr_is_delayed_changelog_post_op_needed (call_frame_t *frame, xlator_t *this, } if ((local->op != GF_FOP_WRITE) && - (local->op != GF_FOP_FXATTROP)) { + (local->op != GF_FOP_FXATTROP) && + (local->op != GF_FOP_FSYNC)) { /*Only allow writes but shard does [f]xattrops on writes, so * they are fine too*/ goto out; </snippet> and that should hopefully give us better performance? This may not yield much benefit if the I/O pattern is such that FSYNC is sent after the last WRITE (like the fio job with end_fsync flag that Nikhil's using in his test). But, if there is a possibility that there are lot more fsyncs that are sent between writes, then this should help. For example, in Sahina's tests, we saw this in the vol profile: 7.57 30.93 us 9.00 us 627.00 us 27442 FINODELK 20.14 172.02 us 28.00 us 772.00 us 13126 FSYNC 26.41 110.26 us 54.00 us 2185.00 us 26855 FXATTROP 45.11 151.55 us 78.00 us 1189.00 us 33373 WRITE 33K writes, spread across ~960 shards and lot of fsyncs. Ravi, could you confirm if this piece of code can be optimized and doesn't lead to any kind of data loss (I think not because anyway fsync would have done a pre-op, so no information is lost in the face of failure, but I'll let you confirm)? If yes, I have a patch ready that I'll send out for review. -Krutika
(In reply to Krutika Dhananjay from comment #41) I think it is better not to optimise fysncs using eager locking because fsync gives guarantees to the application regarding previous writes. So if an fsync fails on a few children of AFR, it is better to capture that information on a per FOP basis by doing 'dirty' (pre-op) and 'undirty + pending xattr in case of failure' (post-op) for every fsync call. Pranith, 1. Does that make sense? 2. Is there any reason that you added GF_FOP_FXATTROP, but not GF_FOP_FTRUNCATE, GF_FOP_FALLOCATE. GF_FOP_ZEROFILL etc to the afr_is_delayed_changelog_post_op_needed() function?
(In reply to Ravishankar N from comment #42) > (In reply to Krutika Dhananjay from comment #41) > > I think it is better not to optimise fysncs using eager locking because > fsync gives guarantees to the application regarding previous writes. So if > an fsync fails on a few children of AFR, it is better to capture that > information on a per FOP basis by doing 'dirty' (pre-op) and 'undirty + > pending xattr in case of failure' (post-op) for every fsync call. When a failure happens post-op-delay anyway doesn't happen right? Otherwise it is a bug? Since fsync happens with a full lock, no other operations should be in progress, so it can use the eager-lock and do delayed-post-op like the other fops. I don't foresee any issues at the moment at least. > > Pranith, > 1. Does that make sense? > 2. Is there any reason that you added GF_FOP_FXATTROP, but not > GF_FOP_FTRUNCATE, GF_FOP_FALLOCATE. GF_FOP_ZEROFILL etc to the > afr_is_delayed_changelog_post_op_needed() function? I wanted to keep the implementation close to the earlier implementation except for shard workloads which does fxattrops. We could add the others too but it would increase test scope at the time. We can add them for future releases and get enough test cycles to confirm that the solution holds true.
(In reply to Ravishankar N from comment #17) > Just some points that I observed during discussion with Sahina and Nikhil : > > 1. Comparing RHHI-libgfapi vs RHHI-Fuse: Fuse is expected to be slower > because of the context switches between kernel and fuse-bridge. > > 2. Running fio inside a VM backed by gluster is not identical to running > fio inside a gluster fuse mount. Fio inside VM can operate on multiple files > (that it creates) using multiple fds, whereas the fio job running on a VM > will always come on a single FD on gluster (irrespective of gfapi or fuse > based access) First of all, a big fat thanks to Nikhil for patiently running all the tests requested by devs. :) I was expecting much better performance from 128GB block-size because there would only be 2 shards - one of these being the base shard opened by qemu and the other created by shard under .shard (this is what we saw on Nikhil's setup), so all of the IO and the eager-locking would be confined within these two files. But the profile tells a different story: 10.01 35.64 us 14.00 us 1525.00 us 207417 FINODELK 14.79 105.38 us 61.00 us 4848.00 us 103679 WRITE 32.24 114.85 us 51.00 us 4945.00 us 207388 FXATTROP 42.90 305.71 us 97.00 us 5484.00 us 103670 FSYNC FINODELK:FXATTROP:WRITE = 2:2:1. Nikhil ran the same fio job from within the vm in both gfapi and fuse. What we saw was that in the fuse case, qemu had had 2 open fds on the same vdisk file (we'd previously assumed that there would be only 1 open fd). This would have caused the benefits of eager-lock to be lost with FUSE: <from-qemu-kvm's-proc/pid/fd> ... ... lrwx------. 1 qemu qemu 64 Sep 5 07:45 42 -> /rhev/data-center/mnt/glusterSD/172.18.70.138:_ssd/b9340184-d923-4f88-8fd9-e3a5b61cd49c/images/628cbfee-b259-4393-ae1f-c9513da313bd/3d622030-bf26-4a44-ba6f-d9953fcdbebc lrwx------. 1 qemu qemu 64 Sep 5 07:45 43 -> /rhev/data-center/mnt/glusterSD/172.18.70.138:_ssd/b9340184-d923-4f88-8fd9-e3a5b61cd49c/images/628cbfee-b259-4393-ae1f-c9513da313bd/3d622030-bf26-4a44-ba6f-d9953fcdbebc ... ... </from-qemu-kvm's-proc/pid/fd> From brick statedump: conn.0.bound_xl./gluster_bricks/ssd/ssd.active.1] gfid=4eb75879-539d-4508-a92c-73a76b7a9035 nlookup=57 fd-count=2 active-fd-count=2 <------------ active fd count is 2 ref=3 ia_type=1 [xlator.features.locks.ssd-locks.inode] path=/b9340184-d923-4f88-8fd9-e3a5b61cd49c/images/3b77754e-cccc-4ee7-a7d4-2124b39c926b/e052530e-fceb-4e74-b8ac-e34360ca09d3 Furthermore, the same test was run with gfapi as well (128GB shards). And the brick statedump repored this: [conn.0.bound_xl./gluster_bricks/ssd/ssd.active.1] gfid=4eb75879-539d-4508-a92c-73a76b7a9035 nlookup=35 fd-count=1 active-fd-count=1 <----------------- active fd count is 1 ref=1 ia_type=1 [xlator.features.locks.ssd-locks.inode] path=/b9340184-d923-4f88-8fd9-e3a5b61cd49c/images/3b77754e-cccc-4ee7-a7d4-2124b39c926b/e052530e-fceb-4e74-b8ac-e34360ca09d3 And here's what the volume-profile reported: 0.00 35.55 us 17.00 us 79.00 us 56 FINODELK 0.00 221.94 us 193.00 us 360.00 us 17 FSYNC 0.00 45.19 us 24.00 us 103.00 us 105 STATFS 0.01 138.21 us 77.00 us 401.00 us 325 LOOKUP 0.04 1172.75 us 60.00 us 4932.00 us 228 FXATTROP 99.95 353.96 us 50.00 us 14868.00 us 1901491 WRITE The FINODELK:WRITE ratio looks much better. This is also one of the reasons (probably the biggest thing coming in the way of eager-lock perf with fuse identified so far) for the big gap between fuse and gfapi. Going back to the case where shard-block-size is 64MB, the same problem seen in 128Gb of two open-fds is applicable in this case as well as far as base shard is concerned. But the remaining shards (~64GB/64MB - 1 = 1024 - 1 = ~1023) would be written through anon-fds and the lifetime of each anon fd depends on when it was first created (the first write that was received by shard on this shard number) and when that write returns. But if until post-op-delay-secs, more writes are wound on this anon-fd, then they get to leverage the benefits of eager-lock on this anon-fd. Trouble comes only when 1s has passed and this fd is destroyed *after* which a new write comes on this shard causing it to create a new anon-fd. So maybe we're still ok as far as the remaining 1023 shards are concerned but I'll add some counters in the code, test and confirm if that's what happens. For now, I'm trying to see in qemu code what this extra open fd is for and whether it can be eliminated with some qemu-option. More on that later. -Krutika
Ok so I checked the qemu code base and it seems like qemu open()s 2 fds on the vm disk file (in the fuse case): 1. https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/block/file-posix.c#L482 - fd gotten from this open() is where all the IO is sent 2. https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/block/file-posix.c#L495 - further down in the same function, this open() is done only for some file locking purposes. But because there are two open fds, AFR would end up releasing the eager-lock from the write fd in order to make way for the "other client" to do IO, so as to prevent any kind of starvation on this second client. As a result, the benefits of eager-lock are lost in RHHI-fuse. (confirmed that these are indeed the same two fds opened by qemu on Nikhil's setup by attaching qemu-kvm process to gdb and simultaneously watching the open fds under /proc go up from 0->1->2) In contrast, with libgfapi, there is only 1 fd that is opened and used for IO on the vdisk - https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/block/gluster.c#L822 (I don't see any "lock_fd" in gfapi driver code) Will check if there's a way to disable this "locking" feature in qemu, so that the fuse numbers can be regenerated with this feature turned off. -Krutika
(In reply to Krutika Dhananjay from comment #47) > Ok so I checked the qemu code base and it seems like qemu open()s 2 fds on > the vm disk file (in the fuse case): > > 1. > https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/ > block/file-posix.c#L482 - fd gotten from this open() is where all the IO > is sent > > 2. > https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/ > block/file-posix.c#L495 - further down in the same function, this open() > is done only for some file locking purposes. > > But because there are two open fds, AFR would end up releasing the > eager-lock from the write fd in order to make way for the "other client" to > do IO, so as to prevent any kind of starvation on this second client. As a > result, the benefits of eager-lock are lost in RHHI-fuse. > > (confirmed that these are indeed the same two fds opened by qemu on Nikhil's > setup by attaching qemu-kvm process to gdb and simultaneously watching the > open fds under /proc go up from 0->1->2) > > In contrast, with libgfapi, there is only 1 fd that is opened and used for > IO on the vdisk - > https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/ > block/gluster.c#L822 > > (I don't see any "lock_fd" in gfapi driver code) > > Will check if there's a way to disable this "locking" feature in qemu, so > that the fuse numbers can be regenerated with this feature turned off. > > -Krutika Excellent investigative work! This explains so much of what we have been seeing in this bz. Another approach to the problem would be to rethink the eager-locking strategy within AFR -- that's something we have more control over. Is there a way to make it more robust so that it can better handle cases like this with multiple open fds?
(In reply to Manoj Pillai from comment #48) > (In reply to Krutika Dhananjay from comment #47) > > Ok so I checked the qemu code base and it seems like qemu open()s 2 fds on > > the vm disk file (in the fuse case): > > > > 1. > > https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/ > > block/file-posix.c#L482 - fd gotten from this open() is where all the IO > > is sent > > > > 2. > > https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/ > > block/file-posix.c#L495 - further down in the same function, this open() > > is done only for some file locking purposes. > > > > But because there are two open fds, AFR would end up releasing the > > eager-lock from the write fd in order to make way for the "other client" to > > do IO, so as to prevent any kind of starvation on this second client. As a > > result, the benefits of eager-lock are lost in RHHI-fuse. > > > > (confirmed that these are indeed the same two fds opened by qemu on Nikhil's > > setup by attaching qemu-kvm process to gdb and simultaneously watching the > > open fds under /proc go up from 0->1->2) > > > > In contrast, with libgfapi, there is only 1 fd that is opened and used for > > IO on the vdisk - > > https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/ > > block/gluster.c#L822 > > > > (I don't see any "lock_fd" in gfapi driver code) > > > > Will check if there's a way to disable this "locking" feature in qemu, so > > that the fuse numbers can be regenerated with this feature turned off. > > > > -Krutika > > Excellent investigative work! This explains so much of what we have been > seeing in this bz. > > Another approach to the problem would be to rethink the eager-locking > strategy within AFR -- that's something we have more control over. Yeah, I agree. The disabling-of-locking thing was more to see if we can quickly test this out for ourselves in the existing setup as it *seems* like an optional feature. -Krutika Is there > a way to make it more robust so that it can better handle cases like this > with multiple open fds?
(In reply to Krutika Dhananjay from comment #50) > > > > Excellent investigative work! This explains so much of what we have been > > seeing in this bz. Nice work indeed! > > > > Another approach to the problem would be to rethink the eager-locking > > strategy within AFR -- that's something we have more control over. There are two things we can do 1 short-term and another long-term. Longterm: There is a plan to integrate AFR's eager-locking with lock-contention infra by Xavi. commit 7ba7a4b27d124f4ee16fe4776a4670cd5b0160c4 Author: Xavier Hernandez <xhernandez> Date: Wed Jun 15 14:42:19 2016 +0200 locks: added inodelk/entrylk contention upcall notifications The locks xlator now is able to send a contention notification to the current owner of the lock. This is only a notification that can be used to improve performance of some client side operations that might benefit from extended duration of lock ownership. Nothing is done if the lock owner decides to ignore the message and to not release the lock. For forced release of acquired resources, leases must be used. Change-Id: I7f1ad32a0b4b445505b09908a050080ad848f8e0 Signed-off-by: Xavier Hernandez <xhernandez> Short-term: There is one change that can be done to make things a bit better compared to what it is now. At the moment, if at the time of receiving a write if AFR already knows that there are multiple fds opened it will disable eager-lock right away and won't use eager-lock for that write. Instead of that we can disable delayed-post-op/eager-lock after the fop is done. With this model multiple writes will be using the same eager-lock and writes will be sent in batches if they come in quickly. The following link has the patch: https://bugzilla.redhat.com/show_bug.cgi?id=1591208#c8 Let me know if you want to experiment with it. It will eventually flow downstream as we have to address the bug I mentioned above. But if we find the numbers with that change are better, we can prioritize it. With that said, it won't be as good as what you guys will find by disabling the second open Krutika is talking about in the comment. > > Yeah, I agree. The disabling-of-locking thing was more to see if we can > quickly test this out for ourselves in the existing setup as it *seems* like > an optional feature. > > -Krutika > > Is there > > a way to make it more robust so that it can better handle cases like this > > with multiple open fds?
Redirecting NEEDINFO to Fam.
(In reply to Pranith Kumar K from comment #51) > (In reply to Krutika Dhananjay from comment #50) > > > > > > Excellent investigative work! This explains so much of what we have been > > > seeing in this bz. > > Nice work indeed! > > > > > > > Another approach to the problem would be to rethink the eager-locking > > > strategy within AFR -- that's something we have more control over. > > There are two things we can do 1 short-term and another long-term. > > Longterm: > There is a plan to integrate AFR's eager-locking with lock-contention infra > by Xavi. > > commit 7ba7a4b27d124f4ee16fe4776a4670cd5b0160c4 > Author: Xavier Hernandez <xhernandez> > Date: Wed Jun 15 14:42:19 2016 +0200 > > locks: added inodelk/entrylk contention upcall notifications > > The locks xlator now is able to send a contention notification to > the current owner of the lock. > > This is only a notification that can be used to improve performance > of some client side operations that might benefit from extended > duration of lock ownership. Nothing is done if the lock owner decides > to ignore the message and to not release the lock. For forced > release of acquired resources, leases must be used. > > Change-Id: I7f1ad32a0b4b445505b09908a050080ad848f8e0 > Signed-off-by: Xavier Hernandez <xhernandez> > > Short-term: > There is one change that can be done to make things a bit better compared to > what it is now. At the moment, if at the time of receiving a write if AFR > already knows that there are multiple fds opened it will disable eager-lock > right away and won't use eager-lock for that write. Instead of that we can > disable delayed-post-op/eager-lock after the fop is done. With this model > multiple writes will be using the same eager-lock and writes will be sent in > batches if they come in quickly. > > The following link has the patch: > https://bugzilla.redhat.com/show_bug.cgi?id=1591208#c8 > > Let me know if you want to experiment with it. It will eventually flow > downstream as we have to address the bug I mentioned above. But if we find > the numbers with that change are better, we can prioritize it. > > With that said, it won't be as good as what you guys will find by disabling > the second open Krutika is talking about in the comment. > > > > > Yeah, I agree. The disabling-of-locking thing was more to see if we can > > quickly test this out for ourselves in the existing setup as it *seems* like > > an optional feature. > > > > -Krutika > > > > Is there > > > a way to make it more robust so that it can better handle cases like this > > > with multiple open fds? Posted this patch upstream: https://review.gluster.org/c/glusterfs/+/21107
Nikhil/Manoj, Do you want to give this a shot?
Current QEMU allows disabling lock with the file driver's 'locking' option. The command line option looks like this: -drive \ file.locking=off,file.filename=test.qcow2,file.driver=file,driver=qcow2 There is also a WIP on upstream to reduce the lock fd (reuse the I/O fd to achieve the same): https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg04155.html
(In reply to Fam Zheng from comment #55) > Current QEMU allows disabling lock with the file driver's 'locking' option. > The command line option looks like this: > > -drive \ > file.locking=off,file.filename=test.qcow2,file.driver=file,driver=qcow2 > > There is also a WIP on upstream to reduce the lock fd (reuse the I/O fd to > achieve the same): > > https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg04155.html Hi Fam, Thanks for that quick reply! Just wondering - is it possible to achieve the same by perhaps editing the vm's xml file (`virsh edit $VM_NAME`) and restarting it? For example, this is what I see currently in the xml file for the disk of our interest: <disk type='file' device='disk' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads'/> <source file='/rhev/data-center/mnt/glusterSD/172.18.70.138:_ssd/b9340184-d923-4f88-8fd9-e3a5b61cd49c/images/3b77754e-cccc-4ee7-a7d4-2124b39c926b/e052530e-fceb-4e74-b8ac-e34360ca09d3'/> <target dev='sda' bus='scsi'/> <serial>3b77754e-cccc-4ee7-a7d4-2124b39c926b</serial> <alias name='ua-3b77754e-cccc-4ee7-a7d4-2124b39c926b'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> Would simply adding "locking='off'" in the driver section also have the same effect: <driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads' locking='off'/> -Krutika
I'm afraid this option is not available in libvirt xml at the moment.
The ratio of finodelks to writes looks much better at least as far as 128GB shard block size is concerned. And the brick profiles both in gfapi case and comment #62 look similar except that there are fewer writes in the latter case sent in 120s run. In that case, the bottleneck should be above afr in terms of the layer responsible not pumping enough IO down to AFR layer. So the problem must be somewhere between qemu and fuse. But if gluster-fuse itself had a problem, then iops on the host runs should have been bad as well (pointed out by Manoj) and that is not the case. In light of this, I compared code again between gfapi qemu driver and file-posix.c. I single-stepped through the write operation from qemu in fuse and here is the life cycle of every write in file-posix.c: raw_co_prw -> paio_submit_co -> offloaded to an aio worker thread starting at aio_worker -> handle_aiocb_rw -> handle_aiocb_rw_linear -> https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/block/file-posix.c#L1136 which does a synchronous pwrite() system call. So the thread is blocked until the call returns. In contrast, in gfapi the write goes through qemu_gluster_co_rw -> glfs_pwritev_async() and each of these calls is passed a callback function such as gluster_finish_aiocb() in this case. glfs_pwritev_async() simply winds the call and returns immediately ready to process more parallel writes, without waiting for the response to the write. Whenever the response is received Yaniv, Is there an option in qemu to send truly asynchronous IO where the thread doesn't have to wait for the response and continue sending more parallel requests, similar to the way it is done in gfapi driver? -Krutika
Kevin might know better than me...
So had linux aio been enabled here - https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/block/file-posix.c#L1556 - the calls would have eventually led to linux aio syscall io_prep_pwrite() being called. So in command line args of the vm where these tests are being run, I see "aio=threads". As per man-page of qemu-kvm: aio=aio aio is "threads", or "native" and selects between pthread based disk I/O and native Linux AIO . I'm guessing we need to set aio=native to have async writes. I'll wait till we hear back from Kevin in any case. -Krutika
That's an easy change - we have this logic of aio=threads on file-based storage and aio=native on block-based storage, but it's really easy to change. I'd start by hacking VDSM, but if needed, I'm sure we can find where to properly change it.
The idea with the thread pool is that there isn't just a single worker thread, but many. So if one thread is blocked while doing a pwrite() syscall, another thread will process the concurrent next request. This does get you real asychronicity, even if each worker thread runs a blocking syscall. You can also switch to Linux AIO, which might be closer to what you meant by "truly asynchronous" because it's not just synchronous calls from multiple threads, but an API that is async by design. This is enabled with aio=native in -drive (and requires cache=none, because Linux AIO just doesn't work without O_DIRECT). In normal cases, both ways perform similarly, with only slight advantages for one or the other configuration (depending on the exact setup). Note that I'm not sure what a fuse driver must do (if anything) to support O_DIRECT and Linux AIO. If the driver doesn't support this, it might fall back to synchronous I/O, which would perform a lot worse than the thread pool. I think you'll see pretty quickly if that's the case.
(In reply to Kevin Wolf from comment #71) > The idea with the thread pool is that there isn't just a single worker > thread, but many. So if one thread is blocked while doing a pwrite() > syscall, another thread will process the concurrent next request. This does > get you real asychronicity, even if each worker thread runs a blocking > syscall. > How many threads in the thread pool by default? How to bump that number up? Would be good to do that experiment and see what that gets us. This test has 4 fio jobs each issuing 16 requests concurrently, so there is a high degree of concurrency in the workload.
(In reply to Manoj Pillai from comment #72) > How many threads in the thread pool by default? How to bump that number up? > Would be good to do that experiment and see what that gets us. It is currently hardcoded to a maximum of 64 worker threads.
(In reply to Kevin Wolf from comment #73) > (In reply to Manoj Pillai from comment #72) > > How many threads in the thread pool by default? How to bump that number up? > > Would be good to do that experiment and see what that gets us. > > It is currently hardcoded to a maximum of 64 worker threads. is there any loss if we increase the thread count ?
(In reply to Kevin Wolf from comment #73) > (In reply to Manoj Pillai from comment #72) > > How many threads in the thread pool by default? How to bump that number up? > > Would be good to do that experiment and see what that gets us. > > It is currently hardcoded to a maximum of 64 worker threads. 64 should be plenty. I guess it makes sense to try with aio=native.
(In reply to Nikhil Chawla from comment #74) > is there any loss if we increase the thread count ? I won't try to predict performance numbers. The only way is to try it out and measure.
(In reply to Yaniv Kaul from comment #70) > That's an easy change - we have this logic of aio=threads on file-based > storage and aio=native on block-based storage, but it's really easy to > change. I'd start by hacking VDSM, but if needed, I'm sure we can find where > to properly change it. I checked the VDSM code and found out the hack were we need to make change to deploy VM with aio=native: https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage.py#L948 I tried to change the code in /usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/storage.cp , and cleared all the python byte codes and object codes followed by vdsmd and supervdsmd restart. Then I deployed the VM again on the host on which the changes were executed, but it didn't make any difference. I am still looking forward to find the missing piece of the puzzle.
(In reply to Nikhil Chawla from comment #79) > (In reply to Yaniv Kaul from comment #70) > > That's an easy change - we have this logic of aio=threads on file-based > > storage and aio=native on block-based storage, but it's really easy to > > change. I'd start by hacking VDSM, but if needed, I'm sure we can find where > > to properly change it. > > I checked the VDSM code and found out the hack were we need to make change > to deploy VM with aio=native: > > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage. > py#L948 > > I tried to change the code in > /usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/storage.cp , and > cleared all the python byte codes and object codes followed by vdsmd and > supervdsmd restart. > Then I deployed the VM again on the host on which the changes were executed, > but it didn't make any difference. > > I am still looking forward to find the missing piece of the puzzle. Arik, how do we change the io=threads/native in 4.2?
(In reply to Yaniv Kaul from comment #80) > (In reply to Nikhil Chawla from comment #79) > > (In reply to Yaniv Kaul from comment #70) > > > That's an easy change - we have this logic of aio=threads on file-based > > > storage and aio=native on block-based storage, but it's really easy to > > > change. I'd start by hacking VDSM, but if needed, I'm sure we can find where > > > to properly change it. > > > > I checked the VDSM code and found out the hack were we need to make change > > to deploy VM with aio=native: > > > > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage. > > py#L948 > > > > I tried to change the code in > > /usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/storage.cp , and > > cleared all the python byte codes and object codes followed by vdsmd and > > supervdsmd restart. > > Then I deployed the VM again on the host on which the changes were executed, > > but it didn't make any difference. > > > > I am still looking forward to find the missing piece of the puzzle. > > Arik, how do we change the io=threads/native in 4.2? Since 4.2 it is set by ovirt-engine [1] so it is a bit more difficult to change. Specifically, the logic for Gluster resides in [2] - if libgfapi is used, the disk's type is set to 'network' and so aio would be set to 'native', otherwise to 'threads'. So one way to change this is by modifying ovirt-engine. Alternatively, it might be easier to change this using the before_vm_start hook in VDSM if it's just for testing. [1] https://github.com/oVirt/ovirt-engine/blob/ovirt-engine-4.2/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/builder/vminfo/LibvirtVmXmlBuilder.java#L1783 [2] https://github.com/oVirt/ovirt-engine/blob/ovirt-engine-4.2/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/builder/vminfo/VmInfoBuildUtils.java#L938
Created attachment 1483155 [details] Qemu process output
Summarising based on Krutika's inputs and the work done so far - for the performance improvements, we need the following changes 1. Fix in the eager-lock implementation to account for inodelk count rather than fd count 2. Fix in oVirt to use aio=native for VM disks on gluster storage domain 3. Tuning in virt profile to use post-op delay to 5s and performance options - client-io-threads, event-threads Krutika, Manoj - is this accurate? If so, we will raise separate bugs to track these. Thanks!
(In reply to Sahina Bose from comment #90) > Summarising based on Krutika's inputs and the work done so far - for the > performance improvements, we need the following changes > 1. Fix in the eager-lock implementation to account for inodelk count rather > than fd count Yes, eager-lock enhancement that will deal with the RHHI-fuse case better. Will leave implementation details to the AFR team. > 2. Fix in oVirt to use aio=native for VM disks on gluster storage domain Yes, aio=native is better, based on this ONE workload. We need to cover a suite of workloads in perf testing to make sure it works well everywhere. What is the best way? Make it an undocumented option that RHHI can set? > 3. Tuning in virt profile to use post-op delay to 5s and performance options > - client-io-threads, event-threads I'm not adding post-op-delay of 5 secs as an ask, since we got the high IOPS without it. Need AFR team to weigh in on the wider implications, whether raising its value is prudent. > > Krutika, Manoj - is this accurate? If so, we will raise separate bugs to > track these. Thanks!
+1 to everything that is said in comment #91. Bumping up post-op-delay-secs is not necessary as RHHI-FUSE performance is surpassing that of RHHI-gfapi by ~1-2K iops even with the default value of 1 sec.
(In reply to Kevin Wolf from comment #71) The logic in this comment #71 makes sense. But looking at the huge difference in performance for the RHHI-fuse case, I'm wondering if aio=threads is causing some serialization related to file system locking? Does this depend on FS-specific handling of concurrent writes? Does anyone have an AI to investigate this?
(In reply to Manoj Pillai from comment #93) > But looking at the huge difference in performance for the RHHI-fuse case, > I'm wondering if aio=threads is causing some serialization related to file > system locking? I don't think that's the case inside QEMU at least. The aio=threads code path is pretty uneventful from the point where it starts to differ from the aio=native one. It just creates a request structure that it hands to a worker thread, which then calls pwritev(). And that's pretty much it. Is the QEMU command line exactly the same for the two cases apart from the aio=native/threads option? In particular, do they use the same cache mode? (In reply to Kevin Wolf from comment #87) > If both paths were implemented optimally, shouldn't libgfapi always be > slightly faster, or at least at the same performance, as the fuse path? If > you see fuse being faster, I wonder if this hints at a problem in the > libgfapi path. By the way, I had a look at the code used from the libgfapi-based block driver in QEMU. Do I understand correctly that glfs_pwritev_async() always copies the data of the whole request into a temporarily allocated linear buffer with iobuf_copy()? I'm not even sure why a linear buffer would be needed for subvol->fops->writev (the function name clearly suggests it can deal with real iovecs), but if it were indeed required, couldn't that be skipped at least if the input is already linear? I wouldn't be surprised that an additional memcpy() for all data has an impact.
(In reply to Kevin Wolf from comment #94) > (In reply to Manoj Pillai from comment #93) > > But looking at the huge difference in performance for the RHHI-fuse case, > > I'm wondering if aio=threads is causing some serialization related to file > > system locking? > > I don't think that's the case inside QEMU at least. The aio=threads code > path is pretty uneventful from the point where it starts to differ from the > aio=native one. It just creates a request structure that it hands to a > worker thread, which then calls pwritev(). And that's pretty much it. > > Is the QEMU command line exactly the same for the two cases apart from the > aio=native/threads option? In particular, do they use the same cache mode? > > (In reply to Kevin Wolf from comment #87) > > If both paths were implemented optimally, shouldn't libgfapi always be > > slightly faster, or at least at the same performance, as the fuse path? If > > you see fuse being faster, I wonder if this hints at a problem in the > > libgfapi path. > > By the way, I had a look at the code used from the libgfapi-based block > driver in QEMU. Do I understand correctly that glfs_pwritev_async() always > copies the data of the whole request into a temporarily allocated linear > buffer with iobuf_copy()? > > I'm not even sure why a linear buffer would be needed for > subvol->fops->writev (the function name clearly suggests it can deal with > real iovecs), but if it were indeed required, couldn't that be skipped at > least if the input is already linear? I wouldn't be surprised that an > additional memcpy() for all data has an impact. So there was a patch that was sent more than 2 years ago that would avoid this additional copy - https://review.gluster.org/c/glusterfs/+/14784 - which was auto-abandoned due to inactivity. Let me try and revive it. -Krutika
(In reply to Krutika Dhananjay from comment #95) > So there was a patch that was sent more than 2 years ago that would avoid > this additional copy - https://review.gluster.org/c/glusterfs/+/14784 - > which was auto-abandoned due to inactivity. Let me try and revive it. This patch would not actually make the whole operation zero-copy because the API would still require the caller to get the buffer from libgfapi and to copy the data into it. So it would only move the additional copy from the library to the application. What would really be needed is a way to use arbitrary addresses, so that glfs_pwritev_async_common() can directly use the passed iovec (and no API changes would be necessary). subvol->fops->writev already takes an arbitrary vector, but it also requires an iobref. Maybe actually just passing an empty iobref would work? Its only purpose seems to be managing the lifecycle of buffers, and if the buffer comes from somewhere external, it doesn't have to be managed.
(In reply to Manoj Pillai from comment #91) > (In reply to Sahina Bose from comment #90) > > Summarising based on Krutika's inputs and the work done so far - for the > > performance improvements, we need the following changes > > 1. Fix in the eager-lock implementation to account for inodelk count rather > > than fd count > > Yes, eager-lock enhancement that will deal with the RHHI-fuse case better. > Will leave implementation details to the AFR team. > > > 2. Fix in oVirt to use aio=native for VM disks on gluster storage domain > > Yes, aio=native is better, based on this ONE workload. We need to cover a > suite of workloads in perf testing to make sure it works well everywhere. > > What is the best way? Make it an undocumented option that RHHI can set? undocumented option? Or did you mean configurable? I've raised a bug 1630744 - we will explore options there > > > 3. Tuning in virt profile to use post-op delay to 5s and performance options > > - client-io-threads, event-threads > > I'm not adding post-op-delay of 5 secs as an ask, since we got the high IOPS > without it. Need AFR team to weigh in on the wider implications, whether > raising its value is prudent. > > > > > Krutika, Manoj - is this accurate? If so, we will raise separate bugs to > > track these. Thanks!
(In reply to Sahina Bose from comment #97) [...] > > > > > 2. Fix in oVirt to use aio=native for VM disks on gluster storage domain > > > > Yes, aio=native is better, based on this ONE workload. We need to cover a > > suite of workloads in perf testing to make sure it works well everywhere. > > > > What is the best way? Make it an undocumented option that RHHI can set? > > undocumented option? Or did you mean configurable? > I've raised a bug 1630744 - we will explore options there That works. Thanks!
Verified that the following changes are available. 1. client-io-threads on the volume is enabled 2. cluster.choose-local is set to false 3. aio=native is turned on RHV side
(In reply to SATHEESARAN from comment #102) > Verified that the following changes are available. > > 1. client-io-threads on the volume is enabled > 2. cluster.choose-local is set to false > 3. aio=native is turned on RHV side Moving to verified based on the above comments