Bug 1616270

Summary: [Tracker] Low Random write IOPS in VM in RHHI 2.0
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nikhil Chawla <nichawla>
Component: rhhiAssignee: Sahina Bose <sabose>
Status: CLOSED CURRENTRELEASE QA Contact: bipin <bshetty>
Severity: urgent Docs Contact:
Priority: high    
Version: rhhiv-1.5CC: ahadas, bshetty, godas, guillaume.pavese, kdhananj, kwolf, michal.skrivanek, mkarg, mpillai, nichawla, pkarampu, psuriset, ravishankar, rcyriac, rhs-bugs, sabose, sankarshan, seamurph, shberry, vbellur, ykaul
Target Milestone: ---Keywords: Tracking
Target Release: RHHI-V 1.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1630368 1630688 (view as bug list) Environment:
Last Closed: 2019-05-20 04:54:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1619627, 1630688, 1630744    
Bug Blocks: 1520836    
Attachments:
Description Flags
Qemu process output none

Description Nikhil Chawla 2018-08-15 13:07:15 UTC
Description of problem:
To get the baseline Random write IOPS of SSD gluster volume, that is added as a storage domain in RHHI so as to provide block storage to VM provisioned, is mounted on of the hosts and a fio random write run is executed over it. Later, the same volume was added as a storage domain and added as a block device to the Virtual Machine(VM) running on top of RHHI and the same fio random write test is executed within the VM. Following are the results:

1. Baseline Random Write IOPS: 6456
2. Virtual Machine Random Write IOPS: 602

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 11 Manoj Pillai 2018-08-17 06:12:54 UTC
Ravi,

The results are indicating that RHHI-libgfapi is performing well, whereas RHHI-fuse is not; also the gluster volume profiles are showing that in the RHHI-fuse case there is a high proportion of FINODELK and FXATTROP calls (not seen in RHHI-libgfapi). 

Is this indicating that the recent eager-lock enhancements are less effective over fuse, and more work is needed there? (Both RHHI-libgfapi and the libgfapi-based gluster-block have shown good performance with the eager-lock fixes). Requesting your comments on the reported results and collected data, and what might be the way forward.

Comment 12 Ravishankar N 2018-08-20 07:03:06 UTC
Hi Manoj,

As far as eager lock modification https://code.engineering.redhat.com/gerrit/#/c/140577/ is concerned, I'm not able to find any potential issues. The changes are independent of which xlators sits on the top. i.e. if the order and sequence of FOPS reaching AFR are same irrespective of whether it is gfapi or fuse, the no. of FINODELK + WRITE + FXATTROP generated by AFR will be the same. The fact that it isn't seems to indicate there is a difference in I/O pattern reaching AFR using gfapi vs fuse. We would need to figure out why that is.

Comment 13 Ravishankar N 2018-08-20 07:07:19 UTC
Nikhil, just wanted to check if 

1) you got a chance to run the tests with eager-lock disabled and compare the results of fio on gfapi vs fuse and you are still seeing  the performance drop.

2)Whether having a fully preallocated vm-disk (comment #9) makes any difference to the results.

Comment 15 Nikhil Chawla 2018-08-21 07:55:02 UTC
(In reply to Ravishankar N from comment #13)
> Nikhil, just wanted to check if 
> 
> 1) you got a chance to run the tests with eager-lock disabled and compare
> the results of fio on gfapi vs fuse and you are still seeing  the
> performance drop.

Yes, ran the test with eager-lock off on RHHI-FUSE and RHHI-Libgfapi. The difference is still persisting.

RHHI-libgfapi - 1216
RHHI-Fuse - 392
 
> 2)Whether having a fully preallocated vm-disk (comment #9) makes any
> difference to the results.

Since, before executing the Random write test, a sequential write test is executed first, which allocates the shards and on top of the same file, Fio random IO is executed. Hence, this is already a preallocated test.

Comment 17 Ravishankar N 2018-08-21 09:40:27 UTC
Just some points that I observed during discussion with Sahina and Nikhil :

1. Comparing RHHI-libgfapi vs  RHHI-Fuse: Fuse is expected to be slower because of the context switches between kernel and fuse-bridge.

2. Running fio inside a VM  backed by gluster is not identical to running fio inside a gluster fuse mount. Fio inside VM can operate on multiple files (that it creates) using multiple fds, whereas the fio job running on a VM will always come on a single FD on gluster (irrespective of gfapi or fuse based access)

Comment 18 Ravishankar N 2018-08-21 09:42:24 UTC
(In reply to Ravishankar N from comment #17)

> Fio inside VM can operate on multiple files
> (that it creates) using multiple fds,

typo. I meant " Fio inside fuse mount can operate on multiple files (that it creates) using multiple fds"

Too many context switches :(

Comment 22 Manoj Pillai 2018-08-21 12:25:50 UTC
(In reply to Ravishankar N from comment #21)
> Manoj, What is the expectation from this BZ? 
> 
> It is to see why the profile info is different for a VM on RHHI using fuse
> vs libgfapi (and not focus on *why* performance numbers are better on
> libgfapi?)

This is a performance bug on RHHI on a random write workload.

RHHI, in the supported configuration, is providing ~600 IOPS on a random write test, with 4k block size. Why so low? What is holding it back?

* The underlying devices are NVMe, probably capable of 200+K IOPS. So that's not the bottleneck.

* Running the test on the volume mounting it directly on the host gives about 3K IOPS; with some tuning we are able to bump that up to 6+K IOPS. [Nikhil has posted data for single-file (to simulate the RHHI case) and multiple files; in each case gluster is able to hit ~6K IOPS]. So, the 600K IOPS that RHHI is giving is not because of inherent gluster limits.

* Running the test with RHHI-libgfapi path, the same test run on the VM again gives approx. 3K IOPS, and with tuning it gives 6K IOPS. So it is not that running in the VM is the cause of the low IOPS.

So what is holding back RHHI performance on this workload in the supported RHHI-fuse configuration? The task here is to find the root cause and a fix.

The gluster volume profiles posted in comment #3 and others, are showing that the RHHI-fuse case is showing much higher proportion of FINODELK and FXATTROP calls. According to my conversation with you, that suggests that the eager locking is not effective. That's the biggest clue that we have so far, IMO. But feel free to propose hypotheses for the reason for poor performance, and validate them.

Setting a needinfo on the assignee to take this forward.

Comment 23 Manoj Pillai 2018-08-21 12:47:20 UTC
(In reply to Manoj Pillai from comment #22)

> multiple files; in each case gluster is able to hit ~6K IOPS]. So, the 600K
> IOPS that RHHI is giving is not because of inherent gluster limits.

That should read 600 IOPS, obviously. Otherwise, we would have a bug of a very different nature :-).

Comment 24 Ravishankar N 2018-08-21 12:53:30 UTC
(In reply to Manoj Pillai from comment #22)

> 
> The gluster volume profiles posted in comment #3 and others, are showing
> that the RHHI-fuse case is showing much higher proportion of FINODELK and
> FXATTROP calls. According to my conversation with you, that suggests that
> the eager locking is not effective. That's the biggest clue that we have so
> far, IMO. But feel free to propose hypotheses for the reason for poor
> performance, and validate them.
> 

I will explore if it is possible to load 2 io-stats xlators manually before and after AFR in the client volfile, and run the VM on gfapi (6000 IOPS) vs VM on fuse (600 IOPS) fio test and capture the profile info for each run. I think it is doable for fuse mount, but let me check for gfapi and confirm. If yes, I will sit with Nikhil and do this on the same setup.

Comment 26 Krutika Dhananjay 2018-08-21 13:28:18 UTC
I ran the same random write fio workload on 2 3-way replicated volumes - one with shard and another without it to keep the test simple. I did this mostly to eliminate the possibility of any bugs in shard that could be causing it to send FXATTROPS even with already created files.
I also had io-stats loaded at multiple points in the client stack of the replicated-sharded volume. (This was done entirely on FUSE mount - no gfapi).
Eager-lock was enabled on both volumes.

Results/Observations:
====================

1. With the unsharded volume:
From brick profile:
      0.00      30.28 us      18.25 us      54.72 us              8    FINODELK
      0.00      82.79 us      46.91 us     118.46 us              4        OPEN
      0.00     307.96 us     222.14 us     418.21 us              4       FSYNC
      0.00     214.29 us      64.62 us     599.05 us             12    FXATTROP
    100.00     168.84 us      73.69 us  434220.60 us        1071435       WRITE

This makes perfect sense.
The fio job operates on 4 files. So 4 * 2 (lock + unlock) = 8 finodelks.
And there are 4 fsyncs - perhaps one per file. And I see that afr's fsync is also converted into a data transaction, so it would have piggybacked on the existing eager-locks on the files, leading to no extra finodelks. All good so far.

2. With sharded volume:

Shard's parent received the following fops:
WRITE             721941      875.17 us      237.37 us  3574429.83 us
FLUSH                  4   183756.46 us     3950.93 us   700065.74 us
FSYNC                  4  2639002.96 us  2502096.61 us  3022192.69 us
LOOKUP                 7     4612.40 us      385.51 us    24819.05 us

Not very different from (1) above. Layers above shard see only 4 files. So 4 fsyncs, 4 flushes etc.

Shard's child received the following fops:
WRITE             721941      858.33 us      226.74 us  3574409.67 us
FLUSH                  4   183750.80 us     3946.39 us   700060.42 us
FSYNC               1024  1761200.73 us   248645.32 us  3019774.01 us
LOOKUP              1049      814.11 us      305.65 us    24783.29 us

The numbers still make sense. 16GB*4 = 64GB. And with shard-block-size of 64MB, there are 1024 shards. So 1024 fsyncs, approximately 1024 lookups as well (wi.
Also note that there are no fxattrops originating from shard itself. So at least there are no bugs in shard's write fop. So far so good.

Each child of replicate-0 received the following fops:

WRITE             721941      544.15 us      152.05 us  3260626.44 us
FSYNC               1024  1222615.59 us   247642.73 us  1957657.64 us
LOOKUP              1053      560.92 us      198.14 us    10163.24 us
FINODELK           11538   153684.12 us       65.11 us  3260720.01 us
FXATTROP           11538   698127.39 us      177.48 us  3469537.85 us

Since there are 1024 shards, I'd imagine that AFR would be taking eager-locks 1024 * 2 (lock + unlock)  = 2048 times.
But why is afr then sending 11538 (>> 2048) inodelks? *Even* if I assume fsync uses its own eager-lock, that would still account for 2048  * 2 = 4096 inodelks.  What are the rest of the finodelks and fxattrops being wound for?

Couple of reasons I can think of:

1. I don't know entirely how the new eager-lock implementation works. Does it still release the lock after 1 sec of inactivity on the fd like it used to in the older scheme? ...

Ravi would you know? ^^

... and the degree of randomness of the fio write was possibly such that a given shard saw no writes for > 1 sec causing the eager-lock to be released and re-acquired multiple times over the course of the run?


2. Another question that crops up in my mind - does eager-lock optimize the locking well for writes on anonymous fds too, given that shard uses anon fds to perform its IO?

For 2, let me write a quick and dirty patch in shard to make it OPEN the shards explicitly before performing writes and check whether that makes a difference.

-Krutika

Comment 27 Ravishankar N 2018-08-23 01:26:11 UTC
(In reply to Krutika Dhananjay from comment #26)

> Couple of reasons I can think of:
> 
> 1. I don't know entirely how the new eager-lock implementation works. Does
> it still release the lock after 1 sec of inactivity on the fd like it used
> to in the older scheme? ...
> 
> Ravi would you know? ^^

The new eager lock changes as I understand it (not going to lie, I'm not yet fully familiar with the no. of fop queues it maintains and how it moves fops from one queue to another) is no longer fd based but is tied to the inode. Fops on the same inode are put into different queues based on various conditions and processed.During post-op if there is only one fd, there is a wait for post_op_delay_secs (1 sec by default) before doing on-wire post-op and unlock. So yes, the old behaviour does seem to be in place.

> 
> ... and the degree of randomness of the fio write was possibly such that a
> given shard saw no writes for > 1 sec causing the eager-lock to be released
> and re-acquired multiple times over the course of the run?
> 
> 
> 2. Another question that crops up in my mind - does eager-lock optimize the
> locking well for writes on anonymous fds too, given that shard uses anon fds
> to perform its IO?
> 
AFR checks for GLUSTERFS_ACTIVE_FD_COUNT in every write cbk and disables eager lock if multiple fds are opened. If anon-fds contribute towards increasing this count, then I'm guessing eager lock benefits would be lost. But this is something that has been there even before the new eager lock changes.

Comment 28 Krutika Dhananjay 2018-08-23 05:12:19 UTC
One more thing that could be contributing to the wide gap in performance between fuse and libgfapi is the presence of FUSE event-history, which as the name suggests is FUSE-specific and causes mutex lock contention between FUSE AND epoll threads in both req and rsp path of fops over a global circular buf. This was fixed in RHGS-3.4.0, but is not part of the RHGS-3.3.1 async branch.

Here's some numbers with and without event-history from a run I'd done long time back - https://bugzilla.redhat.com/show_bug.cgi?id=1501013#c0

(This is also reflected in the cpu usage of the fuse reader thread where with event-history disabledit hits ~99% util).

-Krutika

Comment 30 Manoj Pillai 2018-08-23 08:59:31 UTC
(In reply to Ravishankar N from comment #21)
> Manoj, What is the expectation from this BZ? 
> 

(In reply to Manoj Pillai from comment #22)
<...>
> RHHI, in the supported configuration, is providing ~600 IOPS on a random
> write test, with 4k block size. Why so low? What is holding it back?
> 
<...>
> * Running the test with RHHI-libgfapi path, the same test run on the VM
> again gives approx. 3K IOPS, and with tuning it gives 6K IOPS. So it is not
> that running in the VM is the cause of the low IOPS.
> 
> So what is holding back RHHI performance on this workload in the supported
> RHHI-fuse configuration? The task here is to find the root cause and a fix.
> 

Expressing the expectation as numbers, IMO the RHHI-fuse IOPS results should be close to the RHHI-libgfapi results for this test i.e. should be able to hit close to 6K IOPS with tuning. If not, we should be able to give an explanation for why that is not happening.

Note that the test itself is not a latency-sensitive test -- it has a large no. of concurrent requests outstanding (4 fio jobs with iodepth=16 each) and we are focusing on IOPS. Even though it is expected that the fuse path will have a somewhat higher latency, with so many concurrent requests in flight, I'd expect it to be able to match the libgfapi path on IOPS, if latency is the only difference in the two paths.

Comment 38 Krutika Dhananjay 2018-08-28 06:58:51 UTC
What does the fio option "end_fsync" do? 

man fio(1) says this - "Sync file contents when job exits."
But does it have any bearing on the IOPs fio would report? Or can all FSYNC (gluster) transactions be safely ignored for the purpose of this debugging exercise?

Comment 41 Krutika Dhananjay 2018-08-31 05:54:26 UTC
On a related note, I was going through eager-lock implementation and one thing that caught my eye was this:

https://github.com/gluster/glusterfs/blob/d6d729b0609957c0382749c30da507dda77561b7/xlators/cluster/afr/src/afr-transaction.c#L2023

IIUC, delayed post-op is ONLY applicable to GF_FOP_WRITE and GF_FOP_FXATTROP.
Given that afr's FSYNC is also a transaction now, can we NOT change this piece of code to also make FSYNC delay its post-op and unlock, like this:

<snippet>

diff --git a/xlators/cluster/afr/src/afr-transaction.c b/xlators/cluster/afr/src/afr-transaction.c
index 9c587db05..6b0bfcb6c 100644
--- a/xlators/cluster/afr/src/afr-transaction.c
+++ b/xlators/cluster/afr/src/afr-transaction.c
@@ -2020,7 +2020,8 @@ afr_is_delayed_changelog_post_op_needed (call_frame_t *frame, xlator_t *this,
         }
 
         if ((local->op != GF_FOP_WRITE) &&
-            (local->op != GF_FOP_FXATTROP)) {
+            (local->op != GF_FOP_FXATTROP) &&
+            (local->op != GF_FOP_FSYNC)) {
                 /*Only allow writes but shard does [f]xattrops on writes, so
                  * they are fine too*/
                 goto out;

</snippet>

and that should hopefully give us better performance?

This may not yield much benefit if the I/O pattern is such that FSYNC is sent after the last WRITE (like the fio job with end_fsync flag that Nikhil's using in his test).

But, if there is a possibility that there are lot more fsyncs that are sent between writes, then this should help.

For example, in Sahina's tests, we saw this in the vol profile:

7.57      30.93 us       9.00 us     627.00 us          27442    FINODELK
20.14     172.02 us      28.00 us     772.00 us          13126       FSYNC
26.41     110.26 us      54.00 us    2185.00 us          26855    FXATTROP
45.11     151.55 us      78.00 us    1189.00 us          33373       WRITE


33K writes, spread across ~960 shards and lot of fsyncs.

Ravi, could you confirm if this piece of code can be optimized and doesn't lead to any kind of data loss (I think not because anyway fsync would have done a pre-op, so no information is lost in the face of failure, but I'll let you confirm)? If yes, I have a patch ready that I'll send out for review.

-Krutika

Comment 42 Ravishankar N 2018-09-04 06:08:58 UTC
(In reply to Krutika Dhananjay from comment #41)

I think it is better not to optimise fysncs using eager locking because fsync gives guarantees to the application regarding previous writes. So if an fsync fails on a few children of AFR, it is better to capture that information on a per FOP basis by doing 'dirty' (pre-op) and 'undirty + pending xattr in case of failure' (post-op) for every fsync call.

Pranith,
1. Does that make sense? 
2. Is there any reason that you added GF_FOP_FXATTROP, but not GF_FOP_FTRUNCATE, GF_FOP_FALLOCATE. GF_FOP_ZEROFILL  etc to the afr_is_delayed_changelog_post_op_needed() function?

Comment 43 Pranith Kumar K 2018-09-04 08:27:42 UTC
(In reply to Ravishankar N from comment #42)
> (In reply to Krutika Dhananjay from comment #41)
> 
> I think it is better not to optimise fysncs using eager locking because
> fsync gives guarantees to the application regarding previous writes. So if
> an fsync fails on a few children of AFR, it is better to capture that
> information on a per FOP basis by doing 'dirty' (pre-op) and 'undirty +
> pending xattr in case of failure' (post-op) for every fsync call.

When a failure happens post-op-delay anyway doesn't happen right? Otherwise it is a bug? Since fsync happens with a full lock, no other operations should be in progress, so it can use the eager-lock and do delayed-post-op like the other fops. I don't foresee any issues at the moment at least.

> 
> Pranith,
> 1. Does that make sense? 
> 2. Is there any reason that you added GF_FOP_FXATTROP, but not
> GF_FOP_FTRUNCATE, GF_FOP_FALLOCATE. GF_FOP_ZEROFILL  etc to the
> afr_is_delayed_changelog_post_op_needed() function?

I wanted to keep the implementation close to the earlier implementation except for shard workloads which does fxattrops. We could add the others too but it would increase test scope at the time. We can add them for future releases and get enough test cycles to confirm that the solution holds true.

Comment 46 Krutika Dhananjay 2018-09-05 12:07:40 UTC
(In reply to Ravishankar N from comment #17)
> Just some points that I observed during discussion with Sahina and Nikhil :
> 
> 1. Comparing RHHI-libgfapi vs  RHHI-Fuse: Fuse is expected to be slower
> because of the context switches between kernel and fuse-bridge.
> 
> 2. Running fio inside a VM  backed by gluster is not identical to running
> fio inside a gluster fuse mount. Fio inside VM can operate on multiple files
> (that it creates) using multiple fds, whereas the fio job running on a VM
> will always come on a single FD on gluster (irrespective of gfapi or fuse
> based access)


First of all, a big fat thanks to Nikhil for patiently running all the tests requested by devs. :)

I was expecting much better performance from 128GB block-size because there would only be 2 shards - one of these being the base shard opened by qemu and the other created by shard under .shard (this is what we saw on Nikhil's setup), so all of the IO and the eager-locking would be confined within these two files. But the profile tells a different story:

     10.01      35.64 us      14.00 us    1525.00 us         207417    FINODELK
     14.79     105.38 us      61.00 us    4848.00 us         103679       WRITE
     32.24     114.85 us      51.00 us    4945.00 us         207388    FXATTROP
     42.90     305.71 us      97.00 us    5484.00 us         103670       FSYNC

FINODELK:FXATTROP:WRITE = 2:2:1.

Nikhil ran the same fio job from within the vm in both gfapi and fuse. What we saw was that in the fuse case, qemu had had 2 open fds on the same vdisk file (we'd previously assumed that there would be only 1 open fd). This would have caused the benefits of eager-lock to be lost with FUSE:

<from-qemu-kvm's-proc/pid/fd>
...
...
lrwx------. 1 qemu qemu 64 Sep  5 07:45 42 -> /rhev/data-center/mnt/glusterSD/172.18.70.138:_ssd/b9340184-d923-4f88-8fd9-e3a5b61cd49c/images/628cbfee-b259-4393-ae1f-c9513da313bd/3d622030-bf26-4a44-ba6f-d9953fcdbebc
lrwx------. 1 qemu qemu 64 Sep  5 07:45 43 -> /rhev/data-center/mnt/glusterSD/172.18.70.138:_ssd/b9340184-d923-4f88-8fd9-e3a5b61cd49c/images/628cbfee-b259-4393-ae1f-c9513da313bd/3d622030-bf26-4a44-ba6f-d9953fcdbebc
...
...
</from-qemu-kvm's-proc/pid/fd>

From brick statedump:

conn.0.bound_xl./gluster_bricks/ssd/ssd.active.1]
gfid=4eb75879-539d-4508-a92c-73a76b7a9035
nlookup=57
fd-count=2
active-fd-count=2 <------------ active fd count is 2
ref=3
ia_type=1

[xlator.features.locks.ssd-locks.inode]
path=/b9340184-d923-4f88-8fd9-e3a5b61cd49c/images/3b77754e-cccc-4ee7-a7d4-2124b39c926b/e052530e-fceb-4e74-b8ac-e34360ca09d3

Furthermore, the same test was run with gfapi as well (128GB shards). And the brick statedump repored this:

[conn.0.bound_xl./gluster_bricks/ssd/ssd.active.1]
gfid=4eb75879-539d-4508-a92c-73a76b7a9035
nlookup=35
fd-count=1
active-fd-count=1 <----------------- active fd count is 1
ref=1
ia_type=1

[xlator.features.locks.ssd-locks.inode]
path=/b9340184-d923-4f88-8fd9-e3a5b61cd49c/images/3b77754e-cccc-4ee7-a7d4-2124b39c926b/e052530e-fceb-4e74-b8ac-e34360ca09d3

And here's what the volume-profile reported:

      0.00      35.55 us      17.00 us      79.00 us             56    FINODELK
      0.00     221.94 us     193.00 us     360.00 us             17       FSYNC
      0.00      45.19 us      24.00 us     103.00 us            105      STATFS
      0.01     138.21 us      77.00 us     401.00 us            325      LOOKUP
      0.04    1172.75 us      60.00 us    4932.00 us            228    FXATTROP
     99.95     353.96 us      50.00 us   14868.00 us        1901491       WRITE

The FINODELK:WRITE ratio looks much better.

This is also one of the reasons (probably the biggest thing coming in the way of eager-lock perf with fuse identified so far) for the big gap between fuse and gfapi.

Going back to the case where shard-block-size is 64MB, the same problem seen in 128Gb of two open-fds is applicable in this case as well as far as base shard is concerned. But the remaining shards (~64GB/64MB - 1 = 1024 - 1 = ~1023) would be written through anon-fds and the lifetime of each anon fd depends on when it was first created (the first write that was received by shard on this shard number) and when that write returns. But if until post-op-delay-secs, more writes are wound on this anon-fd, then they get to leverage the benefits of eager-lock on this anon-fd. Trouble comes only when 1s has passed and this fd is destroyed *after* which a new write comes on this shard causing it to create a new anon-fd. So maybe we're still ok as far as the remaining 1023 shards are concerned but I'll add some counters in the code, test and confirm if that's what happens.

For now, I'm trying to see in qemu code what this extra open fd is for and whether it can be eliminated with some qemu-option. More on that later.

-Krutika

Comment 47 Krutika Dhananjay 2018-09-06 06:21:07 UTC
Ok so I checked the qemu code base and it seems like qemu open()s 2 fds on the vm disk file (in the fuse case):

1. https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/block/file-posix.c#L482   - fd gotten from this open() is where all the IO is sent

2. https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/block/file-posix.c#L495   - further down in the same function, this open() is done only for some file locking purposes.

But because there are two open fds, AFR would end up releasing the eager-lock from the write fd in order to make way for the "other client" to do IO, so as to prevent any kind of starvation on this second client. As a result, the benefits of eager-lock are lost in RHHI-fuse.

(confirmed that these are indeed the same two fds opened by qemu on Nikhil's setup by attaching qemu-kvm process to gdb and simultaneously watching the open fds under /proc go up from 0->1->2)

In contrast, with libgfapi, there is only 1 fd that is opened and used for IO on the vdisk - https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/block/gluster.c#L822

(I don't see any "lock_fd" in gfapi driver code)

Will check if there's a way to disable this "locking" feature in qemu, so that the fuse numbers can be regenerated with this feature turned off.

-Krutika

Comment 48 Manoj Pillai 2018-09-06 07:06:42 UTC
(In reply to Krutika Dhananjay from comment #47)
> Ok so I checked the qemu code base and it seems like qemu open()s 2 fds on
> the vm disk file (in the fuse case):
> 
> 1.
> https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/
> block/file-posix.c#L482   - fd gotten from this open() is where all the IO
> is sent
> 
> 2.
> https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/
> block/file-posix.c#L495   - further down in the same function, this open()
> is done only for some file locking purposes.
> 
> But because there are two open fds, AFR would end up releasing the
> eager-lock from the write fd in order to make way for the "other client" to
> do IO, so as to prevent any kind of starvation on this second client. As a
> result, the benefits of eager-lock are lost in RHHI-fuse.
> 
> (confirmed that these are indeed the same two fds opened by qemu on Nikhil's
> setup by attaching qemu-kvm process to gdb and simultaneously watching the
> open fds under /proc go up from 0->1->2)
> 
> In contrast, with libgfapi, there is only 1 fd that is opened and used for
> IO on the vdisk -
> https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/
> block/gluster.c#L822
> 
> (I don't see any "lock_fd" in gfapi driver code)
> 
> Will check if there's a way to disable this "locking" feature in qemu, so
> that the fuse numbers can be regenerated with this feature turned off.
> 
> -Krutika

Excellent investigative work! This explains so much of what we have been seeing in this bz.

Another approach to the problem would be to rethink the eager-locking strategy within AFR -- that's something we have more control over. Is there a way to make it more robust so that it can better handle cases like this with multiple open fds?

Comment 50 Krutika Dhananjay 2018-09-06 07:17:15 UTC
(In reply to Manoj Pillai from comment #48)
> (In reply to Krutika Dhananjay from comment #47)
> > Ok so I checked the qemu code base and it seems like qemu open()s 2 fds on
> > the vm disk file (in the fuse case):
> > 
> > 1.
> > https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/
> > block/file-posix.c#L482   - fd gotten from this open() is where all the IO
> > is sent
> > 
> > 2.
> > https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/
> > block/file-posix.c#L495   - further down in the same function, this open()
> > is done only for some file locking purposes.
> > 
> > But because there are two open fds, AFR would end up releasing the
> > eager-lock from the write fd in order to make way for the "other client" to
> > do IO, so as to prevent any kind of starvation on this second client. As a
> > result, the benefits of eager-lock are lost in RHHI-fuse.
> > 
> > (confirmed that these are indeed the same two fds opened by qemu on Nikhil's
> > setup by attaching qemu-kvm process to gdb and simultaneously watching the
> > open fds under /proc go up from 0->1->2)
> > 
> > In contrast, with libgfapi, there is only 1 fd that is opened and used for
> > IO on the vdisk -
> > https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/
> > block/gluster.c#L822
> > 
> > (I don't see any "lock_fd" in gfapi driver code)
> > 
> > Will check if there's a way to disable this "locking" feature in qemu, so
> > that the fuse numbers can be regenerated with this feature turned off.
> > 
> > -Krutika
> 
> Excellent investigative work! This explains so much of what we have been
> seeing in this bz.
> 
> Another approach to the problem would be to rethink the eager-locking
> strategy within AFR -- that's something we have more control over.

Yeah, I agree. The disabling-of-locking thing was more to see if we can quickly test this out for ourselves in the existing setup as it *seems* like an optional feature.

-Krutika

 Is there
> a way to make it more robust so that it can better handle cases like this
> with multiple open fds?

Comment 51 Pranith Kumar K 2018-09-06 09:26:09 UTC
(In reply to Krutika Dhananjay from comment #50)
> > 
> > Excellent investigative work! This explains so much of what we have been
> > seeing in this bz.

Nice work indeed!

> > 
> > Another approach to the problem would be to rethink the eager-locking
> > strategy within AFR -- that's something we have more control over.

There are two things we can do 1 short-term and another long-term.

Longterm:
There is a plan to integrate AFR's eager-locking with lock-contention infra by Xavi.

commit 7ba7a4b27d124f4ee16fe4776a4670cd5b0160c4
Author: Xavier Hernandez <xhernandez>
Date:   Wed Jun 15 14:42:19 2016 +0200

    locks: added inodelk/entrylk contention upcall notifications
    
    The locks xlator now is able to send a contention notification to
    the current owner of the lock.
    
    This is only a notification that can be used to improve performance
    of some client side operations that might benefit from extended
    duration of lock ownership. Nothing is done if the lock owner decides
    to ignore the message and to not release the lock. For forced
    release of acquired resources, leases must be used.
    
    Change-Id: I7f1ad32a0b4b445505b09908a050080ad848f8e0
    Signed-off-by: Xavier Hernandez <xhernandez>

Short-term:
There is one change that can be done to make things a bit better compared to what it is now. At the moment, if at the time of receiving a write if AFR already knows that there are multiple fds opened it will disable eager-lock right away and won't use eager-lock for that write. Instead of that we can disable delayed-post-op/eager-lock after the fop is done. With this model multiple writes will be using the same eager-lock and writes will be sent in batches if they come in quickly.

The following link has the patch:
https://bugzilla.redhat.com/show_bug.cgi?id=1591208#c8

Let me know if you want to experiment with it. It will eventually flow downstream as we have to address the bug I mentioned above. But if we find the numbers with that change are better, we can prioritize it.

With that said, it won't be as good as what you guys will find by disabling the second open Krutika is talking about in the comment.

> 
> Yeah, I agree. The disabling-of-locking thing was more to see if we can
> quickly test this out for ourselves in the existing setup as it *seems* like
> an optional feature.
> 
> -Krutika
> 
>  Is there
> > a way to make it more robust so that it can better handle cases like this
> > with multiple open fds?

Comment 52 Yaniv Kaul 2018-09-06 13:11:44 UTC
Redirecting NEEDINFO to Fam.

Comment 53 Pranith Kumar K 2018-09-06 13:19:10 UTC
(In reply to Pranith Kumar K from comment #51)
> (In reply to Krutika Dhananjay from comment #50)
> > > 
> > > Excellent investigative work! This explains so much of what we have been
> > > seeing in this bz.
> 
> Nice work indeed!
> 
> > > 
> > > Another approach to the problem would be to rethink the eager-locking
> > > strategy within AFR -- that's something we have more control over.
> 
> There are two things we can do 1 short-term and another long-term.
> 
> Longterm:
> There is a plan to integrate AFR's eager-locking with lock-contention infra
> by Xavi.
> 
> commit 7ba7a4b27d124f4ee16fe4776a4670cd5b0160c4
> Author: Xavier Hernandez <xhernandez>
> Date:   Wed Jun 15 14:42:19 2016 +0200
> 
>     locks: added inodelk/entrylk contention upcall notifications
>     
>     The locks xlator now is able to send a contention notification to
>     the current owner of the lock.
>     
>     This is only a notification that can be used to improve performance
>     of some client side operations that might benefit from extended
>     duration of lock ownership. Nothing is done if the lock owner decides
>     to ignore the message and to not release the lock. For forced
>     release of acquired resources, leases must be used.
>     
>     Change-Id: I7f1ad32a0b4b445505b09908a050080ad848f8e0
>     Signed-off-by: Xavier Hernandez <xhernandez>
> 
> Short-term:
> There is one change that can be done to make things a bit better compared to
> what it is now. At the moment, if at the time of receiving a write if AFR
> already knows that there are multiple fds opened it will disable eager-lock
> right away and won't use eager-lock for that write. Instead of that we can
> disable delayed-post-op/eager-lock after the fop is done. With this model
> multiple writes will be using the same eager-lock and writes will be sent in
> batches if they come in quickly.
> 
> The following link has the patch:
> https://bugzilla.redhat.com/show_bug.cgi?id=1591208#c8
> 
> Let me know if you want to experiment with it. It will eventually flow
> downstream as we have to address the bug I mentioned above. But if we find
> the numbers with that change are better, we can prioritize it.
> 
> With that said, it won't be as good as what you guys will find by disabling
> the second open Krutika is talking about in the comment.
> 
> > 
> > Yeah, I agree. The disabling-of-locking thing was more to see if we can
> > quickly test this out for ourselves in the existing setup as it *seems* like
> > an optional feature.
> > 
> > -Krutika
> > 
> >  Is there
> > > a way to make it more robust so that it can better handle cases like this
> > > with multiple open fds?

Posted this patch upstream: https://review.gluster.org/c/glusterfs/+/21107

Comment 54 Pranith Kumar K 2018-09-06 14:13:44 UTC
Nikhil/Manoj,
    Do you want to give this a shot?

Comment 55 Fam Zheng 2018-09-06 14:21:30 UTC
Current QEMU allows disabling lock with the file driver's 'locking' option. The command line option looks like this:

  -drive \
 file.locking=off,file.filename=test.qcow2,file.driver=file,driver=qcow2

There is also a WIP on upstream to reduce the lock fd (reuse the I/O fd to achieve the same):

https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg04155.html

Comment 56 Krutika Dhananjay 2018-09-06 15:45:24 UTC
(In reply to Fam Zheng from comment #55)
> Current QEMU allows disabling lock with the file driver's 'locking' option.
> The command line option looks like this:
> 
>   -drive \
>  file.locking=off,file.filename=test.qcow2,file.driver=file,driver=qcow2
> 
> There is also a WIP on upstream to reduce the lock fd (reuse the I/O fd to
> achieve the same):
> 
> https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg04155.html

Hi Fam,

Thanks for that quick reply!

Just wondering - is it possible to achieve the same by perhaps editing the vm's xml file (`virsh edit $VM_NAME`) and restarting it?

For example, this is what I see currently in the xml file for the disk of our interest:

    <disk type='file' device='disk' snapshot='no'>
      <driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads'/>
      <source file='/rhev/data-center/mnt/glusterSD/172.18.70.138:_ssd/b9340184-d923-4f88-8fd9-e3a5b61cd49c/images/3b77754e-cccc-4ee7-a7d4-2124b39c926b/e052530e-fceb-4e74-b8ac-e34360ca09d3'/>
      <target dev='sda' bus='scsi'/>
      <serial>3b77754e-cccc-4ee7-a7d4-2124b39c926b</serial>
      <alias name='ua-3b77754e-cccc-4ee7-a7d4-2124b39c926b'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>


Would simply adding "locking='off'" in the driver section also have the same effect:

       <driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads' locking='off'/>

-Krutika

Comment 57 Fam Zheng 2018-09-07 05:42:03 UTC
I'm afraid this option is not available in libvirt xml at the moment.

Comment 67 Krutika Dhananjay 2018-09-10 14:31:39 UTC
The ratio of finodelks to writes looks much better at least as far as 128GB shard block size is concerned. And the brick profiles both in gfapi case and comment #62 look similar except that there are fewer writes in the latter case sent in 120s run. In that case, the bottleneck should be above afr in terms of the layer responsible not pumping enough IO down to AFR layer. So the problem must be somewhere between qemu and fuse. But if gluster-fuse itself had a problem, then iops on the host runs should have been bad as well (pointed out by Manoj) and that is not the case.

In light of this, I compared code again between gfapi qemu driver and file-posix.c.

I single-stepped through the write operation from qemu in fuse and here is the life cycle of every write in file-posix.c:

raw_co_prw -> paio_submit_co -> offloaded to an aio worker thread starting at aio_worker -> handle_aiocb_rw -> handle_aiocb_rw_linear -> https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/block/file-posix.c#L1136 which does a synchronous pwrite() system call. So the thread is blocked until the call returns.

In contrast, in gfapi the write goes through qemu_gluster_co_rw -> glfs_pwritev_async() and each of these calls is passed a callback function such as gluster_finish_aiocb() in this case. glfs_pwritev_async() simply winds the call and returns immediately ready to process more parallel writes, without waiting for the response to the write. Whenever the response is received

Yaniv,

Is there an option in qemu to send truly asynchronous IO where the thread doesn't have to wait for the response and continue sending more parallel requests, similar to the way it is done in gfapi driver?

-Krutika

Comment 68 Yaniv Kaul 2018-09-10 14:33:28 UTC
Kevin might know better than me...

Comment 69 Krutika Dhananjay 2018-09-10 14:44:47 UTC
So had linux aio been enabled here - https://github.com/qemu/qemu/blob/ba87166e14ffd7299c35badc4c11f3fa3c129ec6/block/file-posix.c#L1556 - the calls would have eventually led to linux aio syscall io_prep_pwrite() being called.

So in command line args of the vm where these tests are being run, I see "aio=threads".

As per man-page of qemu-kvm:
aio=aio
aio is "threads", or "native" and selects between pthread based disk I/O and native Linux AIO .

I'm guessing we need to set aio=native to have async writes. I'll wait till we hear back from Kevin in any case.

-Krutika

Comment 70 Yaniv Kaul 2018-09-10 14:48:38 UTC
That's an easy change - we have this logic of aio=threads on file-based storage and aio=native on block-based storage, but it's really easy to change. I'd start by hacking VDSM, but if needed, I'm sure we can find where to properly change it.

Comment 71 Kevin Wolf 2018-09-10 15:06:21 UTC
The idea with the thread pool is that there isn't just a single worker thread, but many. So if one thread is blocked while doing a pwrite() syscall, another thread will process the concurrent next request. This does get you real asychronicity, even if each worker thread runs a blocking syscall.

You can also switch to Linux AIO, which might be closer to what you meant by "truly asynchronous" because it's not just synchronous calls from multiple threads, but an API that is async by design. This is enabled with aio=native in -drive (and requires cache=none, because Linux AIO just doesn't work without O_DIRECT).

In normal cases, both ways perform similarly, with only slight advantages for one or the other configuration (depending on the exact setup).

Note that I'm not sure what a fuse driver must do (if anything) to support O_DIRECT and Linux AIO. If the driver doesn't support this, it might fall back to synchronous I/O, which would perform a lot worse than the thread pool. I think you'll see pretty quickly if that's the case.

Comment 72 Manoj Pillai 2018-09-10 15:28:32 UTC
(In reply to Kevin Wolf from comment #71)
> The idea with the thread pool is that there isn't just a single worker
> thread, but many. So if one thread is blocked while doing a pwrite()
> syscall, another thread will process the concurrent next request. This does
> get you real asychronicity, even if each worker thread runs a blocking
> syscall.
> 

How many threads in the thread pool by default? How to bump that number up? Would be good to do that experiment and see what that gets us.

This test has 4 fio jobs each issuing 16 requests concurrently, so there is a high degree of concurrency in the workload.

Comment 73 Kevin Wolf 2018-09-10 15:35:55 UTC
(In reply to Manoj Pillai from comment #72)
> How many threads in the thread pool by default? How to bump that number up?
> Would be good to do that experiment and see what that gets us.

It is currently hardcoded to a maximum of 64 worker threads.

Comment 74 Nikhil Chawla 2018-09-10 15:56:22 UTC
(In reply to Kevin Wolf from comment #73)
> (In reply to Manoj Pillai from comment #72)
> > How many threads in the thread pool by default? How to bump that number up?
> > Would be good to do that experiment and see what that gets us.
> 
> It is currently hardcoded to a maximum of 64 worker threads.

is there any loss if we increase the thread count ?

Comment 75 Manoj Pillai 2018-09-10 16:47:18 UTC
(In reply to Kevin Wolf from comment #73)
> (In reply to Manoj Pillai from comment #72)
> > How many threads in the thread pool by default? How to bump that number up?
> > Would be good to do that experiment and see what that gets us.
> 
> It is currently hardcoded to a maximum of 64 worker threads.

64 should be plenty.

I guess it makes sense to try with aio=native.

Comment 78 Kevin Wolf 2018-09-11 08:47:52 UTC
(In reply to Nikhil Chawla from comment #74)
> is there any loss if we increase the thread count ?

I won't try to predict performance numbers. The only way is to try it out and measure.

Comment 79 Nikhil Chawla 2018-09-12 04:58:10 UTC
(In reply to Yaniv Kaul from comment #70)
> That's an easy change - we have this logic of aio=threads on file-based
> storage and aio=native on block-based storage, but it's really easy to
> change. I'd start by hacking VDSM, but if needed, I'm sure we can find where
> to properly change it.

I checked the VDSM code and found out the hack were we need to make change to deploy VM with aio=native:

https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage.py#L948

I tried to change the code in /usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/storage.cp , and cleared all the python byte codes and object codes followed by vdsmd and supervdsmd restart. 
Then I deployed the VM again on the host on which the changes were executed, but it didn't make any difference.

I am still looking forward to find the missing piece of the puzzle.

Comment 80 Yaniv Kaul 2018-09-12 05:49:13 UTC
(In reply to Nikhil Chawla from comment #79)
> (In reply to Yaniv Kaul from comment #70)
> > That's an easy change - we have this logic of aio=threads on file-based
> > storage and aio=native on block-based storage, but it's really easy to
> > change. I'd start by hacking VDSM, but if needed, I'm sure we can find where
> > to properly change it.
> 
> I checked the VDSM code and found out the hack were we need to make change
> to deploy VM with aio=native:
> 
> https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage.
> py#L948
> 
> I tried to change the code in
> /usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/storage.cp , and
> cleared all the python byte codes and object codes followed by vdsmd and
> supervdsmd restart. 
> Then I deployed the VM again on the host on which the changes were executed,
> but it didn't make any difference.
> 
> I am still looking forward to find the missing piece of the puzzle.

Arik, how do we change the io=threads/native in 4.2?

Comment 81 Arik 2018-09-12 07:51:07 UTC
(In reply to Yaniv Kaul from comment #80)
> (In reply to Nikhil Chawla from comment #79)
> > (In reply to Yaniv Kaul from comment #70)
> > > That's an easy change - we have this logic of aio=threads on file-based
> > > storage and aio=native on block-based storage, but it's really easy to
> > > change. I'd start by hacking VDSM, but if needed, I'm sure we can find where
> > > to properly change it.
> > 
> > I checked the VDSM code and found out the hack were we need to make change
> > to deploy VM with aio=native:
> > 
> > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage.
> > py#L948
> > 
> > I tried to change the code in
> > /usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/storage.cp , and
> > cleared all the python byte codes and object codes followed by vdsmd and
> > supervdsmd restart. 
> > Then I deployed the VM again on the host on which the changes were executed,
> > but it didn't make any difference.
> > 
> > I am still looking forward to find the missing piece of the puzzle.
> 
> Arik, how do we change the io=threads/native in 4.2?

Since 4.2 it is set by ovirt-engine [1] so it is a bit more difficult to change. Specifically, the logic for Gluster resides in [2] - if libgfapi is used, the disk's type is set to 'network' and so aio would be set to 'native', otherwise to 'threads'. So one way to change this is by modifying ovirt-engine.

Alternatively, it might be easier to change this using the before_vm_start hook in VDSM if it's just for testing.

[1] https://github.com/oVirt/ovirt-engine/blob/ovirt-engine-4.2/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/builder/vminfo/LibvirtVmXmlBuilder.java#L1783
[2] https://github.com/oVirt/ovirt-engine/blob/ovirt-engine-4.2/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/builder/vminfo/VmInfoBuildUtils.java#L938

Comment 83 Nikhil Chawla 2018-09-13 19:54:18 UTC
Created attachment 1483155 [details]
Qemu process output

Comment 90 Sahina Bose 2018-09-17 11:59:11 UTC
Summarising based on Krutika's inputs and the work done so far - for the performance improvements, we need the following changes
1. Fix in the eager-lock implementation to account for inodelk count rather than fd count
2. Fix in oVirt to use aio=native for VM disks on gluster storage domain
3. Tuning in virt profile to use post-op delay to 5s and performance options - client-io-threads, event-threads

Krutika, Manoj - is this accurate? If so, we will raise separate bugs to track these. Thanks!

Comment 91 Manoj Pillai 2018-09-17 12:25:52 UTC
(In reply to Sahina Bose from comment #90)
> Summarising based on Krutika's inputs and the work done so far - for the
> performance improvements, we need the following changes
> 1. Fix in the eager-lock implementation to account for inodelk count rather
> than fd count

Yes, eager-lock enhancement that will deal with the RHHI-fuse case better. Will leave implementation details to the AFR team.

> 2. Fix in oVirt to use aio=native for VM disks on gluster storage domain

Yes, aio=native is better, based on this ONE workload. We need to cover a suite of workloads in perf testing to make sure it works well everywhere. 

What is the best way? Make it an undocumented option that RHHI can set?

> 3. Tuning in virt profile to use post-op delay to 5s and performance options
> - client-io-threads, event-threads

I'm not adding post-op-delay of 5 secs as an ask, since we got the high IOPS without it. Need AFR team to weigh in on the wider implications, whether raising its value is prudent.

> 
> Krutika, Manoj - is this accurate? If so, we will raise separate bugs to
> track these. Thanks!

Comment 92 Krutika Dhananjay 2018-09-17 12:31:44 UTC
+1 to everything that is said in comment #91. Bumping up post-op-delay-secs is not necessary as RHHI-FUSE performance is surpassing that of RHHI-gfapi by ~1-2K iops even with the default value of 1 sec.

Comment 93 Manoj Pillai 2018-09-17 13:11:07 UTC
(In reply to Kevin Wolf from comment #71)

The logic in this comment #71 makes sense. But looking at the huge difference in performance for the RHHI-fuse case, I'm wondering if aio=threads is causing some serialization related to file system locking? Does this depend on FS-specific handling of concurrent writes? Does anyone have an AI to investigate this?

Comment 94 Kevin Wolf 2018-09-18 10:08:07 UTC
(In reply to Manoj Pillai from comment #93)
> But looking at the huge difference in performance for the RHHI-fuse case,
> I'm wondering if aio=threads is causing some serialization related to file
> system locking?

I don't think that's the case inside QEMU at least. The aio=threads code path is pretty uneventful from the point where it starts to differ from the aio=native one. It just creates a request structure that it hands to a worker thread, which then calls pwritev(). And that's pretty much it.

Is the QEMU command line exactly the same for the two cases apart from the aio=native/threads option? In particular, do they use the same cache mode?

(In reply to Kevin Wolf from comment #87)
> If both paths were implemented optimally, shouldn't libgfapi always be
> slightly faster, or at least at the same performance, as the fuse path? If
> you see fuse being faster, I wonder if this hints at a problem in the
> libgfapi path.

By the way, I had a look at the code used from the libgfapi-based block driver in QEMU. Do I understand correctly that glfs_pwritev_async() always copies the data of the whole request into a temporarily allocated linear buffer with iobuf_copy()?

I'm not even sure why a linear buffer would be needed for subvol->fops->writev (the function name clearly suggests it can deal with real iovecs), but if it were indeed required, couldn't that be skipped at least if the input is already linear? I wouldn't be surprised that an additional memcpy() for all data has an impact.

Comment 95 Krutika Dhananjay 2018-09-19 06:43:39 UTC
(In reply to Kevin Wolf from comment #94)
> (In reply to Manoj Pillai from comment #93)
> > But looking at the huge difference in performance for the RHHI-fuse case,
> > I'm wondering if aio=threads is causing some serialization related to file
> > system locking?
> 
> I don't think that's the case inside QEMU at least. The aio=threads code
> path is pretty uneventful from the point where it starts to differ from the
> aio=native one. It just creates a request structure that it hands to a
> worker thread, which then calls pwritev(). And that's pretty much it.
> 
> Is the QEMU command line exactly the same for the two cases apart from the
> aio=native/threads option? In particular, do they use the same cache mode?
> 
> (In reply to Kevin Wolf from comment #87)
> > If both paths were implemented optimally, shouldn't libgfapi always be
> > slightly faster, or at least at the same performance, as the fuse path? If
> > you see fuse being faster, I wonder if this hints at a problem in the
> > libgfapi path.
> 
> By the way, I had a look at the code used from the libgfapi-based block
> driver in QEMU. Do I understand correctly that glfs_pwritev_async() always
> copies the data of the whole request into a temporarily allocated linear
> buffer with iobuf_copy()?
> 
> I'm not even sure why a linear buffer would be needed for
> subvol->fops->writev (the function name clearly suggests it can deal with
> real iovecs), but if it were indeed required, couldn't that be skipped at
> least if the input is already linear? I wouldn't be surprised that an
> additional memcpy() for all data has an impact.

So there was a patch that was sent more than 2 years ago that would avoid this additional copy - https://review.gluster.org/c/glusterfs/+/14784 - which was auto-abandoned due to inactivity. Let me try and revive it.

-Krutika

Comment 96 Kevin Wolf 2018-09-19 08:35:55 UTC
(In reply to Krutika Dhananjay from comment #95)
> So there was a patch that was sent more than 2 years ago that would avoid
> this additional copy - https://review.gluster.org/c/glusterfs/+/14784 -
> which was auto-abandoned due to inactivity. Let me try and revive it.

This patch would not actually make the whole operation zero-copy because the API would still require the caller to get the buffer from libgfapi and to copy the data into it. So it would only move the additional copy from the library to the application.

What would really be needed is a way to use arbitrary addresses, so that glfs_pwritev_async_common() can directly use the passed iovec (and no API changes would be necessary). subvol->fops->writev already takes an arbitrary vector, but it also requires an iobref. Maybe actually just passing an empty iobref would work? Its only purpose seems to be managing the lifecycle of buffers, and if the buffer comes from somewhere external, it doesn't have to be managed.

Comment 97 Sahina Bose 2018-09-19 13:30:08 UTC
(In reply to Manoj Pillai from comment #91)
> (In reply to Sahina Bose from comment #90)
> > Summarising based on Krutika's inputs and the work done so far - for the
> > performance improvements, we need the following changes
> > 1. Fix in the eager-lock implementation to account for inodelk count rather
> > than fd count
> 
> Yes, eager-lock enhancement that will deal with the RHHI-fuse case better.
> Will leave implementation details to the AFR team.
> 
> > 2. Fix in oVirt to use aio=native for VM disks on gluster storage domain
> 
> Yes, aio=native is better, based on this ONE workload. We need to cover a
> suite of workloads in perf testing to make sure it works well everywhere. 
> 
> What is the best way? Make it an undocumented option that RHHI can set?

undocumented option? Or did you mean configurable?
I've raised a bug 1630744 - we will explore options there

> 
> > 3. Tuning in virt profile to use post-op delay to 5s and performance options
> > - client-io-threads, event-threads
> 
> I'm not adding post-op-delay of 5 secs as an ask, since we got the high IOPS
> without it. Need AFR team to weigh in on the wider implications, whether
> raising its value is prudent.
> 
> > 
> > Krutika, Manoj - is this accurate? If so, we will raise separate bugs to
> > track these. Thanks!

Comment 98 Manoj Pillai 2018-09-19 13:52:55 UTC
(In reply to Sahina Bose from comment #97)
[...]
> > 
> > > 2. Fix in oVirt to use aio=native for VM disks on gluster storage domain
> > 
> > Yes, aio=native is better, based on this ONE workload. We need to cover a
> > suite of workloads in perf testing to make sure it works well everywhere. 
> > 
> > What is the best way? Make it an undocumented option that RHHI can set?
> 
> undocumented option? Or did you mean configurable?
> I've raised a bug 1630744 - we will explore options there

That works. Thanks!

Comment 102 SATHEESARAN 2018-10-18 18:38:46 UTC
Verified that the following changes are available.

1. client-io-threads on the volume is enabled
2. cluster.choose-local is set to false
3. aio=native is turned on RHV side

Comment 103 bipin 2018-11-04 07:24:05 UTC
(In reply to SATHEESARAN from comment #102)
> Verified that the following changes are available.
> 
> 1. client-io-threads on the volume is enabled
> 2. cluster.choose-local is set to false
> 3. aio=native is turned on RHV side

Moving to verified based on the above comments