Bug 762656 - (GLUSTER-924) Slow NFS synchronous writes
Slow NFS synchronous writes
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: nfs (Show other bugs)
nfs-alpha
All Linux
low Severity low
: ---
: ---
Assigned To: Shehjar Tikoo
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-05-13 02:13 EDT by Shehjar Tikoo
Modified: 2015-12-01 11:45 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: ---
Regression: RTP
Mount Type: nfs
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Shehjar Tikoo 2010-05-12 23:16:10 EDT
Note to self: Before finalizing this option, must test the usefulness of write-behind for NFS loads, especially such blocking=wait semantics as required by O_DIRECT file opens on NFS client.

Test, wb with iot also, test with GF_OPEN_WB flag also set.
Comment 1 Shehjar Tikoo 2010-05-13 02:13:03 EDT
Mail from user:
The problem we've always had with Gluster NFS is synchronous writes are very slow. Each write operation seems to take 14ms regardless of block size. Because virtual machines tend to perform small synchronous writes, this tends to impact VM performance significantly.

Is there a way to make writes complete quicker, for example, as soon as data is replicated to 2 host's memory buffer?

[admin@domU-12-31-39-03-7C-14 mnt]$ dd if=/dev/zero of=zero bs=1K count=1000 oflag=direct
1000+0 records in
1000+0 records out
1024000 bytes (1.0 MB) copied, 14.0952 s, 72.6 kB/s
[admin@domU-12-31-39-03-7C-14 mnt]$ dd if=/dev/zero of=zero bs=8K count=1000 oflag=direct
1000+0 records in
1000+0 records out
8192000 bytes (8.2 MB) copied, 14.2794 s, 574 kB/s
[admin@domU-12-31-39-03-7C-14 mnt]$ dd if=/dev/zero of=zero bs=64K count=1000 oflag=direct
1000+0 records in
1000+0 records out
65536000 bytes (66 MB) copied, 18.9333 s, 3.5 MB/s
[admin@domU-12-31-39-03-7C-14 mnt]$

My reply:

Without going into much detail, when you say "synchronous", do you mean 
the VM is sending one NFS request at a time and blocking for a reply 
before sending the next NFS request or do you mean it in the sense of 
the NFS protocol which also defines its sync and async requests?

NFS sync request means that the server should not reply before ensuring 
that the data is on the disk, as compared to async, where the server can 
reply by leaving file data in server memory, to be synced later.

> > 
> > Is there a way to make writes complete quicker, for example, as soon
> > as data is replicated to 2 host's memory buffer?

By opening the files with O_DIRECT, as done for the tests below, we're 
getting both behaviours above, i.e. one request at a time and a 
sync-to-disk behaviour, hence the low perf.

Going by AB's suggestion that because replication will be used, we can 
ignore NFS sync semantics, I made some changes in the NFS server. The 
perf we got is much better:

[root@domU-12-31-39-03-7C-14 mnt]# dd if=/dev/zero of=zero bs=64K 
count=1000 oflag=direct
1000+0 records in
1000+0 records out
65536000 bytes (66 MB) copied, 4.65845 s, 14.1 MB/s


User reply:
This looks pretty good. Performance is dramatically improved!

=====================

My concern is that the perf improvement depends on fooling the NFS client into thinking the data is on disk so that it does not send the COMMIT request. COMMIT request translates to an fsync resulting in lower performance.

I not sure how safe this is. Perhaps, we can introduce this as an option. Since replication will be used, we may be able to rely on the self-heal there in case one server goes down without the data having been fsync'd earlier.
Comment 2 Shehjar Tikoo 2010-05-14 01:13:04 EDT
Here are some alternatives in NFS:

1. Make nfs client into think that UNSTABLE writes are STABLE.

2. Force any commits to be a getattr instead of calling fsync. Will ignore this option for now.
Comment 3 Shehjar Tikoo 2010-05-14 04:30:28 EDT
Pasting huge comment from C file for future reference:

 * Before going into the write reply logic, here is a matrix that shows the
 * requirements for a write reply as given by RFC1813.
 *
 * Requested Write Type ||      Possible Returns
 * ==============================================
 * FILE_SYNC            ||      FILE_SYNC
 * DATA_SYNC            ||      DATA_SYNC or FILE_SYNC
 * UNSTABLE             ||      DATA_SYNC or FILE_SYNC or UNSTABLE
 *
 * Write types other than UNSTABLE are together called STABLE.
 * RS - Return Stable
 * RU - Return Unstable
 * WS - Write Stable
 * WU - Write Unstable
 *
 *+======================================================+
 *| Vol Opts -> || trusted-sync | async | sync  | unsafe |
 *| Write Type  ||              |       |       |        |
 *|-------------||--------------|-------|-------|--------|
 *| STABLE      ||      WS      |  WU   |  WS   |   WU   |
 *|             ||      RS      |  RS   |  RS   |   RS   |
 *|-------------||--------------|-------|-------|--------|
 *| UNSTABLE    ||      WU      |  WU   |  WS   |   WU   |
 *|             ||      RS      |  RU   |  RS   |   RS   |
 *+======================================================+
 *
 *
 * In english, these mean:
 * trusted-sync: Write a stable write as stable, write an unstable write as
 * unstable but send the stable flag to client so that the client does not send
 * a COMMIT subsequently. COMMIT results in an fsync and can cause a disk
 * bottleneck.
 *
 * async: Write stable as unstable but because we cannot return an unstable
 * flag on a stable write, return a stable flag. Write unstable as unstable.
 * Helps avoid the overhead of stable requests which need to be synced to disk
 * right away.
 *
 * sync: Write stable requests as stable, write unstable requests also as
 * stable and return stable also. Forces every write to be stable and may be
 * required in some situations.
 *
 * unsafe: Write stable as unstable and return a stable flag, write unstable as
 * unstable and here too return a stable flag. Does both tasks performed by
 * trusted-sync and async, i.e. avoids the overhead of stable writes and avoids
 * the overhead of any subsequent commits.
 *
 * ONLY trusted-sync is implemented at present.
 * Option names may change when implemented. Please change here too.
Comment 4 Anand Avati 2010-05-21 00:32:07 EDT
PATCH: http://patches.gluster.com/patch/3286 in master (nfs: Introduce trusted-write and trusted-sync options)
Comment 5 Shehjar Tikoo 2010-06-01 23:38:19 EDT
Regression Test
Involves testing posix+iot+nfsx using dd with direct writes. With the new option introduced here, the performance while using the option must be higher than when the option is not used.

Test Case 1
We'll use the following two commands in this test:
Command-1:    $ dd if=/dev/zero of=/mnt/testfile bs=64k oflag=direct
Command-2:    $ dd if=/dev/zero of=/mnt/testfile bs=64k


1. Setup posix+iot+nfsx. In nfsx, set the following option:
     option nfs3.<volume-name>.trusted-write on

2. With the above option set, run Command-1. Record the throughput.

3. With the above option unset, run Command-1. This throughput must be much lower than the throughput in (2).

4. With the above option set, run Command-2, Record the throughput.

5. With the above option unset, run Command-2. This throughput must be much lower than the figure in (4).


Test Case 2
We'll use the following two commands in this test:
Command-1:    $ dd if=/dev/zero of=/mnt/testfile bs=64k oflag=sync
Command-2:    $ dd if=/dev/zero of=/mnt/testfile bs=64k


1. Setup posix+iot+nfsx. In nfsx, set the following option:
     option nfs3.<volume-name>.trusted-sync on

2. With the above option set, run Command-1. Record the throughput.

3. With the above option unset, run Command-1. This throughput must be much lower than the throughput in (2).

4. With the above option set, run Command-2, Record the throughput.

5. With the above option unset, run Command-2. This throughput must be much lower than the figure in (4).
Comment 6 Lakshmipathi G 2010-07-08 03:35:09 EDT
testing with nfs-beta-rc7 and nfs-beta-rc8 without 'option trusted-write', dd command fails 

# dd if=/dev/zero of=/root/laks/mnt2/new/test bs=64k count=32000 
dd: closing output file `/root/laks/mnt2/new/test': Input/output error

log file can be found under /share/tickets/924/
Comment 7 Shehjar Tikoo 2010-07-08 05:18:23 EDT
The log shows huge delay from the disk file system for the write operations. That is what is causing nfs requests to timeout. NFS client returns EIO because we're mounting with option soft, which forces nfs client to return an EIO on a timeout.

This is a regression test not for a functional problem but a performance issue. THe test must be run on real hardware for it to show the correct behaviour of the new options.
Comment 8 Lakshmipathi G 2010-07-08 06:13:39 EDT
the dd command failed because of low timeout value passed during mount option.(-o timeo=10) It works with default time out option.
Comment 9 Lakshmipathi G 2010-07-08 22:33:46 EDT
verified with nfs-beta rc8. Performance increased  when trusted-write or trusted-sync option is enabled.
Comment 10 Lakshmipathi G 2010-07-08 22:47:20 EDT
Regression test -  http://test.gluster.com/show_bug.cgi?id=77

Note You need to log in before you can comment on or make changes to this bug.