Hide Forgot
Note to self: Before finalizing this option, must test the usefulness of write-behind for NFS loads, especially such blocking=wait semantics as required by O_DIRECT file opens on NFS client. Test, wb with iot also, test with GF_OPEN_WB flag also set.
Mail from user: The problem we've always had with Gluster NFS is synchronous writes are very slow. Each write operation seems to take 14ms regardless of block size. Because virtual machines tend to perform small synchronous writes, this tends to impact VM performance significantly. Is there a way to make writes complete quicker, for example, as soon as data is replicated to 2 host's memory buffer? [admin@domU-12-31-39-03-7C-14 mnt]$ dd if=/dev/zero of=zero bs=1K count=1000 oflag=direct 1000+0 records in 1000+0 records out 1024000 bytes (1.0 MB) copied, 14.0952 s, 72.6 kB/s [admin@domU-12-31-39-03-7C-14 mnt]$ dd if=/dev/zero of=zero bs=8K count=1000 oflag=direct 1000+0 records in 1000+0 records out 8192000 bytes (8.2 MB) copied, 14.2794 s, 574 kB/s [admin@domU-12-31-39-03-7C-14 mnt]$ dd if=/dev/zero of=zero bs=64K count=1000 oflag=direct 1000+0 records in 1000+0 records out 65536000 bytes (66 MB) copied, 18.9333 s, 3.5 MB/s [admin@domU-12-31-39-03-7C-14 mnt]$ My reply: Without going into much detail, when you say "synchronous", do you mean the VM is sending one NFS request at a time and blocking for a reply before sending the next NFS request or do you mean it in the sense of the NFS protocol which also defines its sync and async requests? NFS sync request means that the server should not reply before ensuring that the data is on the disk, as compared to async, where the server can reply by leaving file data in server memory, to be synced later. > > > > Is there a way to make writes complete quicker, for example, as soon > > as data is replicated to 2 host's memory buffer? By opening the files with O_DIRECT, as done for the tests below, we're getting both behaviours above, i.e. one request at a time and a sync-to-disk behaviour, hence the low perf. Going by AB's suggestion that because replication will be used, we can ignore NFS sync semantics, I made some changes in the NFS server. The perf we got is much better: [root@domU-12-31-39-03-7C-14 mnt]# dd if=/dev/zero of=zero bs=64K count=1000 oflag=direct 1000+0 records in 1000+0 records out 65536000 bytes (66 MB) copied, 4.65845 s, 14.1 MB/s User reply: This looks pretty good. Performance is dramatically improved! ===================== My concern is that the perf improvement depends on fooling the NFS client into thinking the data is on disk so that it does not send the COMMIT request. COMMIT request translates to an fsync resulting in lower performance. I not sure how safe this is. Perhaps, we can introduce this as an option. Since replication will be used, we may be able to rely on the self-heal there in case one server goes down without the data having been fsync'd earlier.
Here are some alternatives in NFS: 1. Make nfs client into think that UNSTABLE writes are STABLE. 2. Force any commits to be a getattr instead of calling fsync. Will ignore this option for now.
Pasting huge comment from C file for future reference: * Before going into the write reply logic, here is a matrix that shows the * requirements for a write reply as given by RFC1813. * * Requested Write Type || Possible Returns * ============================================== * FILE_SYNC || FILE_SYNC * DATA_SYNC || DATA_SYNC or FILE_SYNC * UNSTABLE || DATA_SYNC or FILE_SYNC or UNSTABLE * * Write types other than UNSTABLE are together called STABLE. * RS - Return Stable * RU - Return Unstable * WS - Write Stable * WU - Write Unstable * *+======================================================+ *| Vol Opts -> || trusted-sync | async | sync | unsafe | *| Write Type || | | | | *|-------------||--------------|-------|-------|--------| *| STABLE || WS | WU | WS | WU | *| || RS | RS | RS | RS | *|-------------||--------------|-------|-------|--------| *| UNSTABLE || WU | WU | WS | WU | *| || RS | RU | RS | RS | *+======================================================+ * * * In english, these mean: * trusted-sync: Write a stable write as stable, write an unstable write as * unstable but send the stable flag to client so that the client does not send * a COMMIT subsequently. COMMIT results in an fsync and can cause a disk * bottleneck. * * async: Write stable as unstable but because we cannot return an unstable * flag on a stable write, return a stable flag. Write unstable as unstable. * Helps avoid the overhead of stable requests which need to be synced to disk * right away. * * sync: Write stable requests as stable, write unstable requests also as * stable and return stable also. Forces every write to be stable and may be * required in some situations. * * unsafe: Write stable as unstable and return a stable flag, write unstable as * unstable and here too return a stable flag. Does both tasks performed by * trusted-sync and async, i.e. avoids the overhead of stable writes and avoids * the overhead of any subsequent commits. * * ONLY trusted-sync is implemented at present. * Option names may change when implemented. Please change here too.
PATCH: http://patches.gluster.com/patch/3286 in master (nfs: Introduce trusted-write and trusted-sync options)
Regression Test Involves testing posix+iot+nfsx using dd with direct writes. With the new option introduced here, the performance while using the option must be higher than when the option is not used. Test Case 1 We'll use the following two commands in this test: Command-1: $ dd if=/dev/zero of=/mnt/testfile bs=64k oflag=direct Command-2: $ dd if=/dev/zero of=/mnt/testfile bs=64k 1. Setup posix+iot+nfsx. In nfsx, set the following option: option nfs3.<volume-name>.trusted-write on 2. With the above option set, run Command-1. Record the throughput. 3. With the above option unset, run Command-1. This throughput must be much lower than the throughput in (2). 4. With the above option set, run Command-2, Record the throughput. 5. With the above option unset, run Command-2. This throughput must be much lower than the figure in (4). Test Case 2 We'll use the following two commands in this test: Command-1: $ dd if=/dev/zero of=/mnt/testfile bs=64k oflag=sync Command-2: $ dd if=/dev/zero of=/mnt/testfile bs=64k 1. Setup posix+iot+nfsx. In nfsx, set the following option: option nfs3.<volume-name>.trusted-sync on 2. With the above option set, run Command-1. Record the throughput. 3. With the above option unset, run Command-1. This throughput must be much lower than the throughput in (2). 4. With the above option set, run Command-2, Record the throughput. 5. With the above option unset, run Command-2. This throughput must be much lower than the figure in (4).
testing with nfs-beta-rc7 and nfs-beta-rc8 without 'option trusted-write', dd command fails # dd if=/dev/zero of=/root/laks/mnt2/new/test bs=64k count=32000 dd: closing output file `/root/laks/mnt2/new/test': Input/output error log file can be found under /share/tickets/924/
The log shows huge delay from the disk file system for the write operations. That is what is causing nfs requests to timeout. NFS client returns EIO because we're mounting with option soft, which forces nfs client to return an EIO on a timeout. This is a regression test not for a functional problem but a performance issue. THe test must be run on real hardware for it to show the correct behaviour of the new options.
the dd command failed because of low timeout value passed during mount option.(-o timeo=10) It works with default time out option.
verified with nfs-beta rc8. Performance increased when trusted-write or trusted-sync option is enabled.
Regression test - http://test.gluster.com/show_bug.cgi?id=77