Description of problem: Untar of Tarball taking too much time Version-Release number of selected component (if applicable): RHGS : 3.2 How reproducible: Every-time Steps to Reproduce: 1.Mount the gluster volume through gnfs 2.try to untar the tarball Actual results: Untar is taking around 3+ hours for 6 GB file Expected results: If we tweak some performance parameters then Unatr should take less time to finish. Additional info: Volume Information : Volume Name: gluster Type: Distributed-Replicate Volume ID: 653a5d83-3b1b-49a7-9a64-2874f7f6031a Status: Started Snapshot Count: 0 Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: node0:/srv/gluster/brick01 Brick2: node1:/srv/gluster/brick01 Brick3: node2:/srv/gluster/brick01 Brick4: node3:/srv/gluster/brick01 Brick5: node0:/srv/gluster/brick02 Brick6: node1:/srv/gluster/brick02 Brick7: node2:/srv/gluster/brick02 Brick8: node3:/srv/gluster/brick02 Options Reconfigured: nfs.disable: off performance.readdir-ahead: enable transport.address-family: inet performance.client-io-threads: on server.event-threads: 4 cluster.data-self-heal: off performance.cache-size: 1GB performance.write-behind-window-size: 1MB cluster.entry-self-heal: off cluster.lookup-optimize: on performance.io-thread-count: 16 performance.io-cache: on server.allow-insecure: on server.outstanding-rpc-limit: 0 performance.read-ahead: disable cluster.metadata-self-heal: off cluster.readdir-optimize: on client.event-threads: 4
Some more details about the *_SYNC options that NFS offers for WRITE procedures: [from https://tools.ietf.org/html/rfc1813#section-3.3.7] 3.3.7 Procedure 7: WRITE - Write to file SYNOPSIS WRITE3res NFSPROC3_WRITE(WRITE3args) = 7; enum stable_how { UNSTABLE = 0, DATA_SYNC = 1, FILE_SYNC = 2 }; ... stable If stable is FILE_SYNC, the server must commit the data written plus all file system metadata to stable storage before returning results. This corresponds to the NFS version 2 protocol semantics. Any other behavior constitutes a protocol violation. If stable is DATA_SYNC, then the server must commit all of the data to stable storage and enough of the metadata to retrieve the data before returning. The server implementor is free to implement DATA_SYNC in the same fashion as FILE_SYNC, but with a possible performance drop. If stable is UNSTABLE, the server is free to commit any part of the data and the metadata to stable storage, including all or none, before returning a reply to the client. There is no guarantee whether or when any uncommitted data will subsequently be committed to stable storage. The only guarantees made by the server are that it will not destroy any data without changing the value of verf and that it will not commit the data and metadata at a level less than that requested by the client. See the discussion on COMMIT on page 92 for more information on if and when data is committed to stable storage. There are (volume) options for Gluster/NFS to fake syncing and synced writes: - nfs.trusted-sync - nfs.trusted-write From xlators/nfs/server/src/nfs.c: { .key = {"nfs3.*.trusted-write"}, .type = GF_OPTION_TYPE_BOOL, .default_value = "off", .description = "On an UNSTABLE write from client, return STABLE flag" " to force client to not send a COMMIT request. In " "some environments, combined with a replicated " "GlusterFS setup, this option can improve write " "performance. This flag allows user to trust Gluster" " replication logic to sync data to the disks and " "recover when required. COMMIT requests if received " "will be handled in a default manner by fsyncing." " STABLE writes are still handled in a sync manner. " "Off by default." }, { .key = {"nfs3.*.trusted-sync"}, .type = GF_OPTION_TYPE_BOOL, .default_value = "off", .description = "All writes and COMMIT requests are treated as async." " This implies that no write requests are guaranteed" " to be on server disks when the write reply is " "received at the NFS client. Trusted sync includes " " trusted-write behaviour. Off by default." },
The writes in the tcpdump show that FILE_SYNC is used. (wireshark filter "rpc.msgtyp == CALL && nfs.procedure_v3 == WRITE") This means that the following applies: If stable is FILE_SYNC, the server must commit the data written plus all file system metadata to stable storage before returning results. [....] Any other behavior constitutes a protocol violation. So, enabling the nfs.trusted-sync option "constitutes a protocol violation". Because this bug is about small files getting extracted from a tarball (new files, single write), the writes will be flushed on the close() syscall. Applications (here 'tar') can then check if the close() was successful or writing data failed somewhere. This is referred to as "close-to-open", indicating that files have completely been written once an other process/user reads the newly created file. I suspect that disabling the close-to-open semantics will improve the performance for this use-case. However, it comes with the costs of a potential inconsistency after calling close(), data that has been expected to be written may still be buffered or in transit. For tarball extraction, once could consider using the "nocto" mount option on the NFS-client side. Additional 'sync' calls or unmounting/remounting of the NFS-export will be needed to guarantee syncing of the contents to the NFS-server, otherwise the NFS-client may cache data locally. Abhishek, I assume that this gives you sufficient insight on what can be done with Gluster/NFS and the Linux kernel NFS-client. Not sure if there really is a bug that we need fixing in Gluster here. If you agree, please close this as NOTABUG or similar. Thanks! Please see 'man 5 nfs' for further details on using the "nocto" mount option: cto / nocto Selects whether to use close-to-open cache coherence semantics. If neither option is specified (or if cto is specified), the client uses close-to-open cache coher‐ ence semantics. If the nocto option is specified, the client uses a non-standard heuristic to determine when files on the server have changed. Using the nocto option may improve performance for read- only mounts, but should be used only if the data on the server changes only occasionally. The DATA AND METADATA COHERENCE section discusses the behavior of this option in more detail. ... DATA AND METADATA COHERENCE Some modern cluster file systems provide perfect cache coherence among their clients. Perfect cache coherence among disparate NFS clients is expensive to achieve, especially on wide area networks. As such, NFS settles for weaker cache coherence that satisfies the requirements of most file sharing types. Close-to-open cache consistency Typically file sharing is completely sequential. First client A opens a file, writes something to it, then closes it. Then client B opens the same file, and reads the changes. When an application opens a file stored on an NFS version 3 server, the NFS client checks that the file exists on the server and is permitted to the opener by sending a GETATTR or ACCESS request. The NFS client sends these requests regardless of the freshness of the file's cached attributes. When the application closes the file, the NFS client writes back any pending changes to the file so that the next opener can view the changes. This also gives the NFS client an opportunity to report write errors to the application via the return code from close(2). The behavior of checking at open time and flushing at close time is referred to as close-to-open cache consistency, or CTO. It can be dis‐ abled for an entire mount point using the nocto mount option.