Bug 1450745

Summary:	[GSS]Untar of Tarball taking too much time
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Abhishek Kumar <abhishku>
Component:	gluster-nfs	Assignee:	Niels de Vos <ndevos>
Status:	CLOSED DEFERRED	QA Contact:	Manisha Saini <msaini>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.2	CC:	abhishku, amukherj, bkunal, nbalacha, ndevos, pgurusid, pkarampu, pparsons, rgowdapp, rhs-bugs, sankarshan, srangana, storage-qa-internal
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-04-05 12:36:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1474007

Description Abhishek Kumar 2017-05-15 06:23:02 UTC

Description of problem:

Untar of Tarball taking too much time

Version-Release number of selected component (if applicable):

RHGS : 3.2

How reproducible:

Every-time

Steps to Reproduce:
1.Mount the gluster volume through gnfs
2.try to untar the tarball

Actual results:

Untar is taking around 3+ hours for 6 GB file

Expected results:

If we tweak some performance parameters then Unatr should take less time to finish.

Additional info:

Volume Information :

Volume Name: gluster
Type: Distributed-Replicate
Volume ID: 653a5d83-3b1b-49a7-9a64-2874f7f6031a
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: node0:/srv/gluster/brick01
Brick2: node1:/srv/gluster/brick01
Brick3: node2:/srv/gluster/brick01
Brick4: node3:/srv/gluster/brick01
Brick5: node0:/srv/gluster/brick02
Brick6: node1:/srv/gluster/brick02
Brick7: node2:/srv/gluster/brick02
Brick8: node3:/srv/gluster/brick02
Options Reconfigured:
nfs.disable: off
performance.readdir-ahead: enable
transport.address-family: inet
performance.client-io-threads: on
server.event-threads: 4
cluster.data-self-heal: off
performance.cache-size: 1GB
performance.write-behind-window-size: 1MB
cluster.entry-self-heal: off
cluster.lookup-optimize: on
performance.io-thread-count: 16
performance.io-cache: on
server.allow-insecure: on
server.outstanding-rpc-limit: 0
performance.read-ahead: disable
cluster.metadata-self-heal: off
cluster.readdir-optimize: on
client.event-threads: 4

Comment 38 Niels de Vos 2017-06-21 07:18:32 UTC

Some more details about the *_SYNC options that NFS offers for WRITE procedures:

[from https://tools.ietf.org/html/rfc1813#section-3.3.7]
3.3.7 Procedure 7: WRITE - Write to file

   SYNOPSIS

      WRITE3res NFSPROC3_WRITE(WRITE3args) = 7;

      enum stable_how {
           UNSTABLE  = 0,
           DATA_SYNC = 1,
           FILE_SYNC = 2
      };

...

      stable
         If stable is FILE_SYNC, the server must commit the data
         written plus all file system metadata to stable storage
         before returning results. This corresponds to the NFS
         version 2 protocol semantics. Any other behavior
         constitutes a protocol violation. If stable is
         DATA_SYNC, then the server must commit all of the data
         to stable storage and enough of the metadata to
         retrieve the data before returning.  The server
         implementor is free to implement DATA_SYNC in the same
         fashion as FILE_SYNC, but with a possible performance
         drop.  If stable is UNSTABLE, the server is free to
         commit any part of the data and the metadata to stable
         storage, including all or none, before returning a
         reply to the client. There is no guarantee whether or
         when any uncommitted data will subsequently be
         committed to stable storage. The only guarantees made
         by the server are that it will not destroy any data
         without changing the value of verf and that it will not
         commit the data and metadata at a level less than that
         requested by the client. See the discussion on COMMIT
         on page 92 for more information on if and when
         data is committed to stable storage.


There are (volume) options for Gluster/NFS to fake syncing and synced writes:
 - nfs.trusted-sync
 - nfs.trusted-write

From xlators/nfs/server/src/nfs.c:

        { .key  = {"nfs3.*.trusted-write"},
          .type = GF_OPTION_TYPE_BOOL,
          .default_value = "off",
          .description = "On an UNSTABLE write from client, return STABLE flag"
                         " to force client to not send a COMMIT request. In "
                         "some environments, combined with a replicated "
                         "GlusterFS setup, this option can improve write "
                         "performance. This flag allows user to trust Gluster"
                         " replication logic to sync data to the disks and "
                         "recover when required. COMMIT requests if received "
                         "will be handled in a default manner by fsyncing."
                         " STABLE writes are still handled in a sync manner. "
                         "Off by default."

        },
        { .key  = {"nfs3.*.trusted-sync"},
          .type = GF_OPTION_TYPE_BOOL,
          .default_value = "off",
          .description = "All writes and COMMIT requests are treated as async."
                         " This implies that no write requests are guaranteed"
                         " to be on server disks when the write reply is "
                         "received at the NFS client. Trusted sync includes "
                         " trusted-write behaviour. Off by default."

        },

Comment 39 Niels de Vos 2017-06-21 08:00:18 UTC

The writes in the tcpdump show that FILE_SYNC is used.
(wireshark filter "rpc.msgtyp == CALL && nfs.procedure_v3 == WRITE")

This means that the following applies:

If stable is FILE_SYNC, the server must commit the data
written plus all file system metadata to stable storage
before returning results. [....] Any other behavior
constitutes a protocol violation.

So, enabling the nfs.trusted-sync option "constitutes a protocol violation".

Because this bug is about small files getting extracted from a tarball (new files, single write), the writes will be flushed on the close() syscall. Applications (here 'tar') can then check if the close() was successful or writing data failed somewhere. This is referred to as "close-to-open", indicating that files have completely been written once an other process/user reads the newly created file.

I suspect that disabling the close-to-open semantics will improve the performance for this use-case. However, it comes with the costs of a potential inconsistency after calling close(), data that has been expected to be written may still be buffered or in transit.

For tarball extraction, once could consider using the "nocto" mount option on the NFS-client side. Additional 'sync' calls or unmounting/remounting of the NFS-export will be needed to guarantee syncing of the contents to the NFS-server, otherwise the NFS-client may cache data locally.

Abhishek, I assume that this gives you sufficient insight on what can be done with Gluster/NFS and the Linux kernel NFS-client. Not sure if there really is a bug that we need fixing in Gluster here. If you agree, please close this as NOTABUG or similar. Thanks!

Please see 'man 5 nfs' for further details on using the "nocto" mount option:

cto / nocto Selects whether to use close-to-open cache coherence
semantics. If neither option is specified (or if cto is
specified), the client uses close-to-open cache coher‐
ence semantics. If the nocto option is specified, the
client uses a non-standard heuristic to determine when
files on the server have changed.

Using the nocto option may improve performance for read-
only mounts, but should be used only if the data on the
server changes only occasionally. The DATA AND METADATA
COHERENCE section discusses the behavior of this option
in more detail.

...

DATA AND METADATA COHERENCE
Some modern cluster file systems provide perfect cache coherence among
their clients. Perfect cache coherence among disparate NFS clients is
expensive to achieve, especially on wide area networks. As such, NFS
settles for weaker cache coherence that satisfies the requirements of
most file sharing types.

Close-to-open cache consistency
Typically file sharing is completely sequential. First client A opens
a file, writes something to it, then closes it. Then client B opens
the same file, and reads the changes.

When an application opens a file stored on an NFS version 3 server, the
NFS client checks that the file exists on the server and is permitted
to the opener by sending a GETATTR or ACCESS request. The NFS client
sends these requests regardless of the freshness of the file's cached
attributes.

When the application closes the file, the NFS client writes back any
pending changes to the file so that the next opener can view the
changes. This also gives the NFS client an opportunity to report write
errors to the application via the return code from close(2).

The behavior of checking at open time and flushing at close time is
referred to as close-to-open cache consistency, or CTO. It can be dis‐
abled for an entire mount point using the nocto mount option.