Bug 1450745 - [GSS]Untar of Tarball taking too much time
Summary: [GSS]Untar of Tarball taking too much time
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: gluster-nfs
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Niels de Vos
QA Contact: Manisha Saini
URL:
Whiteboard:
Depends On:
Blocks: 1474007
TreeView+ depends on / blocked
 
Reported: 2017-05-15 06:23 UTC by Abhishek Kumar
Modified: 2020-12-14 08:39 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-05 12:36:00 UTC
Embargoed:


Attachments (Terms of Use)

Description Abhishek Kumar 2017-05-15 06:23:02 UTC
Description of problem:

Untar of Tarball taking too much time

Version-Release number of selected component (if applicable):

RHGS : 3.2

How reproducible:

Every-time

Steps to Reproduce:
1.Mount the gluster volume through gnfs
2.try to untar the tarball

Actual results:

Untar is taking around 3+ hours for 6 GB file

Expected results:

If we tweak some performance parameters then Unatr should take less time to finish.

Additional info:

Volume Information :

Volume Name: gluster
Type: Distributed-Replicate
Volume ID: 653a5d83-3b1b-49a7-9a64-2874f7f6031a
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: node0:/srv/gluster/brick01
Brick2: node1:/srv/gluster/brick01
Brick3: node2:/srv/gluster/brick01
Brick4: node3:/srv/gluster/brick01
Brick5: node0:/srv/gluster/brick02
Brick6: node1:/srv/gluster/brick02
Brick7: node2:/srv/gluster/brick02
Brick8: node3:/srv/gluster/brick02
Options Reconfigured:
nfs.disable: off
performance.readdir-ahead: enable
transport.address-family: inet
performance.client-io-threads: on
server.event-threads: 4
cluster.data-self-heal: off
performance.cache-size: 1GB
performance.write-behind-window-size: 1MB
cluster.entry-self-heal: off
cluster.lookup-optimize: on
performance.io-thread-count: 16
performance.io-cache: on
server.allow-insecure: on
server.outstanding-rpc-limit: 0
performance.read-ahead: disable
cluster.metadata-self-heal: off
cluster.readdir-optimize: on
client.event-threads: 4

Comment 38 Niels de Vos 2017-06-21 07:18:32 UTC
Some more details about the *_SYNC options that NFS offers for WRITE procedures:

[from https://tools.ietf.org/html/rfc1813#section-3.3.7]
3.3.7 Procedure 7: WRITE - Write to file

   SYNOPSIS

      WRITE3res NFSPROC3_WRITE(WRITE3args) = 7;

      enum stable_how {
           UNSTABLE  = 0,
           DATA_SYNC = 1,
           FILE_SYNC = 2
      };

...

      stable
         If stable is FILE_SYNC, the server must commit the data
         written plus all file system metadata to stable storage
         before returning results. This corresponds to the NFS
         version 2 protocol semantics. Any other behavior
         constitutes a protocol violation. If stable is
         DATA_SYNC, then the server must commit all of the data
         to stable storage and enough of the metadata to
         retrieve the data before returning.  The server
         implementor is free to implement DATA_SYNC in the same
         fashion as FILE_SYNC, but with a possible performance
         drop.  If stable is UNSTABLE, the server is free to
         commit any part of the data and the metadata to stable
         storage, including all or none, before returning a
         reply to the client. There is no guarantee whether or
         when any uncommitted data will subsequently be
         committed to stable storage. The only guarantees made
         by the server are that it will not destroy any data
         without changing the value of verf and that it will not
         commit the data and metadata at a level less than that
         requested by the client. See the discussion on COMMIT
         on page 92 for more information on if and when
         data is committed to stable storage.


There are (volume) options for Gluster/NFS to fake syncing and synced writes:
 - nfs.trusted-sync
 - nfs.trusted-write

From xlators/nfs/server/src/nfs.c:

        { .key  = {"nfs3.*.trusted-write"},
          .type = GF_OPTION_TYPE_BOOL,
          .default_value = "off",
          .description = "On an UNSTABLE write from client, return STABLE flag"
                         " to force client to not send a COMMIT request. In "
                         "some environments, combined with a replicated "
                         "GlusterFS setup, this option can improve write "
                         "performance. This flag allows user to trust Gluster"
                         " replication logic to sync data to the disks and "
                         "recover when required. COMMIT requests if received "
                         "will be handled in a default manner by fsyncing."
                         " STABLE writes are still handled in a sync manner. "
                         "Off by default."

        },
        { .key  = {"nfs3.*.trusted-sync"},
          .type = GF_OPTION_TYPE_BOOL,
          .default_value = "off",
          .description = "All writes and COMMIT requests are treated as async."
                         " This implies that no write requests are guaranteed"
                         " to be on server disks when the write reply is "
                         "received at the NFS client. Trusted sync includes "
                         " trusted-write behaviour. Off by default."

        },

Comment 39 Niels de Vos 2017-06-21 08:00:18 UTC
The writes in the tcpdump show that FILE_SYNC is used.
  (wireshark filter "rpc.msgtyp == CALL && nfs.procedure_v3 == WRITE")

This means that the following applies:

         If stable is FILE_SYNC, the server must commit the data
         written plus all file system metadata to stable storage
         before returning results. [....] Any other behavior
         constitutes a protocol violation.

So, enabling the nfs.trusted-sync option "constitutes a protocol violation".


Because this bug is about small files getting extracted from a tarball (new files, single write), the writes will be flushed on the close() syscall. Applications (here 'tar') can then check if the close() was successful or writing data failed somewhere. This is referred to as "close-to-open", indicating that files have completely been written once an other process/user reads the newly created file.

I suspect that disabling the close-to-open semantics will improve the performance for this use-case. However, it comes with the costs of a potential inconsistency after calling close(), data that has been expected to be written may still be buffered or in transit.

For tarball extraction, once could consider using the "nocto" mount option on the NFS-client side. Additional 'sync' calls or unmounting/remounting of the NFS-export will be needed to guarantee syncing of the contents to the NFS-server, otherwise the NFS-client may cache data locally.


Abhishek, I assume that this gives you sufficient insight on what can be done with Gluster/NFS and the Linux kernel NFS-client. Not sure if there really is a bug that we need fixing in Gluster here. If you agree, please close this as NOTABUG or similar. Thanks!


Please see 'man 5 nfs' for further details on using the "nocto" mount option:

       cto / nocto    Selects whether to  use  close-to-open  cache  coherence
                      semantics.  If neither option is specified (or if cto is
                      specified), the client uses close-to-open  cache  coher‐
                      ence  semantics.  If  the nocto option is specified, the
                      client uses a non-standard heuristic to  determine  when
                      files on the server have changed.

                      Using the nocto option may improve performance for read-
                      only mounts, but should be used only if the data on  the
                      server changes only occasionally.  The DATA AND METADATA
                      COHERENCE section discusses the behavior of this  option
                      in more detail.

...

DATA AND METADATA COHERENCE
       Some  modern cluster file systems provide perfect cache coherence among
       their clients.  Perfect cache coherence among disparate NFS clients  is
       expensive  to  achieve, especially on wide area networks.  As such, NFS
       settles for weaker cache coherence that satisfies the  requirements  of
       most file sharing types.

   Close-to-open cache consistency
       Typically  file sharing is completely sequential.  First client A opens
       a file, writes something to it, then closes it.  Then  client  B  opens
       the same file, and reads the changes.

       When an application opens a file stored on an NFS version 3 server, the
       NFS client checks that the file exists on the server and  is  permitted
       to  the  opener by sending a GETATTR or ACCESS request.  The NFS client
       sends these requests regardless of the freshness of the  file's  cached
       attributes.

       When  the  application  closes the file, the NFS client writes back any
       pending changes to the file so  that  the  next  opener  can  view  the
       changes.  This also gives the NFS client an opportunity to report write
       errors to the application via the return code from close(2).

       The behavior of checking at open time and flushing  at  close  time  is
       referred to as close-to-open cache consistency, or CTO.  It can be dis‐
       abled for an entire mount point using the nocto mount option.


Note You need to log in before you can comment on or make changes to this bug.