Bug 1566467

Summary: [nfs-ganesha] NFS SETATTR requests are not durable with NFS-Ganesha backed by CephFS
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Ram Raja <rraja>
Component: CephFSAssignee: Matt Benjamin (redhat) <mbenjamin>
Status: CLOSED ERRATA QA Contact: Ramakrishnan Periyasamy <rperiyas>
Severity: high Docs Contact: Aron Gunn <agunn>
Priority: high    
Version: 3.0CC: agunn, ceph-eng-bugs, ceph-qe-bugs, jlayton, john.spray, mbenjamin, pdonnell, rperiyas, sostapov, vakulkar
Target Milestone: z3   
Target Release: 3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-12.2.4-10.el7cp Ubuntu: ceph_12.2.4-14redhat1xenial Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-15 18:20:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
patch to add ceph testcase to demonstrate problem none

Description Ram Raja 2018-04-12 11:32:47 UTC
Description of problem:

With NFS-Ganesha backed by CephFS, setattr request changes from a NFS client might be lost even though the NFS client received a reply that changes were successfully applied. This could happen if the NFS-Ganesha server crashed, and the libcephfs client had cached the setattr request and not sent it to the MDS, but replied success to the application.

From https://tracker.ceph.com/issues/23291,
" Zheng pointed out that we could end up with a setattr request being cached in the libcephfs client after it has replied with success to the application. This is problematic for ganesha:

1/ NFS client does a NFS SETATTR
2/ ganesha issues setattrx to libcephfs, result is cached in client and not sent to MDS
3/ ganesha replies with success to client
4/ ganesha crashes

setattrx changes are then lost."

Version-Release number of selected component (if applicable): RHCS 3.0.z2

Comment 3 Ram Raja 2018-04-12 11:51:21 UTC
Patches to fix this have been merged upstream Ceph repo's master branch and NFS-Ganesha repo's next branch. We'd need to backport these patches.

Patch in Ceph Master
--------------------
commit 78628a5fde277927126d4d6e5d2957e7e6d42fb1
Author: Jeff Layton <jlayton>
Date:   Wed Mar 14 12:25:36 2018 -0400

    client: add ceph_ll_sync_inode
    
    Ganesha's NFS SETATTR requests are expected to be synchronous, but
    libcephfs can cache attribute changes if the client has the right caps.
    An fsync after changing the attributes would flush the caps back to the
    MDS, but we may not have an Fh in the context of a SETATTR request.
    
    Add a way to sync a specific Inode, even when we don't have easy access
    to a Fh.
    
    Tracker: http://tracker.ceph.com/issues/23291
    Signed-off-by: Jeff Layton <jlayton>

Patches in NFS-Ganesha 'next' branch
------------------------------------
commit fdd020fab3034ef96571646450c28c2ec9329b70
Author: Jeff Layton <jlayton>
Date:   Mon Apr 2 14:42:18 2018 -0400

    cmake: add USE_FSAL_CEPH_LL_SYNC_INODE to config-h.in.cmake
    
    As Dan points out, I left this (rather important) bit out.
    
    Change-Id: I69aa582a203dc099577c591f474367381a535d17
    Reported-by: Daniel Gryniewicz <dang>
    Signed-off-by: Jeff Layton <jlayton>


commit d9fe5c92a1543a2c8b1c1a91c4d610f4adb92e07
Author: Jeff Layton <jlayton>
Date:   Wed Mar 14 13:49:48 2018 -0400

    FSAL_CEPH: add check for and call to ceph_ll_sync_inode
    
    libcephfs can cache the results of a setattr request after replying
    to it. This violates the semantics that NFS expects. Add a call to
    ceph_ll_sync_inode after the setattr, if the function exists.
    
    Change-Id: Ibbe86fe547bc571dd71e7a5d93cc13111c03910a
    Signed-off-by: Jeff Layton <jlayton>

Comment 10 Jeff Layton 2018-04-12 19:34:14 UTC
Created attachment 1421030 [details]
patch to add ceph testcase to demonstrate problem

This patch demonstrates the problem with a cephfs testcase. It fails for me, but if you uncomment the ceph_ll_sync_inode call, it works. If you use Zheng's reclaim branch, you can uncomment the other calls and it then runs quicker too.

So, I believe this problem is theoretically possible with ganesha, but I'm not sure exactly how to make it occur on demand.

Comment 12 Jeff Layton 2018-04-13 17:04:48 UTC
This may be uncovering a different issue as well.

With a ceph pull from earlier this week, I'm seeing ~5s stalls in the ll_sync_inode calls during a ganesha SETATTR op. It looks like we're issuing the fsync to the MDS, but it's taking a full tick to reply.

That said, I need to confirm that and turn up the debugging.

Comment 13 Jeff Layton 2018-04-13 19:21:11 UTC
http://tracker.ceph.com/issues/23714 for the slow ll_sync_inode problem.

Comment 14 Jeff Layton 2018-04-18 14:02:52 UTC
PR for slow ll_sync_inode calls is here:

    https://github.com/ceph/ceph/pull/21498

We'll definitely want to pull that patch in as part of this backport (assuming it passes testing and gets merged, of course).

Comment 20 Ramakrishnan Periyasamy 2018-05-07 04:22:58 UTC
> No reliable reproducer is known so QE should just run basic regression
> testing on this.

Since there is no reliable reproducer and basic regression is enough to move this bz to verified we have ran basic tests with IO for CephFS-NFS as mentioned below.

1. Basic IO workloads using IO tools like Smallfiles, Crefi, fio and "rm -rf" of files and folders in another thread. All these IO were ran parallel in same NFS mount on different directory.
2. IO Workloads used during testing (repeated below workloads after completion of IO in another directory)
    fio: sudo ./fio --name=global --rw=readwrite --size=50G --name=job1 --name=job2 --directory=/mnt/nfs/_dir4_1/ --verify=meta --runtime=2h --time_based
    crefi: for i in create rename chmod chown chgrp symlink hardlink setxattr truncate; do sudo ./crefi.py --multi -b 10 -d 10 -n 1000 -t sparse --random --min=1K --max=512K /mnt/nfs/@#dir8/ ; done
    smallfiles: for i in create append overwrite delete create mkdir read rename readdir symlink rmdir stat chmod setxattr rename getxattr ls-l ; do sudo python ./smallfile_cli.py --operation $i --threads 8 --file-size 1024 --files 2048 --top /mnt/nfs/dir10/ ; done
3. Restart of ganesha service during IO in progress, not seen any failures, IO's were running without any issues (i.e. observed a delay in IO but not faiure)
4. With 2 NFS client(only mount no NFS HA), running IO's from both client on different directory. Used the same workloads as in 1 & 2 from both clients.
5. Restart of ganesha service while IO's in progress from both clients.
6. Basic command tests like creating files and directories with special characters, scp, vi editors, files with big names, etc.
7. During Client IO's MDS failover

Moving this bz to verified state.

Comment 23 errata-xmlrpc 2018-05-15 18:20:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1563