Bug 1566467

Summary:

[nfs-ganesha] NFS SETATTR requests are not durable with NFS-Ganesha backed by CephFS

Product:

[Red Hat Storage] Red Hat Ceph Storage

Reporter:

Ram Raja <rraja>

Component:

CephFS

Assignee:

Matt Benjamin (redhat) <mbenjamin>

Status:

CLOSED ERRATA

QA Contact:

Ramakrishnan Periyasamy <rperiyas>

Severity:

high

Docs Contact:

Aron Gunn <agunn>

Priority:

high

Version:

3.0

CC:

agunn, ceph-eng-bugs, ceph-qe-bugs, jlayton, john.spray, mbenjamin, pdonnell, rperiyas, sostapov, vakulkar

Target Milestone:

Target Release:

3.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

RHEL: ceph-12.2.4-10.el7cp Ubuntu: ceph_12.2.4-14redhat1xenial

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-05-15 18:20:31 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
patch to add ceph testcase to demonstrate problem	none

Description Ram Raja 2018-04-12 11:32:47 UTC

Description of problem:

With NFS-Ganesha backed by CephFS, setattr request changes from a NFS client might be lost even though the NFS client received a reply that changes were successfully applied. This could happen if the NFS-Ganesha server crashed, and the libcephfs client had cached the setattr request and not sent it to the MDS, but replied success to the application.

From https://tracker.ceph.com/issues/23291,
" Zheng pointed out that we could end up with a setattr request being cached in the libcephfs client after it has replied with success to the application. This is problematic for ganesha:

1/ NFS client does a NFS SETATTR
2/ ganesha issues setattrx to libcephfs, result is cached in client and not sent to MDS
3/ ganesha replies with success to client
4/ ganesha crashes

setattrx changes are then lost."

Version-Release number of selected component (if applicable): RHCS 3.0.z2

Comment 3 Ram Raja 2018-04-12 11:51:21 UTC

Patches to fix this have been merged upstream Ceph repo's master branch and NFS-Ganesha repo's next branch. We'd need to backport these patches.

Patch in Ceph Master
--------------------
commit 78628a5fde277927126d4d6e5d2957e7e6d42fb1
Author: Jeff Layton <jlayton>
Date:   Wed Mar 14 12:25:36 2018 -0400

    client: add ceph_ll_sync_inode
    
    Ganesha's NFS SETATTR requests are expected to be synchronous, but
    libcephfs can cache attribute changes if the client has the right caps.
    An fsync after changing the attributes would flush the caps back to the
    MDS, but we may not have an Fh in the context of a SETATTR request.
    
    Add a way to sync a specific Inode, even when we don't have easy access
    to a Fh.
    
    Tracker: http://tracker.ceph.com/issues/23291
    Signed-off-by: Jeff Layton <jlayton>

Patches in NFS-Ganesha 'next' branch
------------------------------------
commit fdd020fab3034ef96571646450c28c2ec9329b70
Author: Jeff Layton <jlayton>
Date:   Mon Apr 2 14:42:18 2018 -0400

    cmake: add USE_FSAL_CEPH_LL_SYNC_INODE to config-h.in.cmake
    
    As Dan points out, I left this (rather important) bit out.
    
    Change-Id: I69aa582a203dc099577c591f474367381a535d17
    Reported-by: Daniel Gryniewicz <dang>
    Signed-off-by: Jeff Layton <jlayton>


commit d9fe5c92a1543a2c8b1c1a91c4d610f4adb92e07
Author: Jeff Layton <jlayton>
Date:   Wed Mar 14 13:49:48 2018 -0400

    FSAL_CEPH: add check for and call to ceph_ll_sync_inode
    
    libcephfs can cache the results of a setattr request after replying
    to it. This violates the semantics that NFS expects. Add a call to
    ceph_ll_sync_inode after the setattr, if the function exists.
    
    Change-Id: Ibbe86fe547bc571dd71e7a5d93cc13111c03910a
    Signed-off-by: Jeff Layton <jlayton>

Comment 10 Jeff Layton 2018-04-12 19:34:14 UTC

Created attachment 1421030 [details]
patch to add ceph testcase to demonstrate problem

This patch demonstrates the problem with a cephfs testcase. It fails for me, but if you uncomment the ceph_ll_sync_inode call, it works. If you use Zheng's reclaim branch, you can uncomment the other calls and it then runs quicker too.

So, I believe this problem is theoretically possible with ganesha, but I'm not sure exactly how to make it occur on demand.

Comment 12 Jeff Layton 2018-04-13 17:04:48 UTC

This may be uncovering a different issue as well.

With a ceph pull from earlier this week, I'm seeing ~5s stalls in the ll_sync_inode calls during a ganesha SETATTR op. It looks like we're issuing the fsync to the MDS, but it's taking a full tick to reply.

That said, I need to confirm that and turn up the debugging.

Comment 13 Jeff Layton 2018-04-13 19:21:11 UTC

http://tracker.ceph.com/issues/23714 for the slow ll_sync_inode problem.

Comment 14 Jeff Layton 2018-04-18 14:02:52 UTC

PR for slow ll_sync_inode calls is here:

    https://github.com/ceph/ceph/pull/21498

We'll definitely want to pull that patch in as part of this backport (assuming it passes testing and gets merged, of course).

Comment 20 Ramakrishnan Periyasamy 2018-05-07 04:22:58 UTC

> No reliable reproducer is known so QE should just run basic regression
> testing on this.

Since there is no reliable reproducer and basic regression is enough to move this bz to verified we have ran basic tests with IO for CephFS-NFS as mentioned below.

1. Basic IO workloads using IO tools like Smallfiles, Crefi, fio and "rm -rf" of files and folders in another thread. All these IO were ran parallel in same NFS mount on different directory.
2. IO Workloads used during testing (repeated below workloads after completion of IO in another directory)
    fio: sudo ./fio --name=global --rw=readwrite --size=50G --name=job1 --name=job2 --directory=/mnt/nfs/_dir4_1/ --verify=meta --runtime=2h --time_based
    crefi: for i in create rename chmod chown chgrp symlink hardlink setxattr truncate; do sudo ./crefi.py --multi -b 10 -d 10 -n 1000 -t sparse --random --min=1K --max=512K /mnt/nfs/@#dir8/ ; done
    smallfiles: for i in create append overwrite delete create mkdir read rename readdir symlink rmdir stat chmod setxattr rename getxattr ls-l ; do sudo python ./smallfile_cli.py --operation $i --threads 8 --file-size 1024 --files 2048 --top /mnt/nfs/dir10/ ; done
3. Restart of ganesha service during IO in progress, not seen any failures, IO's were running without any issues (i.e. observed a delay in IO but not faiure)
4. With 2 NFS client(only mount no NFS HA), running IO's from both client on different directory. Used the same workloads as in 1 & 2 from both clients.
5. Restart of ganesha service while IO's in progress from both clients.
6. Basic command tests like creating files and directories with special characters, scp, vi editors, files with big names, etc.
7. During Client IO's MDS failover

Moving this bz to verified state.

Comment 23 errata-xmlrpc 2018-05-15 18:20:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1563