Bug 1699402

Summary:	smallfile caused kernel Cephfs crash in RHOCS (OpenShift-on-Ceph)
Product:	Red Hat Enterprise Linux 7	Reporter:	Ben England <bengland>
Component:	kernel	Assignee:	Jeff Layton <jlayton>
kernel sub component:	CephFS	QA Contact:	Hemanth Kumar <hyelloji>
Status:	CLOSED ERRATA	Docs Contact:	John Wilkins <jowilkin>
Severity:	high
Priority:	high	CC:	ceph-eng-bugs, cmaiolin, jlayton, pdonnell, rhandlin, sweil, tmuthami, vakulkar, vereddy, zguo, zyan
Version:	7.6
Target Milestone:	rc
Target Release:	7.7
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	kernel-3.10.0-1128.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-09-29 21:00:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ben England 2019-04-12 15:42:00 UTC

Description of problem:

A smallfile workload on 100 pods across 3 nodes crashed kernel Cephfs with a Ceph Mimic cluster.  The system had been running smallfile in various ways for a couple of hours before this happened.  I know Mimic isn't either RHCS 3.2 or RHCS 4.0 but in between, but at the time rook.io did not support Nautilus.

Version-Release number of selected component (if applicable):

Ceph Mimic container image ceph/ceph_v13 deployed with rook.io 
docker images | grep rook => "
docker.io/rook/ceph master 087c372bb156 6 days ago  698 MB"
OpenShift 3.11
RHEL7.6 kernel-3.10.0-957.el7.x86_64
smallfile version quay.io/bengland2/smallfile:20190411.b

How reproducible:

Not sure yet.

Steps to Reproduce (details below):
1. Deploy OpenShift 3.11 (ask Elko Kuric for details) on RHEL7.6
2. Deploy rook.io using specified config files
3. Bring up Cephfs filesystem 
4. run smallfile benchmark for hours in a set of 100 pods with Cephfs bind mounts provisioned by Kubernetes


Actual results:

Kernel crash

Expected results:

No crash.  

Additional info:

Crash is available at 

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/rhocs-bene-2019-04-12-10-19/crash-e23-h19-740xd-2019-04-12-00-55-56.tgz

It appears this smallfile command was running at the time of the crash:

Running oc -n smallfile rsh smallfile-pod-22l9q /smallfile/smallfile_cli.py --top /mnt/cephfs --launch-by-daemon Y --output-json /tmp/result.json --response-times Y --host-set /tmp/smf.pod.id.list --files 10000 --file-size 4 --threads 1 --operation delete

sosreports for all nodes (i.e. configuration) available at 

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/rhocs-bene-2019-04-12-10-19/smf-rhocs-tests/sosreports/

log of entire set of smallfile runs available at:

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/rhocs-bene-2019-04-12-10-19/smf-test-runs.2019-04-11-16-55.log

executing this script:

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/rhocs/alias2/smf-test-runs.sh

individual smallfile results available at:

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/rhocs-bene-2019-04-12-10-19/smf-rhocs-tests/

Comment 3 Yan, Zheng 2019-04-13 08:55:02 UTC

BUG is at
[125150.911742] kernel BUG at fs/ceph/mds_client.c:2100!
  BUG_ON(p > end);

[125150.912549]  [<ffffffffc0cf5352>] __do_request+0x342/0x430 [ceph]
[125150.912569]  [<ffffffffc0cf6a8d>] ceph_mdsc_do_request+0x9d/0x280 [ceph]
[125150.912589]  [<ffffffffc0cd6901>] ceph_d_revalidate+0x221/0x520 [ceph]
[125150.912608]  [<ffffffff8925c43a>] ? __d_lookup+0x7a/0x160
[125150.912623]  [<ffffffff8924d2ea>] lookup_fast+0x1da/0x230
[125150.912637]  [<ffffffff8925156d>] path_lookupat+0x16d/0x8b0
[125150.912653]  [<ffffffff8921bcb5>] ? kmem_cache_alloc+0x35/0x1f0
[125150.912668]  [<ffffffff89252b2f>] ? getname_flags+0x4f/0x1a0
[125150.912682]  [<ffffffff89251cdb>] filename_lookup+0x2b/0xc0
[125150.912696]  [<ffffffff89253cc7>] user_path_at_empty+0x67/0xc0
[125150.912712]  [<ffffffff891ebc6d>] ? handle_mm_fault+0x39d/0x9b0
[125150.913402]  [<ffffffff89253d31>] user_path_at+0x11/0x20
[125150.914045]  [<ffffffff89246bc3>] vfs_fstatat+0x63/0xc0
[125150.914681]  [<ffffffff89246f7e>] SYSC_newstat+0x2e/0x60
[125150.915314]  [<ffffffff89138b36>] ? __audit_syscall_exit+0x1e6/0x280
[125150.915943]  [<ffffffff8924743e>] SyS_newstat+0xe/0x10
[125150.916567]  [<ffffffff89774ddb>] system_call_fastpath+0x22/0x27

Encoding lookup request is simple. there is no req->r_xxx_drop. No idea what happened

Comment 4 Jeff Layton 2019-04-13 11:07:42 UTC

I'm poking around in this code at the moment, so I'll go ahead and grab this one.

Comment 7 Jeff Layton 2019-04-15 11:17:02 UTC

Notes from vmcore:

I believe the struct ceph_mds_request is at 0xffff982ba7a69000, and ceph_msg is at 0xffff983a82aebc00. This is the "front":

crash> struct ceph_msg.front ffff983a82aebc00
  front = {
    iov_base = 0xffff98396bf36180, 
    iov_len = 0x8b
  }

So, "end" should be 0xffff98396bf3620b. After encoding the filepaths:

crash> struct ceph_mds_request.r_request_release_offset 0xffff982ba7a69000
  r_request_release_offset = 0x8a

So it looks like encoding the filepaths ended up taking almost all of the buffer, before we even got to encoding the timespec, which is probably what pushed it over the limit.

Comment 9 Jeff Layton 2019-04-15 11:56:04 UTC

We're dealing with a dentry revalidation here, so this is just a lookup and r_dentry is set in the request. We calculate the frontlen of the request like this:

        len = sizeof(*head) +
                pathlen1 + pathlen2 + 2*(1 + sizeof(u32) + sizeof(u64)) +
                sizeof(struct ceph_timespec);


pathlen2 should 0, and the rest of the fields are fixed-length, so pathlen1 should be 17 bytes (0x11). Looking at r_dentry though:

  d_name = {
    {
      {
        hash = 0xb7f43245, 
        len = 0x18
      }, 
      hash_len = 0x18b7f43245
    }, 
    name = 0xffff98445a522ff8 "starting_gate.tmp.notyet"
  }, 

Just this component is 0x18 bytes long, and it's not at the root of the fs.

Comment 10 Jeff Layton 2019-04-15 12:47:57 UTC

set_request_path_attr is what sets the pointers to the path strings and the length. In the case of a nosnap dentry, we just return a pointer to the d_name string.

        rcu_read_lock();
        if (!dir)
                dir = d_inode_rcu(dentry->d_parent);
        if (dir && ceph_snap(dir) == CEPH_NOSNAP) {
                *pino = ceph_ino(dir);
                rcu_read_unlock();
                *ppath = dentry->d_name.name;
                *ppathlen = dentry->d_name.len;
                return 0;
        }
        rcu_read_unlock();

But when we go to actually encode the thing, we don't pass in the ppathlen:

    ceph_encode_filepath(&p, end, ino1, path1);

I think we hit a race of some sort here, where the length of the string changed after we did the initial set_request_path_attr, but before the encoding. I'm not yet sure how that could happen however.

Comment 11 Jeff Layton 2019-04-15 12:57:20 UTC

Ok, I think the thing probably got involved in a rename while we were working in it. __d_move will call switch_names to swap the dentry names. I suspect that's what happened here. It could be other things too, but basically I think that build_dentry_path is likely not safe.

I think we'll probably need to stop build_dentry_path from passing back pointers to dentry->d_name.name. We either need to hold d_lock in order to stabilize the pointers (probably not feasible), or make a copy of the name before returning.

Still looking over the code to figure out the right thing to do here.

Comment 12 Yan, Zheng 2019-04-15 14:00:10 UTC

good job.

Comment 13 Jeff Layton 2019-04-15 14:03:03 UTC

Thanks. Upstream code is pretty much the same here too, so I'm working on a patch for that first. We can discuss/fix it there and then backport to RHEL7.

Comment 14 Jeff Layton 2019-04-15 18:05:40 UTC

Patch posted upstream:

https://marc.info/?l=ceph-devel&m=155534672111970&w=2

Comment 22 Ben England 2020-01-24 22:31:28 UTC

suggestion: run this with RHEL8.1 Cephfs client and servers running OCP 4.3 and OCS 4.2 GA.  If it doesn't happen there, then we mark it fixed.   I can't really do that until I have the same hardware config.  It doesn't say this here (sigh) but I was running this test on bare metal cluster, OCP 4.3 will support bare metal deployments with UPI, and OCS should run on that.

Comment 25 Jan Stancek 2020-03-12 07:49:59 UTC

Patch(es) committed on kernel-3.10.0-1128.el7

Comment 32 Veera Raghava Reddy 2020-08-27 07:04:53 UTC

Ceph QE is running sanity test. Will update the BZ once test are complete.

Comment 35 errata-xmlrpc 2020-09-29 21:00:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4060