1699402 – smallfile caused kernel Cephfs crash in RHOCS (OpenShift-on-Ceph)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1699402 - smallfile caused kernel Cephfs crash in RHOCS (OpenShift-on-Ceph)

Summary: smallfile caused kernel Cephfs crash in RHOCS (OpenShift-on-Ceph)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	7.6
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	7.7
Assignee:	Jeff Layton
QA Contact:	Hemanth Kumar
Docs Contact:	John Wilkins
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-12 15:42 UTC by Ben England
Modified:	2020-09-29 21:01 UTC (History)
CC List:	11 users (show)
Fixed In Version:	kernel-3.10.0-1128.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-29 21:00:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2020:4060	0	None	None	None	2020-09-29 21:01:26 UTC

Description Ben England 2019-04-12 15:42:00 UTC

Description of problem:

A smallfile workload on 100 pods across 3 nodes crashed kernel Cephfs with a Ceph Mimic cluster.  The system had been running smallfile in various ways for a couple of hours before this happened.  I know Mimic isn't either RHCS 3.2 or RHCS 4.0 but in between, but at the time rook.io did not support Nautilus.

Version-Release number of selected component (if applicable):

Ceph Mimic container image ceph/ceph_v13 deployed with rook.io 
docker images | grep rook => "
docker.io/rook/ceph master 087c372bb156 6 days ago  698 MB"
OpenShift 3.11
RHEL7.6 kernel-3.10.0-957.el7.x86_64
smallfile version quay.io/bengland2/smallfile:20190411.b

How reproducible:

Not sure yet.

Steps to Reproduce (details below):
1. Deploy OpenShift 3.11 (ask Elko Kuric for details) on RHEL7.6
2. Deploy rook.io using specified config files
3. Bring up Cephfs filesystem 
4. run smallfile benchmark for hours in a set of 100 pods with Cephfs bind mounts provisioned by Kubernetes


Actual results:

Kernel crash

Expected results:

No crash.  

Additional info:

Crash is available at 

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/rhocs-bene-2019-04-12-10-19/crash-e23-h19-740xd-2019-04-12-00-55-56.tgz

It appears this smallfile command was running at the time of the crash:

Running oc -n smallfile rsh smallfile-pod-22l9q /smallfile/smallfile_cli.py --top /mnt/cephfs --launch-by-daemon Y --output-json /tmp/result.json --response-times Y --host-set /tmp/smf.pod.id.list --files 10000 --file-size 4 --threads 1 --operation delete

sosreports for all nodes (i.e. configuration) available at 

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/rhocs-bene-2019-04-12-10-19/smf-rhocs-tests/sosreports/

log of entire set of smallfile runs available at:

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/rhocs-bene-2019-04-12-10-19/smf-test-runs.2019-04-11-16-55.log

executing this script:

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/rhocs/alias2/smf-test-runs.sh

individual smallfile results available at:

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/rhocs-bene-2019-04-12-10-19/smf-rhocs-tests/

Comment 3 Yan, Zheng 2019-04-13 08:55:02 UTC

BUG is at
[125150.911742] kernel BUG at fs/ceph/mds_client.c:2100!
  BUG_ON(p > end);

[125150.912549]  [<ffffffffc0cf5352>] __do_request+0x342/0x430 [ceph]
[125150.912569]  [<ffffffffc0cf6a8d>] ceph_mdsc_do_request+0x9d/0x280 [ceph]
[125150.912589]  [<ffffffffc0cd6901>] ceph_d_revalidate+0x221/0x520 [ceph]
[125150.912608]  [<ffffffff8925c43a>] ? __d_lookup+0x7a/0x160
[125150.912623]  [<ffffffff8924d2ea>] lookup_fast+0x1da/0x230
[125150.912637]  [<ffffffff8925156d>] path_lookupat+0x16d/0x8b0
[125150.912653]  [<ffffffff8921bcb5>] ? kmem_cache_alloc+0x35/0x1f0
[125150.912668]  [<ffffffff89252b2f>] ? getname_flags+0x4f/0x1a0
[125150.912682]  [<ffffffff89251cdb>] filename_lookup+0x2b/0xc0
[125150.912696]  [<ffffffff89253cc7>] user_path_at_empty+0x67/0xc0
[125150.912712]  [<ffffffff891ebc6d>] ? handle_mm_fault+0x39d/0x9b0
[125150.913402]  [<ffffffff89253d31>] user_path_at+0x11/0x20
[125150.914045]  [<ffffffff89246bc3>] vfs_fstatat+0x63/0xc0
[125150.914681]  [<ffffffff89246f7e>] SYSC_newstat+0x2e/0x60
[125150.915314]  [<ffffffff89138b36>] ? __audit_syscall_exit+0x1e6/0x280
[125150.915943]  [<ffffffff8924743e>] SyS_newstat+0xe/0x10
[125150.916567]  [<ffffffff89774ddb>] system_call_fastpath+0x22/0x27

Encoding lookup request is simple. there is no req->r_xxx_drop. No idea what happened

Comment 4 Jeff Layton 2019-04-13 11:07:42 UTC

I'm poking around in this code at the moment, so I'll go ahead and grab this one.

Comment 7 Jeff Layton 2019-04-15 11:17:02 UTC

Notes from vmcore:

I believe the struct ceph_mds_request is at 0xffff982ba7a69000, and ceph_msg is at 0xffff983a82aebc00. This is the "front":

crash> struct ceph_msg.front ffff983a82aebc00
  front = {
    iov_base = 0xffff98396bf36180, 
    iov_len = 0x8b
  }

So, "end" should be 0xffff98396bf3620b. After encoding the filepaths:

crash> struct ceph_mds_request.r_request_release_offset 0xffff982ba7a69000
  r_request_release_offset = 0x8a

So it looks like encoding the filepaths ended up taking almost all of the buffer, before we even got to encoding the timespec, which is probably what pushed it over the limit.

Comment 9 Jeff Layton 2019-04-15 11:56:04 UTC

We're dealing with a dentry revalidation here, so this is just a lookup and r_dentry is set in the request. We calculate the frontlen of the request like this:

        len = sizeof(*head) +
                pathlen1 + pathlen2 + 2*(1 + sizeof(u32) + sizeof(u64)) +
                sizeof(struct ceph_timespec);


pathlen2 should 0, and the rest of the fields are fixed-length, so pathlen1 should be 17 bytes (0x11). Looking at r_dentry though:

  d_name = {
    {
      {
        hash = 0xb7f43245, 
        len = 0x18
      }, 
      hash_len = 0x18b7f43245
    }, 
    name = 0xffff98445a522ff8 "starting_gate.tmp.notyet"
  }, 

Just this component is 0x18 bytes long, and it's not at the root of the fs.

Comment 10 Jeff Layton 2019-04-15 12:47:57 UTC

set_request_path_attr is what sets the pointers to the path strings and the length. In the case of a nosnap dentry, we just return a pointer to the d_name string.

        rcu_read_lock();
        if (!dir)
                dir = d_inode_rcu(dentry->d_parent);
        if (dir && ceph_snap(dir) == CEPH_NOSNAP) {
                *pino = ceph_ino(dir);
                rcu_read_unlock();
                *ppath = dentry->d_name.name;
                *ppathlen = dentry->d_name.len;
                return 0;
        }
        rcu_read_unlock();

But when we go to actually encode the thing, we don't pass in the ppathlen:

    ceph_encode_filepath(&p, end, ino1, path1);

I think we hit a race of some sort here, where the length of the string changed after we did the initial set_request_path_attr, but before the encoding. I'm not yet sure how that could happen however.

Comment 11 Jeff Layton 2019-04-15 12:57:20 UTC

Ok, I think the thing probably got involved in a rename while we were working in it. __d_move will call switch_names to swap the dentry names. I suspect that's what happened here. It could be other things too, but basically I think that build_dentry_path is likely not safe.

I think we'll probably need to stop build_dentry_path from passing back pointers to dentry->d_name.name. We either need to hold d_lock in order to stabilize the pointers (probably not feasible), or make a copy of the name before returning.

Still looking over the code to figure out the right thing to do here.

Comment 12 Yan, Zheng 2019-04-15 14:00:10 UTC

good job.

Comment 13 Jeff Layton 2019-04-15 14:03:03 UTC

Thanks. Upstream code is pretty much the same here too, so I'm working on a patch for that first. We can discuss/fix it there and then backport to RHEL7.

Comment 14 Jeff Layton 2019-04-15 18:05:40 UTC

Patch posted upstream:

https://marc.info/?l=ceph-devel&m=155534672111970&w=2

Comment 22 Ben England 2020-01-24 22:31:28 UTC

suggestion: run this with RHEL8.1 Cephfs client and servers running OCP 4.3 and OCS 4.2 GA.  If it doesn't happen there, then we mark it fixed.   I can't really do that until I have the same hardware config.  It doesn't say this here (sigh) but I was running this test on bare metal cluster, OCP 4.3 will support bare metal deployments with UPI, and OCS should run on that.

Comment 25 Jan Stancek 2020-03-12 07:49:59 UTC

Patch(es) committed on kernel-3.10.0-1128.el7

Comment 32 Veera Raghava Reddy 2020-08-27 07:04:53 UTC

Ceph QE is running sanity test. Will update the BZ once test are complete.

Comment 35 errata-xmlrpc 2020-09-29 21:00:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4060

Note You need to log in before you can comment on or make changes to this bug.