Bug 1819483

Summary:	[Tracker for BZ #1900111] Ceph MDS won't run in OCS with millions of files
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Ben England <bengland>
Component:	ceph	Assignee:	Douglas Fuller <dfuller>
Status:	CLOSED ERRATA	QA Contact:	Warren <wusui>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.3	CC:	dcritch, ekuric, gfarnum, kramdoss, madam, muagarwa, ocs-bugs, owasserm, pdonnell, ratamir, sewagner, sostapov
Target Milestone:	---	Keywords:	Performance, Tracking
Target Release:	OCS 4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.8.0-416.ci	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-08-03 18:15:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1900111
Bug Blocks:

Description Ben England 2020-04-01 00:06:13 UTC

- Description of problem:

After multi-hour mixed-workload Cephfs test, Ceph MDS service is down and can't restart. OSDs are fully functional, rados bench works fine.

- Version of all relevant components (if applicable):

version 4.3.3 True False 10d Cluster version is 4.3.3
image: quay.io/rhceph-dev/ocs-olm-operator:latest-4.3
containerImage: quay.io/ocs-dev/ocs-operator:4.3.0
rhcos_url: http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/ocp4/rhcos-4.3.0-x86_64/
openshift_release_url: http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/ocp4/openshift-4.3/

- hw config:
7 Dell 740xd, each with 192 GIB RAM, 56 cores, 2 25-GbE NIC ports, 1 Samsung NVM SSD, and 8-12 HDD.

- Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

As a performance engineer, I can't positively answer questions about scalability and durability of Cephfs in OCS because of this problem.

- Is there any workaround available to the best of your knowledge?

I'll experiment with raising memory request and limit for MDS pods. This seems to get the MDS back on its feet enough to abort the test. I'm doing this by oc edit:

deployment.apps/rook-ceph-mds-example-storagecluster-cephfilesystem-{a,b}

so that the memory limit is 40 GiB instead of 8 GiB. Then you delete the replicasets and the pods. It could go higher perhaps. This is recommended in the Ceph documentation.

https://docs.ceph.com/docs/master/cephfs/add-remove-mds/#provisioning-hardware-for-an-mds

Ceph documentation also recommends using all-SSD metadata pool, I think I can try that out also.

- Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

- Is this issue reproducible?

Don't know yet, it takes a day to reproduce and then bring the pool back down to zero again (after deleting Cephfs PVC).

- Can this issue reproduce from the UI?

Don't know.

- If this is a regression, please provide more details to justify this:

Not sure this is a regression, not many people test Cephfs with 40 million objects in data pool. However, NVidia has run RHCS with a lot of files, though they had problems with small files too.

Steps to Reproduce:
1. install OCP using procedure below
2. install OCS using procedure below
3. run ripsaw's fs-drift benchmark with parameters below, and wait

- Actual results:

I attempt to restart MDS by deleting the active and backup MDS pods, resulting in:

[root@e24-h17-740xd must-gather]# ocos get pod | grep mds
rook-ceph-mds-example-storagecluster-cephfilesystem-a-79ff5bfqv 0/1 OOMKilled 1 56s
rook-ceph-mds-example-storagecluster-cephfilesystem-b-649bp9pkj 1/1 Running 0 11s

You can see the memory climb right up to 8 GB RSS for the MDS pod, it is then OOMkilled, after a while Kubernetes gives up on restarting it.

Have to clear out pools manually and see if the size of the object count in the pool was the source of the problem or not (perhaps directories got too big).

- Expected results:

MDS *never* goes down, though its performance may degrade under some circumstances (if it can't cache enough metadata). At very least, if it encounters a condition that it can't handle, log an error message and inform the user of the limitation.

- Additional info:

where do I begin:-) Files for this bz are here:

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/ocp4/fsd-bz/

Anyone outside redhat will not be able to access, contact me at bengland and I will make them available to you, no secrets here.

Must-gather info is in this sub-directory:

must-gather.local.6842679217637098243/

fs-drift benchmark CR is here:

fsd-benchmark-10hr.yaml

OCP4 deployment done with these UPI-based playbooks:

https://github.com/bengland2/ocp4_upi_baremetal

OCS deployment done with this script and scripts in same directory:

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/openshift/upi/ocs-bringup/all-of-it.sh

ripsaw deployment done as described in documentation here:

https://github.com/cloud-bulldozer/ripsaw/blob/master/docs/installation.md

This is a workload consisting of 50 pods accessing a shared fs of up to 100000 files of at most 1-MiB in size, with a 20-TiB filesystem limit. So there should only be 100,000 files.

workload documentation here:

https://github.com/cloud-bulldozer/ripsaw/blob/master/docs/fs-drift.md

HTH -ben

Comment 2 Ben England 2020-04-01 12:22:12 UTC

After raising the memory request and limit for MDS to 40 GiB from 8 GiB, the workload completed successfully, and the MDS is still up.    Memory size of MDS is ~36.5 GB, over 4 times the default memory limit.    Data for this run in 

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/ocp4/fsd-bz/fatmds

Here is the must-gather data from after the test completed:

must-gather.local.8666068670366363853/

and a screenshot of the workload throughput history:

fsdrift.png

Comment 3 Ben England 2020-04-01 13:22:33 UTC

after the workload had stopped for a few hours, I see little of the 22 GB memory allocated to the process is actually in use.   What's up with that?  It appears that the MDS process tcmalloc library is not releasing unused memory back to the operating system, which prevents other pods from gaining access to it.   In this case, actual RSS is 22.8 GiB, of which 4.1 GiB (20%) is actually in use!   

[root@e24-h17-740xd ~]# cephpod tell mds.example-storagecluster-cephfilesystem-a heap stats
2020-04-01 12:59:15.165 7f6355ffb700  0 client.1063733 ms_handle_reset on v2:10.128.1.16:6800/2259801570
2020-04-01 12:59:15.182 7f6356ffd700  0 client.1063748 ms_handle_reset on v2:10.128.1.16:6800/2259801570
mds.example-storagecluster-cephfilesystem-a tcmalloc heap stats:------------------------------------------------
MALLOC:     4498427416 ( 4290.0 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +  19586313344 (18679.0 MiB) Bytes in central cache freelist
MALLOC: +      8644352 (    8.2 MiB) Bytes in transfer cache freelist
MALLOC: +     26730088 (   25.5 MiB) Bytes in thread cache freelists
MALLOC: +    201195520 (  191.9 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =  24321310720 (23194.6 MiB) Actual memory used (physical + swap)
MALLOC: +  14993752064 (14299.2 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =  39315062784 (37493.8 MiB) Virtual address space used
MALLOC:
MALLOC:        2619510              Spans in use
MALLOC:             21              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

Comment 4 Ben England 2020-04-02 02:00:07 UTC

The original problem is reproducible, when I lower the mem limit from 40 GiB to 8 GiB and start a new test, within 45 minutes the MDS had gone into CLBO state.  And once again, increasing memory limit to 40 GiB (edit to deployment) brought it out of CLBO state immediately into Running state.   

This time, I also made some other changes:

- increased CPU core limit to 6 to see if it would go any faster than it did with 3
- put metadata pool on SSD and data pool on HDD

So far it hasn't made a difference but I expect it will when the metadata pool becomes uncacheable.   The second part was done with this change:

ceph osd crush rule create-replicated ssd default host ssd
ceph osd crush rule create-replicated hdd default host hdd
ceph osd pool set example-storagecluster-cephfilesystem-metadata crush_rule ssd
ceph osd pool set example-storagecluster-cephfilesystem-data0 crush_rule hdd

then to make it move the pools faster:

ceph tell osd.* injectargs '--osd_recovery_sleep_hdd 0.0'

Comment 5 Ben England 2020-04-02 02:34:36 UTC

I resolved the problem in comment 3 with the command:

cephpod tell mds.example-storagecluster-cephfilesystem-a injectargs '--mds_cache_reservation 0.95'

Heap release did not work because MDS was tracking client "caps" (client-side cache), the above command forces the clients to release their "caps" and then MDS can let go of them.  Then you set it back to the default.

Comment 6 Yaniv Kaul 2020-04-02 07:24:06 UTC

Is that a Ceph or an OCS bug?

Comment 7 Ben England 2020-04-03 00:21:56 UTC

IMO comment 5 *might* be a Ceph bug.   But overall this seems to be an OCS bug - when I reconfigured as above, I just ran a test for 20 hours with fs-drift and Cephfs stayed up and completed the run.    I'll have to try it with smaller files, but I was able to fill the storage up to 60% with 1/2 MB file size average.   It wasn't that hard to reconfigure Kubernetes to supply different amounts of memory to MDS.   Seems like it would be possible to adjust memory limit based on whether Cephfs was being used or not.   I'll try to make it run with less memory, and/or use more MDS servers, and look into how comment 5 happened.

[root@e24-h17-740xd ~]# cephpod df
RAW STORAGE:
    CLASS     SIZE       AVAIL      USED        RAW USED     %RAW USED 
    hdd       83 TiB     41 TiB      42 TiB       42 TiB         50.83 
    ssd       13 TiB     13 TiB     1.5 GiB       49 GiB          0.36 
    TOTAL     97 TiB     54 TiB      42 TiB       42 TiB         43.82 
 
POOLS:
POOL                                              ID     STORED      OBJECTS     USED        %USED     MAX AVAIL 
example-storagecluster-cephblockpool              1     1.4 GiB         433     4.2 GiB      0.01       9.2 TiB 
example-storagecluster-cephfilesystem-metadata    4     6.4 GiB       6.47k     6.9 GiB      0.06       3.8 TiB 
example-storagecluster-cephfilesystem-data0       6      13 TiB      50.40M      42 TiB     64.01       7.9 TiB 
...

Comment 8 Greg Farnum 2020-04-06 21:50:08 UTC

We have upcoming (upstream) kernel and userspace patches that cause clients to more proactively drop unused caps which may help, but in general if the client is holding references to indoes or dentries, the MDS must cache those inodes/dentries in memory as well.

However, if setting a very aggressive cache reservation caused the MDS to recall a bunch of client caps, it sounds like either
* the MDS was (buggilly) not requesting clients to drop cached data previously, or
* the MDS cache setting is mismatched in comparison to the container memory limits, or
* there's something else more complicated going on. Perhaps the following:

The heap stats dump you posted does seem to indicate that the MDS had released a bunch of memory back to the allocator, but it hadn't returned that memory to the kernel. You say that invoking the release heap command didn't actually give any memory back? Were these machines using hugepages? That might account for it, if i correctly recall some work Mark did with the OSDs, the hugepage allocator, and tcmalloc.

Comment 9 Michael Adam 2020-04-07 08:19:18 UTC

Moving this out to ocs 4.5 for further analysis. 4.3 is almost out. 4.4 is pretty much closed.

Comment 10 Ben England 2020-05-18 23:43:25 UTC

Thx Greg, will retry with OCS 4.5 and RHCOS 4.4 in scale lab with 26-node cluster.    I think some of MDS memory problem was caps, but I think the version I was using had hugepages disabled already.   I can manually reduce caps and see how this impacts the problem.   As I recall, the problem went away when I released all the caps by changing "mds cache reservation" from 0.05 default to something really high.   But I shouldn't have to do this manually.

Comment 11 Yaniv Kaul 2020-06-25 12:26:50 UTC

(In reply to Ben England from comment #10)
> Thx Greg, will retry with OCS 4.5 and RHCOS 4.4 in scale lab with 26-node
> cluster.    I think some of MDS memory problem was caps, but I think the
> version I was using had hugepages disabled already.   I can manually reduce
> caps and see how this impacts the problem.   As I recall, the problem went
> away when I released all the caps by changing "mds cache reservation" from
> 0.05 default to something really high.   But I shouldn't have to do this
> manually.

Let us know what the results are!

Comment 12 Michael Adam 2020-07-02 20:54:40 UTC

Moving out of ocs 4.5 for now.

Comment 13 Ben England 2020-07-06 13:54:59 UTC

Was unable to re-run in scale lab due to other issues that took higher priority - see document that was produced for OCS 4.5.   This should be part of a QE test, I think.  cc'ing Karthick.

Comment 14 Patrick Donnelly 2020-07-06 20:59:29 UTC

(In reply to Ben England from comment #5)
> I resolved the problem in comment 3 with the command:
> 
> cephpod tell mds.example-storagecluster-cephfilesystem-a injectargs
> '--mds_cache_reservation 0.95'
> 
> Heap release did not work because MDS was tracking client "caps"
> (client-side cache), the above command forces the clients to release their
> "caps" and then MDS can let go of them.  Then you set it back to the default.

This will effectively reduce the MDS cache size to 5% of its target. So, the MDS will be aggressively recalling caps from clients which apparently helps with fixing this.

It may be for OCS that we are allowing clients to hold too many caps. Maybe try:

> ceph config mds mds_max_caps_per_client 100K

Although, from the workload description, I'm not really sure the clients are holding on to too many caps. (100k files total?)

It may simply be that the configured cache memory limit is too high as well.

Comment 15 Yaniv Kaul 2020-08-18 08:27:29 UTC

Who owns the next step here from QE?

Comment 16 Ben England 2020-08-19 12:33:10 UTC

My point is that this workload should not fail catastrophically with OCS defaults.  Short-term, it is OK if it runs slow and can be tuned to be faster (for now ;-).     Suggestions above about caps might do that.  Long-term, I think MDS memory (and eventually number of MDS pods) has to be adjustable based on the amount of files (metadata) under management by Cephfs if you want respectable performance at scale without wasting lots of memory on MDS in cases where Cephfs is not used.

Comment 17 krishnaram Karthick 2020-08-20 11:22:14 UTC

(In reply to Yaniv Kaul from comment #15)
> Who owns the next step here from QE?

I'll take the AI for the ocsqe-qpas team to reproduce the issue.

Comment 18 Yaniv Kaul 2020-08-20 15:00:10 UTC

(In reply to Ben England from comment #16)
> My point is that this workload should not fail catastrophically with OCS
> defaults.  Short-term, it is OK if it runs slow and can be tuned to be
> faster (for now ;-).     Suggestions above about caps might do that. 
> Long-term, I think MDS memory (and eventually number of MDS pods) has to be
> adjustable based on the amount of files (metadata) under management by
> Cephfs if you want respectable performance at scale without wasting lots of
> memory on MDS in cases where Cephfs is not used.

So essentially expose a custom-metric (no. of files?!) and based on it, Kube will restart the MDS (passive first, active later?) with new memory/CPU values based on it?

Comment 19 Raz Tamir 2020-08-24 07:00:06 UTC

Thanks Karthick.
Clearing needinfo

Comment 20 Mudit Agarwal 2020-09-21 02:32:51 UTC

Hi Karthick, Did you get a chance to work upon this?
Should be keep it in OCS4.6?

Comment 21 krishnaram Karthick 2020-09-21 05:29:45 UTC

(In reply to Mudit Agarwal from comment #20)
> Hi Karthick, Did you get a chance to work upon this?
> Should be keep it in OCS4.6?

Warren is working on automating this scenario. We should most probably have an update by end of this week.
Raising needinfo on Warren to update the bug.

Comment 22 Mudit Agarwal 2020-10-09 03:51:20 UTC

Not a blocker, moving it out.

Comment 23 Mudit Agarwal 2021-01-21 12:33:58 UTC

Depenedent Ceph BZ is targeted for 4.2z2

Comment 29 Warren 2021-06-30 05:10:39 UTC

I ran tests on this today that created and deleted a million small files.  The MDS's were up and running at the end and did not go down during the run.  Is this behavior good enough to consider this BZ as being verified?

Comment 30 Warren 2021-07-07 06:26:53 UTC

This MDS's remained up after 2 million small files were written and deleted today.  I am marking this as verified.

Comment 33 errata-xmlrpc 2021-08-03 18:15:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3003