Bug 1819483
Summary: | [Tracker for BZ #1900111] Ceph MDS won't run in OCS with millions of files | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Ben England <bengland> |
Component: | ceph | Assignee: | Douglas Fuller <dfuller> |
Status: | CLOSED ERRATA | QA Contact: | Warren <wusui> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.3 | CC: | dcritch, ekuric, gfarnum, kramdoss, madam, muagarwa, ocs-bugs, owasserm, pdonnell, ratamir, sewagner, sostapov |
Target Milestone: | --- | Keywords: | Performance, Tracking |
Target Release: | OCS 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | 4.8.0-416.ci | Doc Type: | No Doc Update |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-08-03 18:15:11 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1900111 | ||
Bug Blocks: |
Description
Ben England
2020-04-01 00:06:13 UTC
After raising the memory request and limit for MDS to 40 GiB from 8 GiB, the workload completed successfully, and the MDS is still up. Memory size of MDS is ~36.5 GB, over 4 times the default memory limit. Data for this run in http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/ocp4/fsd-bz/fatmds Here is the must-gather data from after the test completed: must-gather.local.8666068670366363853/ and a screenshot of the workload throughput history: fsdrift.png after the workload had stopped for a few hours, I see little of the 22 GB memory allocated to the process is actually in use. What's up with that? It appears that the MDS process tcmalloc library is not releasing unused memory back to the operating system, which prevents other pods from gaining access to it. In this case, actual RSS is 22.8 GiB, of which 4.1 GiB (20%) is actually in use! [root@e24-h17-740xd ~]# cephpod tell mds.example-storagecluster-cephfilesystem-a heap stats 2020-04-01 12:59:15.165 7f6355ffb700 0 client.1063733 ms_handle_reset on v2:10.128.1.16:6800/2259801570 2020-04-01 12:59:15.182 7f6356ffd700 0 client.1063748 ms_handle_reset on v2:10.128.1.16:6800/2259801570 mds.example-storagecluster-cephfilesystem-a tcmalloc heap stats:------------------------------------------------ MALLOC: 4498427416 ( 4290.0 MiB) Bytes in use by application MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist MALLOC: + 19586313344 (18679.0 MiB) Bytes in central cache freelist MALLOC: + 8644352 ( 8.2 MiB) Bytes in transfer cache freelist MALLOC: + 26730088 ( 25.5 MiB) Bytes in thread cache freelists MALLOC: + 201195520 ( 191.9 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 24321310720 (23194.6 MiB) Actual memory used (physical + swap) MALLOC: + 14993752064 (14299.2 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 39315062784 (37493.8 MiB) Virtual address space used MALLOC: MALLOC: 2619510 Spans in use MALLOC: 21 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). Bytes released to the OS take up virtual address space but no physical memory. The original problem is reproducible, when I lower the mem limit from 40 GiB to 8 GiB and start a new test, within 45 minutes the MDS had gone into CLBO state. And once again, increasing memory limit to 40 GiB (edit to deployment) brought it out of CLBO state immediately into Running state. This time, I also made some other changes: - increased CPU core limit to 6 to see if it would go any faster than it did with 3 - put metadata pool on SSD and data pool on HDD So far it hasn't made a difference but I expect it will when the metadata pool becomes uncacheable. The second part was done with this change: ceph osd crush rule create-replicated ssd default host ssd ceph osd crush rule create-replicated hdd default host hdd ceph osd pool set example-storagecluster-cephfilesystem-metadata crush_rule ssd ceph osd pool set example-storagecluster-cephfilesystem-data0 crush_rule hdd then to make it move the pools faster: ceph tell osd.* injectargs '--osd_recovery_sleep_hdd 0.0' I resolved the problem in comment 3 with the command: cephpod tell mds.example-storagecluster-cephfilesystem-a injectargs '--mds_cache_reservation 0.95' Heap release did not work because MDS was tracking client "caps" (client-side cache), the above command forces the clients to release their "caps" and then MDS can let go of them. Then you set it back to the default. Is that a Ceph or an OCS bug? IMO comment 5 *might* be a Ceph bug. But overall this seems to be an OCS bug - when I reconfigured as above, I just ran a test for 20 hours with fs-drift and Cephfs stayed up and completed the run. I'll have to try it with smaller files, but I was able to fill the storage up to 60% with 1/2 MB file size average. It wasn't that hard to reconfigure Kubernetes to supply different amounts of memory to MDS. Seems like it would be possible to adjust memory limit based on whether Cephfs was being used or not. I'll try to make it run with less memory, and/or use more MDS servers, and look into how comment 5 happened. [root@e24-h17-740xd ~]# cephpod df RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 83 TiB 41 TiB 42 TiB 42 TiB 50.83 ssd 13 TiB 13 TiB 1.5 GiB 49 GiB 0.36 TOTAL 97 TiB 54 TiB 42 TiB 42 TiB 43.82 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL example-storagecluster-cephblockpool 1 1.4 GiB 433 4.2 GiB 0.01 9.2 TiB example-storagecluster-cephfilesystem-metadata 4 6.4 GiB 6.47k 6.9 GiB 0.06 3.8 TiB example-storagecluster-cephfilesystem-data0 6 13 TiB 50.40M 42 TiB 64.01 7.9 TiB ... We have upcoming (upstream) kernel and userspace patches that cause clients to more proactively drop unused caps which may help, but in general if the client is holding references to indoes or dentries, the MDS must cache those inodes/dentries in memory as well. However, if setting a very aggressive cache reservation caused the MDS to recall a bunch of client caps, it sounds like either * the MDS was (buggilly) not requesting clients to drop cached data previously, or * the MDS cache setting is mismatched in comparison to the container memory limits, or * there's something else more complicated going on. Perhaps the following: The heap stats dump you posted does seem to indicate that the MDS had released a bunch of memory back to the allocator, but it hadn't returned that memory to the kernel. You say that invoking the release heap command didn't actually give any memory back? Were these machines using hugepages? That might account for it, if i correctly recall some work Mark did with the OSDs, the hugepage allocator, and tcmalloc. Moving this out to ocs 4.5 for further analysis. 4.3 is almost out. 4.4 is pretty much closed. Thx Greg, will retry with OCS 4.5 and RHCOS 4.4 in scale lab with 26-node cluster. I think some of MDS memory problem was caps, but I think the version I was using had hugepages disabled already. I can manually reduce caps and see how this impacts the problem. As I recall, the problem went away when I released all the caps by changing "mds cache reservation" from 0.05 default to something really high. But I shouldn't have to do this manually. (In reply to Ben England from comment #10) > Thx Greg, will retry with OCS 4.5 and RHCOS 4.4 in scale lab with 26-node > cluster. I think some of MDS memory problem was caps, but I think the > version I was using had hugepages disabled already. I can manually reduce > caps and see how this impacts the problem. As I recall, the problem went > away when I released all the caps by changing "mds cache reservation" from > 0.05 default to something really high. But I shouldn't have to do this > manually. Let us know what the results are! Moving out of ocs 4.5 for now. Was unable to re-run in scale lab due to other issues that took higher priority - see document that was produced for OCS 4.5. This should be part of a QE test, I think. cc'ing Karthick. (In reply to Ben England from comment #5) > I resolved the problem in comment 3 with the command: > > cephpod tell mds.example-storagecluster-cephfilesystem-a injectargs > '--mds_cache_reservation 0.95' > > Heap release did not work because MDS was tracking client "caps" > (client-side cache), the above command forces the clients to release their > "caps" and then MDS can let go of them. Then you set it back to the default. This will effectively reduce the MDS cache size to 5% of its target. So, the MDS will be aggressively recalling caps from clients which apparently helps with fixing this. It may be for OCS that we are allowing clients to hold too many caps. Maybe try: > ceph config mds mds_max_caps_per_client 100K Although, from the workload description, I'm not really sure the clients are holding on to too many caps. (100k files total?) It may simply be that the configured cache memory limit is too high as well. Who owns the next step here from QE? My point is that this workload should not fail catastrophically with OCS defaults. Short-term, it is OK if it runs slow and can be tuned to be faster (for now ;-). Suggestions above about caps might do that. Long-term, I think MDS memory (and eventually number of MDS pods) has to be adjustable based on the amount of files (metadata) under management by Cephfs if you want respectable performance at scale without wasting lots of memory on MDS in cases where Cephfs is not used. (In reply to Yaniv Kaul from comment #15) > Who owns the next step here from QE? I'll take the AI for the ocsqe-qpas team to reproduce the issue. (In reply to Ben England from comment #16) > My point is that this workload should not fail catastrophically with OCS > defaults. Short-term, it is OK if it runs slow and can be tuned to be > faster (for now ;-). Suggestions above about caps might do that. > Long-term, I think MDS memory (and eventually number of MDS pods) has to be > adjustable based on the amount of files (metadata) under management by > Cephfs if you want respectable performance at scale without wasting lots of > memory on MDS in cases where Cephfs is not used. So essentially expose a custom-metric (no. of files?!) and based on it, Kube will restart the MDS (passive first, active later?) with new memory/CPU values based on it? Thanks Karthick. Clearing needinfo Hi Karthick, Did you get a chance to work upon this? Should be keep it in OCS4.6? (In reply to Mudit Agarwal from comment #20) > Hi Karthick, Did you get a chance to work upon this? > Should be keep it in OCS4.6? Warren is working on automating this scenario. We should most probably have an update by end of this week. Raising needinfo on Warren to update the bug. Not a blocker, moving it out. Depenedent Ceph BZ is targeted for 4.2z2 I ran tests on this today that created and deleted a million small files. The MDS's were up and running at the end and did not go down during the run. Is this behavior good enough to consider this BZ as being verified? This MDS's remained up after 2 million small files were written and deleted today. I am marking this as verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3003 |