- Description of problem: After multi-hour mixed-workload Cephfs test, Ceph MDS service is down and can't restart. OSDs are fully functional, rados bench works fine. - Version of all relevant components (if applicable): version 4.3.3 True False 10d Cluster version is 4.3.3 image: quay.io/rhceph-dev/ocs-olm-operator:latest-4.3 containerImage: quay.io/ocs-dev/ocs-operator:4.3.0 rhcos_url: http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/ocp4/rhcos-4.3.0-x86_64/ openshift_release_url: http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/ocp4/openshift-4.3/ - hw config: 7 Dell 740xd, each with 192 GIB RAM, 56 cores, 2 25-GbE NIC ports, 1 Samsung NVM SSD, and 8-12 HDD. - Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? As a performance engineer, I can't positively answer questions about scalability and durability of Cephfs in OCS because of this problem. - Is there any workaround available to the best of your knowledge? I'll experiment with raising memory request and limit for MDS pods. This seems to get the MDS back on its feet enough to abort the test. I'm doing this by oc edit: deployment.apps/rook-ceph-mds-example-storagecluster-cephfilesystem-{a,b} so that the memory limit is 40 GiB instead of 8 GiB. Then you delete the replicasets and the pods. It could go higher perhaps. This is recommended in the Ceph documentation. https://docs.ceph.com/docs/master/cephfs/add-remove-mds/#provisioning-hardware-for-an-mds Ceph documentation also recommends using all-SSD metadata pool, I think I can try that out also. - Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 - Is this issue reproducible? Don't know yet, it takes a day to reproduce and then bring the pool back down to zero again (after deleting Cephfs PVC). - Can this issue reproduce from the UI? Don't know. - If this is a regression, please provide more details to justify this: Not sure this is a regression, not many people test Cephfs with 40 million objects in data pool. However, NVidia has run RHCS with a lot of files, though they had problems with small files too. Steps to Reproduce: 1. install OCP using procedure below 2. install OCS using procedure below 3. run ripsaw's fs-drift benchmark with parameters below, and wait - Actual results: I attempt to restart MDS by deleting the active and backup MDS pods, resulting in: [root@e24-h17-740xd must-gather]# ocos get pod | grep mds rook-ceph-mds-example-storagecluster-cephfilesystem-a-79ff5bfqv 0/1 OOMKilled 1 56s rook-ceph-mds-example-storagecluster-cephfilesystem-b-649bp9pkj 1/1 Running 0 11s You can see the memory climb right up to 8 GB RSS for the MDS pod, it is then OOMkilled, after a while Kubernetes gives up on restarting it. Have to clear out pools manually and see if the size of the object count in the pool was the source of the problem or not (perhaps directories got too big). - Expected results: MDS *never* goes down, though its performance may degrade under some circumstances (if it can't cache enough metadata). At very least, if it encounters a condition that it can't handle, log an error message and inform the user of the limitation. - Additional info: where do I begin:-) Files for this bz are here: http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/ocp4/fsd-bz/ Anyone outside redhat will not be able to access, contact me at bengland and I will make them available to you, no secrets here. Must-gather info is in this sub-directory: must-gather.local.6842679217637098243/ fs-drift benchmark CR is here: fsd-benchmark-10hr.yaml OCP4 deployment done with these UPI-based playbooks: https://github.com/bengland2/ocp4_upi_baremetal OCS deployment done with this script and scripts in same directory: http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/openshift/upi/ocs-bringup/all-of-it.sh ripsaw deployment done as described in documentation here: https://github.com/cloud-bulldozer/ripsaw/blob/master/docs/installation.md This is a workload consisting of 50 pods accessing a shared fs of up to 100000 files of at most 1-MiB in size, with a 20-TiB filesystem limit. So there should only be 100,000 files. workload documentation here: https://github.com/cloud-bulldozer/ripsaw/blob/master/docs/fs-drift.md HTH -ben
After raising the memory request and limit for MDS to 40 GiB from 8 GiB, the workload completed successfully, and the MDS is still up. Memory size of MDS is ~36.5 GB, over 4 times the default memory limit. Data for this run in http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/ocp4/fsd-bz/fatmds Here is the must-gather data from after the test completed: must-gather.local.8666068670366363853/ and a screenshot of the workload throughput history: fsdrift.png
after the workload had stopped for a few hours, I see little of the 22 GB memory allocated to the process is actually in use. What's up with that? It appears that the MDS process tcmalloc library is not releasing unused memory back to the operating system, which prevents other pods from gaining access to it. In this case, actual RSS is 22.8 GiB, of which 4.1 GiB (20%) is actually in use! [root@e24-h17-740xd ~]# cephpod tell mds.example-storagecluster-cephfilesystem-a heap stats 2020-04-01 12:59:15.165 7f6355ffb700 0 client.1063733 ms_handle_reset on v2:10.128.1.16:6800/2259801570 2020-04-01 12:59:15.182 7f6356ffd700 0 client.1063748 ms_handle_reset on v2:10.128.1.16:6800/2259801570 mds.example-storagecluster-cephfilesystem-a tcmalloc heap stats:------------------------------------------------ MALLOC: 4498427416 ( 4290.0 MiB) Bytes in use by application MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist MALLOC: + 19586313344 (18679.0 MiB) Bytes in central cache freelist MALLOC: + 8644352 ( 8.2 MiB) Bytes in transfer cache freelist MALLOC: + 26730088 ( 25.5 MiB) Bytes in thread cache freelists MALLOC: + 201195520 ( 191.9 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 24321310720 (23194.6 MiB) Actual memory used (physical + swap) MALLOC: + 14993752064 (14299.2 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 39315062784 (37493.8 MiB) Virtual address space used MALLOC: MALLOC: 2619510 Spans in use MALLOC: 21 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). Bytes released to the OS take up virtual address space but no physical memory.
The original problem is reproducible, when I lower the mem limit from 40 GiB to 8 GiB and start a new test, within 45 minutes the MDS had gone into CLBO state. And once again, increasing memory limit to 40 GiB (edit to deployment) brought it out of CLBO state immediately into Running state. This time, I also made some other changes: - increased CPU core limit to 6 to see if it would go any faster than it did with 3 - put metadata pool on SSD and data pool on HDD So far it hasn't made a difference but I expect it will when the metadata pool becomes uncacheable. The second part was done with this change: ceph osd crush rule create-replicated ssd default host ssd ceph osd crush rule create-replicated hdd default host hdd ceph osd pool set example-storagecluster-cephfilesystem-metadata crush_rule ssd ceph osd pool set example-storagecluster-cephfilesystem-data0 crush_rule hdd then to make it move the pools faster: ceph tell osd.* injectargs '--osd_recovery_sleep_hdd 0.0'
I resolved the problem in comment 3 with the command: cephpod tell mds.example-storagecluster-cephfilesystem-a injectargs '--mds_cache_reservation 0.95' Heap release did not work because MDS was tracking client "caps" (client-side cache), the above command forces the clients to release their "caps" and then MDS can let go of them. Then you set it back to the default.
Is that a Ceph or an OCS bug?
IMO comment 5 *might* be a Ceph bug. But overall this seems to be an OCS bug - when I reconfigured as above, I just ran a test for 20 hours with fs-drift and Cephfs stayed up and completed the run. I'll have to try it with smaller files, but I was able to fill the storage up to 60% with 1/2 MB file size average. It wasn't that hard to reconfigure Kubernetes to supply different amounts of memory to MDS. Seems like it would be possible to adjust memory limit based on whether Cephfs was being used or not. I'll try to make it run with less memory, and/or use more MDS servers, and look into how comment 5 happened. [root@e24-h17-740xd ~]# cephpod df RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 83 TiB 41 TiB 42 TiB 42 TiB 50.83 ssd 13 TiB 13 TiB 1.5 GiB 49 GiB 0.36 TOTAL 97 TiB 54 TiB 42 TiB 42 TiB 43.82 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL example-storagecluster-cephblockpool 1 1.4 GiB 433 4.2 GiB 0.01 9.2 TiB example-storagecluster-cephfilesystem-metadata 4 6.4 GiB 6.47k 6.9 GiB 0.06 3.8 TiB example-storagecluster-cephfilesystem-data0 6 13 TiB 50.40M 42 TiB 64.01 7.9 TiB ...
We have upcoming (upstream) kernel and userspace patches that cause clients to more proactively drop unused caps which may help, but in general if the client is holding references to indoes or dentries, the MDS must cache those inodes/dentries in memory as well. However, if setting a very aggressive cache reservation caused the MDS to recall a bunch of client caps, it sounds like either * the MDS was (buggilly) not requesting clients to drop cached data previously, or * the MDS cache setting is mismatched in comparison to the container memory limits, or * there's something else more complicated going on. Perhaps the following: The heap stats dump you posted does seem to indicate that the MDS had released a bunch of memory back to the allocator, but it hadn't returned that memory to the kernel. You say that invoking the release heap command didn't actually give any memory back? Were these machines using hugepages? That might account for it, if i correctly recall some work Mark did with the OSDs, the hugepage allocator, and tcmalloc.
Moving this out to ocs 4.5 for further analysis. 4.3 is almost out. 4.4 is pretty much closed.
Thx Greg, will retry with OCS 4.5 and RHCOS 4.4 in scale lab with 26-node cluster. I think some of MDS memory problem was caps, but I think the version I was using had hugepages disabled already. I can manually reduce caps and see how this impacts the problem. As I recall, the problem went away when I released all the caps by changing "mds cache reservation" from 0.05 default to something really high. But I shouldn't have to do this manually.
(In reply to Ben England from comment #10) > Thx Greg, will retry with OCS 4.5 and RHCOS 4.4 in scale lab with 26-node > cluster. I think some of MDS memory problem was caps, but I think the > version I was using had hugepages disabled already. I can manually reduce > caps and see how this impacts the problem. As I recall, the problem went > away when I released all the caps by changing "mds cache reservation" from > 0.05 default to something really high. But I shouldn't have to do this > manually. Let us know what the results are!
Moving out of ocs 4.5 for now.
Was unable to re-run in scale lab due to other issues that took higher priority - see document that was produced for OCS 4.5. This should be part of a QE test, I think. cc'ing Karthick.
(In reply to Ben England from comment #5) > I resolved the problem in comment 3 with the command: > > cephpod tell mds.example-storagecluster-cephfilesystem-a injectargs > '--mds_cache_reservation 0.95' > > Heap release did not work because MDS was tracking client "caps" > (client-side cache), the above command forces the clients to release their > "caps" and then MDS can let go of them. Then you set it back to the default. This will effectively reduce the MDS cache size to 5% of its target. So, the MDS will be aggressively recalling caps from clients which apparently helps with fixing this. It may be for OCS that we are allowing clients to hold too many caps. Maybe try: > ceph config mds mds_max_caps_per_client 100K Although, from the workload description, I'm not really sure the clients are holding on to too many caps. (100k files total?) It may simply be that the configured cache memory limit is too high as well.
Who owns the next step here from QE?
My point is that this workload should not fail catastrophically with OCS defaults. Short-term, it is OK if it runs slow and can be tuned to be faster (for now ;-). Suggestions above about caps might do that. Long-term, I think MDS memory (and eventually number of MDS pods) has to be adjustable based on the amount of files (metadata) under management by Cephfs if you want respectable performance at scale without wasting lots of memory on MDS in cases where Cephfs is not used.
(In reply to Yaniv Kaul from comment #15) > Who owns the next step here from QE? I'll take the AI for the ocsqe-qpas team to reproduce the issue.
(In reply to Ben England from comment #16) > My point is that this workload should not fail catastrophically with OCS > defaults. Short-term, it is OK if it runs slow and can be tuned to be > faster (for now ;-). Suggestions above about caps might do that. > Long-term, I think MDS memory (and eventually number of MDS pods) has to be > adjustable based on the amount of files (metadata) under management by > Cephfs if you want respectable performance at scale without wasting lots of > memory on MDS in cases where Cephfs is not used. So essentially expose a custom-metric (no. of files?!) and based on it, Kube will restart the MDS (passive first, active later?) with new memory/CPU values based on it?
Thanks Karthick. Clearing needinfo
Hi Karthick, Did you get a chance to work upon this? Should be keep it in OCS4.6?
(In reply to Mudit Agarwal from comment #20) > Hi Karthick, Did you get a chance to work upon this? > Should be keep it in OCS4.6? Warren is working on automating this scenario. We should most probably have an update by end of this week. Raising needinfo on Warren to update the bug.
Not a blocker, moving it out.
Depenedent Ceph BZ is targeted for 4.2z2
I ran tests on this today that created and deleted a million small files. The MDS's were up and running at the end and did not go down during the run. Is this behavior good enough to consider this BZ as being verified?
This MDS's remained up after 2 million small files were written and deleted today. I am marking this as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3003