Bug 1637153
Summary: | ceph-osd containers' memory RSS inflates until they hit CGroup limit | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Ben England <bengland> |
Component: | RADOS | Assignee: | Josh Durgin <jdurgin> |
Status: | CLOSED NOTABUG | QA Contact: | Manohar Murthy <mmurthy> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.1 | CC: | acalhoun, ceph-eng-bugs, chshort, dgurtner, dzafman, hnallurv, jdurgin, johfulto, kchai, lifeistrue, mmuench, mnelson, nojha, smalleni, vumrao |
Target Milestone: | rc | ||
Target Release: | 4.* | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-05-13 21:51:07 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1638148 |
Description
Ben England
2018-10-08 19:01:34 UTC
I got the dump_mempools data (Mark Nelson's suggestion) this morning during a random 4-KB RBD write run with 512 guests, and although I don't quite yet have the slick graph, results are very simple to understand. Only 2 pools getting significant use are the pglog pool (only up to 200 MiB) and the buffer_anon pool. I can see that the buffer_anon pool is growing large, up to 4 GiB. This isn't supposed to happen is it? no other pool uses significant memory. I'm trying to get some stats or graphs to show this but it's pretty obvious. Is there any tuning that controls this? For example, see this output file http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/scale-lab-2018-sep/mempools-2018-10-10-rndwr4k.log I was sampling every 60 seconds. look at OSD 110, 111 and 116 and 134. They show this pool getting up above 3 GB, whereas the pglog pool does not seem to get bigger than 200 MB (probably because we used the tuning osd max pg log entries = 3000 osd min pg log entries = 3000 suggested by Sage. In some cases you see the pool sizes drop to zero because of an OSD restart caused by the CGroup mem limit. I'll have a graph soon, but I wanted to get started on why this particular pool blew up. I'm going to look at the same data collected during a 4-KB random read test. I'm also going to try to correlate it with /var/log/messages, which shows a LOT of "no reply from" messages where OSD x couldn't talk to OSD Y right around when the OOM kills were happening. Is there any backpressure happening when buffer space gets used up or primary OSDs can't talk to PG peers, or do writes just keep coming in from RADOS clients regardless? and here's the graph. http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/scale-lab-2018-sep/2018-10-10-rnd4kwr-buffer-anon.png same test with random reads never uses more than 35 MiB of buffer_anon space On a different subject, Mark also suggested that we look at tcmalloc behavior and see if it was giving free memory back to the kernel. It appears that it is not. 1.5 GB in this example was not given back. This may not seem like a big deal but x 34 OSDs, that's 50 GB of RAM or 20% of physical memory on this host. Perhaps it should be a little more willing to give back free memory or cap the amount of free memory that it keeps somehow? [root@overcloud-compute-11 /]# ceph daemon osd.495 heap dump osd.495 dumping heap profile now. ------------------------------------------------ MALLOC: 483712848 ( 461.3 MiB) Bytes in use by application MALLOC: + 1612136448 ( 1537.5 MiB) Bytes in page heap freelist MALLOC: + 48597552 ( 46.3 MiB) Bytes in central cache freelist MALLOC: + 6358528 ( 6.1 MiB) Bytes in transfer cache freelist MALLOC: + 27644032 ( 26.4 MiB) Bytes in thread cache freelists MALLOC: + 7471104 ( 7.1 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 2185920512 ( 2084.7 MiB) Actual memory used (physical + swap) MALLOC: + 30154752 ( 28.8 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 2216075264 ( 2113.4 MiB) Virtual address space used MALLOC: MALLOC: 53473 Spans in use MALLOC: 45 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). Bytes released to the OS take up virtual address space but no physical memory. { "error": "(0) Success", "success": true } this problem potentially prevents resolution of 1638148, so I updated "blocks" field . targeting RHCS 3.2, I believe this is a blocker. Trying to reproduce in a RHCS container cluster without openstack to simplify, and then will try to come up with simplest, smallest reproducer. but with different, older hardware not sure that this will work. osd_client_message_size_cap(The largest client data message allowed in memory.) is set to 500MB by default, so one thing we can try doing is, set this value to 50MB, and see what the impact is. I think we should see if this problem still happens when we switch to Bluestore, if it doesn't happen with Bluestore then that's probably the best resolution. Mark Nelson suggests to try doing a heap release command when this happens, and he says Bluestore will not have this problem because there is automatic heap release code in there. If we can verify that this is the case, then we could just close this as "Fixed in Bluestore". Tim and I are working on reproducing this problem in a smaller cluster first using pbench-fio with lots of /dev/rbdN per host. Tim and I can't reproduce with either filestore or bluestore on small scale (25 OSDs), looking for an opportunity to get into some site with ~500 OSDs to try to reproduce with RHCS 3.2. I had the same issue with 108 OSDs. We have some RBDs, with snapshot with those RBDs, when we do some snapshot cleanup by removing snapshots, sometimes, it causes cascading OOM for all OSDs across the cluster. Another scenario is, we have a cluster with 3 racks, each rack has 3 hosts, each host has 12 OSDs, each OSD has 8T disk. The cluster has one pool with 3 replications, 1024 pgs(we are about to increase the number). When we add a new host(12 OSDs, each OSD has 10T disk) to a rack, during the backfilling, we observe OOM on the new server, which kills some OSDs, and we have to restart some OSDs to release some memory. from Nautilus slides: "OSD hard limit on PG log length - Avoids corner cases that could cause OSD memory utilization to grow unbounded". So I think we should retest with Nautilus in this environment and verify that with Bluestore we no longer hit this bz, if so then resolution is fixed in next release" and perhaps backport to Luminous. The hard limit patches have been backported to luminous and should be available in latest 3.2. Neha, you were right. I was seeing a similar problem in ceph-mds and I did dump_mempools and saw this: sh-4.2# ceph daemon /run/ceph/ceph-mds.myfs-b.asok dump_mempools ... "buffer_anon": { "items": 144834, "bytes": 604427203 }, ... as you mentioned, sh-4.2# ceph daemon /run/ceph/ceph-mds.myfs-b.asok config show | grep osd_client_message_size_cap "osd_client_message_size_cap": "524288000", I then lowered it sh-4.2# ceph daemon /run/ceph/ceph-mds.myfs-b.asok config set osd_client_message_size_cap 50000000 and saw this "buffer_anon": { "items": 23380, "bytes": 93826424 }, Reducing consumption by about 1/2 GB. Multiply by number of Ceph daemons in the system and it adds up. Why do we need such a large default value for osd_client_message_size_cap ? Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. Regards, Giri Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. Regards, Giri Ben, have you seen anything like this with nautilus + bluestore? Maybe Alex Calhoun has, cc'ing him. Most testing is being done with RHOCS right now but I don't think that matters too much, since the problem is in the OSDs, which are containerized in both cases. Would like to use https://github.com/cloud-bulldozer/ripsaw fio workload to test this. Mark Nelson is working on an issue involving OSD RSS inflating past osd_memory_target, our thinking is that THP is responsible for this, he wants to set /sys/kernel/mm/transparent_hugepage/enabled to "madvise" where current default is "always". I think this is a good idea because the use of THP is very application-specific, and this would allow Ceph to turn it off while allowing other apps to turn it on. BTW Fedora 30 5.2.18-200.fc30.x86_64 has "madvise" as the default. So perhaps a tuned profile should be created to enable this on systems that don't already default to it, something like the tuned profile in the initial post but with [vm] transparent_hugepages=madvise Ceph has been altered to disable transparent hugepages in Ceph daemons using prctl() systemcall. Bluestore is now used more heavily (how many sites still use Filestore?) and Bluestore does O_DIRECT I/O so it does not use the kernel buffer cache -- this too limits RSS consumption. Finally, I think work was done to periodically do "heap release" in Ceph OSD and MDS daemons, I think this is in RHCS 4, ask Mark Nelson or Josh Durgin. To close this bz, someone needs to redo the OpenStack+Ceph tests in the scale lab with OSP16 and see if the problem has been solved. We need something like Prometheus or Browbeat to track Ceph daemon memory consumption during such a test, and if "ceph tell osd.N heap stats" was added to the Ceph manager prometheus module it would make it easier to understand what is happening. Sai Sindhur Malleni is the tech. lead for Perf & Scale testing for OpenStack right now, I cc'ed him. I do not have that information |