Description of problem: Under a really heavy I/O load to Ceph RBD volumes, Tim Wilkinson and I have observed that a hyperconverged OpenStack-Ceph cluster breaks down almost immediately, with tens of OSDs being killed and restarted. We appear to have plenty of free memory and the system is not memory-overcommitted. The collapse is caused by a chain reaction - some Ceph OSD process exceeds 5 GB in RSS size. - OOM killer runs - ceph-osd processes are killed - Ceph goes into recovery mode - remaining OSDs consume much more memory than before - OSDs hit cgroup limit and crash - remaining OSDs are under even more pressure than before, and so on. - Ceph reaches a point where some data is inaccessible (not lost) due to too many OSDs down. - VMs hang and eventually get stack traces stating that I/O timeouts of > 120 seconds are occurring. This appears to have nothing to do with OpenStack. Manual intervention is sometimes required to get the system running again. This should never happen. The hosts are OpenStack-HCI compute hosts with 256 GB RAM, there are exactly 34 OSDs per host, we allow for each to consume up to 5 GB of RSS (cgroup limit) for a total of 170 GB, and we also run 34-35 VMS of 1 GB each. In theory this uses 50 GB (34*1.5), leaving ~35 GB of unused RAM. However, with 128 VMs, KSM disabled and ceph.conf osd_{max,min}_pg_log_entries=3000, we have run same fio random read and write workload successfully. Each VM has a 95 GB cinder volume with an XFS filesystem on it. The Cinder volume is preallocated (dd zeroes to it before XFS is put in place). Ceph space is about 16.57% used out of 865 TB across 476 OSDs (34 2-TB OSDS/host x 14 hosts). Note that lighter loads may not trigger this behavior, at least not right away. We are trying to measure what workloads are safe. We use AZs (Availability Zones) so that Nova scheduler has no choice about where the VMs run. This is necessary for failover testing, so that we can leave one of the Ceph OSD hosts with no VMs on it so that we are free to bring it down for Ceph failover testing. So that OpenStack Nova will reserve memory for Ceph OSDs, we use nova.conf tuning reserved_host_memory_mb=191500 Version-Release number of selected component (if applicable): Tim Wilkinson and I are currently running container image 3-13, which contains ceph-osd-12.2.5-42.el7cp.x86_64 (RHCS 3.1). This can be found in a recent RHCS 3.1 build dated 9/12/18: /rel-eng/RHCEPH-3.1-RHEL-7-20180912.0/compose/OSD/x86_64/os/Packages How reproducible: every time Steps to Reproduce: 1. deploy openstack RHOSP 13 but use Ceph container 3-13 instead of default 3-12 2. run fio random writes from 256 or 512 VMs targeting cinder volumes 3. monitor ceph-osd containers' RSS size using script below Actual results: Some OSDs quickly inflate until they hit the CGroup limit. This triggers backfilling, which causes other OSDs to inflate. Expected results: OSDs should remain stable with an RSS size below the CGroup limit. Next steps: Is this a RADOS bz or an RBD bz? Can we isolate cause of memory inflation further? Ben is trying to gather info using ceph daemon osd.N dump_mempools suggested by Mark Nelson. Additional info: related bzs: 1628670 - filed against OpenStack (resolved by kernel tuning) 1576095 - ceph-osd memory leak (RHCS 2.5) 1599507 - continually increasing OSD mem utilization (in RHCS 2.5) in a related bz, ceph-osd memory growth was fixed by the change described here: https://bugzilla.redhat.com/show_bug.cgi?id=1576095#c79 So the problem should be fixed if it was caused by absence of this commit, but it is not fixed. For an example of an OOM kill caused by the CGroup memory limit (but MANY of these OOM kills are occurring) see: http://pastebin.test.redhat.com/654924 and here is a graph of ceph-osd RSS size over time during the fio random write test: http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/scale-lab-2018-sep/rhcs31-osd_sizes.png The non-default tuning recommended by Sage and Josh earlier in ceph.conf was: [osd] ... osd_max_pg_log_entries=3000 osd_min_pg_log_entries=3000 Below is the "ceph-osd" tuned profile that we are using in /usr/lib/tuned/ceph-osd/tuned.conf: [main] summary=ceph-osd Filestore tuned profile include=throughput-performance [sysctl] vm.dirty_ratio = 10 vm.dirty_background_ratio = 5 [vm] transparent_hugepages=never [sysfs] /sys/kernel/mm/ksm/run=0
I got the dump_mempools data (Mark Nelson's suggestion) this morning during a random 4-KB RBD write run with 512 guests, and although I don't quite yet have the slick graph, results are very simple to understand. Only 2 pools getting significant use are the pglog pool (only up to 200 MiB) and the buffer_anon pool. I can see that the buffer_anon pool is growing large, up to 4 GiB. This isn't supposed to happen is it? no other pool uses significant memory. I'm trying to get some stats or graphs to show this but it's pretty obvious. Is there any tuning that controls this? For example, see this output file http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/scale-lab-2018-sep/mempools-2018-10-10-rndwr4k.log I was sampling every 60 seconds. look at OSD 110, 111 and 116 and 134. They show this pool getting up above 3 GB, whereas the pglog pool does not seem to get bigger than 200 MB (probably because we used the tuning osd max pg log entries = 3000 osd min pg log entries = 3000 suggested by Sage. In some cases you see the pool sizes drop to zero because of an OSD restart caused by the CGroup mem limit. I'll have a graph soon, but I wanted to get started on why this particular pool blew up. I'm going to look at the same data collected during a 4-KB random read test. I'm also going to try to correlate it with /var/log/messages, which shows a LOT of "no reply from" messages where OSD x couldn't talk to OSD Y right around when the OOM kills were happening. Is there any backpressure happening when buffer space gets used up or primary OSDs can't talk to PG peers, or do writes just keep coming in from RADOS clients regardless?
and here's the graph. http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/scale-lab-2018-sep/2018-10-10-rnd4kwr-buffer-anon.png same test with random reads never uses more than 35 MiB of buffer_anon space
On a different subject, Mark also suggested that we look at tcmalloc behavior and see if it was giving free memory back to the kernel. It appears that it is not. 1.5 GB in this example was not given back. This may not seem like a big deal but x 34 OSDs, that's 50 GB of RAM or 20% of physical memory on this host. Perhaps it should be a little more willing to give back free memory or cap the amount of free memory that it keeps somehow? [root@overcloud-compute-11 /]# ceph daemon osd.495 heap dump osd.495 dumping heap profile now. ------------------------------------------------ MALLOC: 483712848 ( 461.3 MiB) Bytes in use by application MALLOC: + 1612136448 ( 1537.5 MiB) Bytes in page heap freelist MALLOC: + 48597552 ( 46.3 MiB) Bytes in central cache freelist MALLOC: + 6358528 ( 6.1 MiB) Bytes in transfer cache freelist MALLOC: + 27644032 ( 26.4 MiB) Bytes in thread cache freelists MALLOC: + 7471104 ( 7.1 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 2185920512 ( 2084.7 MiB) Actual memory used (physical + swap) MALLOC: + 30154752 ( 28.8 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 2216075264 ( 2113.4 MiB) Virtual address space used MALLOC: MALLOC: 53473 Spans in use MALLOC: 45 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). Bytes released to the OS take up virtual address space but no physical memory. { "error": "(0) Success", "success": true }
this problem potentially prevents resolution of 1638148, so I updated "blocks" field .
targeting RHCS 3.2, I believe this is a blocker. Trying to reproduce in a RHCS container cluster without openstack to simplify, and then will try to come up with simplest, smallest reproducer. but with different, older hardware not sure that this will work.
osd_client_message_size_cap(The largest client data message allowed in memory.) is set to 500MB by default, so one thing we can try doing is, set this value to 50MB, and see what the impact is.
I think we should see if this problem still happens when we switch to Bluestore, if it doesn't happen with Bluestore then that's probably the best resolution.
Mark Nelson suggests to try doing a heap release command when this happens, and he says Bluestore will not have this problem because there is automatic heap release code in there. If we can verify that this is the case, then we could just close this as "Fixed in Bluestore". Tim and I are working on reproducing this problem in a smaller cluster first using pbench-fio with lots of /dev/rbdN per host.
Tim and I can't reproduce with either filestore or bluestore on small scale (25 OSDs), looking for an opportunity to get into some site with ~500 OSDs to try to reproduce with RHCS 3.2.
I had the same issue with 108 OSDs. We have some RBDs, with snapshot with those RBDs, when we do some snapshot cleanup by removing snapshots, sometimes, it causes cascading OOM for all OSDs across the cluster. Another scenario is, we have a cluster with 3 racks, each rack has 3 hosts, each host has 12 OSDs, each OSD has 8T disk. The cluster has one pool with 3 replications, 1024 pgs(we are about to increase the number). When we add a new host(12 OSDs, each OSD has 10T disk) to a rack, during the backfilling, we observe OOM on the new server, which kills some OSDs, and we have to restart some OSDs to release some memory.
from Nautilus slides: "OSD hard limit on PG log length - Avoids corner cases that could cause OSD memory utilization to grow unbounded". So I think we should retest with Nautilus in this environment and verify that with Bluestore we no longer hit this bz, if so then resolution is fixed in next release" and perhaps backport to Luminous.
The hard limit patches have been backported to luminous and should be available in latest 3.2.
Neha, you were right. I was seeing a similar problem in ceph-mds and I did dump_mempools and saw this: sh-4.2# ceph daemon /run/ceph/ceph-mds.myfs-b.asok dump_mempools ... "buffer_anon": { "items": 144834, "bytes": 604427203 }, ... as you mentioned, sh-4.2# ceph daemon /run/ceph/ceph-mds.myfs-b.asok config show | grep osd_client_message_size_cap "osd_client_message_size_cap": "524288000", I then lowered it sh-4.2# ceph daemon /run/ceph/ceph-mds.myfs-b.asok config set osd_client_message_size_cap 50000000 and saw this "buffer_anon": { "items": 23380, "bytes": 93826424 }, Reducing consumption by about 1/2 GB. Multiply by number of Ceph daemons in the system and it adds up. Why do we need such a large default value for osd_client_message_size_cap ?
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. Regards, Giri
Ben, have you seen anything like this with nautilus + bluestore?
Maybe Alex Calhoun has, cc'ing him. Most testing is being done with RHOCS right now but I don't think that matters too much, since the problem is in the OSDs, which are containerized in both cases. Would like to use https://github.com/cloud-bulldozer/ripsaw fio workload to test this.
Mark Nelson is working on an issue involving OSD RSS inflating past osd_memory_target, our thinking is that THP is responsible for this, he wants to set /sys/kernel/mm/transparent_hugepage/enabled to "madvise" where current default is "always". I think this is a good idea because the use of THP is very application-specific, and this would allow Ceph to turn it off while allowing other apps to turn it on. BTW Fedora 30 5.2.18-200.fc30.x86_64 has "madvise" as the default. So perhaps a tuned profile should be created to enable this on systems that don't already default to it, something like the tuned profile in the initial post but with [vm] transparent_hugepages=madvise
Ceph has been altered to disable transparent hugepages in Ceph daemons using prctl() systemcall. Bluestore is now used more heavily (how many sites still use Filestore?) and Bluestore does O_DIRECT I/O so it does not use the kernel buffer cache -- this too limits RSS consumption. Finally, I think work was done to periodically do "heap release" in Ceph OSD and MDS daemons, I think this is in RHCS 4, ask Mark Nelson or Josh Durgin. To close this bz, someone needs to redo the OpenStack+Ceph tests in the scale lab with OSP16 and see if the problem has been solved. We need something like Prometheus or Browbeat to track Ceph daemon memory consumption during such a test, and if "ceph tell osd.N heap stats" was added to the Ceph manager prometheus module it would make it easier to understand what is happening. Sai Sindhur Malleni is the tech. lead for Perf & Scale testing for OpenStack right now, I cc'ed him.
I do not have that information