Description of problem: Under a really heavy I/O load, Tim Wilkinson and I have observed that a hyperconverged OpenStack-Ceph cluster has collapsed, multiple times, almost immediately. We appear to have plenty of free memory and the system is not memory-overcommitted. The collapse is caused by a chain reaction - OOM killer runs - random nova instances and ceph-osd processes are killed - Ceph goes into recovery mode - OSDs consume much more memory than before - OSDs hit cgroup limit and crash - remaining OSDs are under even more pressure than before, and so on. - Ceph reaches a point where some data is inaccessible (not lost) due to too many OSDs down. - VMs hang and eventually get stack traces stating that I/O timeouts of > 120 seconds are occurring. Manual intervention is required to get the system running again. This should never happen. The hosts are OpenStack-HCI compute hosts with 256 GB RAM, there are exactly 34 OSDs per host, we allow for each to consume up to 5 GB of RSS (cgroup limit is 6 GB but most OSDs well below this) for a total of 170 GB, and we also run 34-35 VMS of 1 GB each. In theory this uses 50 GB (34*1.5), leaving ~35 GB of unused RAM. In practice we see way more free memory than 35 GB. The OOM kill happens without any warning. However, with half the number of VMs, KSM disabled and ceph.conf osd_{max,min}_pg_log_entries=3000, we have run same fio random read and write workload successfully. Each VM has a 95 GB cinder volume with an XFS filesystem on it. The Cinder volume is preallocated (dd zeroes to it before XFS is put in place). Ceph space is about 16.57% used out of 865 TB across 476 OSDs (34 2-TB OSDS/host x 14 hosts). Note that lighter loads may not trigger this behavior, at least not right away. We are trying to measure what workloads are safe. We use AZs (Availability Zones) so that Nova scheduler has no choice about where the VMs run. THis is necessary for failover testing, so that when we bring down an OSD host, it doesn't also blow away VMs that were running on that host. Version-Release number of selected component (if applicable): ceph-common-12.2.4-10.el7cp.x86_64 in the container, container image is 3-9 rhosp-release-13.0-3.el7ost.noarch How reproducible: very Steps to Reproduce: 1. deploy RHOSP 13 in a hyperconverged mode 2. create 512 VMs and spread them evenly across 14 RHOSP computeosd hosts for total of 36 VMs/host, using AZs so there is no chance of scheduler involvement (takes Nova scheduler out of the picture) 3. create 100-GB cinder volume for each VM, attach it, initialize it to all zeroes with dd, and then put an XFS filesystem on it 4. run pbench-fio to first populate the XFS filesystem with 1 very large file (about 95 GB in size) 5. run pbench-fio 4-KB random write workload on these files in parallel Actual results: OOM kill kills OSDs, cluster soon degrades to an unusable state. Docker even became inaccessible, we had to restart docker to restart the OSDs. Expected results: Response time increases but OSDs do not go down and system behaves stably and fairly (no user gets locked out or has really long response time). Additional info: hardware is described here: http://wiki.scalelab.redhat.com/assignments/#cloud11 configuration outline: 3 controllers, 14 computeosd hosts (SM6048R), 1 undercloud host. each SM6048R has: - 256 GB RAM - 2 Broadwell CPU sockets - 2 40-GbE ports - 36 HDDs with LSI3108 controller with 1 GB WB cache (we only use 34) - 2 Intel P3700 NVM SSDs (800 GB each) We are trying to isolate the root cause, which could include one or more of: - Ceph OSDs default to osd_min_pg_log_entries << osd_max_pg_log_entries which allows OSD memory consumption to grow dramatically during recovery - Ceph container CGroup limit of 5 GB is reached too easily, causing ceph-osd process to crash due to failed memory allocation. This triggers further recovery and memory pressure. - bz 1628652- Ceph creates tens of thousands of empty /var/lib/ceph/tmp/tmp.* directories for reasons unknown to me, then it creates many processes doing this to same directories: find /var/lib/ceph -min_depth 1 -max_depth 3 -exec chown ceph:ceph {} \; This pegs the system disk and makes docker commands very slow. - KSM (kernel same-page merging) is enabled by default, it should not be. - transparent hugepages are not persistently disabled by ceph-ansible (across reboots), this can cause memory fragmentation and makes VM work harder to recycle pages. Not a factor so far in these experiments because systems haven't been rebooted, may affect customers. - RHEL7 default vm.dirty_ratio=40 vm.dirty_background_ratio=10 can cause memory to be consumed by writes, since HDDs cannot keep up with the load. I hypothesize that this should be tuned to vm.dirty_ratio=10 vm.dirty_background_ratio=5. - vm.min_free_kbytes = 4 GB, so that takes a huge chunk of memory out of play. I can see 1-2 GB but 4? That's almost 2% of RAM, maybe more if you account for hysteresis-curve behavior of kswapd's memory-recycling behavior. The workload is: /opt/pbench-agent/bench-scripts/pbench-fio --sysinfo=none --max-failures=0 \ --samples=${samples} -t ${oper} -b ${bs} \ --client-file=${PWD}/vms.list.${inst} \ --job-file=/tmp/fio.job fio job file is something like this: # cat /tmp/fio.job [global] ioengine=libaio bs=4k iodepth=4 direct=1 fsync_on_close=1 time_based=1 runtime=3160 clocksource=clock_gettime ramp_time=10 startdelay=64 rate_iops=200 [fio] rw=randwrite size=95g write_bw_log=fio write_iops_log=fio write_lat_log=fio write_hist_log=fio numjobs=1 per_job_logs=1 log_avg_msec=60000 log_hist_msec=60000 directory=/mnt/ceph/fio
Created attachment 1485161 [details] graph of Ceph OSD memory consumption
The preceding attachment shows results of a test that ran over an hour with these tunings: with osd_{max,min}_pg_log_entries = 3000 as Sage suggested, and using this tuned profile [main] summary=ceph-osd Filestore tuned profile include=throughput-performance [sysctl] vm.dirty_ratio = 10 vm.dirty_background_ratio = 5 [vm] transparent_hugepages=never [sysfs] /sys/kernel/mm/ksm/run=0 You can see some OSDs' memory consumption climb to the 6-GB CGroup limit and then the OSD is OOM-killed. This triggers massive recovery activity. This part of the problem may be a ceph bug. However, Tim found evidence that a guest VM was OOM-killed. This would not be accounted for by the Ceph OSD CGroup limit. We'll add the log to this bz.
Created attachment 1485197 [details] /var/log/messages covering Sep 7 2018 OOM kill OOM kill took down multiple guest VMs at Sept 7, 2018 at 9:28:18 AM , because there was no free memory. However, there should have been! None of the 34 OSDs on the system at that time were bigger than ~1.2 GB RSS and none of the ~34 guest VMs were bigger than 1 GB RSS, so arithmetic shows that there should have been tons of free memory. This was without any tuning, everything default. Hypothesis was that kernel VM subsystem could not recycle memory fast enough with KSM and/or THP enabled. So far we haven't noticed one of these since ceph-osd tuned profile was created and used, it lowers vm.dirty_ratio and turns off KSM and THP, but we haven't checked all the logs yet.
I checked all the /var/log/messages files, over the last 2 days the only OOM killing was because of ceph-osd cgroup limit being reached, I checked. Here is the count of the number of times this happened, it was a LOT, several times per hour on each host. # ansible -f 15 -m shell -a \ 'awk "/invoked oom-killer/&&!/ansible/" /var/log/messages | wc -l' \ all > invoked-oom-killer.log # awk '/SUCCESS/{ip=$1}!/SUCCESS/{print ip, $1}' invoked-oom-killer.log \ | sort -k1 > invoked-oom-killer.sort.log # ansible -f 15 -m shell -a \ 'awk "/killed as a result of limit of/&&!/ansible/" /var/log/messages | wc -l' \ all > as-result-of-limit.log # awk '/SUCCESS/{ip=$1}!/SUCCESS/{print ip, $1}' as-result-of-limit.log \ | sort -k1 > as-result-of-limit.sort.log # diff -u invoked-oom-killer.sort.log as-result-of-limit.sort.log # cat invoked-oom-killer.sort.log 192.168.24.52 115 192.168.24.53 299 192.168.24.57 249 192.168.24.58 271 192.168.24.59 186 192.168.24.60 441 192.168.24.63 255 192.168.24.64 324 192.168.24.65 265 192.168.24.66 135 192.168.24.67 308 192.168.24.68 290 192.168.24.70 147 192.168.24.75 174 Note that ceph.conf was tuned with osd_{max,min}_pg_log_entries=3000 and all OSDs were subsequently restarted to avoid this - it could have been worse without this tuning, we don't know yet. [global] osd_max_pg_log_entries = 3000 osd_min_pg_log_entries = 3000 cluster network = 172.19.0.0/24 log file = /dev/null mon host = 172.18.0.11,172.18.0.13,172.18.0.10 osd_pool_default_pg_num = 128 osd_pool_default_pgp_num = 128 osd_pool_default_size = 3 public network = 172.18.0.0/24 ... [osd] osd journal size = 5120 osd mkfs options xfs = -f -i size=2048 osd mkfs type = xfs osd mount options xfs = noatime,largeio,inode64,swalloc Here's an example of what these CGroup out-of-memory events looked like: Sep 19 01:12:42 overcloud-compute-11 kernel: tp_fstore_op invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0 Sep 19 01:12:42 overcloud-compute-11 kernel: tp_fstore_op cpuset=docker-dfc4834c21742c5709fc806b51d0e0054b54010796139e7ff89265626640e178.scope mems_allowed=0-1 Sep 19 01:12:42 overcloud-compute-11 kernel: CPU: 14 PID: 489324 Comm: tp_fstore_op Kdump: loaded Tainted: G ------------ T 3.10.0-862.3.3.el7.x86_64 #1 Sep 19 01:12:42 overcloud-compute-11 kernel: Hardware name: Supermicro SSG-6048R-E1CR36H/X10DRH-iT, BIOS 2.0 12/17/2015 Sep 19 01:12:42 overcloud-compute-11 kernel: Call Trace: Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba90e78e>] dump_stack+0x19/0x1b Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba90a110>] dump_header+0x90/0x229 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba4d805b>] ? cred_has_capability+0x6b/0x120 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba40b538>] ? try_get_mem_cgroup_from_mm+0x28/0x60 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba397c44>] oom_kill_process+0x254/0x3d0 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba4d813e>] ? selinux_capable+0x2e/0x40 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba40f326>] mem_cgroup_oom_synchronize+0x546/0x570 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba40e7a0>] ? mem_cgroup_charge_common+0xc0/0xc0 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba3984d4>] pagefault_out_of_memory+0x14/0x90 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba908232>] mm_fault_error+0x6a/0x157 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba91b8b6>] __do_page_fault+0x496/0x4f0 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba91b945>] do_page_fault+0x35/0x90 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba917788>] page_fault+0x28/0x30 Sep 19 01:12:42 overcloud-compute-11 kernel: Task in /system.slice/docker-dfc4834c21742c5709fc806b51d0e0054b54010796139e7ff89265626640e178.scope killed as a result of limit of /system.slice/docker-dfc4834c21742c5709fc806b51d0e0054b54010796139e7ff89265626640e178.scope Sep 19 01:12:42 overcloud-compute-11 kernel: memory: usage 6291456kB, limit 6291456kB, failcnt 512703 Sep 19 01:12:42 overcloud-compute-11 kernel: memory+swap: usage 6291456kB, limit 12582912kB, failcnt 0 Sep 19 01:12:42 overcloud-compute-11 kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Sep 19 01:12:42 overcloud-compute-11 kernel: Memory cgroup stats for /system.slice/docker-dfc4834c21742c5709fc806b51d0e0054b54010796139e7ff89265626640e178.scope: cache:19464KB rss:6271912KB rss_huge:0KB mapped_file:4KB swap:0KB inactive_anon:0KB active_anon:6271908KB inactive_file:14732KB active_file:4228KB unevictable:0KB Sep 19 01:12:42 overcloud-compute-11 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Sep 19 01:12:42 overcloud-compute-11 kernel: [25256] 0 25256 3014 483 11 0 0 entrypoint.sh Sep 19 01:12:42 overcloud-compute-11 kernel: [488002] 167 488002 1775093 1572599 3206 0 0 ceph-osd Sep 19 01:12:42 overcloud-compute-11 kernel: Memory cgroup out of memory: Kill process 683041 (rocksdb:bg0) score 1001 or sacrifice child Sep 19 01:12:42 overcloud-compute-11 kernel: Killed process 488002 (ceph-osd) total-vm:7100372kB, anon-rss:6271108kB, file-rss:19288kB, shmem-rss:0kB
related bzs: 1576095 1599507 This errata claims that a similar problem was fixed in RHCS 2.5. Did this fix make it into RHCS ceph-osd-12.2.4-10.el7cp.x86_64 used by RHOSP 13 and is it relevant? https://access.redhat.com/errata/RHSA-2018:2261
This issue still exists with newer versions of the relevant components. ceph-common-12.2.4-42.el7cp.x86_64 rhosp-release-13.0-9.el7ost.noarch container image is 3-12
This issue still exists with container image 3-13, which contains RHCS 3.1. However, the system-wide OOM kills disappear with the tuning described above + the Nova memory tuning reserved_host_memory_mb=191500. All remaining OOM kills result from the CGroup limit being reached. THis is a Ceph problem not an OpenStack problem, opening a separate bz on this. To resolve this bz we need to lower dirty ratio and disable KSM. Here's a tuned profile that does this (minus comments): [main] summary=ceph-osd Filestore tuned profile include=throughput-performance [sysctl] vm.dirty_ratio = 10 vm.dirty_background_ratio = 3 [sysfs] /sys/kernel/mm/ksm/run=0 If this file is installed with OpenStack as /usr/lib/tuned/ceph-osd-hci/tuned.conf and you do the command: # tuned-adm profile ceph-osd-hci This will persistently make these changes to the kernel configuration. Sebastian Han said that ceph-ansible has been changed to persistently disable THP for Filestore (but not for Bluestore, which is good IMO). See https://github.com/ceph/ceph-ansible/issues/1013#issuecomment-425001139 So I think with the above change we can resolve this as far as OpenStack is concerned.
(In reply to Ben England from comment #9) > This issue still exists with container image 3-13, which contains RHCS 3.1. > However, the system-wide OOM kills disappear with the tuning described above > + the Nova memory tuning reserved_host_memory_mb=191500. All remaining OOM > kills result from the CGroup limit being reached. THis is a Ceph problem > not an OpenStack problem, opening a separate bz on this. > > To resolve this bz we need to lower dirty ratio and disable KSM. Here's a > tuned profile that does this (minus comments): > > [main] > summary=ceph-osd Filestore tuned profile > include=throughput-performance > [sysctl] > vm.dirty_ratio = 10 > vm.dirty_background_ratio = 3 > [sysfs] > /sys/kernel/mm/ksm/run=0 > > If this file is installed with OpenStack as > /usr/lib/tuned/ceph-osd-hci/tuned.conf and you do the command: > > # tuned-adm profile ceph-osd-hci > > This will persistently make these changes to the kernel configuration. > > Sebastian Han said that ceph-ansible has been changed to persistently > disable THP for Filestore (but not for Bluestore, which is good IMO). See > > https://github.com/ceph/ceph-ansible/issues/1013#issuecomment-425001139 > > So I think with the above change we can resolve this as far as OpenStack is > concerned. Ben, OK, makes sense. HCI users should use a new tuned profile (e.g. ceph-osd-hci) as defined above. They currently set the throughput performance profile [1] but Tripleo cannot define arbitrary profiles at the moment and the puppet-tripleo which sets the profile basically execs a predefined profile [2]. Thus, for OSP13 I will provide an example in this bug of how to use a preboot script to define the tuned profile above and pass an override to set the new profile. I will set a needinfo to myself to provide that example and then pass the new proposed content along to the docs team for any RHHI-C or OSP HCI documentation to include the new example. John [1] https://github.com/openstack/tripleo-heat-templates/blob/714680051ee514ab56b87b1fde47f8745514d951/roles/ComputeHCI.yaml#L13 [2] https://github.com/openstack/puppet-tripleo/blob/6f790d624198eeb9219b26848c05f0edafd09dab/manifests/profile/base/tuned.pp
A related bz for ceph-osd process RSS growth is 1637153 . This bz has to be fixed too but is assigned to the Ceph team not OpenStack team. Tim Wilkinson in Perf & Scale has created a dedicated containerized Ceph cluster with 24 OSDs (no OpenStack) to see if we can observe the problem there, if so it will be easier to isolate.
*** Bug 1639434 has been marked as a duplicate of this bug. ***
I now have working example where TripleO applies the desired tuned profile so it looks like we can add address this directly instead providing a more complicated doc.
what RHOSP build should we expect this in? the basic direction looks good, just need to see the end result. Thx -ben
Documentation https://review.openstack.org/#/c/628261
(In reply to Ben England from comment #20) > what RHOSP build should we expect this in? the basic direction looks good, > just need to see the end result. Thx -ben The changes have merged into master and are being backported to queens upstream. A future z-stream release of 13, from the next time it does an import, should contain this change. https://review.openstack.org/#/q/topic:bug/1800232+(status:open+OR+status:merged)
Great - Thank you. BTW Tim and I failed to reproduce this problem in testing on a smaller cluster. We also do not know if it happens with Bluestore in RHCS 3.2, which has some OSD memory management features not present in RHCS 3.1. If it does not happen with Bluestore, then RHCS 4.0 is supposed to deliver a migration playbook for getting all RHCS customers onto Bluestore.
Point of clarification. (In reply to John Fulton from comment #10) > OK, makes sense. HCI users should use a new tuned profile (e.g. > ceph-osd-hci) as defined above. They currently set the throughput > performance profile [1] but Tripleo cannot define arbitrary profiles at the > moment... To resolve this bug TripleO, as far back as queens, now CAN define arbitrary tuned profiles and how to do this is documented at: https://docs.openstack.org/tripleo-docs/latest/install/advanced_deployment/tuned.html We also ship the profile based on Ben's recommendation in this bug: https://github.com/openstack/tripleo-heat-templates/blob/stable/queens/environments/tuned-ceph-filestore-hci.yaml so that those using HCI with Ceph filestore may deploy with "-e environments/tuned-ceph-filestore-hci.yaml". The bug is in POST and a future z-stream for OSP13 should pick up this change.
thanks, by addressing the bug this way, we can deal with any future changes to tuning recommendations without changing software. sounds like you can close it to me.
Verified on puppet-tripleo-8.4.1-2
The doc note is good, particularly the part about how you don't need to do it if Bluestore is used.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0939