Bug 1628670
Summary: | [Ceph] Memory pressure leads to OpenStack-Ceph HCI cluster meltdown with filestore | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Ben England <bengland> | ||||||
Component: | openstack-tripleo-heat-templates | Assignee: | John Fulton <johfulto> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Yogev Rabl <yrabl> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 13.0 (Queens) | CC: | amcleod, dgurtner, gcharot, jdurgin, johfulto, jschluet, jtaleric, lhh, mburns, nlevine, pgrist, rhos-docs, sisadoun, srevivo, twilkins | ||||||
Target Milestone: | z6 | Keywords: | Triaged, ZStream | ||||||
Target Release: | 13.0 (Queens) | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | puppet-tripleo-8.4.1-2.el7ost openstack-tripleo-heat-templates-8.3.1-5.el7ost | Doc Type: | Bug Fix | ||||||
Doc Text: |
With this update, there is a new `TunedCustomProfile` parameter that can contain a string in INI format. This parameter describes a custom tuned profile that is based on heavy I/O load testing.
This update also includes a new environment file for users of hyperconverged Ceph deployments who are using the Ceph filestore storage backend. This environment file creates `/etc/tuned/ceph-filestore-osd-hci/tuned.conf` and sets the tuned profile to an active state. Do not use the new environment file with Ceph bluestore.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2019-04-30 17:27:35 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Ben England
2018-09-13 17:12:24 UTC
Created attachment 1485161 [details]
graph of Ceph OSD memory consumption
The preceding attachment shows results of a test that ran over an hour with these tunings: with osd_{max,min}_pg_log_entries = 3000 as Sage suggested, and using this tuned profile [main] summary=ceph-osd Filestore tuned profile include=throughput-performance [sysctl] vm.dirty_ratio = 10 vm.dirty_background_ratio = 5 [vm] transparent_hugepages=never [sysfs] /sys/kernel/mm/ksm/run=0 You can see some OSDs' memory consumption climb to the 6-GB CGroup limit and then the OSD is OOM-killed. This triggers massive recovery activity. This part of the problem may be a ceph bug. However, Tim found evidence that a guest VM was OOM-killed. This would not be accounted for by the Ceph OSD CGroup limit. We'll add the log to this bz. Created attachment 1485197 [details]
/var/log/messages covering Sep 7 2018 OOM kill
OOM kill took down multiple guest VMs at Sept 7, 2018 at 9:28:18 AM , because there was no free memory. However, there should have been! None of the 34 OSDs on the system at that time were bigger than ~1.2 GB RSS and none of the ~34 guest VMs were bigger than 1 GB RSS, so arithmetic shows that there should have been tons of free memory.
This was without any tuning, everything default. Hypothesis was that kernel VM subsystem could not recycle memory fast enough with KSM and/or THP enabled.
So far we haven't noticed one of these since ceph-osd tuned profile was created and used, it lowers vm.dirty_ratio and turns off KSM and THP, but we haven't checked all the logs yet.
I checked all the /var/log/messages files, over the last 2 days the only OOM killing was because of ceph-osd cgroup limit being reached, I checked. Here is the count of the number of times this happened, it was a LOT, several times per hour on each host. # ansible -f 15 -m shell -a \ 'awk "/invoked oom-killer/&&!/ansible/" /var/log/messages | wc -l' \ all > invoked-oom-killer.log # awk '/SUCCESS/{ip=$1}!/SUCCESS/{print ip, $1}' invoked-oom-killer.log \ | sort -k1 > invoked-oom-killer.sort.log # ansible -f 15 -m shell -a \ 'awk "/killed as a result of limit of/&&!/ansible/" /var/log/messages | wc -l' \ all > as-result-of-limit.log # awk '/SUCCESS/{ip=$1}!/SUCCESS/{print ip, $1}' as-result-of-limit.log \ | sort -k1 > as-result-of-limit.sort.log # diff -u invoked-oom-killer.sort.log as-result-of-limit.sort.log # cat invoked-oom-killer.sort.log 192.168.24.52 115 192.168.24.53 299 192.168.24.57 249 192.168.24.58 271 192.168.24.59 186 192.168.24.60 441 192.168.24.63 255 192.168.24.64 324 192.168.24.65 265 192.168.24.66 135 192.168.24.67 308 192.168.24.68 290 192.168.24.70 147 192.168.24.75 174 Note that ceph.conf was tuned with osd_{max,min}_pg_log_entries=3000 and all OSDs were subsequently restarted to avoid this - it could have been worse without this tuning, we don't know yet. [global] osd_max_pg_log_entries = 3000 osd_min_pg_log_entries = 3000 cluster network = 172.19.0.0/24 log file = /dev/null mon host = 172.18.0.11,172.18.0.13,172.18.0.10 osd_pool_default_pg_num = 128 osd_pool_default_pgp_num = 128 osd_pool_default_size = 3 public network = 172.18.0.0/24 ... [osd] osd journal size = 5120 osd mkfs options xfs = -f -i size=2048 osd mkfs type = xfs osd mount options xfs = noatime,largeio,inode64,swalloc Here's an example of what these CGroup out-of-memory events looked like: Sep 19 01:12:42 overcloud-compute-11 kernel: tp_fstore_op invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0 Sep 19 01:12:42 overcloud-compute-11 kernel: tp_fstore_op cpuset=docker-dfc4834c21742c5709fc806b51d0e0054b54010796139e7ff89265626640e178.scope mems_allowed=0-1 Sep 19 01:12:42 overcloud-compute-11 kernel: CPU: 14 PID: 489324 Comm: tp_fstore_op Kdump: loaded Tainted: G ------------ T 3.10.0-862.3.3.el7.x86_64 #1 Sep 19 01:12:42 overcloud-compute-11 kernel: Hardware name: Supermicro SSG-6048R-E1CR36H/X10DRH-iT, BIOS 2.0 12/17/2015 Sep 19 01:12:42 overcloud-compute-11 kernel: Call Trace: Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba90e78e>] dump_stack+0x19/0x1b Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba90a110>] dump_header+0x90/0x229 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba4d805b>] ? cred_has_capability+0x6b/0x120 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba40b538>] ? try_get_mem_cgroup_from_mm+0x28/0x60 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba397c44>] oom_kill_process+0x254/0x3d0 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba4d813e>] ? selinux_capable+0x2e/0x40 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba40f326>] mem_cgroup_oom_synchronize+0x546/0x570 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba40e7a0>] ? mem_cgroup_charge_common+0xc0/0xc0 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba3984d4>] pagefault_out_of_memory+0x14/0x90 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba908232>] mm_fault_error+0x6a/0x157 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba91b8b6>] __do_page_fault+0x496/0x4f0 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba91b945>] do_page_fault+0x35/0x90 Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba917788>] page_fault+0x28/0x30 Sep 19 01:12:42 overcloud-compute-11 kernel: Task in /system.slice/docker-dfc4834c21742c5709fc806b51d0e0054b54010796139e7ff89265626640e178.scope killed as a result of limit of /system.slice/docker-dfc4834c21742c5709fc806b51d0e0054b54010796139e7ff89265626640e178.scope Sep 19 01:12:42 overcloud-compute-11 kernel: memory: usage 6291456kB, limit 6291456kB, failcnt 512703 Sep 19 01:12:42 overcloud-compute-11 kernel: memory+swap: usage 6291456kB, limit 12582912kB, failcnt 0 Sep 19 01:12:42 overcloud-compute-11 kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Sep 19 01:12:42 overcloud-compute-11 kernel: Memory cgroup stats for /system.slice/docker-dfc4834c21742c5709fc806b51d0e0054b54010796139e7ff89265626640e178.scope: cache:19464KB rss:6271912KB rss_huge:0KB mapped_file:4KB swap:0KB inactive_anon:0KB active_anon:6271908KB inactive_file:14732KB active_file:4228KB unevictable:0KB Sep 19 01:12:42 overcloud-compute-11 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Sep 19 01:12:42 overcloud-compute-11 kernel: [25256] 0 25256 3014 483 11 0 0 entrypoint.sh Sep 19 01:12:42 overcloud-compute-11 kernel: [488002] 167 488002 1775093 1572599 3206 0 0 ceph-osd Sep 19 01:12:42 overcloud-compute-11 kernel: Memory cgroup out of memory: Kill process 683041 (rocksdb:bg0) score 1001 or sacrifice child Sep 19 01:12:42 overcloud-compute-11 kernel: Killed process 488002 (ceph-osd) total-vm:7100372kB, anon-rss:6271108kB, file-rss:19288kB, shmem-rss:0kB related bzs: 1576095 1599507 This errata claims that a similar problem was fixed in RHCS 2.5. Did this fix make it into RHCS ceph-osd-12.2.4-10.el7cp.x86_64 used by RHOSP 13 and is it relevant? https://access.redhat.com/errata/RHSA-2018:2261 This issue still exists with newer versions of the relevant components. ceph-common-12.2.4-42.el7cp.x86_64 rhosp-release-13.0-9.el7ost.noarch container image is 3-12 This issue still exists with container image 3-13, which contains RHCS 3.1. However, the system-wide OOM kills disappear with the tuning described above + the Nova memory tuning reserved_host_memory_mb=191500. All remaining OOM kills result from the CGroup limit being reached. THis is a Ceph problem not an OpenStack problem, opening a separate bz on this. To resolve this bz we need to lower dirty ratio and disable KSM. Here's a tuned profile that does this (minus comments): [main] summary=ceph-osd Filestore tuned profile include=throughput-performance [sysctl] vm.dirty_ratio = 10 vm.dirty_background_ratio = 3 [sysfs] /sys/kernel/mm/ksm/run=0 If this file is installed with OpenStack as /usr/lib/tuned/ceph-osd-hci/tuned.conf and you do the command: # tuned-adm profile ceph-osd-hci This will persistently make these changes to the kernel configuration. Sebastian Han said that ceph-ansible has been changed to persistently disable THP for Filestore (but not for Bluestore, which is good IMO). See https://github.com/ceph/ceph-ansible/issues/1013#issuecomment-425001139 So I think with the above change we can resolve this as far as OpenStack is concerned. (In reply to Ben England from comment #9) > This issue still exists with container image 3-13, which contains RHCS 3.1. > However, the system-wide OOM kills disappear with the tuning described above > + the Nova memory tuning reserved_host_memory_mb=191500. All remaining OOM > kills result from the CGroup limit being reached. THis is a Ceph problem > not an OpenStack problem, opening a separate bz on this. > > To resolve this bz we need to lower dirty ratio and disable KSM. Here's a > tuned profile that does this (minus comments): > > [main] > summary=ceph-osd Filestore tuned profile > include=throughput-performance > [sysctl] > vm.dirty_ratio = 10 > vm.dirty_background_ratio = 3 > [sysfs] > /sys/kernel/mm/ksm/run=0 > > If this file is installed with OpenStack as > /usr/lib/tuned/ceph-osd-hci/tuned.conf and you do the command: > > # tuned-adm profile ceph-osd-hci > > This will persistently make these changes to the kernel configuration. > > Sebastian Han said that ceph-ansible has been changed to persistently > disable THP for Filestore (but not for Bluestore, which is good IMO). See > > https://github.com/ceph/ceph-ansible/issues/1013#issuecomment-425001139 > > So I think with the above change we can resolve this as far as OpenStack is > concerned. Ben, OK, makes sense. HCI users should use a new tuned profile (e.g. ceph-osd-hci) as defined above. They currently set the throughput performance profile [1] but Tripleo cannot define arbitrary profiles at the moment and the puppet-tripleo which sets the profile basically execs a predefined profile [2]. Thus, for OSP13 I will provide an example in this bug of how to use a preboot script to define the tuned profile above and pass an override to set the new profile. I will set a needinfo to myself to provide that example and then pass the new proposed content along to the docs team for any RHHI-C or OSP HCI documentation to include the new example. John [1] https://github.com/openstack/tripleo-heat-templates/blob/714680051ee514ab56b87b1fde47f8745514d951/roles/ComputeHCI.yaml#L13 [2] https://github.com/openstack/puppet-tripleo/blob/6f790d624198eeb9219b26848c05f0edafd09dab/manifests/profile/base/tuned.pp A related bz for ceph-osd process RSS growth is 1637153 . This bz has to be fixed too but is assigned to the Ceph team not OpenStack team. Tim Wilkinson in Perf & Scale has created a dedicated containerized Ceph cluster with 24 OSDs (no OpenStack) to see if we can observe the problem there, if so it will be easier to isolate. A related bz for ceph-osd process RSS growth is 1637153 . This bz has to be fixed too but is assigned to the Ceph team not OpenStack team. Tim Wilkinson in Perf & Scale has created a dedicated containerized Ceph cluster with 24 OSDs (no OpenStack) to see if we can observe the problem there, if so it will be easier to isolate. *** Bug 1639434 has been marked as a duplicate of this bug. *** I now have working example where TripleO applies the desired tuned profile so it looks like we can add address this directly instead providing a more complicated doc. what RHOSP build should we expect this in? the basic direction looks good, just need to see the end result. Thx -ben Documentation https://review.openstack.org/#/c/628261 (In reply to Ben England from comment #20) > what RHOSP build should we expect this in? the basic direction looks good, > just need to see the end result. Thx -ben The changes have merged into master and are being backported to queens upstream. A future z-stream release of 13, from the next time it does an import, should contain this change. https://review.openstack.org/#/q/topic:bug/1800232+(status:open+OR+status:merged) Great - Thank you. BTW Tim and I failed to reproduce this problem in testing on a smaller cluster. We also do not know if it happens with Bluestore in RHCS 3.2, which has some OSD memory management features not present in RHCS 3.1. If it does not happen with Bluestore, then RHCS 4.0 is supposed to deliver a migration playbook for getting all RHCS customers onto Bluestore. Point of clarification. (In reply to John Fulton from comment #10) > OK, makes sense. HCI users should use a new tuned profile (e.g. > ceph-osd-hci) as defined above. They currently set the throughput > performance profile [1] but Tripleo cannot define arbitrary profiles at the > moment... To resolve this bug TripleO, as far back as queens, now CAN define arbitrary tuned profiles and how to do this is documented at: https://docs.openstack.org/tripleo-docs/latest/install/advanced_deployment/tuned.html We also ship the profile based on Ben's recommendation in this bug: https://github.com/openstack/tripleo-heat-templates/blob/stable/queens/environments/tuned-ceph-filestore-hci.yaml so that those using HCI with Ceph filestore may deploy with "-e environments/tuned-ceph-filestore-hci.yaml". The bug is in POST and a future z-stream for OSP13 should pick up this change. thanks, by addressing the bug this way, we can deal with any future changes to tuning recommendations without changing software. sounds like you can close it to me. Verified on puppet-tripleo-8.4.1-2 The doc note is good, particularly the part about how you don't need to do it if Bluestore is used. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0939 |