Bug 1628670

Summary:

[Ceph] Memory pressure leads to OpenStack-Ceph HCI cluster meltdown with filestore

Product:

Red Hat OpenStack

Reporter:

Ben England <bengland>

Component:

openstack-tripleo-heat-templates

Assignee:

John Fulton <johfulto>

Status:

CLOSED ERRATA

QA Contact:

Yogev Rabl <yrabl>

Severity:

high

Docs Contact:

Priority:

high

Version:

13.0 (Queens)

CC:

amcleod, dgurtner, gcharot, jdurgin, johfulto, jschluet, jtaleric, lhh, mburns, nlevine, pgrist, rhos-docs, sisadoun, srevivo, twilkins

Target Milestone:

Keywords:

Triaged, ZStream

Target Release:

13.0 (Queens)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

puppet-tripleo-8.4.1-2.el7ost openstack-tripleo-heat-templates-8.3.1-5.el7ost

Doc Type:

Bug Fix

Doc Text:

With this update, there is a new `TunedCustomProfile` parameter that can contain a string in INI format. This parameter describes a custom tuned profile that is based on heavy I/O load testing. This update also includes a new environment file for users of hyperconverged Ceph deployments who are using the Ceph filestore storage backend. This environment file creates `/etc/tuned/ceph-filestore-osd-hci/tuned.conf` and sets the tuned profile to an active state. Do not use the new environment file with Ceph bluestore.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-04-30 17:27:35 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
graph of Ceph OSD memory consumption	none
/var/log/messages covering Sep 7 2018 OOM kill	none

Description Ben England 2018-09-13 17:12:24 UTC

Description of problem:

Under a really heavy I/O load, Tim Wilkinson and I have observed that a hyperconverged OpenStack-Ceph cluster has  collapsed, multiple times, almost immediately.   We appear to have plenty of free memory and the system is not memory-overcommitted.  The collapse is caused by a chain reaction

- OOM killer runs
- random nova instances and ceph-osd processes are killed
- Ceph goes into recovery mode
- OSDs consume much more memory than before
- OSDs hit cgroup limit and crash
- remaining OSDs are under even more pressure than before, and so on.  
- Ceph reaches a point where some data is inaccessible (not lost) due to too many OSDs down.
- VMs hang and eventually get stack traces stating that I/O timeouts of > 120 seconds are occurring.   

Manual intervention is required to get the system running again.  This should never happen.  The hosts are OpenStack-HCI compute hosts with 256 GB RAM, there are exactly 34 OSDs per host, we allow for each to consume up to 5 GB of RSS (cgroup limit is 6 GB but most OSDs well below this) for a total of 170 GB, and we also run 34-35 VMS of 1 GB each.  In theory this uses 50 GB (34*1.5), leaving ~35 GB of unused RAM.   In practice we see way more free memory than 35 GB.  The OOM kill happens without any warning.

However, with half the number of VMs, KSM disabled and ceph.conf osd_{max,min}_pg_log_entries=3000, we have run same fio random read and write workload successfully.

Each VM has a 95 GB cinder volume with an XFS filesystem on it.  The Cinder volume is preallocated (dd zeroes to it before XFS is put in place).  Ceph space is about 16.57% used out of 865 TB across 476 OSDs (34 2-TB OSDS/host x 14 hosts).

Note that lighter loads may not trigger this behavior, at least not right away.  We are trying to measure what workloads are safe.

We use AZs (Availability Zones) so that Nova scheduler has no choice about where the VMs run.  THis is necessary for failover testing, so that when we bring down an OSD host, it doesn't also blow away VMs that were running on that host.

Version-Release number of selected component (if applicable):

ceph-common-12.2.4-10.el7cp.x86_64 in the container,
container image is 3-9
rhosp-release-13.0-3.el7ost.noarch

How reproducible:

very

Steps to Reproduce:
1. deploy RHOSP 13 in a hyperconverged mode
2. create 512 VMs and spread them evenly across 14 RHOSP computeosd hosts for total of 36 VMs/host, using AZs so there is no chance of scheduler involvement (takes Nova scheduler out of the picture)
3. create 100-GB cinder volume for each VM, attach it, initialize it to all zeroes with dd, and then put an XFS filesystem on it
4. run pbench-fio to first populate the XFS filesystem with 1 very large file (about 95 GB in size)
5. run pbench-fio 4-KB random write workload on these files in parallel

Actual results:

OOM kill kills OSDs, cluster soon degrades to an unusable state.  Docker even became inaccessible, we had to restart docker to restart the OSDs.

Expected results:

Response time increases but OSDs do not go down and system behaves stably and fairly (no user gets locked out or has really long response time).

Additional info:

hardware is described here:

http://wiki.scalelab.redhat.com/assignments/#cloud11

configuration outline:

3 controllers, 14 computeosd hosts (SM6048R), 1 undercloud host.
each SM6048R has:
- 256 GB RAM
- 2 Broadwell CPU sockets
- 2 40-GbE ports
- 36 HDDs with LSI3108 controller with 1 GB WB cache (we only use 34)
- 2 Intel P3700 NVM SSDs (800 GB each)

We are trying to isolate the root cause, which could include one or more of:

- Ceph OSDs default to osd_min_pg_log_entries << osd_max_pg_log_entries
  which allows OSD memory consumption to grow dramatically during recovery

- Ceph container CGroup limit of 5 GB is reached too easily, causing ceph-osd process to crash due to failed memory allocation.  This triggers further recovery and memory pressure.

- bz 1628652- Ceph creates tens of thousands of empty /var/lib/ceph/tmp/tmp.* directories for reasons unknown to me, then it creates many processes doing this to same directories:

find /var/lib/ceph -min_depth 1 -max_depth 3 -exec chown ceph:ceph {} \;

This pegs the system disk and makes docker commands very slow.

- KSM (kernel same-page merging) is enabled by default, it should not be.

- transparent hugepages are not persistently disabled by ceph-ansible (across reboots), this can cause memory fragmentation and makes VM work harder to recycle pages.  Not a factor so far in these experiments because systems haven't been rebooted, may affect customers.

- RHEL7 default vm.dirty_ratio=40 vm.dirty_background_ratio=10 can cause memory to be consumed by writes, since HDDs cannot keep up with the load.  I hypothesize that this should be tuned to vm.dirty_ratio=10 vm.dirty_background_ratio=5.

- vm.min_free_kbytes = 4 GB, so that takes a huge chunk of memory out of play.  I can see 1-2 GB but 4? That's almost 2% of RAM, maybe more if you account for hysteresis-curve behavior of kswapd's memory-recycling behavior.


The workload is:

/opt/pbench-agent/bench-scripts/pbench-fio --sysinfo=none --max-failures=0 \
  --samples=${samples} -t ${oper} -b ${bs} \
  --client-file=${PWD}/vms.list.${inst} \
  --job-file=/tmp/fio.job

fio job file is something like this:

# cat /tmp/fio.job 
[global]
ioengine=libaio
bs=4k
iodepth=4
direct=1
fsync_on_close=1
time_based=1
runtime=3160
clocksource=clock_gettime
ramp_time=10
startdelay=64
rate_iops=200

[fio]
rw=randwrite
size=95g
write_bw_log=fio
write_iops_log=fio
write_lat_log=fio
write_hist_log=fio
numjobs=1
per_job_logs=1
log_avg_msec=60000
log_hist_msec=60000
directory=/mnt/ceph/fio

Comment 3 Ben England 2018-09-20 14:08:08 UTC

Created attachment 1485161 [details]
graph of Ceph OSD memory consumption

Comment 4 Ben England 2018-09-20 14:43:18 UTC

The preceding attachment shows results of a test that ran over an hour with  these tunings:

with osd_{max,min}_pg_log_entries = 3000 as Sage suggested, and using this tuned profile
[main]
summary=ceph-osd Filestore tuned profile
include=throughput-performance
[sysctl]
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
[vm]
transparent_hugepages=never
[sysfs]
/sys/kernel/mm/ksm/run=0

You can see some OSDs' memory consumption climb to the 6-GB CGroup limit and then the OSD is OOM-killed.  This triggers massive recovery activity.   This part of the problem may be a ceph bug.  However, Tim found evidence that a guest VM was OOM-killed.  This would not be accounted for by the Ceph OSD CGroup limit.   We'll add the log to this bz.

Comment 5 Ben England 2018-09-20 15:41:13 UTC

Created attachment 1485197 [details]
/var/log/messages covering Sep 7 2018 OOM kill

OOM kill took down multiple guest VMs at Sept 7, 2018 at 9:28:18 AM , because there was no free memory.  However, there should have been!  None of the 34 OSDs on the system at that time were bigger than ~1.2 GB RSS and none of the ~34 guest VMs were bigger than 1 GB RSS, so arithmetic shows that there should have been tons of free memory.

This was without any tuning, everything default.  Hypothesis was that kernel VM subsystem could not recycle memory fast enough with KSM and/or THP enabled. 
 So far we haven't noticed one of these since ceph-osd tuned profile was created and used, it lowers vm.dirty_ratio and turns off KSM and THP, but we haven't checked all the logs yet.

Comment 6 Ben England 2018-09-20 20:08:35 UTC

I checked all the /var/log/messages files, over the last 2 days the only OOM killing was because of ceph-osd cgroup limit being reached, I checked.  Here is the count of the number of times this happened, it was a LOT, several times per hour on each host.

# ansible -f 15 -m shell -a \
   'awk "/invoked oom-killer/&&!/ansible/" /var/log/messages | wc -l' \
   all > invoked-oom-killer.log
# awk '/SUCCESS/{ip=$1}!/SUCCESS/{print ip, $1}' invoked-oom-killer.log \
    | sort -k1 > invoked-oom-killer.sort.log

# ansible -f 15 -m shell -a \
   'awk "/killed as a result of limit of/&&!/ansible/" /var/log/messages | wc -l' \
   all > as-result-of-limit.log
# awk '/SUCCESS/{ip=$1}!/SUCCESS/{print ip, $1}' as-result-of-limit.log \
    | sort -k1 > as-result-of-limit.sort.log

# diff -u invoked-oom-killer.sort.log as-result-of-limit.sort.log
# cat invoked-oom-killer.sort.log 
192.168.24.52 115
192.168.24.53 299
192.168.24.57 249
192.168.24.58 271
192.168.24.59 186
192.168.24.60 441
192.168.24.63 255
192.168.24.64 324
192.168.24.65 265
192.168.24.66 135
192.168.24.67 308
192.168.24.68 290
192.168.24.70 147
192.168.24.75 174


Note that ceph.conf was tuned with osd_{max,min}_pg_log_entries=3000 and all OSDs were subsequently restarted to avoid this - it could have been worse without this tuning, we don't know yet.

[global]
osd_max_pg_log_entries = 3000
osd_min_pg_log_entries = 3000
cluster network = 172.19.0.0/24
log file = /dev/null
mon host = 172.18.0.11,172.18.0.13,172.18.0.10
osd_pool_default_pg_num = 128
osd_pool_default_pgp_num = 128
osd_pool_default_size = 3
public network = 172.18.0.0/24
...
[osd]
osd journal size = 5120
osd mkfs options xfs = -f -i size=2048
osd mkfs type = xfs
osd mount options xfs = noatime,largeio,inode64,swalloc


Here's an example of what these CGroup out-of-memory events looked like:

Sep 19 01:12:42 overcloud-compute-11 kernel: tp_fstore_op invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0
Sep 19 01:12:42 overcloud-compute-11 kernel: tp_fstore_op cpuset=docker-dfc4834c21742c5709fc806b51d0e0054b54010796139e7ff89265626640e178.scope mems_allowed=0-1
Sep 19 01:12:42 overcloud-compute-11 kernel: CPU: 14 PID: 489324 Comm: tp_fstore_op Kdump: loaded Tainted: G               ------------ T 3.10.0-862.3.3.el7.x86_64 #1
Sep 19 01:12:42 overcloud-compute-11 kernel: Hardware name: Supermicro SSG-6048R-E1CR36H/X10DRH-iT, BIOS 2.0 12/17/2015
Sep 19 01:12:42 overcloud-compute-11 kernel: Call Trace:
Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba90e78e>] dump_stack+0x19/0x1b
Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba90a110>] dump_header+0x90/0x229
Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba4d805b>] ? cred_has_capability+0x6b/0x120
Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba40b538>] ? try_get_mem_cgroup_from_mm+0x28/0x60
Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba397c44>] oom_kill_process+0x254/0x3d0
Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba4d813e>] ? selinux_capable+0x2e/0x40
Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba40f326>] mem_cgroup_oom_synchronize+0x546/0x570
Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba40e7a0>] ? mem_cgroup_charge_common+0xc0/0xc0
Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba3984d4>] pagefault_out_of_memory+0x14/0x90
Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba908232>] mm_fault_error+0x6a/0x157
Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba91b8b6>] __do_page_fault+0x496/0x4f0
Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba91b945>] do_page_fault+0x35/0x90
Sep 19 01:12:42 overcloud-compute-11 kernel: [<ffffffffba917788>] page_fault+0x28/0x30
Sep 19 01:12:42 overcloud-compute-11 kernel: Task in /system.slice/docker-dfc4834c21742c5709fc806b51d0e0054b54010796139e7ff89265626640e178.scope killed as a result of limit of /system.slice/docker-dfc4834c21742c5709fc806b51d0e0054b54010796139e7ff89265626640e178.scope
Sep 19 01:12:42 overcloud-compute-11 kernel: memory: usage 6291456kB, limit 6291456kB, failcnt 512703
Sep 19 01:12:42 overcloud-compute-11 kernel: memory+swap: usage 6291456kB, limit 12582912kB, failcnt 0
Sep 19 01:12:42 overcloud-compute-11 kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Sep 19 01:12:42 overcloud-compute-11 kernel: Memory cgroup stats for /system.slice/docker-dfc4834c21742c5709fc806b51d0e0054b54010796139e7ff89265626640e178.scope: cache:19464KB rss:6271912KB rss_huge:0KB mapped_file:4KB swap:0KB inactive_anon:0KB active_anon:6271908KB inactive_file:14732KB active_file:4228KB unevictable:0KB
Sep 19 01:12:42 overcloud-compute-11 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Sep 19 01:12:42 overcloud-compute-11 kernel: [25256]     0 25256     3014      483      11        0             0 entrypoint.sh
Sep 19 01:12:42 overcloud-compute-11 kernel: [488002]   167 488002  1775093  1572599    3206        0             0 ceph-osd
Sep 19 01:12:42 overcloud-compute-11 kernel: Memory cgroup out of memory: Kill process 683041 (rocksdb:bg0) score 1001 or sacrifice child
Sep 19 01:12:42 overcloud-compute-11 kernel: Killed process 488002 (ceph-osd) total-vm:7100372kB, anon-rss:6271108kB, file-rss:19288kB, shmem-rss:0kB

Comment 7 Ben England 2018-09-20 20:21:31 UTC

related bzs:  1576095 1599507

This errata claims that a similar problem was fixed in RHCS 2.5.  Did this fix make it into RHCS ceph-osd-12.2.4-10.el7cp.x86_64 used by RHOSP 13 and is it relevant?

https://access.redhat.com/errata/RHSA-2018:2261

Comment 8 Tim Wilkinson 2018-10-01 16:14:09 UTC

This issue still exists with newer versions of the relevant components.

  ceph-common-12.2.4-42.el7cp.x86_64
  rhosp-release-13.0-9.el7ost.noarch
  container image is 3-12

Comment 9 Ben England 2018-10-08 18:15:17 UTC

This issue still exists with container image 3-13, which contains RHCS 3.1.  However, the system-wide OOM kills disappear with the tuning described above + the Nova memory tuning reserved_host_memory_mb=191500.  All remaining OOM kills result from the CGroup limit being reached.  THis is a Ceph problem not an OpenStack problem, opening a separate bz on this.

To resolve this bz we need to lower dirty ratio and disable KSM.  Here's a tuned profile that does this (minus comments):

[main]
summary=ceph-osd Filestore tuned profile
include=throughput-performance
[sysctl]
vm.dirty_ratio = 10
vm.dirty_background_ratio = 3
[sysfs]
/sys/kernel/mm/ksm/run=0

If this file is installed with OpenStack as /usr/lib/tuned/ceph-osd-hci/tuned.conf and you do the command:

# tuned-adm profile ceph-osd-hci

This will persistently make these changes to the kernel configuration.

Sebastian Han said that ceph-ansible has been changed to persistently disable THP for Filestore (but not for Bluestore, which is good IMO).  See 

https://github.com/ceph/ceph-ansible/issues/1013#issuecomment-425001139

So I think with the above change we can resolve this as far as OpenStack is concerned.

Comment 10 John Fulton 2018-10-09 13:26:11 UTC

(In reply to Ben England from comment #9)
> This issue still exists with container image 3-13, which contains RHCS 3.1. 
> However, the system-wide OOM kills disappear with the tuning described above
> + the Nova memory tuning reserved_host_memory_mb=191500.  All remaining OOM
> kills result from the CGroup limit being reached.  THis is a Ceph problem
> not an OpenStack problem, opening a separate bz on this.
> 
> To resolve this bz we need to lower dirty ratio and disable KSM.  Here's a
> tuned profile that does this (minus comments):
> 
> [main]
> summary=ceph-osd Filestore tuned profile
> include=throughput-performance
> [sysctl]
> vm.dirty_ratio = 10
> vm.dirty_background_ratio = 3
> [sysfs]
> /sys/kernel/mm/ksm/run=0
> 
> If this file is installed with OpenStack as
> /usr/lib/tuned/ceph-osd-hci/tuned.conf and you do the command:
> 
> # tuned-adm profile ceph-osd-hci
> 
> This will persistently make these changes to the kernel configuration.
> 
> Sebastian Han said that ceph-ansible has been changed to persistently
> disable THP for Filestore (but not for Bluestore, which is good IMO).  See 
> 
> https://github.com/ceph/ceph-ansible/issues/1013#issuecomment-425001139
> 
> So I think with the above change we can resolve this as far as OpenStack is
> concerned.

Ben,

OK, makes sense. HCI users should use a new tuned profile (e.g. ceph-osd-hci) as defined above. They currently set the throughput performance profile [1] but Tripleo cannot define arbitrary profiles at the moment and the puppet-tripleo which sets the profile basically execs a predefined profile [2]. Thus, for OSP13 I will provide an example in this bug of how to use a preboot script to define the tuned profile above and pass an override to set the new profile. I will set a needinfo to myself to provide that example and then pass the new proposed content along to the docs team for any RHHI-C or OSP HCI documentation to include the new example. 

  John

[1] https://github.com/openstack/tripleo-heat-templates/blob/714680051ee514ab56b87b1fde47f8745514d951/roles/ComputeHCI.yaml#L13

[2] https://github.com/openstack/puppet-tripleo/blob/6f790d624198eeb9219b26848c05f0edafd09dab/manifests/profile/base/tuned.pp

Comment 12 Ben England 2018-10-09 13:56:24 UTC

A related bz for ceph-osd process RSS growth is 1637153 .  This bz  has to be fixed too but is assigned to the Ceph team not OpenStack team.  Tim Wilkinson in Perf & Scale  has created a dedicated containerized Ceph cluster with 24 OSDs (no OpenStack) to see if we can observe the problem there, if so it will be easier to isolate.

Comment 13 Ben England 2018-10-09 13:56:25 UTC

A related bz for ceph-osd process RSS growth is 1637153 .  This bz  has to be fixed too but is assigned to the Ceph team not OpenStack team.  Tim Wilkinson in Perf & Scale  has created a dedicated containerized Ceph cluster with 24 OSDs (no OpenStack) to see if we can observe the problem there, if so it will be easier to isolate.

Comment 18 John Fulton 2018-11-28 16:47:17 UTC

*** Bug 1639434 has been marked as a duplicate of this bug. ***

Comment 19 John Fulton 2018-11-28 16:54:09 UTC

I now have working example where TripleO applies the desired tuned profile so it looks like we can add address this directly instead providing a more complicated doc.

Comment 20 Ben England 2018-12-14 15:00:47 UTC

what RHOSP build should we expect this in?  the basic direction looks good, just need to see the end result.  Thx -ben

Comment 21 John Fulton 2019-01-03 18:57:39 UTC

Documentation https://review.openstack.org/#/c/628261

Comment 22 John Fulton 2019-01-03 19:01:13 UTC

(In reply to Ben England from comment #20)
> what RHOSP build should we expect this in?  the basic direction looks good,
> just need to see the end result.  Thx -ben

The changes have merged into master and are being backported to queens upstream. A future z-stream release of 13, from the next time it does an import, should contain this change.

 https://review.openstack.org/#/q/topic:bug/1800232+(status:open+OR+status:merged)

Comment 23 Ben England 2019-01-15 22:36:32 UTC

Great - Thank you.  BTW Tim and I failed to reproduce this problem in testing on a smaller cluster.  We also do not know if it happens with Bluestore in RHCS 3.2, which has some OSD memory management features not present in RHCS 3.1.  If it does not happen with Bluestore, then RHCS 4.0 is supposed to deliver a migration playbook for getting all RHCS customers onto Bluestore.

Comment 24 John Fulton 2019-01-28 18:41:54 UTC

Point of clarification.

(In reply to John Fulton from comment #10)
> OK, makes sense. HCI users should use a new tuned profile (e.g.
> ceph-osd-hci) as defined above. They currently set the throughput
> performance profile [1] but Tripleo cannot define arbitrary profiles at the
> moment...

To resolve this bug TripleO, as far back as queens, now CAN define arbitrary tuned profiles and how to do this is documented at:

 https://docs.openstack.org/tripleo-docs/latest/install/advanced_deployment/tuned.html

We also ship the profile based on Ben's recommendation in this bug:

 https://github.com/openstack/tripleo-heat-templates/blob/stable/queens/environments/tuned-ceph-filestore-hci.yaml

so that those using HCI with Ceph filestore may deploy with "-e environments/tuned-ceph-filestore-hci.yaml". 

The bug is in POST and a future z-stream for OSP13 should pick up this change.

Comment 25 Ben England 2019-01-28 19:05:41 UTC

thanks, by addressing the bug this way, we can deal with any future changes to tuning recommendations without changing software.  sounds like you can close it to me.

Comment 34 Yogev Rabl 2019-04-18 15:52:54 UTC

Verified on puppet-tripleo-8.4.1-2

Comment 35 Ben England 2019-04-26 15:53:25 UTC

The doc note is good, particularly the part about how you don't need to do it if Bluestore is used.

Comment 37 errata-xmlrpc 2019-04-30 17:27:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0939