Bug 1700027 - Mimic: auto tuned osd-memory-target causes OOM memory kills
Summary: Mimic: auto tuned osd-memory-target causes OOM memory kills
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Rook
Version: 4.0
Hardware: Unspecified
OS: Linux
unspecified
high
Target Milestone: rc
: 4.0
Assignee: Sébastien Han
QA Contact: Ben England
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-15 15:45 UTC by Ben England
Modified: 2019-10-01 16:37 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-01 16:37:28 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rook rook pull 3005 0 None closed ceph: osd, do not auto tune osd-memory-target 2021-02-03 22:02:36 UTC

Description Ben England 2019-04-15 15:45:54 UTC
Description of problem:

In testing with RHOCS and the fio workload, we got Ceph into a state where one of the OSDs reported RocksDB corruption, and the rest of the cluster did not backfill.   The application hung and did not complete.  More data is needed but we have some and wanted to get your input on how to proceed.


Version-Release number of selected component (if applicable):

OpenShift 3.11  atomic-openshift-3.11.98-1.git.0.0cbaff3.el7.x86_64
Ceph Mimic upstream  docker.io/ceph/ceph:v13  id 4775d1a2eff9 4 weeks ago
rook.io docker.io/rook/ceph:master docker.io/rook/ceph 9 days ago
fio v3.12

Note that the same workload and configuration succeeded when a much smaller amount of data was used.    

Waiting for Nautilus image to be available for install with rook.io so we can switch to a code base that is more up to date and supported.


How reproducible:

Happened the first time, we don't know whether it would happen again.

Steps to Reproduce:
1.  install openshift 3.11
2.  install ceph with rook.io 
3.  deploy ceph storage class for Cephfs
4.  deploy fio pods using that storage class
5.  run the fio workload using pbench-fio

Actual results:

fio hung, 1 OSD went down, and PGs stayed in a degraded state (no backfill).


Expected results:

fio completes successfully, and no OSDs go down.  Or if the system is out of space, then fio should return ENOSPC and should not hang.


Additional info:

all result files/logs are in 

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/fio-osd-crash/


fio never completed with this workload, which required 80 GB file per fio pod, so a total of 8 TB of space was consumed, but the system had 38 TB of raw space, so it was only using 24 TB of that, or about 62%.   OSD 23 had only 18% free space left though.   

Fio output is in nohup.out which includes the fio job file.   If you log into the fio pods you can see that each of them has a file of the form /mnt/cephfs/<ip-address>.job.0.0 , since there is only 1 job per pod and fio --client is being used.   fio is running using this command:

/usr/local/bin/fio  --client=10.130.1.176 .... --client=10.130.1.160 --max-jobs=1 --output-format=json /var/lib/pbench-agent/fio_ceph-rook-rook-hdd80GBperf-pod-read_bs_4_2019.04.12T13.21.03/1-read-4KiB/sample1/fio.job

OOM kills of ceph OSDs can be seen in logs/ subdirectory.   One of those OSDs, OSD 2 was unable to restart, had a bluestore RocksDB data corruption error at 2019-04-15 14:35:24.096 .   This OSD's latest restart attempt is captured in crashed-osd-2-oc-logs.log, but we don't have the logs from when it actually went down for the first time, was it an OOM kill?

ceph.conf from inside the OSDs shows that the osd_memory_target was set to 6 GB.   But when I try to talk to the daemon directly, using 

docker exec -it <container-name> bash

I get:

[root@e23-h01-740xd /]# ceph daemon mon.a config show
Can't get admin socket path: unable to get conf option admin_socket for mon.a: warning: line 18: 'osd_pool_default_size' in section 'global' redefined 
warning: line 19: 'osd_pool_default_min_size' in section 'global' redefined 
warning: line 20: 'osd_pool_default_pg_num' in section 'global' redefined 
warning: line 21: 'osd_pool_default_pgp_num' in section 'global' redefined 
parse error setting 'cluster_addr' to ''
parse error setting 'public_addr' to ''

which might mean that it never parsed and used the osd_memory_target parameter.

Comment 1 Mark Nelson 2019-04-15 17:32:54 UTC
This appears to be be the relevant line in the logs:

2019-04-15 14:35:24.096 7f2224ffed80  3 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/rocksdb/db/db_impl_open.cc:485] db.wal/001071.log: dropping 3632 bytes; Corruption: missing start of fragmented record(2)

That's pointing to a consistency related to the WAL.  There's been previous reports of something similar on ceph-users:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-November/022039.html
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028795.html

This particular commit may be relevant in upstream rocksdb:

https://github.com/facebook/rocksdb/pull/3603

Mark

Comment 2 Ben England 2019-04-15 18:23:33 UTC
Hi mark, did this fix make it into Nautilus but not Mimic?  If so, that's a reasonable working hypothesis, very soon we'll be able to try it with Nautilus using rook.io and see.

Comment 3 Ben England 2019-04-15 22:09:17 UTC
Somehow we got a data loss event out of this (OBJECT_UNFOUND below)!  This should not happen as a result of losing a single OSD.   We were using size=3 min_size=2 for all pools.  Perhaps this is because multiple OSDs were OOM-killed.  If we avoid the OOM kills, I bet we avoid this problem, which I have never seen before.  I'm going to recreate the cluster to not set "public addr" and "cluster addr" parameters, and this time make sure that osd_memory_target is set by asking the OSD daemon what it is set to.  It should default to 4 GB, much lower than CGroup limit.

BTW, deleting the fio pods did NOT clean up the files.  That's too bad because the bad PG below was in cephfs_data and might have been corrected by removing the RADOS objects for that file.  Fortunately it did not unmount the filesystem in the pods so I was able to go and delete the files by hand using this:

for p in `ocrc get pod | awk '/fio-pod/{ print $1 }'` ; do echo $p ; ocrc rsh $p sh -c 'ip=$(ifconfig | awk "/ inet 10/{print \$2}") ; echo $ip ; rm -fv /mnt/cephfs/$ip.job.0.0' ; done

sh-4.2# ceph health detail
HEALTH_ERR 3983/6083727 objects misplaced (0.065%); 1/2027909 objects unfound (0.000%); Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 3986/6083727 objects degraded (0.066%), 1 pg degraded, 1 pg undersized
OBJECT_MISPLACED 3983/6083727 objects misplaced (0.065%)
OBJECT_UNFOUND 1/2027909 objects unfound (0.000%)
    pg 2.128 has 1 unfound objects
PG_DAMAGED Possible data damage: 1 pg recovery_unfound
    pg 2.128 is active+forced_recovery+recovery_unfound+undersized+degraded+remapped+forced_backfill, acting [9,19], 1 unfound
PG_DEGRADED Degraded data redundancy: 3986/6083727 objects degraded (0.066%), 1 pg degraded, 1 pg undersized
    pg 2.128 is stuck undersized for 5270.321579, current state active+forced_recovery+recovery_unfound+undersized+degraded+remapped+forced_backfill, last acting [9,19]

sh-4.2# ceph -s
  cluster:
    id:     e16f75fb-2b14-44e5-ae25-b7da5d007837
    health: HEALTH_ERR
            3983/6083727 objects misplaced (0.065%)
            1/2027909 objects unfound (0.000%)
            Possible data damage: 1 pg recovery_unfound
            Degraded data redundancy: 3986/6083727 objects degraded (0.066%), 1 pg degraded, 1 pg undersized
 
  services:
    mon: 3 daemons, quorum a,b,c
    mgr: a(active)
    mds: myfs-1/1/1 up  {0=myfs-b=up:active}, 1 up:standby-replay
    osd: 23 osds: 23 up, 23 in; 1 remapped pgs
 
  data:
    pools:   2 pools, 1024 pgs
    objects: 2.03 M objects, 7.7 TiB
    usage:   24 TiB used, 14 TiB / 38 TiB avail
    pgs:     3986/6083727 objects degraded (0.066%)
             3983/6083727 objects misplaced (0.065%)
             1/2027909 objects unfound (0.000%)
             1022 active+clean
             1    active+forced_recovery+recovery_unfound+undersized+degraded+remapped+forced_backfill
             1    active+clean+scrubbing+deep
 
  io:
    client:   852 B/s rd, 1 op/s rd, 0 op/s wr

Comment 4 Ben England 2019-04-16 14:52:25 UTC
I ripped down the bad cluster and redeployed.  I still cannot deploy Nautilus successfully, so I went back to ceph/ceph_v14.   As for the cause of the OOM memory kills, I found it - CGroup limit < osd_memory_target.  However, it's not my fault.   If this is rook.io doing this, it led to data loss, needs to be fixed ASAP.

[root@e23-h03-740xd ceph]# ocrc rsh rook-ceph-osd-11-5f59dd4bb5-gp2c9
sh-4.2# ceph daemon /var/lib/rook/osd11/rook-ceph-osd.11.asok config show | grep osd_memory_target
    "osd_memory_target": "8388608000",

CGroup limits are listed below...

[root@e23-h03-740xd ceph]# oc describe node/e24-h03-740xd.alias.bos.scalelab.redhat.com
Name:               e24-h03-740xd.alias.bos.scalelab.redhat.com
...
Non-terminated Pods:         (17 in total)
  Namespace                  Name                                     CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                     ------------  ----------  ---------------  -------------
...
  rook-ceph                  rook-ceph-osd-12-c567bf489-txqqj         1 (1%)        1 (1%)      8000Mi (4%)      8000Mi (4%)
  rook-ceph                  rook-ceph-osd-14-6b96d869f9-f47k4        1 (1%)        1 (1%)      8000Mi (4%)      8000Mi (4%)
...

This is not what I specified in my config map in cluster-ben.yaml for my most recent attempt:

apiVersion: v1
kind: Namespace
metadata:
  name: rook-ceph
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: rook-config-override
  namespace: rook-ceph
data:
  config: "\n
   [global] \n
   public network = 10.1.0.0/16 \n
   cluster network = 10.2.0.0/16 \n
   [mds] \n
   mds_cache_memory_limit = 4000000000 \n
   [osd]\n
   osd_memory_target = 6000000000\n"
---
...


[root@e23-h03-740xd ceph]# ocrc get cm/rook-ceph-config -o yaml
apiVersion: v1
data:
  ceph.conf: "[global]\nlog_file                  = \nmon_cluster_log_file      =
    \nmon_allow_pool_delete     = true\nmon_max_pg_per_osd        = 1000\nosd_pg_bits
    \              = 11\nosd_pgp_bits              = 11\nosd_pool_default_size     =
    1\nosd_pool_default_min_size = 1\nosd_pool_default_pg_num   = 100\nosd_pool_default_pgp_num
    \ = 100\nrbd_default_features      = 3\nfatal_signal_handlers     = false\npublic
    network            = 10.1.0.0/16\ncluster network           = 10.2.0.0/16\n\n[mds]\nmds_cache_memory_limit
    = 4000000000\n\n[osd]\nosd_memory_target = 6000000000\n\n"
kind: ConfigMap
metadata:
  creationTimestamp: 2019-04-16T13:22:07Z
  name: rook-ceph-config
  namespace: rook-ceph
  resourceVersion: "6705507"
  selfLink: /api/v1/namespaces/rook-ceph/configmaps/rook-ceph-config
  uid: a23d3e02-604a-11e9-8dac-246e96bdcfda


So how am I to blame for this?  What could the user do differently?  BTW I couldn't even do this check before, because I was specifying these parameters:

osd pool default size = 3
osd pool default min size = 2

Whereas rook.io was also setting them to 1 (a bad idea, default should be min(3, OSD-node-count)), consequently the "ceph daemon" command wouldn't even talk to the daemon.

Comment 5 Ben England 2019-04-16 15:44:44 UTC
sorry I meant I went back to ceph/ceph_v13

Comment 9 Sébastien Han 2019-04-16 20:04:29 UTC
Ben, can you describe/inspect the osd container, please? Which version of rook?
Rook sets osd-memory-target to pod's limit.

Is the osd consuming more memory than the target?

Comment 10 Ben England 2019-04-16 21:35:02 UTC
rook version = docker.io/rook/ceph:master, 047e0878ff14, 3 days ago

I appreciate that rook.io is trying to make it easier for the end user by taking away the need for the end-user to tune osd_memory_target, but setting osd_memory_target and CGroup limit to be the same is causing the OOM kill.  I'd suggest setting the CGroup limit to be 1/2 GB + (0.1 x osd_memory_target). If we still get an OOM kill in this case, then it clearly is caused by the OSD consuming too much memory.  The osd_memory_target implementation is not that perfect.   We can work on reducing the gap between osd_memory_target and CGroup limit with performance testing in future releases, but for now we have to live with the gap.

Here are a couple of examples of the memory usage in the OSD when the OOM kill occurred, from dmesg-logs/e24-h03-740xd/tmp/dmesg (other examples in dmesg-logs/{e23,e24}-h01-740xd/tmp/dmesg).

[55703.179053] Memory cgroup stats for /kubepods.slice/kubepods-podcd1cbfa0_5bc9_11e9_8dac_246e96bdcfda.slice/docker-df3983cba86afe458e1623621713ce2b0208ac0775d9db1e1713e8f6a40eed2a.scope: cache:144KB rss:8190824KB rss_huge:51200KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:8190816KB inactive_file:16KB active_file:0KB unevictable:0KB
...
[55706.573201] Memory cgroup stats for /kubepods.slice/kubepods-podcd38e8c8_5bc9_11e9_8dac_246e96bdcfda.slice/docker-20a2e270e4818911f95a3db21b7fe40a60b37499ab10603b7747f9605db5ab6f.scope: cache:144KB rss:8191856KB rss_huge:67584KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:8191796KB inactive_file:28KB active_file:0KB unevictable:0KB

and from above comment 4:

    "osd_memory_target": "8388608000",
CGroup limit = 8000Mi 

8000*1024*1024 = 8388608000, 

so as you say, the osd_memory_target == the CGroup limit.

8000*1024 = 8192000 KB, and both of these examples above have a total memory consumption (approximately rss+rss_huge) > 8192000 KB.

Also, it would be a good escape hatch if we could override the rook default for osd_memory_target, just in case the default isn't working for somebody.  Perhaps a rook parameter that makes the CGroup limit margin of 10% be a default and lets the user adjust it.   This is a general problem for any storage service, not just Ceph OSDs, right?

Does that make sense?  HTH -ben

Comment 11 Mark Nelson 2019-04-16 21:52:53 UTC
Regarding the osd_memory_target, we only control mapped memory, not RSS memory.  There's no guarantee that when we've unmapped memory in the OSD that the kernel will reclaim it (and certainly no guarantee that it will reclaim it immediately).  Sometimes the kernel chooses not to because there isn't memory pressure (not this case) and sometimes it simply can't reclaim it, especially when transparent huge pages is enabled.  Typically I see overage at less than 10%, but it's possible that it could be higher in some situations.  The container memory limit will absolutely need to be higher than the osd_memory_target though.

Comment 12 Mark Nelson 2019-04-16 21:54:36 UTC
I should also mention that the bluestore cache currently can exceed it's limit as we trim the caches in a loop (every 50ms or so if I remember correctly).  I'm planning on changing this to trim on write (which helps, but doesn't totally fix memory overage), but I'm not sure that change will be backported to nautilus.

Comment 13 Ben England 2019-04-16 22:13:59 UTC
I left a comment in the PR.  I like the osdMemoryTargetSafetyFactor var, it lets people fine tune the safety margin later.  I left a comment on the PR.

Comment 14 Ben England 2019-04-17 17:14:17 UTC
awaiting container image with the PR, meanwhile Shekhar Berry and I tried re-running the fio load, I manually set osd_memory target like this before the test:

ceph tell osd.* injectargs --osd_memory_target 6000000000

But this did not work, still getting OOM kills like this one during fio run:

https://pastebin.com/2dBH4gqL

Rook.io is forcing its value for osd_memory_target by running OSDs this way, see end of command line:

[root@e23-h01-740xd ~]# ps awux | grep ceph-osd
root     112679 85.1  0.9 2932148 1943764 ?     Ssl  16:36  30:53 ceph-osd --foreground --id 21 --conf /var/lib/rook/osd21/rook-ceph.config --osd-data /var/lib/rook/osd21 --keyring /var/lib/rook/osd21/keyring --cluster rook-ceph --osd-uuid 9580a63c-90c8-4384-8c3b-2bad1bba50e5 --osd-memory-target 8388608000

Comment 15 Ben England 2019-10-01 13:11:40 UTC
Please close, I think this is part of another bug (rook setting osd_memory_target = resources:osd:memory:limit) that was already fixed.   No one has reproduced this in a while AFAIK with Nautilus, this bz was on Mimic.


Note You need to log in before you can comment on or make changes to this bug.