Description of problem: In testing with RHOCS and the fio workload, we got Ceph into a state where one of the OSDs reported RocksDB corruption, and the rest of the cluster did not backfill. The application hung and did not complete. More data is needed but we have some and wanted to get your input on how to proceed. Version-Release number of selected component (if applicable): OpenShift 3.11 atomic-openshift-3.11.98-1.git.0.0cbaff3.el7.x86_64 Ceph Mimic upstream docker.io/ceph/ceph:v13 id 4775d1a2eff9 4 weeks ago rook.io docker.io/rook/ceph:master docker.io/rook/ceph 9 days ago fio v3.12 Note that the same workload and configuration succeeded when a much smaller amount of data was used. Waiting for Nautilus image to be available for install with rook.io so we can switch to a code base that is more up to date and supported. How reproducible: Happened the first time, we don't know whether it would happen again. Steps to Reproduce: 1. install openshift 3.11 2. install ceph with rook.io 3. deploy ceph storage class for Cephfs 4. deploy fio pods using that storage class 5. run the fio workload using pbench-fio Actual results: fio hung, 1 OSD went down, and PGs stayed in a degraded state (no backfill). Expected results: fio completes successfully, and no OSDs go down. Or if the system is out of space, then fio should return ENOSPC and should not hang. Additional info: all result files/logs are in http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/fio-osd-crash/ fio never completed with this workload, which required 80 GB file per fio pod, so a total of 8 TB of space was consumed, but the system had 38 TB of raw space, so it was only using 24 TB of that, or about 62%. OSD 23 had only 18% free space left though. Fio output is in nohup.out which includes the fio job file. If you log into the fio pods you can see that each of them has a file of the form /mnt/cephfs/<ip-address>.job.0.0 , since there is only 1 job per pod and fio --client is being used. fio is running using this command: /usr/local/bin/fio --client=10.130.1.176 .... --client=10.130.1.160 --max-jobs=1 --output-format=json /var/lib/pbench-agent/fio_ceph-rook-rook-hdd80GBperf-pod-read_bs_4_2019.04.12T13.21.03/1-read-4KiB/sample1/fio.job OOM kills of ceph OSDs can be seen in logs/ subdirectory. One of those OSDs, OSD 2 was unable to restart, had a bluestore RocksDB data corruption error at 2019-04-15 14:35:24.096 . This OSD's latest restart attempt is captured in crashed-osd-2-oc-logs.log, but we don't have the logs from when it actually went down for the first time, was it an OOM kill? ceph.conf from inside the OSDs shows that the osd_memory_target was set to 6 GB. But when I try to talk to the daemon directly, using docker exec -it <container-name> bash I get: [root@e23-h01-740xd /]# ceph daemon mon.a config show Can't get admin socket path: unable to get conf option admin_socket for mon.a: warning: line 18: 'osd_pool_default_size' in section 'global' redefined warning: line 19: 'osd_pool_default_min_size' in section 'global' redefined warning: line 20: 'osd_pool_default_pg_num' in section 'global' redefined warning: line 21: 'osd_pool_default_pgp_num' in section 'global' redefined parse error setting 'cluster_addr' to '' parse error setting 'public_addr' to '' which might mean that it never parsed and used the osd_memory_target parameter.
This appears to be be the relevant line in the logs: 2019-04-15 14:35:24.096 7f2224ffed80 3 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/rocksdb/db/db_impl_open.cc:485] db.wal/001071.log: dropping 3632 bytes; Corruption: missing start of fragmented record(2) That's pointing to a consistency related to the WAL. There's been previous reports of something similar on ceph-users: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-November/022039.html http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028795.html This particular commit may be relevant in upstream rocksdb: https://github.com/facebook/rocksdb/pull/3603 Mark
Hi mark, did this fix make it into Nautilus but not Mimic? If so, that's a reasonable working hypothesis, very soon we'll be able to try it with Nautilus using rook.io and see.
Somehow we got a data loss event out of this (OBJECT_UNFOUND below)! This should not happen as a result of losing a single OSD. We were using size=3 min_size=2 for all pools. Perhaps this is because multiple OSDs were OOM-killed. If we avoid the OOM kills, I bet we avoid this problem, which I have never seen before. I'm going to recreate the cluster to not set "public addr" and "cluster addr" parameters, and this time make sure that osd_memory_target is set by asking the OSD daemon what it is set to. It should default to 4 GB, much lower than CGroup limit. BTW, deleting the fio pods did NOT clean up the files. That's too bad because the bad PG below was in cephfs_data and might have been corrected by removing the RADOS objects for that file. Fortunately it did not unmount the filesystem in the pods so I was able to go and delete the files by hand using this: for p in `ocrc get pod | awk '/fio-pod/{ print $1 }'` ; do echo $p ; ocrc rsh $p sh -c 'ip=$(ifconfig | awk "/ inet 10/{print \$2}") ; echo $ip ; rm -fv /mnt/cephfs/$ip.job.0.0' ; done sh-4.2# ceph health detail HEALTH_ERR 3983/6083727 objects misplaced (0.065%); 1/2027909 objects unfound (0.000%); Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 3986/6083727 objects degraded (0.066%), 1 pg degraded, 1 pg undersized OBJECT_MISPLACED 3983/6083727 objects misplaced (0.065%) OBJECT_UNFOUND 1/2027909 objects unfound (0.000%) pg 2.128 has 1 unfound objects PG_DAMAGED Possible data damage: 1 pg recovery_unfound pg 2.128 is active+forced_recovery+recovery_unfound+undersized+degraded+remapped+forced_backfill, acting [9,19], 1 unfound PG_DEGRADED Degraded data redundancy: 3986/6083727 objects degraded (0.066%), 1 pg degraded, 1 pg undersized pg 2.128 is stuck undersized for 5270.321579, current state active+forced_recovery+recovery_unfound+undersized+degraded+remapped+forced_backfill, last acting [9,19] sh-4.2# ceph -s cluster: id: e16f75fb-2b14-44e5-ae25-b7da5d007837 health: HEALTH_ERR 3983/6083727 objects misplaced (0.065%) 1/2027909 objects unfound (0.000%) Possible data damage: 1 pg recovery_unfound Degraded data redundancy: 3986/6083727 objects degraded (0.066%), 1 pg degraded, 1 pg undersized services: mon: 3 daemons, quorum a,b,c mgr: a(active) mds: myfs-1/1/1 up {0=myfs-b=up:active}, 1 up:standby-replay osd: 23 osds: 23 up, 23 in; 1 remapped pgs data: pools: 2 pools, 1024 pgs objects: 2.03 M objects, 7.7 TiB usage: 24 TiB used, 14 TiB / 38 TiB avail pgs: 3986/6083727 objects degraded (0.066%) 3983/6083727 objects misplaced (0.065%) 1/2027909 objects unfound (0.000%) 1022 active+clean 1 active+forced_recovery+recovery_unfound+undersized+degraded+remapped+forced_backfill 1 active+clean+scrubbing+deep io: client: 852 B/s rd, 1 op/s rd, 0 op/s wr
I ripped down the bad cluster and redeployed. I still cannot deploy Nautilus successfully, so I went back to ceph/ceph_v14. As for the cause of the OOM memory kills, I found it - CGroup limit < osd_memory_target. However, it's not my fault. If this is rook.io doing this, it led to data loss, needs to be fixed ASAP. [root@e23-h03-740xd ceph]# ocrc rsh rook-ceph-osd-11-5f59dd4bb5-gp2c9 sh-4.2# ceph daemon /var/lib/rook/osd11/rook-ceph-osd.11.asok config show | grep osd_memory_target "osd_memory_target": "8388608000", CGroup limits are listed below... [root@e23-h03-740xd ceph]# oc describe node/e24-h03-740xd.alias.bos.scalelab.redhat.com Name: e24-h03-740xd.alias.bos.scalelab.redhat.com ... Non-terminated Pods: (17 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits --------- ---- ------------ ---------- --------------- ------------- ... rook-ceph rook-ceph-osd-12-c567bf489-txqqj 1 (1%) 1 (1%) 8000Mi (4%) 8000Mi (4%) rook-ceph rook-ceph-osd-14-6b96d869f9-f47k4 1 (1%) 1 (1%) 8000Mi (4%) 8000Mi (4%) ... This is not what I specified in my config map in cluster-ben.yaml for my most recent attempt: apiVersion: v1 kind: Namespace metadata: name: rook-ceph --- apiVersion: v1 kind: ConfigMap metadata: name: rook-config-override namespace: rook-ceph data: config: "\n [global] \n public network = 10.1.0.0/16 \n cluster network = 10.2.0.0/16 \n [mds] \n mds_cache_memory_limit = 4000000000 \n [osd]\n osd_memory_target = 6000000000\n" --- ... [root@e23-h03-740xd ceph]# ocrc get cm/rook-ceph-config -o yaml apiVersion: v1 data: ceph.conf: "[global]\nlog_file = \nmon_cluster_log_file = \nmon_allow_pool_delete = true\nmon_max_pg_per_osd = 1000\nosd_pg_bits \ = 11\nosd_pgp_bits = 11\nosd_pool_default_size = 1\nosd_pool_default_min_size = 1\nosd_pool_default_pg_num = 100\nosd_pool_default_pgp_num \ = 100\nrbd_default_features = 3\nfatal_signal_handlers = false\npublic network = 10.1.0.0/16\ncluster network = 10.2.0.0/16\n\n[mds]\nmds_cache_memory_limit = 4000000000\n\n[osd]\nosd_memory_target = 6000000000\n\n" kind: ConfigMap metadata: creationTimestamp: 2019-04-16T13:22:07Z name: rook-ceph-config namespace: rook-ceph resourceVersion: "6705507" selfLink: /api/v1/namespaces/rook-ceph/configmaps/rook-ceph-config uid: a23d3e02-604a-11e9-8dac-246e96bdcfda So how am I to blame for this? What could the user do differently? BTW I couldn't even do this check before, because I was specifying these parameters: osd pool default size = 3 osd pool default min size = 2 Whereas rook.io was also setting them to 1 (a bad idea, default should be min(3, OSD-node-count)), consequently the "ceph daemon" command wouldn't even talk to the daemon.
sorry I meant I went back to ceph/ceph_v13
Ben, can you describe/inspect the osd container, please? Which version of rook? Rook sets osd-memory-target to pod's limit. Is the osd consuming more memory than the target?
rook version = docker.io/rook/ceph:master, 047e0878ff14, 3 days ago I appreciate that rook.io is trying to make it easier for the end user by taking away the need for the end-user to tune osd_memory_target, but setting osd_memory_target and CGroup limit to be the same is causing the OOM kill. I'd suggest setting the CGroup limit to be 1/2 GB + (0.1 x osd_memory_target). If we still get an OOM kill in this case, then it clearly is caused by the OSD consuming too much memory. The osd_memory_target implementation is not that perfect. We can work on reducing the gap between osd_memory_target and CGroup limit with performance testing in future releases, but for now we have to live with the gap. Here are a couple of examples of the memory usage in the OSD when the OOM kill occurred, from dmesg-logs/e24-h03-740xd/tmp/dmesg (other examples in dmesg-logs/{e23,e24}-h01-740xd/tmp/dmesg). [55703.179053] Memory cgroup stats for /kubepods.slice/kubepods-podcd1cbfa0_5bc9_11e9_8dac_246e96bdcfda.slice/docker-df3983cba86afe458e1623621713ce2b0208ac0775d9db1e1713e8f6a40eed2a.scope: cache:144KB rss:8190824KB rss_huge:51200KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:8190816KB inactive_file:16KB active_file:0KB unevictable:0KB ... [55706.573201] Memory cgroup stats for /kubepods.slice/kubepods-podcd38e8c8_5bc9_11e9_8dac_246e96bdcfda.slice/docker-20a2e270e4818911f95a3db21b7fe40a60b37499ab10603b7747f9605db5ab6f.scope: cache:144KB rss:8191856KB rss_huge:67584KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:8191796KB inactive_file:28KB active_file:0KB unevictable:0KB and from above comment 4: "osd_memory_target": "8388608000", CGroup limit = 8000Mi 8000*1024*1024 = 8388608000, so as you say, the osd_memory_target == the CGroup limit. 8000*1024 = 8192000 KB, and both of these examples above have a total memory consumption (approximately rss+rss_huge) > 8192000 KB. Also, it would be a good escape hatch if we could override the rook default for osd_memory_target, just in case the default isn't working for somebody. Perhaps a rook parameter that makes the CGroup limit margin of 10% be a default and lets the user adjust it. This is a general problem for any storage service, not just Ceph OSDs, right? Does that make sense? HTH -ben
Regarding the osd_memory_target, we only control mapped memory, not RSS memory. There's no guarantee that when we've unmapped memory in the OSD that the kernel will reclaim it (and certainly no guarantee that it will reclaim it immediately). Sometimes the kernel chooses not to because there isn't memory pressure (not this case) and sometimes it simply can't reclaim it, especially when transparent huge pages is enabled. Typically I see overage at less than 10%, but it's possible that it could be higher in some situations. The container memory limit will absolutely need to be higher than the osd_memory_target though.
I should also mention that the bluestore cache currently can exceed it's limit as we trim the caches in a loop (every 50ms or so if I remember correctly). I'm planning on changing this to trim on write (which helps, but doesn't totally fix memory overage), but I'm not sure that change will be backported to nautilus.
I left a comment in the PR. I like the osdMemoryTargetSafetyFactor var, it lets people fine tune the safety margin later. I left a comment on the PR.
awaiting container image with the PR, meanwhile Shekhar Berry and I tried re-running the fio load, I manually set osd_memory target like this before the test: ceph tell osd.* injectargs --osd_memory_target 6000000000 But this did not work, still getting OOM kills like this one during fio run: https://pastebin.com/2dBH4gqL Rook.io is forcing its value for osd_memory_target by running OSDs this way, see end of command line: [root@e23-h01-740xd ~]# ps awux | grep ceph-osd root 112679 85.1 0.9 2932148 1943764 ? Ssl 16:36 30:53 ceph-osd --foreground --id 21 --conf /var/lib/rook/osd21/rook-ceph.config --osd-data /var/lib/rook/osd21 --keyring /var/lib/rook/osd21/keyring --cluster rook-ceph --osd-uuid 9580a63c-90c8-4384-8c3b-2bad1bba50e5 --osd-memory-target 8388608000
Please close, I think this is part of another bug (rook setting osd_memory_target = resources:osd:memory:limit) that was already fixed. No one has reproduced this in a while AFAIK with Nautilus, this bz was on Mimic.