+++ This bug was initially created as a clone of Bug #2241185 +++ Description of problem (please be detailed as possible and provide log snippests): [RDR] Ceph reports osd down but pod are in running state Version of all relevant components (if applicable): OCP version:- 4.14.0-0.nightly-2023-09-20-033502 ODF version:- 4.14.0-136 CEPH version:- ceph version 17.2.6-138.el9cp (b488c8dad42b2ecffcd96f3d76eeeecce48b8590) quincy (stable) ACM version:- 2.9.0-109 SUBMARINER version:- devel VOLSYNC version:- volsync-product.v0.7.4 VOLSYNC method:- destinationCopyMethod: LocalDirect Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Deploy RDR cluster 2.Run workloads for few day 3.observer ceph status and osd pod status Actual results: cephstatus cluster: id: 928c8604-6082-4802-893c-3222d5f150ae health: HEALTH_WARN 2 MDSs report slow metadata IOs 1 MDSs report slow requests 2 osds down 2 hosts (2 osds) down 2 racks (2 osds) down Reduced data availability: 169 pgs inactive Degraded data redundancy: 154474/231711 objects degraded (66.667%), 131 pgs degraded, 169 pgs undersized services: mon: 3 daemons, quorum d,e,f (age 116m) mgr: a(active, since 116m) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 1 up (since 110m), 3 in (since 6d) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 77.24k objects, 294 GiB usage: 884 GiB used, 5.1 TiB / 6 TiB avail pgs: 100.000% pgs not active 154474/231711 objects degraded (66.667%) 131 undersized+degraded+peered 38 undersized+peered oc get pods -l app=rook-ceph-osd NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-5f7f4cb466-9rcwq 2/2 Running 2 9h rook-ceph-osd-1-6cbcd67bfb-dpf4n 2/2 Running 0 121m rook-ceph-osd-2-5f5cfb4f6d-xnhx2 2/2 Running 2 26h snippet from osd pod log debug2023-09-28T11:26:47.928+0000 7f5df6527640 1 osd.2 pg_epoch: 20604 pg[12.14( v 20573'397443 (20475'395441,20573'397443] local-lis/les=20489/20490 n=1375 ec=31/22 lis/c=20596/18223 les/c/f=20597/18224/0 sis=20604) [0,2,1] r=1 lpr=20604 pi=[18223,20604)/2 crt=20573'397443 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray debug2023-09-28T11:26:48.945+0000 7f5df9d2e640 1 osd.2 pg_epoch: 20605 pg[1.1d( v 20603'59507 (19815'57422,20603'59507] local-lis/les=20488/20489 n=1094 ec=29/11 lis/c=20488/18226 les/c/f=20489/18227/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18226,20605)/3 luod=0'0 lua=20592'59504 crt=20603'59507 mlcod 0'0 active mbc={}] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367 debug2023-09-28T11:26:48.945+0000 7f5df5d26640 1 osd.2 pg_epoch: 20605 pg[1.5( v 20564'82189 (18532'79706,20564'82189] local-lis/les=20488/20489 n=1075 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[20488,20605)/1 luod=0'0 crt=20564'82189 mlcod 0'0 active mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367 debug2023-09-28T11:26:48.945+0000 7f5df9d2e640 1 osd.2 pg_epoch: 20605 pg[1.1d( v 20603'59507 (19815'57422,20603'59507] local-lis/les=20488/20489 n=1094 ec=29/11 lis/c=20488/18226 les/c/f=20489/18227/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18226,20605)/3 crt=20603'59507 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray debug2023-09-28T11:26:48.945+0000 7f5df5d26640 1 osd.2 pg_epoch: 20605 pg[1.5( v 20564'82189 (18532'79706,20564'82189] local-lis/les=20488/20489 n=1075 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[20488,20605)/1 crt=20564'82189 mlcod 0'0 unknown NOTIFY mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] state<Start>: transitioning to Stray debug2023-09-28T11:26:48.946+0000 7f5dfbd32640 1 osd.2 pg_epoch: 20605 pg[12.9( v 20573'454858 (20474'453032,20573'454858] local-lis/les=20489/20490 n=1397 ec=31/22 lis/c=20596/18241 les/c/f=20597/18242/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18241,20605)/3 luod=0'0 crt=20573'454858 mlcod 0'0 active mbc={}] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367 debug2023-09-28T11:26:48.946+0000 7f5dfbd32640 1 osd.2 pg_epoch: 20605 pg[12.9( v 20573'454858 (20474'453032,20573'454858] local-lis/les=20489/20490 n=1397 ec=31/22 lis/c=20596/18241 les/c/f=20597/18242/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18241,20605)/3 crt=20573'454858 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray debug2023-09-28T11:26:50.971+0000 7f5df7529640 1 osd.2 pg_epoch: 20607 pg[1.2( v 20597'95850 (18572'93811,20597'95850] local-lis/les=20488/20489 n=1099 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20607) [1,0,2] r=2 lpr=20607 pi=[20488,20607)/1 luod=0'0 crt=20594'95848 mlcod 0'0 active mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] start_peering_interval up [1,0,2] -> [1,0,2], acting [1,0] -> [1,0,2], acting_primary 1 -> 1, up_primary 1 -> 1, role -1 -> 2, features acting 4540138320759226367 upacting 4540138320759226367 debug2023-09-28T11:26:50.971+0000 7f5df7529640 1 osd.2 pg_epoch: 20607 pg[1.2( v 20597'95850 (18572'93811,20597'95850] local-lis/les=20488/20489 n=1099 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20607) [1,0,2] r=2 lpr=20607 pi=[20488,20607)/1 crt=20594'95848 mlcod 0'0 unknown NOTIFY mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] state<Start>: transitioning to Stray debug2023-09-28T11:27:07.822+0000 7f5df5d26640 0 log_channel(cluster) log [DBG] : 11.d scrub starts debug2023-09-28T11:27:08.026+0000 7f5df5d26640 0 log_channel(cluster) log [DBG] : 11.d scrub ok debug2023-09-28T11:27:08.871+0000 7f5dfbd32640 0 log_channel(cluster) log [DBG] : 12.1 scrub starts debug2023-09-28T11:27:20.526+0000 7f5df7d2a640 0 log_channel(cluster) log [DBG] : 12.1 scrub ok debug2023-09-28T11:27:37.647+0000 7f5dfad30640 0 log_channel(cluster) log [DBG] : 12.13 scrub starts debug2023-09-28T11:27:47.852+0000 7f5dfad30640 0 log_channel(cluster) log [DBG] : 12.13 scrub ok debug2023-09-28T11:27:51.828+0000 7f5df952d640 0 log_channel(cluster) log [DBG] : 11.e scrub starts debug2023-09-28T11:27:52.066+0000 7f5df952d640 0 log_channel(cluster) log [DBG] : 11.e scrub ok debug2023-09-28T11:28:51.854+0000 7f5df852b640 0 log_channel(cluster) log [DBG] : 5.0 scrub starts debug2023-09-28T11:28:52.296+0000 7f5df852b640 0 log_channel(cluster) log [DBG] : 5.0 scrub ok debug2023-09-28T11:28:52.891+0000 7f5df9d2e640 0 log_channel(cluster) log [DBG] : 11.1d scrub starts debug2023-09-28T11:28:53.130+0000 7f5df5d26640 0 log_channel(cluster) log [DBG] : 11.1d scrub ok /bin/bash: line 9: 412 Killed ceph-osd --foreground --id 2 --fsid 928c8604-6082-4802-893c-3222d5f150ae --setuser ceph --setgroup ceph --crush-location="root=default host=ocs-deviceset-thin-csi-odf-1-data-0457vv rack=rack2" --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8192 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false --public-addr=242.0.255.251 --public-bind-addr=10.131.0.41 --cluster-addr=10.131.0.41 + RESTART_INTERVAL=24 + rc=0 + '[' 0 -eq 0 ']' + touch /tmp/osd-sleep OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will sleep for 24 hours. Restart the pod manually once the flapping issue is fixed + echo 'OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will sleep for 24 hours. Restart the pod manually once the flapping issue is fixed' + wait + sleep 24h Expected results: osd pod should not be in Running if ceph report osd as down Additional info: --- Additional comment from RHEL Program Management on 2023-09-28 13:27:46 UTC --- This bug having no release flag set previously, is now set with release flag 'odf‑4.14.0' to '?', and so is being proposed to be fixed at the ODF 4.14.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag. --- Additional comment from RHEL Program Management on 2023-09-28 13:27:46 UTC --- Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP. --- Additional comment from Santosh Pillai on 2023-09-28 15:44:21 UTC --- We added a new change on how OSDs are created to support https://bugzilla.redhat.com/show_bug.cgi?id=2223959 If the OSD daemon exits with a 0 exit code, we are assuming that its due to OSD flapping. Looks like in this case, the bash script received the exit code 0 when the OSD went down, and it caused was caused by something else other than OSD flapping. Need to investigate more. --- Additional comment from Radoslaw Zarzynski on 2023-09-28 16:10:18 UTC --- > /bin/bash: line 9: 412 Killed ceph-osd --foreground It looks the ceph-osd process has been killed by a signal but the orchestrator decided to not respawn the container by, IIUC, running within `sleep 24h` instead of `ceph-osd`. Under such conditions it's expected to see a divergence between `ceph -s` and containers' statuses. The interesting thing might be what and why generated the signal that terminated the OSD. However, there is no OSD log, no dmesg nor anything that could help with identifying the source of this signal. Without showing the signal's origin assigning the case to RADOS is premature. Regards, Radek --- Additional comment from Prashant Dhange on 2023-09-28 16:12:32 UTC --- The OSD was killed by external entity : /bin/bash: line 9: 412 Killed ceph-osd --foreground --id 2 --fsid 928c8604-6082-4802-893c-3222d5f150ae --setuser ceph --setgroup ceph --crush-location="root=default host=ocs-deviceset-thin-csi-odf-1-data-0457vv rack=rack2" --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8192 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false --public-addr=242.0.255.251 --public-bind-addr=10.131.0.41 --cluster-addr=10.131.0.41 Based on the dmesg logs provided, the OSD was killed by OM killer : [Thu Sep 28 11:31:40 2023] msgr-worker-1 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=3, oom_score_adj=-997 [Thu Sep 28 11:31:40 2023] CPU: 11 PID: 16132 Comm: msgr-worker-1 Not tainted 5.14.0-284.32.1.el9_2.x86_64 #1 [Thu Sep 28 11:31:40 2023] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 [Thu Sep 28 11:31:40 2023] Call Trace: [Thu Sep 28 11:31:40 2023] <TASK> [Thu Sep 28 11:31:40 2023] dump_stack_lvl+0x34/0x48 [Thu Sep 28 11:31:40 2023] dump_header+0x4a/0x201 [Thu Sep 28 11:31:40 2023] oom_kill_process.cold+0xb/0x10 [Thu Sep 28 11:31:40 2023] out_of_memory+0xed/0x2e0 [Thu Sep 28 11:31:40 2023] mem_cgroup_out_of_memory+0x13a/0x150 [Thu Sep 28 11:31:40 2023] try_charge_memcg+0x6df/0x7a0 [Thu Sep 28 11:31:40 2023] ? __alloc_skb+0x8e/0x1d0 [Thu Sep 28 11:31:40 2023] mem_cgroup_charge_skmem+0x3f/0xf0 [Thu Sep 28 11:31:40 2023] __sk_mem_raise_allocated+0x247/0x4a0 [Thu Sep 28 11:31:40 2023] __sk_mem_schedule+0x2b/0x40 [Thu Sep 28 11:31:40 2023] tcp_sendmsg_locked+0x6f9/0xac0 [Thu Sep 28 11:31:40 2023] tcp_sendmsg+0x28/0x40 [Thu Sep 28 11:31:40 2023] sock_sendmsg+0x5b/0x70 [Thu Sep 28 11:31:40 2023] ____sys_sendmsg+0x230/0x270 [Thu Sep 28 11:31:40 2023] ? import_iovec+0x17/0x20 [Thu Sep 28 11:31:40 2023] ? sendmsg_copy_msghdr+0x7d/0xa0 [Thu Sep 28 11:31:40 2023] ___sys_sendmsg+0x81/0xc0 [Thu Sep 28 11:31:40 2023] ? inet_recvmsg+0x5c/0x140 [Thu Sep 28 11:31:40 2023] ? sock_read_iter+0x92/0x100 [Thu Sep 28 11:31:40 2023] ? new_sync_read+0x166/0x180 [Thu Sep 28 11:31:40 2023] ? __fget_light+0x9f/0x130 [Thu Sep 28 11:31:40 2023] __sys_sendmsg+0x59/0xa0 [Thu Sep 28 11:31:40 2023] do_syscall_64+0x59/0x90 [Thu Sep 28 11:31:40 2023] ? syscall_exit_to_user_mode+0x12/0x30 [Thu Sep 28 11:31:40 2023] ? do_syscall_64+0x69/0x90 [Thu Sep 28 11:31:40 2023] ? syscall_exit_to_user_mode+0x12/0x30 [Thu Sep 28 11:31:40 2023] ? do_syscall_64+0x69/0x90 [Thu Sep 28 11:31:40 2023] ? syscall_exit_work+0x11a/0x150 [Thu Sep 28 11:31:40 2023] ? syscall_exit_to_user_mode+0x12/0x30 [Thu Sep 28 11:31:40 2023] ? do_syscall_64+0x69/0x90 [Thu Sep 28 11:31:40 2023] ? irqtime_account_irq+0x3c/0xb0 [Thu Sep 28 11:31:40 2023] ? __irq_exit_rcu+0x46/0xe0 [Thu Sep 28 11:31:40 2023] ? sysvec_apic_timer_interrupt+0x3c/0x90 [Thu Sep 28 11:31:40 2023] entry_SYSCALL_64_after_hwframe+0x63/0xcd [Thu Sep 28 11:31:40 2023] RIP: 0033:0x7f5e1a1e3a2d [Thu Sep 28 11:31:40 2023] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 da c1 f4 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 2e c2 f4 ff 48 [Thu Sep 28 11:31:40 2023] RSP: 002b:00007f5e18750930 EFLAGS: 00000293 ORIG_RAX: 000000000000002e [Thu Sep 28 11:31:40 2023] RAX: ffffffffffffffda RBX: 0000000003b50c29 RCX: 00007f5e1a1e3a2d [Thu Sep 28 11:31:40 2023] RDX: 000000000000c000 RSI: 00007f5e18750990 RDI: 0000000000000066 [Thu Sep 28 11:31:40 2023] RBP: 000000000000c000 R08: 0000000000000000 R09: 00007f5e187549d0 [Thu Sep 28 11:31:40 2023] R10: 0000000000010000 R11: 0000000000000293 R12: 000000000000c3cc [Thu Sep 28 11:31:40 2023] R13: 0000560992735340 R14: 0000000003b50c29 R15: 00007f5e18750990 [Thu Sep 28 11:31:40 2023] </TASK> [Thu Sep 28 11:31:40 2023] memory: usage 5242880kB, limit 5242880kB, failcnt 75036 [Thu Sep 28 11:31:40 2023] swap: usage 0kB, limit 0kB, failcnt 0 [Thu Sep 28 11:31:40 2023] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod872d48c7_1d2b_4781_a9ce_85819219ee47.slice/crio-00973b15ebc4f4e724c0883a636f08d13b34f31760791e432d17d38f6b774d9b.scope: [Thu Sep 28 11:31:40 2023] anon 5347696640 file 163840 kernel 16015360 kernel_stack 1261568 pagetables 11128832 percpu 0 sock 4829184 vmalloc 0 shmem 0 zswap 0 zswapped 0 file_mapped 0 file_dirty 0 file_writeback 0 swapcached 0 anon_thp 2097152 file_thp 0 shmem_thp 0 inactive_anon 5347545088 active_anon 24576 inactive_file 147456 active_file 0 unevictable 0 slab_reclaimable 2407680 slab_unreclaimable 1151208 slab 3558888 workingset_refault_anon 0 workingset_refault_file 85 workingset_activate_anon 0 workingset_activate_file 31 workingset_restore_anon 0 workingset_restore_file 0 workingset_nodereclaim 0 pgscan 166907 pgsteal 161735 pgscan_kswapd 0 pgscan_direct 166907 pgsteal_kswapd 0 pgsteal_direct 161735 pgfault 3111055 pgmajfault 0 pgrefill 46159 pgactivate 43386 pgdeactivate 46159 pglazyfree 0 pglazyfreed 0 zswpin 0 zswpout 0 thp_fault_alloc 15 thp_collapse_alloc 0 [Thu Sep 28 11:31:40 2023] Tasks state (memory values in pages): [Thu Sep 28 11:31:40 2023] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [Thu Sep 28 11:31:40 2023] [ 16080] 0 16080 1097 810 40960 0 -997 bash [Thu Sep 28 11:31:40 2023] [ 16123] 167 16123 1522377 1312915 11108352 0 -997 ceph-osd [Thu Sep 28 11:31:40 2023] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-00973b15ebc4f4e724c0883a636f08d13b34f31760791e432d17d38f6b774d9b.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod872d48c7_1d2b_4781_a9ce_85819219ee47.slice/crio-00973b15ebc4f4e724c0883a636f08d13b34f31760791e432d17d38f6b774d9b.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod872d48c7_1d2b_4781_a9ce_85819219ee47.slice/crio-00973b15ebc4f4e724c0883a636f08d13b34f31760791e432d17d38f6b774d9b.scope,task=ceph-osd,pid=16123,uid=167 [Thu Sep 28 11:31:40 2023] Memory cgroup out of memory: Killed process 16123 (ceph-osd) total-vm:6089508kB, anon-rss:5217632kB, file-rss:34028kB, shmem-rss:0kB, UID:167 pgtables:10848kB oom_score_adj:-997 [Thu Sep 28 11:31:40 2023] libceph: osd2 (2)242.0.255.251:6800 socket closed (con state OPEN) Though OSD daemon was consuming less than 3GB of memory, Have we deployed ODF nodes with enough system resources ? How much RAM is on ODF nodes ? Kindly provide access to this cluster to investigate this issue further. --- Additional comment from Prashant Dhange on 2023-09-28 16:18:13 UTC --- (In reply to Prashant Dhange from comment #5) > The OSD was killed by external entity : > /bin/bash: line 9: 412 Killed ceph-osd --foreground --id > 2 --fsid 928c8604-6082-4802-893c-3222d5f150ae --setuser ceph --setgroup ceph > --crush-location="root=default host=ocs-deviceset-thin-csi-odf-1-data-0457vv > rack=rack2" --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 > --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 > --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 > --bluestore-compression-min-blob-size=8192 > --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 > --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 > --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true > --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true > --default-log-stderr-prefix=debug --default-log-to-file=false > --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false > --public-addr=242.0.255.251 --public-bind-addr=10.131.0.41 > --cluster-addr=10.131.0.41 > > Based on the dmesg logs provided, the OSD was killed by OM killer : > [Thu Sep 28 11:31:40 2023] msgr-worker-1 invoked oom-killer: > gfp_mask=0xcc0(GFP_KERNEL), order=3, oom_score_adj=-997 > [Thu Sep 28 11:31:40 2023] CPU: 11 PID: 16132 Comm: msgr-worker-1 Not > tainted 5.14.0-284.32.1.el9_2.x86_64 #1 > [Thu Sep 28 11:31:40 2023] Hardware name: VMware, Inc. VMware Virtual > Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 > [Thu Sep 28 11:31:40 2023] Call Trace: > [Thu Sep 28 11:31:40 2023] <TASK> > [Thu Sep 28 11:31:40 2023] dump_stack_lvl+0x34/0x48 > [Thu Sep 28 11:31:40 2023] dump_header+0x4a/0x201 > [Thu Sep 28 11:31:40 2023] oom_kill_process.cold+0xb/0x10 > [Thu Sep 28 11:31:40 2023] out_of_memory+0xed/0x2e0 > [Thu Sep 28 11:31:40 2023] mem_cgroup_out_of_memory+0x13a/0x150 > [Thu Sep 28 11:31:40 2023] try_charge_memcg+0x6df/0x7a0 > [Thu Sep 28 11:31:40 2023] ? __alloc_skb+0x8e/0x1d0 > [Thu Sep 28 11:31:40 2023] mem_cgroup_charge_skmem+0x3f/0xf0 > [Thu Sep 28 11:31:40 2023] __sk_mem_raise_allocated+0x247/0x4a0 > [Thu Sep 28 11:31:40 2023] __sk_mem_schedule+0x2b/0x40 > [Thu Sep 28 11:31:40 2023] tcp_sendmsg_locked+0x6f9/0xac0 > [Thu Sep 28 11:31:40 2023] tcp_sendmsg+0x28/0x40 > [Thu Sep 28 11:31:40 2023] sock_sendmsg+0x5b/0x70 > [Thu Sep 28 11:31:40 2023] ____sys_sendmsg+0x230/0x270 > [Thu Sep 28 11:31:40 2023] ? import_iovec+0x17/0x20 > [Thu Sep 28 11:31:40 2023] ? sendmsg_copy_msghdr+0x7d/0xa0 > [Thu Sep 28 11:31:40 2023] ___sys_sendmsg+0x81/0xc0 > [Thu Sep 28 11:31:40 2023] ? inet_recvmsg+0x5c/0x140 > [Thu Sep 28 11:31:40 2023] ? sock_read_iter+0x92/0x100 > [Thu Sep 28 11:31:40 2023] ? new_sync_read+0x166/0x180 > [Thu Sep 28 11:31:40 2023] ? __fget_light+0x9f/0x130 > [Thu Sep 28 11:31:40 2023] __sys_sendmsg+0x59/0xa0 > [Thu Sep 28 11:31:40 2023] do_syscall_64+0x59/0x90 > [Thu Sep 28 11:31:40 2023] ? syscall_exit_to_user_mode+0x12/0x30 > [Thu Sep 28 11:31:40 2023] ? do_syscall_64+0x69/0x90 > [Thu Sep 28 11:31:40 2023] ? syscall_exit_to_user_mode+0x12/0x30 > [Thu Sep 28 11:31:40 2023] ? do_syscall_64+0x69/0x90 > [Thu Sep 28 11:31:40 2023] ? syscall_exit_work+0x11a/0x150 > [Thu Sep 28 11:31:40 2023] ? syscall_exit_to_user_mode+0x12/0x30 > [Thu Sep 28 11:31:40 2023] ? do_syscall_64+0x69/0x90 > [Thu Sep 28 11:31:40 2023] ? irqtime_account_irq+0x3c/0xb0 > [Thu Sep 28 11:31:40 2023] ? __irq_exit_rcu+0x46/0xe0 > [Thu Sep 28 11:31:40 2023] ? sysvec_apic_timer_interrupt+0x3c/0x90 > [Thu Sep 28 11:31:40 2023] entry_SYSCALL_64_after_hwframe+0x63/0xcd > [Thu Sep 28 11:31:40 2023] RIP: 0033:0x7f5e1a1e3a2d > [Thu Sep 28 11:31:40 2023] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 > e8 da c1 f4 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 > 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 2e c2 f4 ff 48 > [Thu Sep 28 11:31:40 2023] RSP: 002b:00007f5e18750930 EFLAGS: 00000293 > ORIG_RAX: 000000000000002e > [Thu Sep 28 11:31:40 2023] RAX: ffffffffffffffda RBX: 0000000003b50c29 RCX: > 00007f5e1a1e3a2d > [Thu Sep 28 11:31:40 2023] RDX: 000000000000c000 RSI: 00007f5e18750990 RDI: > 0000000000000066 > [Thu Sep 28 11:31:40 2023] RBP: 000000000000c000 R08: 0000000000000000 R09: > 00007f5e187549d0 > [Thu Sep 28 11:31:40 2023] R10: 0000000000010000 R11: 0000000000000293 R12: > 000000000000c3cc > [Thu Sep 28 11:31:40 2023] R13: 0000560992735340 R14: 0000000003b50c29 R15: > 00007f5e18750990 > [Thu Sep 28 11:31:40 2023] </TASK> > [Thu Sep 28 11:31:40 2023] memory: usage 5242880kB, limit 5242880kB, failcnt > 75036 > [Thu Sep 28 11:31:40 2023] swap: usage 0kB, limit 0kB, failcnt 0 > [Thu Sep 28 11:31:40 2023] Memory cgroup stats for > /kubepods.slice/kubepods-burstable.slice/kubepods-burstable- > pod872d48c7_1d2b_4781_a9ce_85819219ee47.slice/crio- > 00973b15ebc4f4e724c0883a636f08d13b34f31760791e432d17d38f6b774d9b.scope: > [Thu Sep 28 11:31:40 2023] anon 5347696640 > file 163840 > kernel 16015360 > kernel_stack 1261568 > pagetables 11128832 > percpu 0 > sock 4829184 > vmalloc 0 > shmem 0 > zswap 0 > zswapped 0 > file_mapped 0 > file_dirty 0 > file_writeback 0 > swapcached 0 > anon_thp 2097152 > file_thp 0 > shmem_thp 0 > inactive_anon 5347545088 > active_anon 24576 > inactive_file 147456 > active_file 0 > unevictable 0 > slab_reclaimable 2407680 > slab_unreclaimable 1151208 > slab 3558888 > workingset_refault_anon 0 > workingset_refault_file 85 > workingset_activate_anon 0 > workingset_activate_file 31 > workingset_restore_anon 0 > workingset_restore_file 0 > workingset_nodereclaim 0 > pgscan 166907 > pgsteal 161735 > pgscan_kswapd 0 > pgscan_direct 166907 > pgsteal_kswapd 0 > pgsteal_direct 161735 > pgfault 3111055 > pgmajfault 0 > pgrefill 46159 > pgactivate 43386 > pgdeactivate 46159 > pglazyfree 0 > pglazyfreed 0 > zswpin 0 > zswpout 0 > thp_fault_alloc 15 > thp_collapse_alloc 0 > [Thu Sep 28 11:31:40 2023] Tasks state (memory values in pages): > [Thu Sep 28 11:31:40 2023] [ pid ] uid tgid total_vm rss > pgtables_bytes swapents oom_score_adj name > [Thu Sep 28 11:31:40 2023] [ 16080] 0 16080 1097 810 40960 > 0 -997 bash > [Thu Sep 28 11:31:40 2023] [ 16123] 167 16123 1522377 1312915 11108352 > 0 -997 ceph-osd My bad, these values are in pages so OSD was consuming around 5.8 GB. So it is more than default osd_memory_target set. Though OSD can consume more than the defined osd_memory_target under different circumstances. Need to investigate it further. --- Additional comment from Santosh Pillai on 2023-09-28 16:28:58 UTC --- (In reply to Prashant Dhange from comment #5) > The OSD was killed by external entity : > /bin/bash: line 9: 412 Killed ceph-osd --foreground --id > 2 --fsid 928c8604-6082-4802-893c-3222d5f150ae --setuser ceph --setgroup ceph > --crush-location="root=default host=ocs-deviceset-thin-csi-odf-1-data-0457vv > rack=rack2" --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 > --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 > --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 > --bluestore-compression-min-blob-size=8192 > --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 > --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 > --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true > --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true > --default-log-stderr-prefix=debug --default-log-to-file=false > --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false > --public-addr=242.0.255.251 --public-bind-addr=10.131.0.41 > --cluster-addr=10.131.0.41 > > Based on the dmesg logs provided, the OSD was killed by OM killer : Is it expected for ceph OSD to stop with exit code 0 when killed by OM killer ? (We recently made a change in rook to make the OSD pod to sleep for 24h when OSD daemon exits with 0 status code. This is to support the OSD flapping scenario where OSD daemon exits with 0 exit code when its marked down 6 times within 600 seconds) --- Additional comment from Santosh Pillai on 2023-09-28 16:30:18 UTC --- Adding needinfo for comment #7 --- Additional comment from Prashant Dhange on 2023-09-28 19:52:00 UTC --- (In reply to Santosh Pillai from comment #7) > (In reply to Prashant Dhange from comment #5) > > The OSD was killed by external entity : > > /bin/bash: line 9: 412 Killed ceph-osd --foreground --id > > 2 --fsid 928c8604-6082-4802-893c-3222d5f150ae --setuser ceph --setgroup ceph > > --crush-location="root=default host=ocs-deviceset-thin-csi-odf-1-data-0457vv > > rack=rack2" --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 > > --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 > > --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 > > --bluestore-compression-min-blob-size=8192 > > --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 > > --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 > > --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true > > --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true > > --default-log-stderr-prefix=debug --default-log-to-file=false > > --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false > > --public-addr=242.0.255.251 --public-bind-addr=10.131.0.41 > > --cluster-addr=10.131.0.41 > > > > Based on the dmesg logs provided, the OSD was killed by OM killer : > > Is it expected for ceph OSD to stop with exit code 0 when killed by OM > killer ? > > (We recently made a change in rook to make the OSD pod to sleep for 24h when > OSD daemon exits with 0 status code. This is to support the OSD flapping > scenario where OSD daemon exits with 0 exit code when its marked down 6 > times within 600 seconds) No. The ceph-osd daemon should exit with code 137 when killed by OOM killer. Is ceph-osd daemon running through shell script or config.json ? The oc get events and oc describe pod should give more details on ceph-osd daemon exit code if it is being exec'ed by runc directly. We do see event similar to below if ceph daemon is killed by OOM killer : State: Running Started: Thu, 21 Sep 2023 15:57:43 +0200 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Thu, 21 Sep 2023 15:51:24 +0200 Finished: Thu, 21 Sep 2023 15:54:58 +0200 Ready: True Restart Count: 6 --- Additional comment from Travis Nielsen on 2023-09-28 21:06:15 UTC --- Prashant Is this example of the OOMKilled event from this cluster? Or from a previous cluster? With the change in OSD recently to now run from a script, the OSD may not be responding as expected to the OOMKill event. On a related note, I was experimenting with the OSD pod for a related issue today and saw that if I exec to the OSD pod and attempt to kill the OSD process (to force a restart), it ignores most kill codes except SIGTERM. Seems like we need to capture more signals. --- Additional comment from Prashant Dhange on 2023-09-28 23:23:14 UTC --- (In reply to Travis Nielsen from comment #10) > Prashant Is this example of the OOMKilled event from this cluster? Or from a > previous cluster? This is from one of the previous occurrences we observed in the standalone as well as in the ODF environment. We should see similar behavior if daemon is running inside a pod or on RHEL/CoreOS node and getting killed by OOM killer. I tried below perl command to demonstrate this behavior. # perl -wE 'my @xs; for (1..2**20) { push @xs, q{a} x 2**20 }; say scalar @xs;' Killed # echo $? 137 # dmesg ... [37820430.866938] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-1000.slice/session-19948.scope,task=perl,pid=902192,uid=1000 [37820430.866951] Out of memory: Killed process 902192 (perl) total-vm:63155268kB, anon-rss:62688288kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:123632kB oom_score_adj:0 [37820433.527503] oom_reaper: reaped process 902192 (perl), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > With the change in OSD recently to now run from a script, > the OSD may not be responding as expected to the OOMKill event. Okay. I may have to reproduce with OSD to confirm it but AFAIR ceph-osd will return 137 in an event of OOM killer to it's parent process. > > On a related note, I was experimenting with the OSD pod for a related issue > today and saw that if I exec to the OSD pod and attempt to kill the OSD > process (to force a restart), it ignores most kill codes except SIGTERM. > Seems like we need to capture more signals. Is it by sending a kill signal to OSD pid ("kill -KILL <osd-pid>") ? --- Additional comment from Pratik Surve on 2023-09-29 14:24:30 UTC --- > Though OSD daemon was consuming less than 3GB of memory, Have we deployed ODF nodes with enough system resources? How much RAM is on ODF nodes ? Hi Prashant, Yes, the node has enough system resources. node was created with 16cpu 64 GB RAM and 120 GB root disk. --- Additional comment from Prashant Dhange on 2023-10-02 18:33:46 UTC --- (In reply to Pratik Surve from comment #12) > > Though OSD daemon was consuming less than 3GB of memory, Have we deployed ODF nodes with enough system resources? How much RAM is on ODF nodes ? > > Hi Prashant, > Yes, the node has enough system resources. node was created with 16cpu 64 GB > RAM and 120 GB root disk. I donot any memory leak issue with OSD daemons but if we are seeing OSDs getting killed frequently by OOM killer then kindly start collecting the osd perf dump output every min till we experience the OOM killer event for one of the osd daemon. Run this script on the bastion node or where oc is installed (make sure we have the oc login session for extended period of time) : -------------------------------------------------------------------------------------------------------------------------------------------- #!/bin/bash TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name) LOG_DIR="$PWD/osd_perf_dump_data.$(date +%F_%H-%M-%S)" SLEEP_TIME=60 mkdir $LOG_DIR cd $LOG_DIR while true do for osd in $(oc exec $TOOLS_POD -- ceph osd ls) do echo "Collecting perf dump for osd.$osd" oc exec $TOOLS_POD -- ceph tell osd.$osd perf dump > $LOG_DIR/osd.$osd.perf_dump.$(date +%F_%H-%M-%S).txt done echo "Sleeping for $SLEEP_TIME seconds" sleep $SLEEP_TIME done -------------------------------------------------------------------------------------------------------------------------------------------- NOTE: Change LOG_DIR path accordingly as above script will create log dir in the current directory. Let me know if you have any further questions. --- Additional comment from Travis Nielsen on 2023-10-02 19:22:11 UTC --- Adding dev ack, we need to fix this for 4.14, or consider reverting the sleep for flapping OSDs. Note we've marked this as a blocker for upstream releases: https://github.com/rook/rook/issues/12972 --- Additional comment from Blaine Gardner on 2023-10-02 21:35:13 UTC --- I think I found an error with the bash script used to handle flapping OSDs. Will discuss with Travis and Santosh offline to issue a patch in time for the release. --- Additional comment from RHEL Program Management on 2023-10-03 05:13:00 UTC --- This BZ is being approved for ODF 4.14.0 release, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'odf‑4.14.0 --- Additional comment from RHEL Program Management on 2023-10-03 05:13:00 UTC --- Since this bug has been approved for ODF 4.14.0 release, through release flag 'odf-4.14.0+', the Target Release is being set to 'ODF 4.14.0 --- Additional comment from Pratik Surve on 2023-10-03 12:46:56 UTC --- Hi Prashant, I have collected the script output here http://magna012.ceph.redhat.com/script_output.zip. when some OSD went down there were no log collected sh dev_script.sh Collecting perf dump for osd.0 Collecting perf dump for osd.1 Collecting perf dump for osd.2 Sleeping for 60 seconds Collecting perf dump for osd.0 Collecting perf dump for osd.1 Collecting perf dump for osd.2 Sleeping for 60 seconds Collecting perf dump for osd.0 Collecting perf dump for osd.1 Collecting perf dump for osd.2 Sleeping for 60 seconds Collecting perf dump for osd.0 Collecting perf dump for osd.1 Collecting perf dump for osd.2 Sleeping for 60 seconds Collecting perf dump for osd.0 Collecting perf dump for osd.1 Collecting perf dump for osd.2 Sleeping for 60 seconds Collecting perf dump for osd.0 Collecting perf dump for osd.1 Collecting perf dump for osd.2 Sleeping for 60 seconds Collecting perf dump for osd.0 Collecting perf dump for osd.1 Collecting perf dump for osd.2 Sleeping for 60 seconds Collecting perf dump for osd.0 Collecting perf dump for osd.1 Collecting perf dump for osd.2 Sleeping for 60 seconds Collecting perf dump for osd.0 Error ENXIO: problem getting command descriptions from osd.0 command terminated with exit code 6 Collecting perf dump for osd.1 Collecting perf dump for osd.2 Error ENXIO: problem getting command descriptions from osd.2 command terminated with exit code 6 Sleeping for 60 seconds Collecting perf dump for osd.0 Error ENXIO: problem getting command descriptions from osd.0 command terminated with exit code 6 Collecting perf dump for osd.1 Collecting perf dump for osd.2 Error ENXIO: problem getting command descriptions from osd.2 command terminated with exit code 6 Sleeping for 60 seconds Collecting perf dump for osd.0 Error ENXIO: problem getting command descriptions from osd.0 command terminated with exit code 6 Collecting perf dump for osd.1 Collecting perf dump for osd.2 Error ENXIO: problem getting command descriptions from osd.2 command terminated with exit code 6 Sleeping for 60 seconds Collecting perf dump for osd.0 Error ENXIO: problem getting command descriptions from osd.0 command terminated with exit code 6 Collecting perf dump for osd.1 Collecting perf dump for osd.2 Error ENXIO: problem getting command descriptions from osd.2 command terminated with exit code 6 Sleeping for 60 seconds Collecting perf dump for osd.0 Error ENXIO: problem getting command descriptions from osd.0 command terminated with exit code 6 Collecting perf dump for osd.1 Collecting perf dump for osd.2 Error ENXIO: problem getting command descriptions from osd.2 command terminated with exit code 6 Sleeping for 60 seconds Before running the script I saw that OSD's were recovered $cephstatus cluster: id: 928c8604-6082-4802-893c-3222d5f150ae health: HEALTH_OK services: mon: 3 daemons, quorum d,e,f (age 5d) mgr: a(active, since 5d) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 3d), 3 in (since 11d) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 328.79k objects, 1.2 TiB usage: 3.7 TiB used, 2.3 TiB / 6 TiB avail pgs: 169 active+clean io: client: 889 MiB/s rd, 272 KiB/s wr, 245 op/s rd, 12 op/s wr When i initiated failover and relocated operations for some workloads i saw that osd went down again $cephstatus cluster: id: 928c8604-6082-4802-893c-3222d5f150ae health: HEALTH_WARN 1 MDSs report slow metadata IOs 2 osds down 2 hosts (2 osds) down 2 racks (2 osds) down Reduced data availability: 169 pgs inactive Degraded data redundancy: 454044/681066 objects degraded (66.667%), 133 pgs degraded, 169 pgs undersized services: mon: 3 daemons, quorum d,e,f (age 5d) mgr: a(active, since 5d) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 1 up (since 3m), 3 in (since 11d) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 227.02k objects, 877 GiB usage: 2.6 TiB used, 3.4 TiB / 6 TiB avail pgs: 100.000% pgs not active 454044/681066 objects degraded (66.667%) 133 undersized+degraded+peered 36 undersized+peered --- Additional comment from Prashant Dhange on 2023-10-03 23:06:37 UTC --- (In reply to Pratik Surve from comment #18) > Hi Prashant, > > > I have collected the script output here > http://magna012.ceph.redhat.com/script_output.zip. > > when some OSD went down there were no log collected Both osd.0 and osd.2 were killed by OOM Killer. From the perf dump output, most of the memory was in use by bluestore cache and majority for caching kv and kv_onode : osd.2.perf_dump.2023-10-03_17-53-07.txt "bluestore-pricache": { "target_bytes": 4294967296, "mapped_bytes": 4192239616, "unmapped_bytes": 2023022592, "heap_bytes": 6215262208, "cache_bytes": 2845415830 }, ... "bluestore-pricache:kv": { "pri0_bytes": 67108864, "pri1_bytes": 0, "pri2_bytes": 7443216, "pri3_bytes": 3600880, "pri4_bytes": 15534992, "pri5_bytes": 13541328, "pri6_bytes": 372139648, "pri7_bytes": 0, "pri8_bytes": 0, "pri9_bytes": 0, "pri10_bytes": 0, "pri11_bytes": 93203112, "reserved_bytes": 81739384, "committed_bytes": 654311424 }, "bluestore-pricache:kv_onode": { "pri0_bytes": 67108864, "pri1_bytes": 0, "pri2_bytes": 0, "pri3_bytes": 0, "pri4_bytes": 0, "pri5_bytes": 0, "pri6_bytes": 0, "pri7_bytes": 0, "pri8_bytes": 0, "pri9_bytes": 0, "pri10_bytes": 1155541216, "pri11_bytes": 8284721, "reserved_bytes": 77688047, "committed_bytes": 1308622848 }, ... I have recovered the cluster by restarting the osd.0 and osd.2 pods. Now cluster is in HEALTH_OK state. > When i initiated failover and relocated operations for some workloads i saw > that osd went down again Can you collect the perf/heap/mempools dump using below script and perform failover again to reproduce this issue ? ------------------------------------------------------------------------------------------------------------------------------------- #!/bin/bash TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name) LOG_DIR="$PWD/osd_mem_dump_data.$(date +%F_%H-%M-%S)" SLEEP_TIME=60 mkdir $LOG_DIR mkdir -p "$LOG_DIR/perf_dump/" mkdir -p "$LOG_DIR/heap_dump/" mkdir -p "$LOG_DIR/dump_mempools/" cd $LOG_DIR for osd in $(oc exec $TOOLS_POD -- ceph osd ls) do echo "Starting heap profiler for osd.$osd" oc exec $TOOLS_POD -- ceph tell osd.$osd heap start_profiler done while true do for osd in $(oc exec $TOOLS_POD -- ceph osd ls) do echo "Collecting perf dump for osd.$osd" oc exec $TOOLS_POD -- ceph tell osd.$osd perf dump > $LOG_DIR/perf_dump/osd.$osd.perf_dump.$(date +%F_%H-%M-%S).txt echo "Collecting dump_mempools for osd.$osd" oc exec $TOOLS_POD -- ceph tell osd.$osd dump_mempools > $LOG_DIR/dump_mempools/osd.$osd.dump_mempools.$(date +%F_%H-%M-%S).txt echo "Collecting heap dump for osd.$osd" oc exec $TOOLS_POD -- ceph tell osd.$osd heap dump > $LOG_DIR/heap_dump/osd.$osd.heap_dump.$(date +%F_%H-%M-%S).txt done echo "Sleeping for $SLEEP_TIME seconds" sleep $SLEEP_TIME done ------------------------------------------------------------------------------------------------------------------------------------- --- Additional comment from Prashant Dhange on 2023-10-03 23:20:02 UTC --- @prsurve I have cloned this BZ to RADOS BZ#2242025 to investigate higher memory allocations for bluestore cache which is causing the OSD daemons getting killed by OOM killer. Kindly submit the data collected using script mentioned in comment#19 to BZ#2242025. Let's keep this BZ to address issue related to stopping OSD pod from re-starting if OSD daemon is flapping frequently. --- Additional comment from Pratik Surve on 2023-10-05 09:50:04 UTC --- Hi Prashant, I'm not hitting the issue on my current setup. I'm, going to redeploy the cluster and see if I can repro the same issue and then provide the script output added in comment#19. --- Additional comment from Elvir Kuric on 2023-10-13 07:18:35 UTC --- This issue is visible with : ceph version 17.2.6-138.el9cp (b488c8dad42b2ecffcd96f3d76eeeecce48b8590) quincy (stable) flapping OSD logs below, restarting OSD helps it move to running state and it joins cluster. --- + trap sigterm SIGTERM + wait + ceph-osd --foreground --id 52 --fsid 543d8344-4233-4669-b5ab-0d08e3e62a36 --setuser ceph --setgroup ceph '--crush-location=root=default host=d29-h11-000-r750' --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8192 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false debug2023-10-10T14:48:34.024+0000 7fd95817b5c0 0 set uid:gid to 167:167 (ceph:ceph) debug2023-10-10T14:48:34.024+0000 7fd95817b5c0 0 ceph version 17.2.6-138.el9cp (b488c8dad42b2ecffcd96f3d76eeeecce48b8590) quincy (stable), process ceph-osd, pid 1270884 debug2023-10-10T14:48:34.024+0000 7fd95817b5c0 0 pidfile_write: ignore empty --pid-file debug2023-10-10T14:48:34.024+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable debug2023-10-10T14:48:34.024+0000 7fd95817b5c0 0 starting osd.52 osd_data /var/lib/ceph/osd/ceph-52 /var/lib/ceph/osd/ceph-52/journal debug2023-10-10T14:48:34.025+0000 7fd95817b5c0 -1 Falling back to public interface debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 0 load: jerasure load: lrc debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 1 mClockScheduler: set_osd_capacity_params_from_config: osd_bandwidth_cost_per_io: 499321.90 bytes/io, osd_bandwidth_capacity_per_shard 19660800.00 bytes/second debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 0 osd.52:0.OSDShard using op scheduler mClockScheduler debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 1 mClockScheduler: set_osd_capacity_params_from_config: osd_bandwidth_cost_per_io: 499321.90 bytes/io, osd_bandwidth_capacity_per_shard 19660800.00 bytes/second debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 0 osd.52:1.OSDShard using op scheduler mClockScheduler debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 1 mClockScheduler: set_osd_capacity_params_from_config: osd_bandwidth_cost_per_io: 499321.90 bytes/io, osd_bandwidth_capacity_per_shard 19660800.00 bytes/second debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 0 osd.52:2.OSDShard using op scheduler mClockScheduler debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 1 mClockScheduler: set_osd_capacity_params_from_config: osd_bandwidth_cost_per_io: 499321.90 bytes/io, osd_bandwidth_capacity_per_shard 19660800.00 bytes/second debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 0 osd.52:3.OSDShard using op scheduler mClockScheduler debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 1 mClockScheduler: set_osd_capacity_params_from_config: osd_bandwidth_cost_per_io: 499321.90 bytes/io, osd_bandwidth_capacity_per_shard 19660800.00 bytes/second debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 0 osd.52:4.OSDShard using op scheduler mClockScheduler debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable debug2023-10-10T14:48:34.031+0000 7fd95817b5c0 1 mClockScheduler: set_osd_capacity_params_from_config: osd_bandwidth_cost_per_io: 499321.90 bytes/io, osd_bandwidth_capacity_per_shard 19660800.00 bytes/second debug2023-10-10T14:48:34.031+0000 7fd95817b5c0 0 osd.52:5.OSDShard using op scheduler mClockScheduler debug2023-10-10T14:48:34.031+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable debug2023-10-10T14:48:34.031+0000 7fd95817b5c0 1 mClockScheduler: set_osd_capacity_params_from_config: osd_bandwidth_cost_per_io: 499321.90 bytes/io, osd_bandwidth_capacity_per_shard 19660800.00 bytes/second debug2023-10-10T14:48:34.031+0000 7fd95817b5c0 0 osd.52:6.OSDShard using op scheduler mClockScheduler debug2023-10-10T14:48:34.031+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable debug2023-10-10T14:48:34.031+0000 7fd95817b5c0 1 mClockScheduler: set_osd_capacity_params_from_config: osd_bandwidth_cost_per_io: 499321.90 bytes/io, osd_bandwidth_capacity_per_shard 19660800.00 bytes/second debug2023-10-10T14:48:34.031+0000 7fd95817b5c0 0 osd.52:7.OSDShard using op scheduler mClockScheduler debug2023-10-10T14:48:34.031+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable debug2023-10-10T14:48:34.031+0000 7fd95817b5c0 -1 osd.52 0 OSD::pre_init: object store '/var/lib/ceph/osd/ceph-52' is currently in use. (Is ceph-osd already running?) debug2023-10-10T14:48:34.031+0000 7fd95817b5c0 -1 ** ERROR: osd pre_init failed: (16) Device or resource busy + RESTART_INTERVAL=24 + rc=0 + '[' 0 -eq 0 ']' + touch /tmp/osd-sleep + echo 'OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will sleep for 24 hours. Restart the pod manually once the flapping issue is fixed' OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will sleep for 24 hours. Restart the pod manually once the flapping issue is fixed + wait + sleep 24h --- Additional comment from Blaine Gardner on 2023-10-13 15:30:07 UTC --- A fix for this was merged yesterday here: https://github.com/red-hat-storage/rook/pull/528 The logs just shared above look to still be using the un-patched version of ODF. Please try again with the latest ODF build that contains the commit dd3395050531d91ac32ee71fe2c488e20decc9c1 The head of the build is likely at ed872f1565ead1f1eb260c297a17d3db38f82e90 (the merge commit)
To enable this new Rook setting from the OCS operator, the latest Rook API must be updated at the same time that is pulling in some more changes from upstream than what is strictly required for this setting. See the linked PR. The additional changes are not considered risky, but neither can they be separated from this change. The Rook PR had been ack'ed a while back by QE, but coming this late in the cycle we just need to confirm that QE has the bandwidth available to test this change for OSDs flapping still in the 4.14.0 timeline. The change to support flapping OSDs is not a trivial change. If we go ahead and merge this, we also need to consider the possibility of reverting this change if we find any regressions in OSDs during this testing. Elad, please consider whether to ack this.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days