Bug 2244517 - [RDR] Ceph reports osd down but pod are in running state
Summary: [RDR] Ceph reports osd down but pod are in running state
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.14.0
Assignee: Santosh Pillai
QA Contact: Pratik Surve
URL:
Whiteboard:
Depends On: 2241185
Blocks: 2242025
TreeView+ depends on / blocked
 
Reported: 2023-10-16 17:41 UTC by Santosh Pillai
Modified: 2024-03-08 04:26 UTC (History)
15 users (show)

Fixed In Version: 4.14.0-154
Doc Type: No Doc Update
Doc Text:
Clone Of: 2241185
Environment:
Last Closed: 2023-11-08 18:55:27 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 2214 0 None open bug 2244517 : [DNM] set FlappingRestartIntervalHours to 24 hrs 2023-10-16 17:43:36 UTC
Github red-hat-storage ocs-operator pull 2219 0 None open Bug 2244517: set FlappingRestartIntervalHours to 24 hrs 2023-10-17 13:10:59 UTC
Red Hat Product Errata RHSA-2023:6832 0 None None None 2023-11-08 18:56:29 UTC

Description Santosh Pillai 2023-10-16 17:41:57 UTC
+++ This bug was initially created as a clone of Bug #2241185 +++

Description of problem (please be detailed as possible and provide log
snippests):
[RDR] Ceph reports osd down but pod are in running state


Version of all relevant components (if applicable):
OCP version:- 4.14.0-0.nightly-2023-09-20-033502
ODF version:- 4.14.0-136
CEPH version:- ceph version 17.2.6-138.el9cp (b488c8dad42b2ecffcd96f3d76eeeecce48b8590) quincy (stable)
ACM version:- 2.9.0-109
SUBMARINER version:- devel
VOLSYNC version:- volsync-product.v0.7.4
VOLSYNC method:- destinationCopyMethod: LocalDirect

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy RDR cluster 
2.Run workloads for few day
3.observer ceph status and osd pod status


Actual results:

cephstatus       
  cluster:
    id:     928c8604-6082-4802-893c-3222d5f150ae
    health: HEALTH_WARN
            2 MDSs report slow metadata IOs
            1 MDSs report slow requests
            2 osds down
            2 hosts (2 osds) down
            2 racks (2 osds) down
            Reduced data availability: 169 pgs inactive
            Degraded data redundancy: 154474/231711 objects degraded (66.667%), 131 pgs degraded, 169 pgs undersized
 
  services:
    mon:        3 daemons, quorum d,e,f (age 116m)
    mgr:        a(active, since 116m)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 1 up (since 110m), 3 in (since 6d)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 77.24k objects, 294 GiB
    usage:   884 GiB used, 5.1 TiB / 6 TiB avail
    pgs:     100.000% pgs not active
             154474/231711 objects degraded (66.667%)
             131 undersized+degraded+peered
             38  undersized+peered

oc get pods -l app=rook-ceph-osd
NAME                               READY   STATUS    RESTARTS   AGE
rook-ceph-osd-0-5f7f4cb466-9rcwq   2/2     Running   2          9h
rook-ceph-osd-1-6cbcd67bfb-dpf4n   2/2     Running   0          121m
rook-ceph-osd-2-5f5cfb4f6d-xnhx2   2/2     Running   2          26h


snippet from osd pod log

debug2023-09-28T11:26:47.928+0000 7f5df6527640  1 osd.2 pg_epoch: 20604 pg[12.14( v 20573'397443 (20475'395441,20573'397443] local-lis/les=20489/20490 n=1375 ec=31/22 lis/c=20596/18223 les/c/f=20597/18224/0 sis=20604) [0,2,1] r=1 lpr=20604 pi=[18223,20604)/2 crt=20573'397443 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
debug2023-09-28T11:26:48.945+0000 7f5df9d2e640  1 osd.2 pg_epoch: 20605 pg[1.1d( v 20603'59507 (19815'57422,20603'59507] local-lis/les=20488/20489 n=1094 ec=29/11 lis/c=20488/18226 les/c/f=20489/18227/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18226,20605)/3 luod=0'0 lua=20592'59504 crt=20603'59507 mlcod 0'0 active mbc={}] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367
debug2023-09-28T11:26:48.945+0000 7f5df5d26640  1 osd.2 pg_epoch: 20605 pg[1.5( v 20564'82189 (18532'79706,20564'82189] local-lis/les=20488/20489 n=1075 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[20488,20605)/1 luod=0'0 crt=20564'82189 mlcod 0'0 active mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367
debug2023-09-28T11:26:48.945+0000 7f5df9d2e640  1 osd.2 pg_epoch: 20605 pg[1.1d( v 20603'59507 (19815'57422,20603'59507] local-lis/les=20488/20489 n=1094 ec=29/11 lis/c=20488/18226 les/c/f=20489/18227/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18226,20605)/3 crt=20603'59507 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
debug2023-09-28T11:26:48.945+0000 7f5df5d26640  1 osd.2 pg_epoch: 20605 pg[1.5( v 20564'82189 (18532'79706,20564'82189] local-lis/les=20488/20489 n=1075 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[20488,20605)/1 crt=20564'82189 mlcod 0'0 unknown NOTIFY mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] state<Start>: transitioning to Stray
debug2023-09-28T11:26:48.946+0000 7f5dfbd32640  1 osd.2 pg_epoch: 20605 pg[12.9( v 20573'454858 (20474'453032,20573'454858] local-lis/les=20489/20490 n=1397 ec=31/22 lis/c=20596/18241 les/c/f=20597/18242/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18241,20605)/3 luod=0'0 crt=20573'454858 mlcod 0'0 active mbc={}] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367
debug2023-09-28T11:26:48.946+0000 7f5dfbd32640  1 osd.2 pg_epoch: 20605 pg[12.9( v 20573'454858 (20474'453032,20573'454858] local-lis/les=20489/20490 n=1397 ec=31/22 lis/c=20596/18241 les/c/f=20597/18242/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18241,20605)/3 crt=20573'454858 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
debug2023-09-28T11:26:50.971+0000 7f5df7529640  1 osd.2 pg_epoch: 20607 pg[1.2( v 20597'95850 (18572'93811,20597'95850] local-lis/les=20488/20489 n=1099 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20607) [1,0,2] r=2 lpr=20607 pi=[20488,20607)/1 luod=0'0 crt=20594'95848 mlcod 0'0 active mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] start_peering_interval up [1,0,2] -> [1,0,2], acting [1,0] -> [1,0,2], acting_primary 1 -> 1, up_primary 1 -> 1, role -1 -> 2, features acting 4540138320759226367 upacting 4540138320759226367
debug2023-09-28T11:26:50.971+0000 7f5df7529640  1 osd.2 pg_epoch: 20607 pg[1.2( v 20597'95850 (18572'93811,20597'95850] local-lis/les=20488/20489 n=1099 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20607) [1,0,2] r=2 lpr=20607 pi=[20488,20607)/1 crt=20594'95848 mlcod 0'0 unknown NOTIFY mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] state<Start>: transitioning to Stray
debug2023-09-28T11:27:07.822+0000 7f5df5d26640  0 log_channel(cluster) log [DBG] : 11.d scrub starts
debug2023-09-28T11:27:08.026+0000 7f5df5d26640  0 log_channel(cluster) log [DBG] : 11.d scrub ok
debug2023-09-28T11:27:08.871+0000 7f5dfbd32640  0 log_channel(cluster) log [DBG] : 12.1 scrub starts
debug2023-09-28T11:27:20.526+0000 7f5df7d2a640  0 log_channel(cluster) log [DBG] : 12.1 scrub ok
debug2023-09-28T11:27:37.647+0000 7f5dfad30640  0 log_channel(cluster) log [DBG] : 12.13 scrub starts
debug2023-09-28T11:27:47.852+0000 7f5dfad30640  0 log_channel(cluster) log [DBG] : 12.13 scrub ok
debug2023-09-28T11:27:51.828+0000 7f5df952d640  0 log_channel(cluster) log [DBG] : 11.e scrub starts
debug2023-09-28T11:27:52.066+0000 7f5df952d640  0 log_channel(cluster) log [DBG] : 11.e scrub ok
debug2023-09-28T11:28:51.854+0000 7f5df852b640  0 log_channel(cluster) log [DBG] : 5.0 scrub starts
debug2023-09-28T11:28:52.296+0000 7f5df852b640  0 log_channel(cluster) log [DBG] : 5.0 scrub ok
debug2023-09-28T11:28:52.891+0000 7f5df9d2e640  0 log_channel(cluster) log [DBG] : 11.1d scrub starts
debug2023-09-28T11:28:53.130+0000 7f5df5d26640  0 log_channel(cluster) log [DBG] : 11.1d scrub ok
/bin/bash: line 9:   412 Killed                  ceph-osd --foreground --id 2 --fsid 928c8604-6082-4802-893c-3222d5f150ae --setuser ceph --setgroup ceph --crush-location="root=default host=ocs-deviceset-thin-csi-odf-1-data-0457vv rack=rack2" --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8192 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false --public-addr=242.0.255.251 --public-bind-addr=10.131.0.41 --cluster-addr=10.131.0.41
+ RESTART_INTERVAL=24
+ rc=0
+ '[' 0 -eq 0 ']'
+ touch /tmp/osd-sleep
OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will sleep for 24 hours. Restart the pod manually once the flapping issue is fixed
+ echo 'OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will sleep for 24 hours. Restart the pod manually once the flapping issue is fixed'
+ wait
+ sleep 24h


Expected results:
osd pod should not be in Running if ceph report osd as down

Additional info:

--- Additional comment from RHEL Program Management on 2023-09-28 13:27:46 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.14.0' to '?', and so is being proposed to be fixed at the ODF 4.14.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2023-09-28 13:27:46 UTC ---

Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP.

--- Additional comment from Santosh Pillai on 2023-09-28 15:44:21 UTC ---

We added a new change on how OSDs are created to support https://bugzilla.redhat.com/show_bug.cgi?id=2223959

If the OSD daemon exits with a 0 exit code, we are assuming that its due to OSD flapping. Looks like in this case, the bash script received the exit code 0 when the OSD went down, and it caused was caused by something else other than OSD flapping. 

Need to investigate more.

--- Additional comment from Radoslaw Zarzynski on 2023-09-28 16:10:18 UTC ---

> /bin/bash: line 9:   412 Killed                  ceph-osd --foreground

It looks the ceph-osd process has been killed by a signal
but the orchestrator decided to not respawn the container
by, IIUC, running within `sleep 24h` instead of `ceph-osd`.

Under such conditions it's expected to see a divergence
between `ceph -s` and containers' statuses.

The interesting thing might be what and why generated
the signal that terminated the OSD. However, there is
no OSD log, no dmesg nor anything that could help with
identifying the source of this signal.

Without showing the signal's origin assigning the case
to RADOS is premature.

Regards,
Radek

--- Additional comment from Prashant Dhange on 2023-09-28 16:12:32 UTC ---

The OSD was killed by external entity :
/bin/bash: line 9:   412 Killed                  ceph-osd --foreground --id 2 --fsid 928c8604-6082-4802-893c-3222d5f150ae --setuser ceph --setgroup ceph --crush-location="root=default host=ocs-deviceset-thin-csi-odf-1-data-0457vv rack=rack2" --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8192 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false --public-addr=242.0.255.251 --public-bind-addr=10.131.0.41 --cluster-addr=10.131.0.41

Based on the dmesg logs provided, the OSD was killed by OM killer :
[Thu Sep 28 11:31:40 2023] msgr-worker-1 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=3, oom_score_adj=-997
[Thu Sep 28 11:31:40 2023] CPU: 11 PID: 16132 Comm: msgr-worker-1 Not tainted 5.14.0-284.32.1.el9_2.x86_64 #1
[Thu Sep 28 11:31:40 2023] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[Thu Sep 28 11:31:40 2023] Call Trace:
[Thu Sep 28 11:31:40 2023]  <TASK>
[Thu Sep 28 11:31:40 2023]  dump_stack_lvl+0x34/0x48
[Thu Sep 28 11:31:40 2023]  dump_header+0x4a/0x201
[Thu Sep 28 11:31:40 2023]  oom_kill_process.cold+0xb/0x10
[Thu Sep 28 11:31:40 2023]  out_of_memory+0xed/0x2e0
[Thu Sep 28 11:31:40 2023]  mem_cgroup_out_of_memory+0x13a/0x150
[Thu Sep 28 11:31:40 2023]  try_charge_memcg+0x6df/0x7a0
[Thu Sep 28 11:31:40 2023]  ? __alloc_skb+0x8e/0x1d0
[Thu Sep 28 11:31:40 2023]  mem_cgroup_charge_skmem+0x3f/0xf0
[Thu Sep 28 11:31:40 2023]  __sk_mem_raise_allocated+0x247/0x4a0
[Thu Sep 28 11:31:40 2023]  __sk_mem_schedule+0x2b/0x40
[Thu Sep 28 11:31:40 2023]  tcp_sendmsg_locked+0x6f9/0xac0
[Thu Sep 28 11:31:40 2023]  tcp_sendmsg+0x28/0x40
[Thu Sep 28 11:31:40 2023]  sock_sendmsg+0x5b/0x70
[Thu Sep 28 11:31:40 2023]  ____sys_sendmsg+0x230/0x270
[Thu Sep 28 11:31:40 2023]  ? import_iovec+0x17/0x20
[Thu Sep 28 11:31:40 2023]  ? sendmsg_copy_msghdr+0x7d/0xa0
[Thu Sep 28 11:31:40 2023]  ___sys_sendmsg+0x81/0xc0
[Thu Sep 28 11:31:40 2023]  ? inet_recvmsg+0x5c/0x140
[Thu Sep 28 11:31:40 2023]  ? sock_read_iter+0x92/0x100
[Thu Sep 28 11:31:40 2023]  ? new_sync_read+0x166/0x180
[Thu Sep 28 11:31:40 2023]  ? __fget_light+0x9f/0x130
[Thu Sep 28 11:31:40 2023]  __sys_sendmsg+0x59/0xa0
[Thu Sep 28 11:31:40 2023]  do_syscall_64+0x59/0x90
[Thu Sep 28 11:31:40 2023]  ? syscall_exit_to_user_mode+0x12/0x30
[Thu Sep 28 11:31:40 2023]  ? do_syscall_64+0x69/0x90
[Thu Sep 28 11:31:40 2023]  ? syscall_exit_to_user_mode+0x12/0x30
[Thu Sep 28 11:31:40 2023]  ? do_syscall_64+0x69/0x90
[Thu Sep 28 11:31:40 2023]  ? syscall_exit_work+0x11a/0x150
[Thu Sep 28 11:31:40 2023]  ? syscall_exit_to_user_mode+0x12/0x30
[Thu Sep 28 11:31:40 2023]  ? do_syscall_64+0x69/0x90
[Thu Sep 28 11:31:40 2023]  ? irqtime_account_irq+0x3c/0xb0
[Thu Sep 28 11:31:40 2023]  ? __irq_exit_rcu+0x46/0xe0
[Thu Sep 28 11:31:40 2023]  ? sysvec_apic_timer_interrupt+0x3c/0x90
[Thu Sep 28 11:31:40 2023]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[Thu Sep 28 11:31:40 2023] RIP: 0033:0x7f5e1a1e3a2d
[Thu Sep 28 11:31:40 2023] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 da c1 f4 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 2e c2 f4 ff 48
[Thu Sep 28 11:31:40 2023] RSP: 002b:00007f5e18750930 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
[Thu Sep 28 11:31:40 2023] RAX: ffffffffffffffda RBX: 0000000003b50c29 RCX: 00007f5e1a1e3a2d
[Thu Sep 28 11:31:40 2023] RDX: 000000000000c000 RSI: 00007f5e18750990 RDI: 0000000000000066
[Thu Sep 28 11:31:40 2023] RBP: 000000000000c000 R08: 0000000000000000 R09: 00007f5e187549d0
[Thu Sep 28 11:31:40 2023] R10: 0000000000010000 R11: 0000000000000293 R12: 000000000000c3cc
[Thu Sep 28 11:31:40 2023] R13: 0000560992735340 R14: 0000000003b50c29 R15: 00007f5e18750990
[Thu Sep 28 11:31:40 2023]  </TASK>
[Thu Sep 28 11:31:40 2023] memory: usage 5242880kB, limit 5242880kB, failcnt 75036
[Thu Sep 28 11:31:40 2023] swap: usage 0kB, limit 0kB, failcnt 0
[Thu Sep 28 11:31:40 2023] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod872d48c7_1d2b_4781_a9ce_85819219ee47.slice/crio-00973b15ebc4f4e724c0883a636f08d13b34f31760791e432d17d38f6b774d9b.scope:
[Thu Sep 28 11:31:40 2023] anon 5347696640
                           file 163840
                           kernel 16015360
                           kernel_stack 1261568
                           pagetables 11128832
                           percpu 0
                           sock 4829184
                           vmalloc 0
                           shmem 0
                           zswap 0
                           zswapped 0
                           file_mapped 0
                           file_dirty 0
                           file_writeback 0
                           swapcached 0
                           anon_thp 2097152
                           file_thp 0
                           shmem_thp 0
                           inactive_anon 5347545088
                           active_anon 24576
                           inactive_file 147456
                           active_file 0
                           unevictable 0
                           slab_reclaimable 2407680
                           slab_unreclaimable 1151208
                           slab 3558888
                           workingset_refault_anon 0
                           workingset_refault_file 85
                           workingset_activate_anon 0
                           workingset_activate_file 31
                           workingset_restore_anon 0
                           workingset_restore_file 0
                           workingset_nodereclaim 0
                           pgscan 166907
                           pgsteal 161735
                           pgscan_kswapd 0
                           pgscan_direct 166907
                           pgsteal_kswapd 0
                           pgsteal_direct 161735
                           pgfault 3111055
                           pgmajfault 0
                           pgrefill 46159
                           pgactivate 43386
                           pgdeactivate 46159
                           pglazyfree 0
                           pglazyfreed 0
                           zswpin 0
                           zswpout 0
                           thp_fault_alloc 15
                           thp_collapse_alloc 0
[Thu Sep 28 11:31:40 2023] Tasks state (memory values in pages):
[Thu Sep 28 11:31:40 2023] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Thu Sep 28 11:31:40 2023] [  16080]     0 16080     1097      810    40960        0          -997 bash
[Thu Sep 28 11:31:40 2023] [  16123]   167 16123  1522377  1312915 11108352        0          -997 ceph-osd
[Thu Sep 28 11:31:40 2023] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-00973b15ebc4f4e724c0883a636f08d13b34f31760791e432d17d38f6b774d9b.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod872d48c7_1d2b_4781_a9ce_85819219ee47.slice/crio-00973b15ebc4f4e724c0883a636f08d13b34f31760791e432d17d38f6b774d9b.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod872d48c7_1d2b_4781_a9ce_85819219ee47.slice/crio-00973b15ebc4f4e724c0883a636f08d13b34f31760791e432d17d38f6b774d9b.scope,task=ceph-osd,pid=16123,uid=167
[Thu Sep 28 11:31:40 2023] Memory cgroup out of memory: Killed process 16123 (ceph-osd) total-vm:6089508kB, anon-rss:5217632kB, file-rss:34028kB, shmem-rss:0kB, UID:167 pgtables:10848kB oom_score_adj:-997
[Thu Sep 28 11:31:40 2023] libceph: osd2 (2)242.0.255.251:6800 socket closed (con state OPEN)

Though OSD daemon was consuming less than 3GB of memory, Have we deployed ODF nodes with enough system resources ? How much RAM is on ODF nodes ?

Kindly provide access to this cluster to investigate this issue further.

--- Additional comment from Prashant Dhange on 2023-09-28 16:18:13 UTC ---

(In reply to Prashant Dhange from comment #5)
> The OSD was killed by external entity :
> /bin/bash: line 9:   412 Killed                  ceph-osd --foreground --id
> 2 --fsid 928c8604-6082-4802-893c-3222d5f150ae --setuser ceph --setgroup ceph
> --crush-location="root=default host=ocs-deviceset-thin-csi-odf-1-data-0457vv
> rack=rack2" --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8
> --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0
> --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0
> --bluestore-compression-min-blob-size=8192
> --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536
> --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000
> --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true
> --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true
> --default-log-stderr-prefix=debug --default-log-to-file=false
> --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false
> --public-addr=242.0.255.251 --public-bind-addr=10.131.0.41
> --cluster-addr=10.131.0.41
> 
> Based on the dmesg logs provided, the OSD was killed by OM killer :
> [Thu Sep 28 11:31:40 2023] msgr-worker-1 invoked oom-killer:
> gfp_mask=0xcc0(GFP_KERNEL), order=3, oom_score_adj=-997
> [Thu Sep 28 11:31:40 2023] CPU: 11 PID: 16132 Comm: msgr-worker-1 Not
> tainted 5.14.0-284.32.1.el9_2.x86_64 #1
> [Thu Sep 28 11:31:40 2023] Hardware name: VMware, Inc. VMware Virtual
> Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
> [Thu Sep 28 11:31:40 2023] Call Trace:
> [Thu Sep 28 11:31:40 2023]  <TASK>
> [Thu Sep 28 11:31:40 2023]  dump_stack_lvl+0x34/0x48
> [Thu Sep 28 11:31:40 2023]  dump_header+0x4a/0x201
> [Thu Sep 28 11:31:40 2023]  oom_kill_process.cold+0xb/0x10
> [Thu Sep 28 11:31:40 2023]  out_of_memory+0xed/0x2e0
> [Thu Sep 28 11:31:40 2023]  mem_cgroup_out_of_memory+0x13a/0x150
> [Thu Sep 28 11:31:40 2023]  try_charge_memcg+0x6df/0x7a0
> [Thu Sep 28 11:31:40 2023]  ? __alloc_skb+0x8e/0x1d0
> [Thu Sep 28 11:31:40 2023]  mem_cgroup_charge_skmem+0x3f/0xf0
> [Thu Sep 28 11:31:40 2023]  __sk_mem_raise_allocated+0x247/0x4a0
> [Thu Sep 28 11:31:40 2023]  __sk_mem_schedule+0x2b/0x40
> [Thu Sep 28 11:31:40 2023]  tcp_sendmsg_locked+0x6f9/0xac0
> [Thu Sep 28 11:31:40 2023]  tcp_sendmsg+0x28/0x40
> [Thu Sep 28 11:31:40 2023]  sock_sendmsg+0x5b/0x70
> [Thu Sep 28 11:31:40 2023]  ____sys_sendmsg+0x230/0x270
> [Thu Sep 28 11:31:40 2023]  ? import_iovec+0x17/0x20
> [Thu Sep 28 11:31:40 2023]  ? sendmsg_copy_msghdr+0x7d/0xa0
> [Thu Sep 28 11:31:40 2023]  ___sys_sendmsg+0x81/0xc0
> [Thu Sep 28 11:31:40 2023]  ? inet_recvmsg+0x5c/0x140
> [Thu Sep 28 11:31:40 2023]  ? sock_read_iter+0x92/0x100
> [Thu Sep 28 11:31:40 2023]  ? new_sync_read+0x166/0x180
> [Thu Sep 28 11:31:40 2023]  ? __fget_light+0x9f/0x130
> [Thu Sep 28 11:31:40 2023]  __sys_sendmsg+0x59/0xa0
> [Thu Sep 28 11:31:40 2023]  do_syscall_64+0x59/0x90
> [Thu Sep 28 11:31:40 2023]  ? syscall_exit_to_user_mode+0x12/0x30
> [Thu Sep 28 11:31:40 2023]  ? do_syscall_64+0x69/0x90
> [Thu Sep 28 11:31:40 2023]  ? syscall_exit_to_user_mode+0x12/0x30
> [Thu Sep 28 11:31:40 2023]  ? do_syscall_64+0x69/0x90
> [Thu Sep 28 11:31:40 2023]  ? syscall_exit_work+0x11a/0x150
> [Thu Sep 28 11:31:40 2023]  ? syscall_exit_to_user_mode+0x12/0x30
> [Thu Sep 28 11:31:40 2023]  ? do_syscall_64+0x69/0x90
> [Thu Sep 28 11:31:40 2023]  ? irqtime_account_irq+0x3c/0xb0
> [Thu Sep 28 11:31:40 2023]  ? __irq_exit_rcu+0x46/0xe0
> [Thu Sep 28 11:31:40 2023]  ? sysvec_apic_timer_interrupt+0x3c/0x90
> [Thu Sep 28 11:31:40 2023]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [Thu Sep 28 11:31:40 2023] RIP: 0033:0x7f5e1a1e3a2d
> [Thu Sep 28 11:31:40 2023] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08
> e8 da c1 f4 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00
> 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 2e c2 f4 ff 48
> [Thu Sep 28 11:31:40 2023] RSP: 002b:00007f5e18750930 EFLAGS: 00000293
> ORIG_RAX: 000000000000002e
> [Thu Sep 28 11:31:40 2023] RAX: ffffffffffffffda RBX: 0000000003b50c29 RCX:
> 00007f5e1a1e3a2d
> [Thu Sep 28 11:31:40 2023] RDX: 000000000000c000 RSI: 00007f5e18750990 RDI:
> 0000000000000066
> [Thu Sep 28 11:31:40 2023] RBP: 000000000000c000 R08: 0000000000000000 R09:
> 00007f5e187549d0
> [Thu Sep 28 11:31:40 2023] R10: 0000000000010000 R11: 0000000000000293 R12:
> 000000000000c3cc
> [Thu Sep 28 11:31:40 2023] R13: 0000560992735340 R14: 0000000003b50c29 R15:
> 00007f5e18750990
> [Thu Sep 28 11:31:40 2023]  </TASK>
> [Thu Sep 28 11:31:40 2023] memory: usage 5242880kB, limit 5242880kB, failcnt
> 75036
> [Thu Sep 28 11:31:40 2023] swap: usage 0kB, limit 0kB, failcnt 0
> [Thu Sep 28 11:31:40 2023] Memory cgroup stats for
> /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-
> pod872d48c7_1d2b_4781_a9ce_85819219ee47.slice/crio-
> 00973b15ebc4f4e724c0883a636f08d13b34f31760791e432d17d38f6b774d9b.scope:
> [Thu Sep 28 11:31:40 2023] anon 5347696640
>                            file 163840
>                            kernel 16015360
>                            kernel_stack 1261568
>                            pagetables 11128832
>                            percpu 0
>                            sock 4829184
>                            vmalloc 0
>                            shmem 0
>                            zswap 0
>                            zswapped 0
>                            file_mapped 0
>                            file_dirty 0
>                            file_writeback 0
>                            swapcached 0
>                            anon_thp 2097152
>                            file_thp 0
>                            shmem_thp 0
>                            inactive_anon 5347545088
>                            active_anon 24576
>                            inactive_file 147456
>                            active_file 0
>                            unevictable 0
>                            slab_reclaimable 2407680
>                            slab_unreclaimable 1151208
>                            slab 3558888
>                            workingset_refault_anon 0
>                            workingset_refault_file 85
>                            workingset_activate_anon 0
>                            workingset_activate_file 31
>                            workingset_restore_anon 0
>                            workingset_restore_file 0
>                            workingset_nodereclaim 0
>                            pgscan 166907
>                            pgsteal 161735
>                            pgscan_kswapd 0
>                            pgscan_direct 166907
>                            pgsteal_kswapd 0
>                            pgsteal_direct 161735
>                            pgfault 3111055
>                            pgmajfault 0
>                            pgrefill 46159
>                            pgactivate 43386
>                            pgdeactivate 46159
>                            pglazyfree 0
>                            pglazyfreed 0
>                            zswpin 0
>                            zswpout 0
>                            thp_fault_alloc 15
>                            thp_collapse_alloc 0
> [Thu Sep 28 11:31:40 2023] Tasks state (memory values in pages):
> [Thu Sep 28 11:31:40 2023] [  pid  ]   uid  tgid total_vm      rss
> pgtables_bytes swapents oom_score_adj name
> [Thu Sep 28 11:31:40 2023] [  16080]     0 16080     1097      810    40960 
> 0          -997 bash
> [Thu Sep 28 11:31:40 2023] [  16123]   167 16123  1522377  1312915 11108352 
> 0          -997 ceph-osd
My bad, these values are in pages so OSD was consuming around 5.8 GB. So it is more than default osd_memory_target set. Though OSD can consume more than the defined osd_memory_target under different circumstances. Need to investigate it further.

--- Additional comment from Santosh Pillai on 2023-09-28 16:28:58 UTC ---

(In reply to Prashant Dhange from comment #5)
> The OSD was killed by external entity :
> /bin/bash: line 9:   412 Killed                  ceph-osd --foreground --id
> 2 --fsid 928c8604-6082-4802-893c-3222d5f150ae --setuser ceph --setgroup ceph
> --crush-location="root=default host=ocs-deviceset-thin-csi-odf-1-data-0457vv
> rack=rack2" --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8
> --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0
> --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0
> --bluestore-compression-min-blob-size=8192
> --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536
> --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000
> --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true
> --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true
> --default-log-stderr-prefix=debug --default-log-to-file=false
> --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false
> --public-addr=242.0.255.251 --public-bind-addr=10.131.0.41
> --cluster-addr=10.131.0.41
> 
> Based on the dmesg logs provided, the OSD was killed by OM killer :

Is it expected for ceph OSD to stop with exit code 0 when killed by OM killer ?

(We recently made a change in rook to make the OSD pod to sleep for 24h when OSD daemon exits with 0 status code. This is to support the OSD flapping scenario where OSD daemon exits with 0 exit code when its marked down 6 times within 600 seconds)

--- Additional comment from Santosh Pillai on 2023-09-28 16:30:18 UTC ---

Adding needinfo for comment #7

--- Additional comment from Prashant Dhange on 2023-09-28 19:52:00 UTC ---

(In reply to Santosh Pillai from comment #7)
> (In reply to Prashant Dhange from comment #5)
> > The OSD was killed by external entity :
> > /bin/bash: line 9:   412 Killed                  ceph-osd --foreground --id
> > 2 --fsid 928c8604-6082-4802-893c-3222d5f150ae --setuser ceph --setgroup ceph
> > --crush-location="root=default host=ocs-deviceset-thin-csi-odf-1-data-0457vv
> > rack=rack2" --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8
> > --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0
> > --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0
> > --bluestore-compression-min-blob-size=8192
> > --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536
> > --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000
> > --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true
> > --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true
> > --default-log-stderr-prefix=debug --default-log-to-file=false
> > --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false
> > --public-addr=242.0.255.251 --public-bind-addr=10.131.0.41
> > --cluster-addr=10.131.0.41
> > 
> > Based on the dmesg logs provided, the OSD was killed by OM killer :
> 
> Is it expected for ceph OSD to stop with exit code 0 when killed by OM
> killer ?
> 
> (We recently made a change in rook to make the OSD pod to sleep for 24h when
> OSD daemon exits with 0 status code. This is to support the OSD flapping
> scenario where OSD daemon exits with 0 exit code when its marked down 6
> times within 600 seconds)
No. The ceph-osd daemon should exit with code 137 when killed by OOM killer. Is ceph-osd daemon running through shell script or config.json ? The oc get events and oc describe pod should give more details on ceph-osd daemon exit code if it is being exec'ed by runc directly. 

We do see event similar to below if ceph daemon is killed by OOM killer :
    State:          Running
      Started:      Thu, 21 Sep 2023 15:57:43 +0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 21 Sep 2023 15:51:24 +0200
      Finished:     Thu, 21 Sep 2023 15:54:58 +0200
    Ready:          True
    Restart Count:  6

--- Additional comment from Travis Nielsen on 2023-09-28 21:06:15 UTC ---

Prashant Is this example of the OOMKilled event from this cluster? Or from a previous cluster? With the change in OSD recently to now run from a script, the OSD may not be responding as expected to the OOMKill event. 

On a related note, I was experimenting with the OSD pod for a related issue today and saw that if I exec to the OSD pod and attempt to kill the OSD process (to force a restart), it ignores most kill codes except SIGTERM. Seems like we need to capture more signals.

--- Additional comment from Prashant Dhange on 2023-09-28 23:23:14 UTC ---

(In reply to Travis Nielsen from comment #10)
> Prashant Is this example of the OOMKilled event from this cluster? Or from a
> previous cluster? 
This is from one of the previous occurrences we observed in the standalone as well as in the ODF environment. We should see similar behavior if daemon is running inside a pod or on RHEL/CoreOS node and getting killed by OOM killer. I tried below perl command to demonstrate this behavior.

# perl -wE 'my @xs; for (1..2**20) { push @xs, q{a} x 2**20 }; say scalar @xs;'
Killed

# echo $?
137

# dmesg
...
[37820430.866938] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-1000.slice/session-19948.scope,task=perl,pid=902192,uid=1000
[37820430.866951] Out of memory: Killed process 902192 (perl) total-vm:63155268kB, anon-rss:62688288kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:123632kB oom_score_adj:0
[37820433.527503] oom_reaper: reaped process 902192 (perl), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

> With the change in OSD recently to now run from a script,
> the OSD may not be responding as expected to the OOMKill event. 
Okay. I may have to reproduce with OSD to confirm it but AFAIR ceph-osd will return 137 in an event of OOM killer to it's parent process.

> 
> On a related note, I was experimenting with the OSD pod for a related issue
> today and saw that if I exec to the OSD pod and attempt to kill the OSD
> process (to force a restart), it ignores most kill codes except SIGTERM.
> Seems like we need to capture more signals.
Is it by sending a kill signal to OSD pid ("kill -KILL <osd-pid>") ?

--- Additional comment from Pratik Surve on 2023-09-29 14:24:30 UTC ---

> Though OSD daemon was consuming less than 3GB of memory, Have we deployed ODF nodes with enough system resources? How much RAM is on ODF nodes ?

Hi Prashant,
Yes, the node has enough system resources. node was created with 16cpu 64 GB RAM and 120 GB root disk.

--- Additional comment from Prashant Dhange on 2023-10-02 18:33:46 UTC ---

(In reply to Pratik Surve from comment #12)
> > Though OSD daemon was consuming less than 3GB of memory, Have we deployed ODF nodes with enough system resources? How much RAM is on ODF nodes ?
> 
> Hi Prashant,
> Yes, the node has enough system resources. node was created with 16cpu 64 GB
> RAM and 120 GB root disk.

I donot any memory leak issue with OSD daemons but if we are seeing OSDs getting killed frequently by OOM killer then kindly start collecting the osd perf dump output every min till we experience the OOM killer event for one of the osd daemon.

Run this script on the bastion node or where oc is installed (make sure we have the oc login session for extended period of time) :
--------------------------------------------------------------------------------------------------------------------------------------------
#!/bin/bash
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
LOG_DIR="$PWD/osd_perf_dump_data.$(date +%F_%H-%M-%S)"
SLEEP_TIME=60

mkdir $LOG_DIR
cd $LOG_DIR

while true
do
	for osd in $(oc exec $TOOLS_POD -- ceph osd ls)
	do
		echo "Collecting perf dump for osd.$osd"
		oc exec $TOOLS_POD -- ceph tell osd.$osd perf dump > $LOG_DIR/osd.$osd.perf_dump.$(date +%F_%H-%M-%S).txt
	done
	echo "Sleeping for $SLEEP_TIME seconds"
        sleep $SLEEP_TIME
done 
--------------------------------------------------------------------------------------------------------------------------------------------
NOTE: Change LOG_DIR path accordingly as above script will create log dir in the current directory.

Let me know if you have any further questions.

--- Additional comment from Travis Nielsen on 2023-10-02 19:22:11 UTC ---

Adding dev ack, we need to fix this for 4.14, or consider reverting the sleep for flapping OSDs. 

Note we've marked this as a blocker for upstream releases: https://github.com/rook/rook/issues/12972

--- Additional comment from Blaine Gardner on 2023-10-02 21:35:13 UTC ---

I think I found an error with the bash script used to handle flapping OSDs. Will discuss with Travis and Santosh offline to issue a patch in time for the release.

--- Additional comment from RHEL Program Management on 2023-10-03 05:13:00 UTC ---

This BZ is being approved for ODF 4.14.0 release, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'odf‑4.14.0

--- Additional comment from RHEL Program Management on 2023-10-03 05:13:00 UTC ---

Since this bug has been approved for ODF 4.14.0 release, through release flag 'odf-4.14.0+', the Target Release is being set to 'ODF 4.14.0

--- Additional comment from Pratik Surve on 2023-10-03 12:46:56 UTC ---

Hi Prashant,


I have collected the script output here http://magna012.ceph.redhat.com/script_output.zip. 

when some OSD went down there were no log collected 

sh dev_script.sh
Collecting perf dump for osd.0
Collecting perf dump for osd.1
Collecting perf dump for osd.2
Sleeping for 60 seconds
Collecting perf dump for osd.0
Collecting perf dump for osd.1
Collecting perf dump for osd.2
Sleeping for 60 seconds
Collecting perf dump for osd.0
Collecting perf dump for osd.1
Collecting perf dump for osd.2
Sleeping for 60 seconds
Collecting perf dump for osd.0
Collecting perf dump for osd.1
Collecting perf dump for osd.2
Sleeping for 60 seconds
Collecting perf dump for osd.0
Collecting perf dump for osd.1
Collecting perf dump for osd.2
Sleeping for 60 seconds
Collecting perf dump for osd.0
Collecting perf dump for osd.1
Collecting perf dump for osd.2
Sleeping for 60 seconds
Collecting perf dump for osd.0
Collecting perf dump for osd.1
Collecting perf dump for osd.2
Sleeping for 60 seconds
Collecting perf dump for osd.0
Collecting perf dump for osd.1
Collecting perf dump for osd.2
Sleeping for 60 seconds
Collecting perf dump for osd.0
Error ENXIO: problem getting command descriptions from osd.0
command terminated with exit code 6
Collecting perf dump for osd.1
Collecting perf dump for osd.2
Error ENXIO: problem getting command descriptions from osd.2
command terminated with exit code 6
Sleeping for 60 seconds
Collecting perf dump for osd.0
Error ENXIO: problem getting command descriptions from osd.0
command terminated with exit code 6
Collecting perf dump for osd.1
Collecting perf dump for osd.2
Error ENXIO: problem getting command descriptions from osd.2
command terminated with exit code 6
Sleeping for 60 seconds
Collecting perf dump for osd.0
Error ENXIO: problem getting command descriptions from osd.0
command terminated with exit code 6
Collecting perf dump for osd.1
Collecting perf dump for osd.2
Error ENXIO: problem getting command descriptions from osd.2
command terminated with exit code 6
Sleeping for 60 seconds
Collecting perf dump for osd.0
Error ENXIO: problem getting command descriptions from osd.0
command terminated with exit code 6
Collecting perf dump for osd.1
Collecting perf dump for osd.2
Error ENXIO: problem getting command descriptions from osd.2
command terminated with exit code 6
Sleeping for 60 seconds


Collecting perf dump for osd.0
Error ENXIO: problem getting command descriptions from osd.0
command terminated with exit code 6
Collecting perf dump for osd.1
Collecting perf dump for osd.2
Error ENXIO: problem getting command descriptions from osd.2
command terminated with exit code 6
Sleeping for 60 seconds



Before running the script I saw that OSD's were recovered 

$cephstatus
  cluster:
    id:     928c8604-6082-4802-893c-3222d5f150ae
    health: HEALTH_OK
 
  services:
    mon:        3 daemons, quorum d,e,f (age 5d)
    mgr:        a(active, since 5d)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 3 up (since 3d), 3 in (since 11d)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 328.79k objects, 1.2 TiB
    usage:   3.7 TiB used, 2.3 TiB / 6 TiB avail
    pgs:     169 active+clean
 
  io:
    client:   889 MiB/s rd, 272 KiB/s wr, 245 op/s rd, 12 op/s wr


When i initiated failover and relocated operations for some workloads i saw that osd went down again


$cephstatus
  cluster:
    id:     928c8604-6082-4802-893c-3222d5f150ae
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            2 osds down
            2 hosts (2 osds) down
            2 racks (2 osds) down
            Reduced data availability: 169 pgs inactive
            Degraded data redundancy: 454044/681066 objects degraded (66.667%), 133 pgs degraded, 169 pgs undersized
 
  services:
    mon:        3 daemons, quorum d,e,f (age 5d)
    mgr:        a(active, since 5d)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 1 up (since 3m), 3 in (since 11d)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 227.02k objects, 877 GiB
    usage:   2.6 TiB used, 3.4 TiB / 6 TiB avail
    pgs:     100.000% pgs not active
             454044/681066 objects degraded (66.667%)
             133 undersized+degraded+peered
             36  undersized+peered

--- Additional comment from Prashant Dhange on 2023-10-03 23:06:37 UTC ---

(In reply to Pratik Surve from comment #18)
> Hi Prashant,
> 
> 
> I have collected the script output here
> http://magna012.ceph.redhat.com/script_output.zip. 
> 
> when some OSD went down there were no log collected 
Both osd.0 and osd.2 were killed by OOM Killer.

From the perf dump output, most of the memory was in use by bluestore cache and majority for caching kv and kv_onode :

osd.2.perf_dump.2023-10-03_17-53-07.txt 

    "bluestore-pricache": {
        "target_bytes": 4294967296,
        "mapped_bytes": 4192239616,
        "unmapped_bytes": 2023022592,
        "heap_bytes": 6215262208,
        "cache_bytes": 2845415830
    },
    ...
        "bluestore-pricache:kv": {
        "pri0_bytes": 67108864,
        "pri1_bytes": 0,
        "pri2_bytes": 7443216,
        "pri3_bytes": 3600880,
        "pri4_bytes": 15534992,
        "pri5_bytes": 13541328,
        "pri6_bytes": 372139648,
        "pri7_bytes": 0,
        "pri8_bytes": 0,
        "pri9_bytes": 0,
        "pri10_bytes": 0,
        "pri11_bytes": 93203112,
        "reserved_bytes": 81739384,
        "committed_bytes": 654311424
    },
    "bluestore-pricache:kv_onode": {
        "pri0_bytes": 67108864,
        "pri1_bytes": 0,
        "pri2_bytes": 0,
        "pri3_bytes": 0,
        "pri4_bytes": 0,
        "pri5_bytes": 0,
        "pri6_bytes": 0,
        "pri7_bytes": 0,
        "pri8_bytes": 0,
        "pri9_bytes": 0,
        "pri10_bytes": 1155541216,
        "pri11_bytes": 8284721,
        "reserved_bytes": 77688047,
        "committed_bytes": 1308622848
    },
    ...

I have recovered the cluster by restarting the osd.0 and osd.2 pods. Now cluster is in HEALTH_OK state.

> When i initiated failover and relocated operations for some workloads i saw
> that osd went down again

Can you collect the perf/heap/mempools dump using below script and perform failover again to reproduce this issue ?

-------------------------------------------------------------------------------------------------------------------------------------
#!/bin/bash
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
LOG_DIR="$PWD/osd_mem_dump_data.$(date +%F_%H-%M-%S)"
SLEEP_TIME=60

mkdir $LOG_DIR
mkdir -p "$LOG_DIR/perf_dump/"
mkdir -p "$LOG_DIR/heap_dump/"
mkdir -p "$LOG_DIR/dump_mempools/"

cd $LOG_DIR

for osd in $(oc exec $TOOLS_POD -- ceph osd ls)
do
	echo "Starting heap profiler for osd.$osd"
	oc exec $TOOLS_POD -- ceph tell osd.$osd heap start_profiler
done

while true
do
        for osd in $(oc exec $TOOLS_POD -- ceph osd ls)
        do
                echo "Collecting perf dump for osd.$osd"
                oc exec $TOOLS_POD -- ceph tell osd.$osd perf dump > $LOG_DIR/perf_dump/osd.$osd.perf_dump.$(date +%F_%H-%M-%S).txt
                echo "Collecting dump_mempools for osd.$osd"
		oc exec $TOOLS_POD -- ceph tell osd.$osd dump_mempools > $LOG_DIR/dump_mempools/osd.$osd.dump_mempools.$(date +%F_%H-%M-%S).txt
		echo "Collecting heap dump for osd.$osd"
		oc exec $TOOLS_POD -- ceph tell osd.$osd heap dump > $LOG_DIR/heap_dump/osd.$osd.heap_dump.$(date +%F_%H-%M-%S).txt

        done
        echo "Sleeping for $SLEEP_TIME seconds"
        sleep $SLEEP_TIME
done
-------------------------------------------------------------------------------------------------------------------------------------

--- Additional comment from Prashant Dhange on 2023-10-03 23:20:02 UTC ---

@prsurve I have cloned this BZ to RADOS BZ#2242025 to investigate higher memory allocations for bluestore cache which is causing the OSD daemons getting killed by OOM killer. Kindly submit the data collected using script mentioned in comment#19 to BZ#2242025. Let's keep this BZ to address issue related to stopping OSD pod from re-starting if OSD daemon is flapping frequently.

--- Additional comment from Pratik Surve on 2023-10-05 09:50:04 UTC ---

Hi Prashant,

I'm not hitting the issue on my current setup. I'm, going to redeploy the cluster and see if I can repro the same issue and then provide the script output added in comment#19.

--- Additional comment from Elvir Kuric on 2023-10-13 07:18:35 UTC ---

This issue is visible with : ceph version 17.2.6-138.el9cp (b488c8dad42b2ecffcd96f3d76eeeecce48b8590) quincy (stable)

flapping OSD logs below, restarting OSD helps it move to running state and it joins cluster. 

--- 
+ trap sigterm SIGTERM
+ wait
+ ceph-osd --foreground --id 52 --fsid 543d8344-4233-4669-b5ab-0d08e3e62a36 --setuser ceph --setgroup ceph '--crush-location=root=default host=d29-h11-000-r750' --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8192 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false
debug2023-10-10T14:48:34.024+0000 7fd95817b5c0  0 set uid:gid to 167:167 (ceph:ceph)
debug2023-10-10T14:48:34.024+0000 7fd95817b5c0  0 ceph version 17.2.6-138.el9cp (b488c8dad42b2ecffcd96f3d76eeeecce48b8590) quincy (stable), process ceph-osd, pid 1270884
debug2023-10-10T14:48:34.024+0000 7fd95817b5c0  0 pidfile_write: ignore empty --pid-file
debug2023-10-10T14:48:34.024+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable
debug2023-10-10T14:48:34.024+0000 7fd95817b5c0  0 starting osd.52 osd_data /var/lib/ceph/osd/ceph-52 /var/lib/ceph/osd/ceph-52/journal
debug2023-10-10T14:48:34.025+0000 7fd95817b5c0 -1 Falling back to public interface
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0  0 load: jerasure load: lrc 
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0  1 mClockScheduler: set_osd_capacity_params_from_config: osd_bandwidth_cost_per_io: 499321.90 bytes/io, osd_bandwidth_capacity_per_shard 19660800.00 bytes/second
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0  0 osd.52:0.OSDShard using op scheduler mClockScheduler
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0  1 mClockScheduler: set_osd_capacity_params_from_config: osd_bandwidth_cost_per_io: 499321.90 bytes/io, osd_bandwidth_capacity_per_shard 19660800.00 bytes/second
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0  0 osd.52:1.OSDShard using op scheduler mClockScheduler
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0  1 mClockScheduler: set_osd_capacity_params_from_config: osd_bandwidth_cost_per_io: 499321.90 bytes/io, osd_bandwidth_capacity_per_shard 19660800.00 bytes/second
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0  0 osd.52:2.OSDShard using op scheduler mClockScheduler
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0  1 mClockScheduler: set_osd_capacity_params_from_config: osd_bandwidth_cost_per_io: 499321.90 bytes/io, osd_bandwidth_capacity_per_shard 19660800.00 bytes/second
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0  0 osd.52:3.OSDShard using op scheduler mClockScheduler
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0  1 mClockScheduler: set_osd_capacity_params_from_config: osd_bandwidth_cost_per_io: 499321.90 bytes/io, osd_bandwidth_capacity_per_shard 19660800.00 bytes/second
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0  0 osd.52:4.OSDShard using op scheduler mClockScheduler
debug2023-10-10T14:48:34.030+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable
debug2023-10-10T14:48:34.031+0000 7fd95817b5c0  1 mClockScheduler: set_osd_capacity_params_from_config: osd_bandwidth_cost_per_io: 499321.90 bytes/io, osd_bandwidth_capacity_per_shard 19660800.00 bytes/second
debug2023-10-10T14:48:34.031+0000 7fd95817b5c0  0 osd.52:5.OSDShard using op scheduler mClockScheduler
debug2023-10-10T14:48:34.031+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable
debug2023-10-10T14:48:34.031+0000 7fd95817b5c0  1 mClockScheduler: set_osd_capacity_params_from_config: osd_bandwidth_cost_per_io: 499321.90 bytes/io, osd_bandwidth_capacity_per_shard 19660800.00 bytes/second
debug2023-10-10T14:48:34.031+0000 7fd95817b5c0  0 osd.52:6.OSDShard using op scheduler mClockScheduler
debug2023-10-10T14:48:34.031+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable
debug2023-10-10T14:48:34.031+0000 7fd95817b5c0  1 mClockScheduler: set_osd_capacity_params_from_config: osd_bandwidth_cost_per_io: 499321.90 bytes/io, osd_bandwidth_capacity_per_shard 19660800.00 bytes/second
debug2023-10-10T14:48:34.031+0000 7fd95817b5c0  0 osd.52:7.OSDShard using op scheduler mClockScheduler
debug2023-10-10T14:48:34.031+0000 7fd95817b5c0 -1 bluestore-rdr(/var/lib/ceph/osd/ceph-52) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-52/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable
debug2023-10-10T14:48:34.031+0000 7fd95817b5c0 -1 osd.52 0 OSD::pre_init: object store '/var/lib/ceph/osd/ceph-52' is currently in use. (Is ceph-osd already running?)
debug2023-10-10T14:48:34.031+0000 7fd95817b5c0 -1  ** ERROR: osd pre_init failed: (16) Device or resource busy
+ RESTART_INTERVAL=24
+ rc=0
+ '[' 0 -eq 0 ']'
+ touch /tmp/osd-sleep
+ echo 'OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will sleep for 24 hours. Restart the pod manually once the flapping issue is fixed'
OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will sleep for 24 hours. Restart the pod manually once the flapping issue is fixed
+ wait
+ sleep 24h

--- Additional comment from Blaine Gardner on 2023-10-13 15:30:07 UTC ---

A fix for this was merged yesterday here: https://github.com/red-hat-storage/rook/pull/528

The logs just shared above look to still be using the un-patched version of ODF. 

Please try again with the latest ODF build that contains the commit dd3395050531d91ac32ee71fe2c488e20decc9c1

The head of the build is likely at ed872f1565ead1f1eb260c297a17d3db38f82e90 (the merge commit)

Comment 4 Travis Nielsen 2023-10-16 17:52:23 UTC
To enable this new Rook setting from the OCS operator, the latest Rook API must be updated at the same time that is pulling in some more changes from upstream than what is strictly required for this setting. See the linked PR. The additional changes are not considered risky, but neither can they be separated from this change.

The Rook PR had been ack'ed a while back by QE, but coming this late in the cycle we just need to confirm that QE has the bandwidth available to test this change for OSDs flapping still in the 4.14.0 timeline. The change to support flapping OSDs is not a trivial change. If we go ahead and merge this, we also need to consider the possibility of reverting this change if we find any regressions in OSDs during this testing. 

Elad, please consider whether to ack this.

Comment 10 errata-xmlrpc 2023-11-08 18:55:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832

Comment 11 Red Hat Bugzilla 2024-03-08 04:26:18 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.