Description of problem (please be detailed as possible and provide log snippests): [RDR] Ceph reports osd down but pod are in running state Version of all relevant components (if applicable): OCP version:- 4.14.0-0.nightly-2023-09-20-033502 ODF version:- 4.14.0-136 CEPH version:- ceph version 17.2.6-138.el9cp (b488c8dad42b2ecffcd96f3d76eeeecce48b8590) quincy (stable) ACM version:- 2.9.0-109 SUBMARINER version:- devel VOLSYNC version:- volsync-product.v0.7.4 VOLSYNC method:- destinationCopyMethod: LocalDirect Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Deploy RDR cluster 2.Run workloads for few day 3.observer ceph status and osd pod status Actual results: cephstatus cluster: id: 928c8604-6082-4802-893c-3222d5f150ae health: HEALTH_WARN 2 MDSs report slow metadata IOs 1 MDSs report slow requests 2 osds down 2 hosts (2 osds) down 2 racks (2 osds) down Reduced data availability: 169 pgs inactive Degraded data redundancy: 154474/231711 objects degraded (66.667%), 131 pgs degraded, 169 pgs undersized services: mon: 3 daemons, quorum d,e,f (age 116m) mgr: a(active, since 116m) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 1 up (since 110m), 3 in (since 6d) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 77.24k objects, 294 GiB usage: 884 GiB used, 5.1 TiB / 6 TiB avail pgs: 100.000% pgs not active 154474/231711 objects degraded (66.667%) 131 undersized+degraded+peered 38 undersized+peered oc get pods -l app=rook-ceph-osd NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-5f7f4cb466-9rcwq 2/2 Running 2 9h rook-ceph-osd-1-6cbcd67bfb-dpf4n 2/2 Running 0 121m rook-ceph-osd-2-5f5cfb4f6d-xnhx2 2/2 Running 2 26h snippet from osd pod log debug2023-09-28T11:26:47.928+0000 7f5df6527640 1 osd.2 pg_epoch: 20604 pg[12.14( v 20573'397443 (20475'395441,20573'397443] local-lis/les=20489/20490 n=1375 ec=31/22 lis/c=20596/18223 les/c/f=20597/18224/0 sis=20604) [0,2,1] r=1 lpr=20604 pi=[18223,20604)/2 crt=20573'397443 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray debug2023-09-28T11:26:48.945+0000 7f5df9d2e640 1 osd.2 pg_epoch: 20605 pg[1.1d( v 20603'59507 (19815'57422,20603'59507] local-lis/les=20488/20489 n=1094 ec=29/11 lis/c=20488/18226 les/c/f=20489/18227/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18226,20605)/3 luod=0'0 lua=20592'59504 crt=20603'59507 mlcod 0'0 active mbc={}] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367 debug2023-09-28T11:26:48.945+0000 7f5df5d26640 1 osd.2 pg_epoch: 20605 pg[1.5( v 20564'82189 (18532'79706,20564'82189] local-lis/les=20488/20489 n=1075 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[20488,20605)/1 luod=0'0 crt=20564'82189 mlcod 0'0 active mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367 debug2023-09-28T11:26:48.945+0000 7f5df9d2e640 1 osd.2 pg_epoch: 20605 pg[1.1d( v 20603'59507 (19815'57422,20603'59507] local-lis/les=20488/20489 n=1094 ec=29/11 lis/c=20488/18226 les/c/f=20489/18227/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18226,20605)/3 crt=20603'59507 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray debug2023-09-28T11:26:48.945+0000 7f5df5d26640 1 osd.2 pg_epoch: 20605 pg[1.5( v 20564'82189 (18532'79706,20564'82189] local-lis/les=20488/20489 n=1075 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[20488,20605)/1 crt=20564'82189 mlcod 0'0 unknown NOTIFY mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] state<Start>: transitioning to Stray debug2023-09-28T11:26:48.946+0000 7f5dfbd32640 1 osd.2 pg_epoch: 20605 pg[12.9( v 20573'454858 (20474'453032,20573'454858] local-lis/les=20489/20490 n=1397 ec=31/22 lis/c=20596/18241 les/c/f=20597/18242/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18241,20605)/3 luod=0'0 crt=20573'454858 mlcod 0'0 active mbc={}] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367 debug2023-09-28T11:26:48.946+0000 7f5dfbd32640 1 osd.2 pg_epoch: 20605 pg[12.9( v 20573'454858 (20474'453032,20573'454858] local-lis/les=20489/20490 n=1397 ec=31/22 lis/c=20596/18241 les/c/f=20597/18242/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18241,20605)/3 crt=20573'454858 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray debug2023-09-28T11:26:50.971+0000 7f5df7529640 1 osd.2 pg_epoch: 20607 pg[1.2( v 20597'95850 (18572'93811,20597'95850] local-lis/les=20488/20489 n=1099 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20607) [1,0,2] r=2 lpr=20607 pi=[20488,20607)/1 luod=0'0 crt=20594'95848 mlcod 0'0 active mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] start_peering_interval up [1,0,2] -> [1,0,2], acting [1,0] -> [1,0,2], acting_primary 1 -> 1, up_primary 1 -> 1, role -1 -> 2, features acting 4540138320759226367 upacting 4540138320759226367 debug2023-09-28T11:26:50.971+0000 7f5df7529640 1 osd.2 pg_epoch: 20607 pg[1.2( v 20597'95850 (18572'93811,20597'95850] local-lis/les=20488/20489 n=1099 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20607) [1,0,2] r=2 lpr=20607 pi=[20488,20607)/1 crt=20594'95848 mlcod 0'0 unknown NOTIFY mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] state<Start>: transitioning to Stray debug2023-09-28T11:27:07.822+0000 7f5df5d26640 0 log_channel(cluster) log [DBG] : 11.d scrub starts debug2023-09-28T11:27:08.026+0000 7f5df5d26640 0 log_channel(cluster) log [DBG] : 11.d scrub ok debug2023-09-28T11:27:08.871+0000 7f5dfbd32640 0 log_channel(cluster) log [DBG] : 12.1 scrub starts debug2023-09-28T11:27:20.526+0000 7f5df7d2a640 0 log_channel(cluster) log [DBG] : 12.1 scrub ok debug2023-09-28T11:27:37.647+0000 7f5dfad30640 0 log_channel(cluster) log [DBG] : 12.13 scrub starts debug2023-09-28T11:27:47.852+0000 7f5dfad30640 0 log_channel(cluster) log [DBG] : 12.13 scrub ok debug2023-09-28T11:27:51.828+0000 7f5df952d640 0 log_channel(cluster) log [DBG] : 11.e scrub starts debug2023-09-28T11:27:52.066+0000 7f5df952d640 0 log_channel(cluster) log [DBG] : 11.e scrub ok debug2023-09-28T11:28:51.854+0000 7f5df852b640 0 log_channel(cluster) log [DBG] : 5.0 scrub starts debug2023-09-28T11:28:52.296+0000 7f5df852b640 0 log_channel(cluster) log [DBG] : 5.0 scrub ok debug2023-09-28T11:28:52.891+0000 7f5df9d2e640 0 log_channel(cluster) log [DBG] : 11.1d scrub starts debug2023-09-28T11:28:53.130+0000 7f5df5d26640 0 log_channel(cluster) log [DBG] : 11.1d scrub ok /bin/bash: line 9: 412 Killed ceph-osd --foreground --id 2 --fsid 928c8604-6082-4802-893c-3222d5f150ae --setuser ceph --setgroup ceph --crush-location="root=default host=ocs-deviceset-thin-csi-odf-1-data-0457vv rack=rack2" --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8192 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false --public-addr=242.0.255.251 --public-bind-addr=10.131.0.41 --cluster-addr=10.131.0.41 + RESTART_INTERVAL=24 + rc=0 + '[' 0 -eq 0 ']' + touch /tmp/osd-sleep OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will sleep for 24 hours. Restart the pod manually once the flapping issue is fixed + echo 'OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will sleep for 24 hours. Restart the pod manually once the flapping issue is fixed' + wait + sleep 24h Expected results: osd pod should not be in Running if ceph report osd as down Additional info:
We added a new change on how OSDs are created to support https://bugzilla.redhat.com/show_bug.cgi?id=2223959 If the OSD daemon exits with a 0 exit code, we are assuming that its due to OSD flapping. Looks like in this case, the bash script received the exit code 0 when the OSD went down, and it caused was caused by something else other than OSD flapping. Need to investigate more.
Adding needinfo for comment #7
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832