Bug 2241185 - [RDR] Ceph reports osd down but pod are in running state
Summary: [RDR] Ceph reports osd down but pod are in running state
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.14.0
Assignee: Santosh Pillai
QA Contact: Pratik Surve
URL:
Whiteboard:
Depends On:
Blocks: 2242025 2244517
TreeView+ depends on / blocked
 
Reported: 2023-09-28 13:27 UTC by Pratik Surve
Modified: 2023-11-08 18:56 UTC (History)
13 users (show)

Fixed In Version: 4.14.0-150
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2242025 2244517 (view as bug list)
Environment:
Last Closed: 2023-11-08 18:54:58 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 528 0 None Merged Bug 2242433: osd: use bash script if restart interval is > 0 2023-10-16 05:59:43 UTC
Github rook rook issues 12972 0 None open OSD pod goes to sleep when osd daemon is OOMKilled 2023-10-02 19:22:11 UTC
Github rook rook pull 12992 0 None open osd: fix osd script to handle sigterm 2023-10-05 12:40:06 UTC
Red Hat Product Errata RHSA-2023:6832 0 None None None 2023-11-08 18:56:32 UTC

Description Pratik Surve 2023-09-28 13:27:38 UTC
Description of problem (please be detailed as possible and provide log
snippests):
[RDR] Ceph reports osd down but pod are in running state


Version of all relevant components (if applicable):
OCP version:- 4.14.0-0.nightly-2023-09-20-033502
ODF version:- 4.14.0-136
CEPH version:- ceph version 17.2.6-138.el9cp (b488c8dad42b2ecffcd96f3d76eeeecce48b8590) quincy (stable)
ACM version:- 2.9.0-109
SUBMARINER version:- devel
VOLSYNC version:- volsync-product.v0.7.4
VOLSYNC method:- destinationCopyMethod: LocalDirect

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy RDR cluster 
2.Run workloads for few day
3.observer ceph status and osd pod status


Actual results:

cephstatus       
  cluster:
    id:     928c8604-6082-4802-893c-3222d5f150ae
    health: HEALTH_WARN
            2 MDSs report slow metadata IOs
            1 MDSs report slow requests
            2 osds down
            2 hosts (2 osds) down
            2 racks (2 osds) down
            Reduced data availability: 169 pgs inactive
            Degraded data redundancy: 154474/231711 objects degraded (66.667%), 131 pgs degraded, 169 pgs undersized
 
  services:
    mon:        3 daemons, quorum d,e,f (age 116m)
    mgr:        a(active, since 116m)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 1 up (since 110m), 3 in (since 6d)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 77.24k objects, 294 GiB
    usage:   884 GiB used, 5.1 TiB / 6 TiB avail
    pgs:     100.000% pgs not active
             154474/231711 objects degraded (66.667%)
             131 undersized+degraded+peered
             38  undersized+peered

oc get pods -l app=rook-ceph-osd
NAME                               READY   STATUS    RESTARTS   AGE
rook-ceph-osd-0-5f7f4cb466-9rcwq   2/2     Running   2          9h
rook-ceph-osd-1-6cbcd67bfb-dpf4n   2/2     Running   0          121m
rook-ceph-osd-2-5f5cfb4f6d-xnhx2   2/2     Running   2          26h


snippet from osd pod log

debug2023-09-28T11:26:47.928+0000 7f5df6527640  1 osd.2 pg_epoch: 20604 pg[12.14( v 20573'397443 (20475'395441,20573'397443] local-lis/les=20489/20490 n=1375 ec=31/22 lis/c=20596/18223 les/c/f=20597/18224/0 sis=20604) [0,2,1] r=1 lpr=20604 pi=[18223,20604)/2 crt=20573'397443 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
debug2023-09-28T11:26:48.945+0000 7f5df9d2e640  1 osd.2 pg_epoch: 20605 pg[1.1d( v 20603'59507 (19815'57422,20603'59507] local-lis/les=20488/20489 n=1094 ec=29/11 lis/c=20488/18226 les/c/f=20489/18227/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18226,20605)/3 luod=0'0 lua=20592'59504 crt=20603'59507 mlcod 0'0 active mbc={}] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367
debug2023-09-28T11:26:48.945+0000 7f5df5d26640  1 osd.2 pg_epoch: 20605 pg[1.5( v 20564'82189 (18532'79706,20564'82189] local-lis/les=20488/20489 n=1075 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[20488,20605)/1 luod=0'0 crt=20564'82189 mlcod 0'0 active mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367
debug2023-09-28T11:26:48.945+0000 7f5df9d2e640  1 osd.2 pg_epoch: 20605 pg[1.1d( v 20603'59507 (19815'57422,20603'59507] local-lis/les=20488/20489 n=1094 ec=29/11 lis/c=20488/18226 les/c/f=20489/18227/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18226,20605)/3 crt=20603'59507 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
debug2023-09-28T11:26:48.945+0000 7f5df5d26640  1 osd.2 pg_epoch: 20605 pg[1.5( v 20564'82189 (18532'79706,20564'82189] local-lis/les=20488/20489 n=1075 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[20488,20605)/1 crt=20564'82189 mlcod 0'0 unknown NOTIFY mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] state<Start>: transitioning to Stray
debug2023-09-28T11:26:48.946+0000 7f5dfbd32640  1 osd.2 pg_epoch: 20605 pg[12.9( v 20573'454858 (20474'453032,20573'454858] local-lis/les=20489/20490 n=1397 ec=31/22 lis/c=20596/18241 les/c/f=20597/18242/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18241,20605)/3 luod=0'0 crt=20573'454858 mlcod 0'0 active mbc={}] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367
debug2023-09-28T11:26:48.946+0000 7f5dfbd32640  1 osd.2 pg_epoch: 20605 pg[12.9( v 20573'454858 (20474'453032,20573'454858] local-lis/les=20489/20490 n=1397 ec=31/22 lis/c=20596/18241 les/c/f=20597/18242/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18241,20605)/3 crt=20573'454858 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
debug2023-09-28T11:26:50.971+0000 7f5df7529640  1 osd.2 pg_epoch: 20607 pg[1.2( v 20597'95850 (18572'93811,20597'95850] local-lis/les=20488/20489 n=1099 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20607) [1,0,2] r=2 lpr=20607 pi=[20488,20607)/1 luod=0'0 crt=20594'95848 mlcod 0'0 active mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] start_peering_interval up [1,0,2] -> [1,0,2], acting [1,0] -> [1,0,2], acting_primary 1 -> 1, up_primary 1 -> 1, role -1 -> 2, features acting 4540138320759226367 upacting 4540138320759226367
debug2023-09-28T11:26:50.971+0000 7f5df7529640  1 osd.2 pg_epoch: 20607 pg[1.2( v 20597'95850 (18572'93811,20597'95850] local-lis/les=20488/20489 n=1099 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20607) [1,0,2] r=2 lpr=20607 pi=[20488,20607)/1 crt=20594'95848 mlcod 0'0 unknown NOTIFY mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] state<Start>: transitioning to Stray
debug2023-09-28T11:27:07.822+0000 7f5df5d26640  0 log_channel(cluster) log [DBG] : 11.d scrub starts
debug2023-09-28T11:27:08.026+0000 7f5df5d26640  0 log_channel(cluster) log [DBG] : 11.d scrub ok
debug2023-09-28T11:27:08.871+0000 7f5dfbd32640  0 log_channel(cluster) log [DBG] : 12.1 scrub starts
debug2023-09-28T11:27:20.526+0000 7f5df7d2a640  0 log_channel(cluster) log [DBG] : 12.1 scrub ok
debug2023-09-28T11:27:37.647+0000 7f5dfad30640  0 log_channel(cluster) log [DBG] : 12.13 scrub starts
debug2023-09-28T11:27:47.852+0000 7f5dfad30640  0 log_channel(cluster) log [DBG] : 12.13 scrub ok
debug2023-09-28T11:27:51.828+0000 7f5df952d640  0 log_channel(cluster) log [DBG] : 11.e scrub starts
debug2023-09-28T11:27:52.066+0000 7f5df952d640  0 log_channel(cluster) log [DBG] : 11.e scrub ok
debug2023-09-28T11:28:51.854+0000 7f5df852b640  0 log_channel(cluster) log [DBG] : 5.0 scrub starts
debug2023-09-28T11:28:52.296+0000 7f5df852b640  0 log_channel(cluster) log [DBG] : 5.0 scrub ok
debug2023-09-28T11:28:52.891+0000 7f5df9d2e640  0 log_channel(cluster) log [DBG] : 11.1d scrub starts
debug2023-09-28T11:28:53.130+0000 7f5df5d26640  0 log_channel(cluster) log [DBG] : 11.1d scrub ok
/bin/bash: line 9:   412 Killed                  ceph-osd --foreground --id 2 --fsid 928c8604-6082-4802-893c-3222d5f150ae --setuser ceph --setgroup ceph --crush-location="root=default host=ocs-deviceset-thin-csi-odf-1-data-0457vv rack=rack2" --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8192 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false --public-addr=242.0.255.251 --public-bind-addr=10.131.0.41 --cluster-addr=10.131.0.41
+ RESTART_INTERVAL=24
+ rc=0
+ '[' 0 -eq 0 ']'
+ touch /tmp/osd-sleep
OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will sleep for 24 hours. Restart the pod manually once the flapping issue is fixed
+ echo 'OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will sleep for 24 hours. Restart the pod manually once the flapping issue is fixed'
+ wait
+ sleep 24h


Expected results:
osd pod should not be in Running if ceph report osd as down

Additional info:

Comment 3 Santosh Pillai 2023-09-28 15:44:21 UTC
We added a new change on how OSDs are created to support https://bugzilla.redhat.com/show_bug.cgi?id=2223959

If the OSD daemon exits with a 0 exit code, we are assuming that its due to OSD flapping. Looks like in this case, the bash script received the exit code 0 when the OSD went down, and it caused was caused by something else other than OSD flapping. 

Need to investigate more.

Comment 8 Santosh Pillai 2023-09-28 16:30:18 UTC
Adding needinfo for comment #7

Comment 27 errata-xmlrpc 2023-11-08 18:54:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832


Note You need to log in before you can comment on or make changes to this bug.