Bug 2241185

Summary: [RDR] Ceph reports osd down but pod are in running state
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Pratik Surve <prsurve>
Component: rookAssignee: Santosh Pillai <sapillai>
Status: CLOSED ERRATA QA Contact: Pratik Surve <prsurve>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.14CC: bniver, brgardne, ebenahar, ekuric, kramdoss, kseeger, muagarwa, nojha, odf-bz-bot, pdhange, sapillai, sostapov, tnielsen
Target Milestone: ---   
Target Release: ODF 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.14.0-150 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2242025 2244517 (view as bug list) Environment:
Last Closed: 2023-11-08 18:54:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2242025, 2244517    

Description Pratik Surve 2023-09-28 13:27:38 UTC
Description of problem (please be detailed as possible and provide log
snippests):
[RDR] Ceph reports osd down but pod are in running state


Version of all relevant components (if applicable):
OCP version:- 4.14.0-0.nightly-2023-09-20-033502
ODF version:- 4.14.0-136
CEPH version:- ceph version 17.2.6-138.el9cp (b488c8dad42b2ecffcd96f3d76eeeecce48b8590) quincy (stable)
ACM version:- 2.9.0-109
SUBMARINER version:- devel
VOLSYNC version:- volsync-product.v0.7.4
VOLSYNC method:- destinationCopyMethod: LocalDirect

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy RDR cluster 
2.Run workloads for few day
3.observer ceph status and osd pod status


Actual results:

cephstatus       
  cluster:
    id:     928c8604-6082-4802-893c-3222d5f150ae
    health: HEALTH_WARN
            2 MDSs report slow metadata IOs
            1 MDSs report slow requests
            2 osds down
            2 hosts (2 osds) down
            2 racks (2 osds) down
            Reduced data availability: 169 pgs inactive
            Degraded data redundancy: 154474/231711 objects degraded (66.667%), 131 pgs degraded, 169 pgs undersized
 
  services:
    mon:        3 daemons, quorum d,e,f (age 116m)
    mgr:        a(active, since 116m)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 1 up (since 110m), 3 in (since 6d)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 77.24k objects, 294 GiB
    usage:   884 GiB used, 5.1 TiB / 6 TiB avail
    pgs:     100.000% pgs not active
             154474/231711 objects degraded (66.667%)
             131 undersized+degraded+peered
             38  undersized+peered

oc get pods -l app=rook-ceph-osd
NAME                               READY   STATUS    RESTARTS   AGE
rook-ceph-osd-0-5f7f4cb466-9rcwq   2/2     Running   2          9h
rook-ceph-osd-1-6cbcd67bfb-dpf4n   2/2     Running   0          121m
rook-ceph-osd-2-5f5cfb4f6d-xnhx2   2/2     Running   2          26h


snippet from osd pod log

debug2023-09-28T11:26:47.928+0000 7f5df6527640  1 osd.2 pg_epoch: 20604 pg[12.14( v 20573'397443 (20475'395441,20573'397443] local-lis/les=20489/20490 n=1375 ec=31/22 lis/c=20596/18223 les/c/f=20597/18224/0 sis=20604) [0,2,1] r=1 lpr=20604 pi=[18223,20604)/2 crt=20573'397443 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
debug2023-09-28T11:26:48.945+0000 7f5df9d2e640  1 osd.2 pg_epoch: 20605 pg[1.1d( v 20603'59507 (19815'57422,20603'59507] local-lis/les=20488/20489 n=1094 ec=29/11 lis/c=20488/18226 les/c/f=20489/18227/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18226,20605)/3 luod=0'0 lua=20592'59504 crt=20603'59507 mlcod 0'0 active mbc={}] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367
debug2023-09-28T11:26:48.945+0000 7f5df5d26640  1 osd.2 pg_epoch: 20605 pg[1.5( v 20564'82189 (18532'79706,20564'82189] local-lis/les=20488/20489 n=1075 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[20488,20605)/1 luod=0'0 crt=20564'82189 mlcod 0'0 active mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367
debug2023-09-28T11:26:48.945+0000 7f5df9d2e640  1 osd.2 pg_epoch: 20605 pg[1.1d( v 20603'59507 (19815'57422,20603'59507] local-lis/les=20488/20489 n=1094 ec=29/11 lis/c=20488/18226 les/c/f=20489/18227/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18226,20605)/3 crt=20603'59507 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
debug2023-09-28T11:26:48.945+0000 7f5df5d26640  1 osd.2 pg_epoch: 20605 pg[1.5( v 20564'82189 (18532'79706,20564'82189] local-lis/les=20488/20489 n=1075 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[20488,20605)/1 crt=20564'82189 mlcod 0'0 unknown NOTIFY mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] state<Start>: transitioning to Stray
debug2023-09-28T11:26:48.946+0000 7f5dfbd32640  1 osd.2 pg_epoch: 20605 pg[12.9( v 20573'454858 (20474'453032,20573'454858] local-lis/les=20489/20490 n=1397 ec=31/22 lis/c=20596/18241 les/c/f=20597/18242/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18241,20605)/3 luod=0'0 crt=20573'454858 mlcod 0'0 active mbc={}] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367
debug2023-09-28T11:26:48.946+0000 7f5dfbd32640  1 osd.2 pg_epoch: 20605 pg[12.9( v 20573'454858 (20474'453032,20573'454858] local-lis/les=20489/20490 n=1397 ec=31/22 lis/c=20596/18241 les/c/f=20597/18242/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18241,20605)/3 crt=20573'454858 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
debug2023-09-28T11:26:50.971+0000 7f5df7529640  1 osd.2 pg_epoch: 20607 pg[1.2( v 20597'95850 (18572'93811,20597'95850] local-lis/les=20488/20489 n=1099 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20607) [1,0,2] r=2 lpr=20607 pi=[20488,20607)/1 luod=0'0 crt=20594'95848 mlcod 0'0 active mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] start_peering_interval up [1,0,2] -> [1,0,2], acting [1,0] -> [1,0,2], acting_primary 1 -> 1, up_primary 1 -> 1, role -1 -> 2, features acting 4540138320759226367 upacting 4540138320759226367
debug2023-09-28T11:26:50.971+0000 7f5df7529640  1 osd.2 pg_epoch: 20607 pg[1.2( v 20597'95850 (18572'93811,20597'95850] local-lis/les=20488/20489 n=1099 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20607) [1,0,2] r=2 lpr=20607 pi=[20488,20607)/1 crt=20594'95848 mlcod 0'0 unknown NOTIFY mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] state<Start>: transitioning to Stray
debug2023-09-28T11:27:07.822+0000 7f5df5d26640  0 log_channel(cluster) log [DBG] : 11.d scrub starts
debug2023-09-28T11:27:08.026+0000 7f5df5d26640  0 log_channel(cluster) log [DBG] : 11.d scrub ok
debug2023-09-28T11:27:08.871+0000 7f5dfbd32640  0 log_channel(cluster) log [DBG] : 12.1 scrub starts
debug2023-09-28T11:27:20.526+0000 7f5df7d2a640  0 log_channel(cluster) log [DBG] : 12.1 scrub ok
debug2023-09-28T11:27:37.647+0000 7f5dfad30640  0 log_channel(cluster) log [DBG] : 12.13 scrub starts
debug2023-09-28T11:27:47.852+0000 7f5dfad30640  0 log_channel(cluster) log [DBG] : 12.13 scrub ok
debug2023-09-28T11:27:51.828+0000 7f5df952d640  0 log_channel(cluster) log [DBG] : 11.e scrub starts
debug2023-09-28T11:27:52.066+0000 7f5df952d640  0 log_channel(cluster) log [DBG] : 11.e scrub ok
debug2023-09-28T11:28:51.854+0000 7f5df852b640  0 log_channel(cluster) log [DBG] : 5.0 scrub starts
debug2023-09-28T11:28:52.296+0000 7f5df852b640  0 log_channel(cluster) log [DBG] : 5.0 scrub ok
debug2023-09-28T11:28:52.891+0000 7f5df9d2e640  0 log_channel(cluster) log [DBG] : 11.1d scrub starts
debug2023-09-28T11:28:53.130+0000 7f5df5d26640  0 log_channel(cluster) log [DBG] : 11.1d scrub ok
/bin/bash: line 9:   412 Killed                  ceph-osd --foreground --id 2 --fsid 928c8604-6082-4802-893c-3222d5f150ae --setuser ceph --setgroup ceph --crush-location="root=default host=ocs-deviceset-thin-csi-odf-1-data-0457vv rack=rack2" --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8192 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false --public-addr=242.0.255.251 --public-bind-addr=10.131.0.41 --cluster-addr=10.131.0.41
+ RESTART_INTERVAL=24
+ rc=0
+ '[' 0 -eq 0 ']'
+ touch /tmp/osd-sleep
OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will sleep for 24 hours. Restart the pod manually once the flapping issue is fixed
+ echo 'OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will sleep for 24 hours. Restart the pod manually once the flapping issue is fixed'
+ wait
+ sleep 24h


Expected results:
osd pod should not be in Running if ceph report osd as down

Additional info:

Comment 3 Santosh Pillai 2023-09-28 15:44:21 UTC
We added a new change on how OSDs are created to support https://bugzilla.redhat.com/show_bug.cgi?id=2223959

If the OSD daemon exits with a 0 exit code, we are assuming that its due to OSD flapping. Looks like in this case, the bash script received the exit code 0 when the OSD went down, and it caused was caused by something else other than OSD flapping. 

Need to investigate more.

Comment 8 Santosh Pillai 2023-09-28 16:30:18 UTC
Adding needinfo for comment #7

Comment 27 errata-xmlrpc 2023-11-08 18:54:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832