2241185 – [RDR] Ceph reports osd down but pod are in running state

Bug 2241185 - [RDR] Ceph reports osd down but pod are in running state

Summary: [RDR] Ceph reports osd down but pod are in running state

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.14.0
Assignee:	Santosh Pillai
QA Contact:	Pratik Surve
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2242025 2244517
TreeView+	depends on / blocked

Reported:	2023-09-28 13:27 UTC by Pratik Surve
Modified:	2023-11-08 18:56 UTC (History)
CC List:	13 users (show)
Fixed In Version:	4.14.0-150
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2242025 2244517 (view as bug list)
Environment:
Last Closed:	2023-11-08 18:54:58 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 528	None	Merged	Bug 2242433: osd: use bash script if restart interval is > 0	2023-10-16 05:59:43 UTC
Github	rook rook issues 12972	None	open	OSD pod goes to sleep when osd daemon is OOMKilled	2023-10-02 19:22:11 UTC
Github	rook rook pull 12992	None	open	osd: fix osd script to handle sigterm	2023-10-05 12:40:06 UTC
Red Hat Product Errata	RHSA-2023:6832	None	None	None	2023-11-08 18:56:32 UTC

Description Pratik Surve 2023-09-28 13:27:38 UTC

Description of problem (please be detailed as possible and provide log
snippests):
[RDR] Ceph reports osd down but pod are in running state


Version of all relevant components (if applicable):
OCP version:- 4.14.0-0.nightly-2023-09-20-033502
ODF version:- 4.14.0-136
CEPH version:- ceph version 17.2.6-138.el9cp (b488c8dad42b2ecffcd96f3d76eeeecce48b8590) quincy (stable)
ACM version:- 2.9.0-109
SUBMARINER version:- devel
VOLSYNC version:- volsync-product.v0.7.4
VOLSYNC method:- destinationCopyMethod: LocalDirect

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy RDR cluster 
2.Run workloads for few day
3.observer ceph status and osd pod status


Actual results:

cephstatus       
  cluster:
    id:     928c8604-6082-4802-893c-3222d5f150ae
    health: HEALTH_WARN
            2 MDSs report slow metadata IOs
            1 MDSs report slow requests
            2 osds down
            2 hosts (2 osds) down
            2 racks (2 osds) down
            Reduced data availability: 169 pgs inactive
            Degraded data redundancy: 154474/231711 objects degraded (66.667%), 131 pgs degraded, 169 pgs undersized
 
  services:
    mon:        3 daemons, quorum d,e,f (age 116m)
    mgr:        a(active, since 116m)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 1 up (since 110m), 3 in (since 6d)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 77.24k objects, 294 GiB
    usage:   884 GiB used, 5.1 TiB / 6 TiB avail
    pgs:     100.000% pgs not active
             154474/231711 objects degraded (66.667%)
             131 undersized+degraded+peered
             38  undersized+peered

oc get pods -l app=rook-ceph-osd
NAME                               READY   STATUS    RESTARTS   AGE
rook-ceph-osd-0-5f7f4cb466-9rcwq   2/2     Running   2          9h
rook-ceph-osd-1-6cbcd67bfb-dpf4n   2/2     Running   0          121m
rook-ceph-osd-2-5f5cfb4f6d-xnhx2   2/2     Running   2          26h


snippet from osd pod log

debug2023-09-28T11:26:47.928+0000 7f5df6527640  1 osd.2 pg_epoch: 20604 pg[12.14( v 20573'397443 (20475'395441,20573'397443] local-lis/les=20489/20490 n=1375 ec=31/22 lis/c=20596/18223 les/c/f=20597/18224/0 sis=20604) [0,2,1] r=1 lpr=20604 pi=[18223,20604)/2 crt=20573'397443 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
debug2023-09-28T11:26:48.945+0000 7f5df9d2e640  1 osd.2 pg_epoch: 20605 pg[1.1d( v 20603'59507 (19815'57422,20603'59507] local-lis/les=20488/20489 n=1094 ec=29/11 lis/c=20488/18226 les/c/f=20489/18227/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18226,20605)/3 luod=0'0 lua=20592'59504 crt=20603'59507 mlcod 0'0 active mbc={}] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367
debug2023-09-28T11:26:48.945+0000 7f5df5d26640  1 osd.2 pg_epoch: 20605 pg[1.5( v 20564'82189 (18532'79706,20564'82189] local-lis/les=20488/20489 n=1075 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[20488,20605)/1 luod=0'0 crt=20564'82189 mlcod 0'0 active mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367
debug2023-09-28T11:26:48.945+0000 7f5df9d2e640  1 osd.2 pg_epoch: 20605 pg[1.1d( v 20603'59507 (19815'57422,20603'59507] local-lis/les=20488/20489 n=1094 ec=29/11 lis/c=20488/18226 les/c/f=20489/18227/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18226,20605)/3 crt=20603'59507 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
debug2023-09-28T11:26:48.945+0000 7f5df5d26640  1 osd.2 pg_epoch: 20605 pg[1.5( v 20564'82189 (18532'79706,20564'82189] local-lis/les=20488/20489 n=1075 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[20488,20605)/1 crt=20564'82189 mlcod 0'0 unknown NOTIFY mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] state<Start>: transitioning to Stray
debug2023-09-28T11:26:48.946+0000 7f5dfbd32640  1 osd.2 pg_epoch: 20605 pg[12.9( v 20573'454858 (20474'453032,20573'454858] local-lis/les=20489/20490 n=1397 ec=31/22 lis/c=20596/18241 les/c/f=20597/18242/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18241,20605)/3 luod=0'0 crt=20573'454858 mlcod 0'0 active mbc={}] start_peering_interval up [0,2,1] -> [0,2,1], acting [0,1] -> [0,2,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 4540138320759226367 upacting 4540138320759226367
debug2023-09-28T11:26:48.946+0000 7f5dfbd32640  1 osd.2 pg_epoch: 20605 pg[12.9( v 20573'454858 (20474'453032,20573'454858] local-lis/les=20489/20490 n=1397 ec=31/22 lis/c=20596/18241 les/c/f=20597/18242/0 sis=20605) [0,2,1] r=1 lpr=20605 pi=[18241,20605)/3 crt=20573'454858 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
debug2023-09-28T11:26:50.971+0000 7f5df7529640  1 osd.2 pg_epoch: 20607 pg[1.2( v 20597'95850 (18572'93811,20597'95850] local-lis/les=20488/20489 n=1099 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20607) [1,0,2] r=2 lpr=20607 pi=[20488,20607)/1 luod=0'0 crt=20594'95848 mlcod 0'0 active mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] start_peering_interval up [1,0,2] -> [1,0,2], acting [1,0] -> [1,0,2], acting_primary 1 -> 1, up_primary 1 -> 1, role -1 -> 2, features acting 4540138320759226367 upacting 4540138320759226367
debug2023-09-28T11:26:50.971+0000 7f5df7529640  1 osd.2 pg_epoch: 20607 pg[1.2( v 20597'95850 (18572'93811,20597'95850] local-lis/les=20488/20489 n=1099 ec=29/11 lis/c=20596/20488 les/c/f=20597/20489/0 sis=20607) [1,0,2] r=2 lpr=20607 pi=[20488,20607)/1 crt=20594'95848 mlcod 0'0 unknown NOTIFY mbc={} ps=[b66~7,b74~7,b82~5,b89~4,b8f~2,b92~1,b96~1,b99~2,b9c~1,b9e~1,ba6~1,baa~1]] state<Start>: transitioning to Stray
debug2023-09-28T11:27:07.822+0000 7f5df5d26640  0 log_channel(cluster) log [DBG] : 11.d scrub starts
debug2023-09-28T11:27:08.026+0000 7f5df5d26640  0 log_channel(cluster) log [DBG] : 11.d scrub ok
debug2023-09-28T11:27:08.871+0000 7f5dfbd32640  0 log_channel(cluster) log [DBG] : 12.1 scrub starts
debug2023-09-28T11:27:20.526+0000 7f5df7d2a640  0 log_channel(cluster) log [DBG] : 12.1 scrub ok
debug2023-09-28T11:27:37.647+0000 7f5dfad30640  0 log_channel(cluster) log [DBG] : 12.13 scrub starts
debug2023-09-28T11:27:47.852+0000 7f5dfad30640  0 log_channel(cluster) log [DBG] : 12.13 scrub ok
debug2023-09-28T11:27:51.828+0000 7f5df952d640  0 log_channel(cluster) log [DBG] : 11.e scrub starts
debug2023-09-28T11:27:52.066+0000 7f5df952d640  0 log_channel(cluster) log [DBG] : 11.e scrub ok
debug2023-09-28T11:28:51.854+0000 7f5df852b640  0 log_channel(cluster) log [DBG] : 5.0 scrub starts
debug2023-09-28T11:28:52.296+0000 7f5df852b640  0 log_channel(cluster) log [DBG] : 5.0 scrub ok
debug2023-09-28T11:28:52.891+0000 7f5df9d2e640  0 log_channel(cluster) log [DBG] : 11.1d scrub starts
debug2023-09-28T11:28:53.130+0000 7f5df5d26640  0 log_channel(cluster) log [DBG] : 11.1d scrub ok
/bin/bash: line 9:   412 Killed                  ceph-osd --foreground --id 2 --fsid 928c8604-6082-4802-893c-3222d5f150ae --setuser ceph --setgroup ceph --crush-location="root=default host=ocs-deviceset-thin-csi-odf-1-data-0457vv rack=rack2" --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8192 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false --public-addr=242.0.255.251 --public-bind-addr=10.131.0.41 --cluster-addr=10.131.0.41
+ RESTART_INTERVAL=24
+ rc=0
+ '[' 0 -eq 0 ']'
+ touch /tmp/osd-sleep
OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will sleep for 24 hours. Restart the pod manually once the flapping issue is fixed
+ echo 'OSD daemon exited with code 0, possibly due to OSD flapping. The OSD pod will sleep for 24 hours. Restart the pod manually once the flapping issue is fixed'
+ wait
+ sleep 24h


Expected results:
osd pod should not be in Running if ceph report osd as down

Additional info:

Comment 3 Santosh Pillai 2023-09-28 15:44:21 UTC

We added a new change on how OSDs are created to support https://bugzilla.redhat.com/show_bug.cgi?id=2223959

If the OSD daemon exits with a 0 exit code, we are assuming that its due to OSD flapping. Looks like in this case, the bash script received the exit code 0 when the OSD went down, and it caused was caused by something else other than OSD flapping. 

Need to investigate more.

Comment 8 Santosh Pillai 2023-09-28 16:30:18 UTC

Adding needinfo for comment #7

Comment 27 errata-xmlrpc 2023-11-08 18:54:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832

Note You need to log in before you can comment on or make changes to this bug.