Description of problem (please be detailed as possible and provide log snippests): - ODF CLuster used as backend for 3scale project - Case started of with below pods in CLBO and osd.1 and osd.2 were down. rook-ceph-crashcollector-ocp-xq4fg-worker-ocs-dndtv-6d5bbdlrx4n 1/1 Running 0 56m 172.26.2.3 ocp-xq4fg-worker-ocs-dndtv <none> <none> rook-ceph-crashcollector-ocp-xq4fg-worker-ocs-kdmdv-5f8985zd5sv 1/1 Running 0 71m 172.27.2.12 ocp-xq4fg-worker-ocs-kdmdv <none> <none> rook-ceph-crashcollector-ocp-xq4fg-worker-ocs-tb22v-8486f4868t7 1/1 Running 0 84m 172.24.4.18 ocp-xq4fg-worker-ocs-tb22v <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58f58df98xjfh 2/2 Running 0 84m 172.24.4.16 ocp-xq4fg-worker-ocs-tb22v <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-f74549b77bd9j 2/2 Running 0 64m 172.27.2.15 ocp-xq4fg-worker-ocs-kdmdv <none> <none> rook-ceph-mgr-a-5f45f48656-9sqr7 2/2 Running 0 84m 172.24.4.17 ocp-xq4fg-worker-ocs-tb22v <none> <none> rook-ceph-mon-c-77b6d94bb5-hhrn2 2/2 Running 0 28m 172.26.2.17 ocp-xq4fg-worker-ocs-dndtv <none> <none> rook-ceph-mon-e-6b9d888fc9-pf7gq 2/2 Running 3 26d 172.24.4.11 ocp-xq4fg-worker-ocs-tb22v <none> <none> rook-ceph-mon-f-596ff854bb-6s9gh 2/2 Running 0 28m 172.27.2.16 ocp-xq4fg-worker-ocs-kdmdv <none> <none> rook-ceph-operator-7fcb865999-srw2t 1/1 Running 0 44m 172.26.2.13 ocp-xq4fg-worker-ocs-dndtv <none> <none> rook-ceph-osd-0-6957867bc6-dgdv6 2/2 Running 9 (19m ago) 42m 172.24.4.22 ocp-xq4fg-worker-ocs-tb22v <none> <none> rook-ceph-osd-1-66fcb9d68c-qf2h2 1/2 Running 9 (5m47s ago) 80m 172.27.2.7 ocp-xq4fg-worker-ocs-kdmdv <none> <none> rook-ceph-osd-2-5d8579f7f4-qzw89 1/2 CrashLoopBackOff 12 (2m29s ago) 41m 172.26.2.15 ocp-xq4fg-worker-ocs-dndtv <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5f655f45fmd2 1/2 CrashLoopBackOff 15 (2m52s ago) 71m 172.27.2.13 ocp-xq4fg-worker-ocs-kdmdv <none> <none> [amenon@supportshell-1 must_gather_commands]$ cat ceph_status cluster: id: a68b4792-6284-4f1d-9d20-e42ed96b59ef health: HEALTH_ERR 1 filesystem is degraded 1 MDSs report slow metadata IOs 132/272233 objects unfound (0.048%) 2 osds down 2 hosts (2 osds) down 2 racks (2 osds) down Reduced data availability: 177 pgs inactive, 99 pgs down Possible data damage: 29 pgs recovery_unfound Degraded data redundancy: 221154/816699 objects degraded (27.079%), 60 pgs degraded services: mon: 3 daemons, quorum c,e,f (age 29m) mgr: a(active, since 85m) mds: 1/1 daemons up, 1 standby osd: 3 osds: 1 up (since 22s), 3 in (since 7M) data: volumes: 0/1 healthy, 1 recovering pools: 11 pools, 177 pgs objects: 272.23k objects, 525 GiB usage: 1.6 TiB used, 1.4 TiB / 3 TiB avail pgs: 100.000% pgs not active 221154/816699 objects degraded (27.079%) 132/272233 objects unfound (0.048%) 99 down 31 undersized+degraded+peered 29 recovery_unfound+undersized+degraded+peered 18 undersized+peered - Please note at this stage only osd.0 was up [amenon@supportshell-1 must_gather_commands]$ cat ceph_osd_tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 3.00000 root default -4 1.00000 rack rack0 -3 1.00000 host ocp-xq4fg-worker-ocs-dndtv 2 hdd 1.00000 osd.2 down 1.00000 1.00000 -12 1.00000 rack rack1 -11 1.00000 host ocp-xq4fg-worker-ocs-kdmdv 1 hdd 1.00000 osd.1 down 1.00000 1.00000 -8 1.00000 rack rack2 -7 1.00000 host ocp-xq4fg-worker-ocs-tb22v 0 hdd 1.00000 osd.0 up 1.00000 1.00000 - Cu replaced osd nodes for osd.2 and osd.0. - At this stage, osd.0 was down. Looks like osd.1 and osd.0 were flapping. [bhull@supportshell-1 must_gather_commands]$ more ceph_osd_tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 3.00000 root default -4 1.00000 rack rack0 -3 1.00000 host ocp-xq4fg-worker-ocs-bnsfg 2 hdd 1.00000 osd.2 up 1.00000 1.00000 -12 1.00000 rack rack1 -11 1.00000 host ocp-xq4fg-worker-ocs-kdmdv 1 hdd 1.00000 osd.1 up 1.00000 1.00000 -8 1.00000 rack rack2 -7 1.00000 host ocp-xq4fg-worker-ocs-tb22v 0 hdd 1.00000 osd.0 down 1.00000 1.00000 - Currently we have all 3 osds up and running. - all 3 osd pods are running rook-ceph-osd-0-8bdc48b56-k4m6x 2/2 Running 0 2m25s 172.26.4.28 ocp-xq4fg-worker-ocs-ffgzx <none> <none> rook-ceph-osd-1-66fcb9d68c-2j2ht 2/2 Running 44 (7h56m ago) 10h 172.27.2.33 ocp-xq4fg-worker-ocs-kdmdv <none> <none> rook-ceph-osd-2-7f7bfd7fc5-7cpnm 2/2 Running 0 7h30m 172.25.4.29 ocp-xq4fg-worker-ocs-bnsfg <none> <none> - But mon.e is down rook-ceph-mon-e-6b9d888fc9-djkcs 0/2 Pending 0 12m <none> <none> <none> <none> - Cluster is currently backfilling, but main issue is that we have 46 pgs incomplete + 1 pg in recovery_unfound [amenon@supportshell-1 must_gather_commands]$ cat ceph_status cluster: id: a68b4792-6284-4f1d-9d20-e42ed96b59ef health: HEALTH_ERR 1 filesystem is degraded 1 MDSs report slow metadata IOs 1/3 mons down, quorum f,g 4/223819 objects unfound (0.002%) 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set Reduced data availability: 47 pgs inactive, 46 pgs incomplete Possible data damage: 1 pg recovery_unfound Degraded data redundancy: 221461/671457 objects degraded (32.982%), 98 pgs degraded, 112 pgs undersized 29 slow ops, oldest one blocked for 11261 sec, daemons [osd.0,osd.2] have slow ops. services: mon: 3 daemons, quorum f,g (age 6h), out of quorum: e mgr: a(active, since 8h) mds: 1/1 daemons up, 1 standby osd: 3 osds: 3 up (since 3m), 3 in (since 3m); 70 remapped pgs data: volumes: 0/1 healthy, 1 recovering pools: 11 pools, 177 pgs objects: 223.82k objects, 439 GiB usage: 947 GiB used, 2.1 TiB / 3 TiB avail pgs: 26.554% pgs not active 221461/671457 objects degraded (32.982%) 118865/671457 objects misplaced (17.703%) 4/223819 objects unfound (0.002%) 49 active+undersized+degraded+remapped+backfill_wait 47 active+undersized+degraded 46 incomplete 14 active+undersized 10 active+clean+remapped 9 active+clean 1 active+undersized+degraded+remapped+backfilling 1 recovery_unfound+undersized+degraded+remapped+peered - ceph is still in ERR status beacuse of the 46 pgs incomplete + 1 pg in recovery_unfound. - Cu confirmed that the old osd.0 is empty. As we don't have any data on osd.0, the 46 pgs are still stuck in incomplete and we are not able to recover them. - Ceph is concerned with these 46 pgs, which show [1,2] and [2,1] so the data is in question at this point. [WRN] PG_AVAILABILITY: Reduced data availability: 46 pgs inactive, 46 pgs incomplete pg 1.0 is incomplete, acting [1,2] pg 1.1 is incomplete, acting [2,1] pg 1.14 is incomplete, acting [2,1] . . . - Need help from engineering on whether we can recover these pgs and get to RCA Version of all relevant components (if applicable): - odf v4.10.12 - ceph version 16.2.7-126.el8cp Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? - Yes, Business services are currently unavailable. The ODF cluster is used as backend to their 3scale project. Cu will try to bring up the applications once backfilling is completed. Is there any workaround available to the best of your knowledge? N/A Additional info: All m-gs are available in supportshell under ~/03531113