Description of problem (please be detailed as possible and provide log snippets): From case 03489442: CU deleted 3 nodes from AWS accidentally sooner they rebuild that from AWS, 2 of them are related to storage. Then we have seen like 16 OSDs down. openshift-storage rook-ceph-osd-0-69dc458f97-d2spj 0/2 Pending 0 12h openshift-storage rook-ceph-osd-1-7b4cb48447-ssrcg 0/2 Pending 0 12h openshift-storage rook-ceph-osd-10-77bcfc7dcc-7fkmr 0/2 Pending 0 12h openshift-storage rook-ceph-osd-11-c4544797d-fmxk9 0/2 Pending 0 11h openshift-storage rook-ceph-osd-12-bcfc77d94-xclzf 0/2 Pending 0 11h openshift-storage rook-ceph-osd-13-dfbf556fd-xm4pv 0/2 Pending 0 12h openshift-storage rook-ceph-osd-14-5c4f6656db-qfxzk 0/2 Pending 0 11h openshift-storage rook-ceph-osd-15-ff5f7cb49-ph7h7 0/2 Pending 0 12h openshift-storage rook-ceph-osd-16-b546dcf65-t97ch 0/2 Pending 0 12h openshift-storage rook-ceph-osd-17-5c8df8dcd5-cs2t8 0/2 Pending 0 11h openshift-storage rook-ceph-osd-2-7b8b5986cb-5kgq4 0/2 Pending 0 11h openshift-storage rook-ceph-osd-3-c84669957-sd6ml 0/2 Pending 0 10h openshift-storage rook-ceph-osd-4-bd45cbdd-fhpc9 0/2 Pending 0 10h openshift-storage rook-ceph-osd-5-8456875449-l6r59 0/2 Pending 0 10h openshift-storage rook-ceph-osd-8-5d76df9687-kvdvw 0/2 Pending 0 10h openshift-storage rook-ceph-osd-9-76b9d87c95-kbhbg 0/2 Pending 0 12h Then we follow this 03447387 to re-create osd-0 which is working now osd0 is up and running from #428. Later on, OSD-2 is also up! But the rest OSDs not coming up so far, and then we saw Noobaa in the terminating stage. Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product? (please explain in detail what is the user impact)? CU didn't know too much about multiple cloud, But I didn't see RGW in ceph -s so far. that's why we can't risky deleting noobaa. But so far CU still have enough capacity to hold the business we have checked, we would someone from the Engineering team to check noobaa status and give us some recommendation on how to move this on! Is there any workaround available to the best of your knowledge? We are still waiting for the rest OSDs up but also with the backend backfill happening now! Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue be reproducible? NA Can this issue reproduce from the UI? NA If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
ceph health detail sh-4.4$ ceph health detail HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; Reduced data availability: 10 pgs inactive, 4 pgs incomplete; 1 daemons have recently crashed; 56 slow ops, oldest one blocked for 15249 sec, daemons [osd.13,osd.3,osd.5,osd.9] have slow ops. [WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client ip-10-40-9-207:csi-cephfs-node failing to respond to capability release client_id: 6560476 [WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs mds.ocs-storagecluster-cephfilesystem-a(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 7513 secs [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests mds.ocs-storagecluster-cephfilesystem-a(mds.0): 3 slow requests are blocked > 30 secs [WRN] PG_AVAILABILITY: Reduced data availability: 10 pgs inactive, 4 pgs incomplete pg 2.1c is stuck inactive for 4h, current state unknown, last acting [] pg 2.24 is stuck inactive for 4h, current state unknown, last acting [] pg 2.27 is stuck inactive for 4h, current state unknown, last acting [] pg 2.3f is incomplete, acting [3,19,12] (reducing pool ocs-storagecluster-cephblockpool min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 2.b8 is incomplete, acting [0,3,12] (reducing pool ocs-storagecluster-cephblockpool min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 2.e9 is stuck inactive for 4h, current state unknown, last acting [] pg 2.189 is stuck inactive for 4h, current state unknown, last acting [] pg 2.1b2 is incomplete, acting [19,3,18] (reducing pool ocs-storagecluster-cephblockpool min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 2.1c7 is incomplete, acting [13,0,3] (reducing pool ocs-storagecluster-cephblockpool min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 4.30 is stuck inactive for 4h, current state unknown, last acting [] [WRN] RECENT_CRASH: 1 daemons have recently crashed client.admin crashed on host rook-ceph-osd-15-75875f74b4-mcl2f at 2023-04-18T12:22:56.598997Z [WRN] SLOW_OPS: 56 slow ops, oldest one blocked for 15249 sec, daemons [osd.13,osd.3,osd.5,osd.9] have slow ops.