Bug 2187580

Summary:	[GSS] CU deleted 3 nodes from AWS, 2 related to storage then 16 OSDs down, we try to rebuild OSDs and so on
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	lema
Component:	ceph	Assignee:	Radoslaw Zarzynski <rzarzyns>
ceph sub component:	RADOS	QA Contact:	Elad <ebenahar>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	bkunal, bniver, hnallurv, juqiao, lsantann, muagarwa, ocs-bugs, odf-bz-bot, rzarzyns, sostapov, tnielsen
Version:	4.10
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-05-04 06:47:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description lema 2023-04-18 03:28:48 UTC

Description of problem (please be detailed as possible and provide log
snippets):

From case 03489442: 

CU deleted 3 nodes from AWS accidentally sooner they rebuild that from AWS, 2 of them are related to storage. 

Then we have seen like 16 OSDs down.

openshift-storage  rook-ceph-osd-0-69dc458f97-d2spj                                 0/2    Pending    0         12h
openshift-storage  rook-ceph-osd-1-7b4cb48447-ssrcg                                 0/2    Pending    0         12h
openshift-storage  rook-ceph-osd-10-77bcfc7dcc-7fkmr                                0/2    Pending    0         12h
openshift-storage  rook-ceph-osd-11-c4544797d-fmxk9                                 0/2    Pending    0         11h
openshift-storage  rook-ceph-osd-12-bcfc77d94-xclzf                                 0/2    Pending    0         11h
openshift-storage  rook-ceph-osd-13-dfbf556fd-xm4pv                                 0/2    Pending    0         12h
openshift-storage  rook-ceph-osd-14-5c4f6656db-qfxzk                                0/2    Pending    0         11h
openshift-storage  rook-ceph-osd-15-ff5f7cb49-ph7h7                                 0/2    Pending    0         12h
openshift-storage  rook-ceph-osd-16-b546dcf65-t97ch                                 0/2    Pending    0         12h
openshift-storage  rook-ceph-osd-17-5c8df8dcd5-cs2t8                                0/2    Pending    0         11h
openshift-storage  rook-ceph-osd-2-7b8b5986cb-5kgq4                                 0/2    Pending    0         11h
openshift-storage  rook-ceph-osd-3-c84669957-sd6ml                                  0/2    Pending    0         10h
openshift-storage  rook-ceph-osd-4-bd45cbdd-fhpc9                                   0/2    Pending    0         10h
openshift-storage  rook-ceph-osd-5-8456875449-l6r59                                 0/2    Pending    0         10h
openshift-storage  rook-ceph-osd-8-5d76df9687-kvdvw                                 0/2    Pending    0         10h
openshift-storage  rook-ceph-osd-9-76b9d87c95-kbhbg                                 0/2    Pending    0         12h

Then we follow this 03447387 to re-create osd-0 which is working now osd0 is up and running from #428.

Later on, OSD-2 is also up! 

But the rest OSDs not coming up so far, and then we saw Noobaa in the terminating stage.


Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product?
(please explain in detail what is the user impact)?

CU didn't know too much about multiple cloud, But I didn't see RGW in ceph -s so far.  that's why we can't risky deleting noobaa.

But so far CU still have enough capacity to hold the business we have checked,  we would someone from the Engineering team to check noobaa status and give us some recommendation on how to move this on! 

Is there any workaround available to the best of your knowledge?

We are still waiting for the rest OSDs up but also with the backend backfill happening now! 

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue be reproducible?

NA

Can this issue reproduce from the UI?

NA

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 7 Levy Sant'Anna 2023-04-18 18:36:02 UTC

ceph health detail

sh-4.4$ ceph health detail
HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; Reduced data availability: 10 pgs inactive, 4 pgs incomplete; 1 daemons have recently crashed; 56 slow ops, oldest one blocked for 15249 sec, daemons [osd.13,osd.3,osd.5,osd.9] have slow ops.
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client ip-10-40-9-207:csi-cephfs-node failing to respond to capability release client_id: 6560476
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 7513 secs
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): 3 slow requests are blocked > 30 secs
[WRN] PG_AVAILABILITY: Reduced data availability: 10 pgs inactive, 4 pgs incomplete
    pg 2.1c is stuck inactive for 4h, current state unknown, last acting []
    pg 2.24 is stuck inactive for 4h, current state unknown, last acting []
    pg 2.27 is stuck inactive for 4h, current state unknown, last acting []
    pg 2.3f is incomplete, acting [3,19,12] (reducing pool ocs-storagecluster-cephblockpool min_size from 2 may help; search ceph.com/docs for 'incomplete')
    pg 2.b8 is incomplete, acting [0,3,12] (reducing pool ocs-storagecluster-cephblockpool min_size from 2 may help; search ceph.com/docs for 'incomplete')
    pg 2.e9 is stuck inactive for 4h, current state unknown, last acting []
    pg 2.189 is stuck inactive for 4h, current state unknown, last acting []
    pg 2.1b2 is incomplete, acting [19,3,18] (reducing pool ocs-storagecluster-cephblockpool min_size from 2 may help; search ceph.com/docs for 'incomplete')
    pg 2.1c7 is incomplete, acting [13,0,3] (reducing pool ocs-storagecluster-cephblockpool min_size from 2 may help; search ceph.com/docs for 'incomplete')
    pg 4.30 is stuck inactive for 4h, current state unknown, last acting []
[WRN] RECENT_CRASH: 1 daemons have recently crashed
    client.admin crashed on host rook-ceph-osd-15-75875f74b4-mcl2f at 2023-04-18T12:22:56.598997Z
[WRN] SLOW_OPS: 56 slow ops, oldest one blocked for 15249 sec, daemons [osd.13,osd.3,osd.5,osd.9] have slow ops.

Comment 14 Red Hat Bugzilla 2023-12-08 04:33:09 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days