Bug 2187580 - [GSS] CU deleted 3 nodes from AWS, 2 related to storage then 16 OSDs down, we try to rebuild OSDs and so on [NEEDINFO]
Summary: [GSS] CU deleted 3 nodes from AWS, 2 related to storage then 16 OSDs down, we...
Keywords:
Status: CLOSED DUPLICATE of bug 2174612
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Radoslaw Zarzynski
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-04-18 03:28 UTC by lema
Modified: 2023-08-09 16:37 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-05-04 06:47:13 UTC
Embargoed:
lsantann: needinfo? (rzarzyns)


Attachments (Terms of Use)

Description lema 2023-04-18 03:28:48 UTC
Description of problem (please be detailed as possible and provide log
snippets):

From case 03489442: 

CU deleted 3 nodes from AWS accidentally sooner they rebuild that from AWS, 2 of them are related to storage. 

Then we have seen like 16 OSDs down.

openshift-storage  rook-ceph-osd-0-69dc458f97-d2spj                                 0/2    Pending    0         12h
openshift-storage  rook-ceph-osd-1-7b4cb48447-ssrcg                                 0/2    Pending    0         12h
openshift-storage  rook-ceph-osd-10-77bcfc7dcc-7fkmr                                0/2    Pending    0         12h
openshift-storage  rook-ceph-osd-11-c4544797d-fmxk9                                 0/2    Pending    0         11h
openshift-storage  rook-ceph-osd-12-bcfc77d94-xclzf                                 0/2    Pending    0         11h
openshift-storage  rook-ceph-osd-13-dfbf556fd-xm4pv                                 0/2    Pending    0         12h
openshift-storage  rook-ceph-osd-14-5c4f6656db-qfxzk                                0/2    Pending    0         11h
openshift-storage  rook-ceph-osd-15-ff5f7cb49-ph7h7                                 0/2    Pending    0         12h
openshift-storage  rook-ceph-osd-16-b546dcf65-t97ch                                 0/2    Pending    0         12h
openshift-storage  rook-ceph-osd-17-5c8df8dcd5-cs2t8                                0/2    Pending    0         11h
openshift-storage  rook-ceph-osd-2-7b8b5986cb-5kgq4                                 0/2    Pending    0         11h
openshift-storage  rook-ceph-osd-3-c84669957-sd6ml                                  0/2    Pending    0         10h
openshift-storage  rook-ceph-osd-4-bd45cbdd-fhpc9                                   0/2    Pending    0         10h
openshift-storage  rook-ceph-osd-5-8456875449-l6r59                                 0/2    Pending    0         10h
openshift-storage  rook-ceph-osd-8-5d76df9687-kvdvw                                 0/2    Pending    0         10h
openshift-storage  rook-ceph-osd-9-76b9d87c95-kbhbg                                 0/2    Pending    0         12h

Then we follow this 03447387 to re-create osd-0 which is working now osd0 is up and running from #428.

Later on, OSD-2 is also up! 

But the rest OSDs not coming up so far, and then we saw Noobaa in the terminating stage.


Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product?
(please explain in detail what is the user impact)?

CU didn't know too much about multiple cloud, But I didn't see RGW in ceph -s so far.  that's why we can't risky deleting noobaa.

But so far CU still have enough capacity to hold the business we have checked,  we would someone from the Engineering team to check noobaa status and give us some recommendation on how to move this on! 

Is there any workaround available to the best of your knowledge?

We are still waiting for the rest OSDs up but also with the backend backfill happening now! 

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue be reproducible?

NA

Can this issue reproduce from the UI?

NA

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 7 Levy Sant'Anna 2023-04-18 18:36:02 UTC
ceph health detail

sh-4.4$ ceph health detail
HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; Reduced data availability: 10 pgs inactive, 4 pgs incomplete; 1 daemons have recently crashed; 56 slow ops, oldest one blocked for 15249 sec, daemons [osd.13,osd.3,osd.5,osd.9] have slow ops.
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client ip-10-40-9-207:csi-cephfs-node failing to respond to capability release client_id: 6560476
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 7513 secs
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): 3 slow requests are blocked > 30 secs
[WRN] PG_AVAILABILITY: Reduced data availability: 10 pgs inactive, 4 pgs incomplete
    pg 2.1c is stuck inactive for 4h, current state unknown, last acting []
    pg 2.24 is stuck inactive for 4h, current state unknown, last acting []
    pg 2.27 is stuck inactive for 4h, current state unknown, last acting []
    pg 2.3f is incomplete, acting [3,19,12] (reducing pool ocs-storagecluster-cephblockpool min_size from 2 may help; search ceph.com/docs for 'incomplete')
    pg 2.b8 is incomplete, acting [0,3,12] (reducing pool ocs-storagecluster-cephblockpool min_size from 2 may help; search ceph.com/docs for 'incomplete')
    pg 2.e9 is stuck inactive for 4h, current state unknown, last acting []
    pg 2.189 is stuck inactive for 4h, current state unknown, last acting []
    pg 2.1b2 is incomplete, acting [19,3,18] (reducing pool ocs-storagecluster-cephblockpool min_size from 2 may help; search ceph.com/docs for 'incomplete')
    pg 2.1c7 is incomplete, acting [13,0,3] (reducing pool ocs-storagecluster-cephblockpool min_size from 2 may help; search ceph.com/docs for 'incomplete')
    pg 4.30 is stuck inactive for 4h, current state unknown, last acting []
[WRN] RECENT_CRASH: 1 daemons have recently crashed
    client.admin crashed on host rook-ceph-osd-15-75875f74b4-mcl2f at 2023-04-18T12:22:56.598997Z
[WRN] SLOW_OPS: 56 slow ops, oldest one blocked for 15249 sec, daemons [osd.13,osd.3,osd.5,osd.9] have slow ops.


Note You need to log in before you can comment on or make changes to this bug.