Bug 2143554 - [MS] Ceph health is not okay after Tier4b tests on ROSA4.11 V2.0.9 clusters. OSD recovery taking more time than usual time than expected
Summary: [MS] Ceph health is not okay after Tier4b tests on ROSA4.11 V2.0.9 clusters. ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-managed-service
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Nobody
QA Contact: Neha Berry
URL:
Whiteboard:
: 2143555 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-11-17 08:02 UTC by suchita
Modified: 2023-08-09 17:00 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-03 13:20:27 UTC
Embargoed:


Attachments (Terms of Use)

Description suchita 2022-11-17 08:02:43 UTC
Description of problem:
Ceph health is not okay during Tier4b tests on ROSA4.11 V2.0.9 clusters. 
OSD recovery taking unusual time than expected 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Deploy ROSA 4.11 provider consumer cluster
2. perform day 2 operation/ run tier4b regression tests
3. ( Will update the more specific Reproducer details after analysis)


Actual results:
OSD is down and ceph health is not okay

Expected results:
Ceph health should be okay and OSD should recover within a timeout

Additional info:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1013/
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-c2p/sgatfane-c2p_20221116T060915/multicluster/logs/test_report_1668602746.html
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1011/
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-16npr/sgatfane-16npr_20221116T043318/multicluster/logs/test_report_1668602449.html
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-16npr/sgatfane-16npr_20221116T043318/multicluster/logs/test_report_1668602490.html

Must gather:

Comment 1 suchita 2022-11-17 08:04:43 UTC
*** Bug 2143555 has been marked as a duplicate of this bug. ***

Comment 2 Elad 2022-11-20 11:25:06 UTC
Suchita, would it be possible to compare the time it takes the OSD to recover between an SDN based cluster and an OVN one?

Comment 8 suchita 2023-04-10 13:47:38 UTC
This has been observed where the Provider is SDN and Consumer is OVN. 

From SDN Provider + SDN Consumer we have a few Runs where this behavior is inconsistently observed. 

Now this time we have Provider 4.11 (OVN) and Consumer4.11 (OVN) , with the first 2 attempts on the freshly deployed cluster this behavior is not observed.

However, as this issue is inconsistent, the rebalancing time for OSD is not the same every time.

Comment 9 suchita 2023-04-10 13:56:05 UTC
I have observed a similar issue mostly while running node operation tests on the upgraded cluster.

Comment 10 Rewant 2023-07-03 13:20:27 UTC
As all our provider and consumers are already on ROSA4.11 and going forward with the new service we are not going to have providers on SDN. Hence moving it to won't fix.


Note You need to log in before you can comment on or make changes to this bug.