Bug 2143554

Summary: [MS] Ceph health is not okay after Tier4b tests on ROSA4.11 V2.0.9 clusters. OSD recovery taking more time than usual time than expected
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: suchita <sgatfane>
Component: odf-managed-serviceAssignee: Nobody <nobody>
Status: CLOSED WONTFIX QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: aeyal, ebenahar, ocs-bugs, odf-bz-bot, resoni, sgatfane
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-03 13:20:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description suchita 2022-11-17 08:02:43 UTC
Description of problem:
Ceph health is not okay during Tier4b tests on ROSA4.11 V2.0.9 clusters. 
OSD recovery taking unusual time than expected 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Deploy ROSA 4.11 provider consumer cluster
2. perform day 2 operation/ run tier4b regression tests
3. ( Will update the more specific Reproducer details after analysis)


Actual results:
OSD is down and ceph health is not okay

Expected results:
Ceph health should be okay and OSD should recover within a timeout

Additional info:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1013/
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-c2p/sgatfane-c2p_20221116T060915/multicluster/logs/test_report_1668602746.html
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1011/
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-16npr/sgatfane-16npr_20221116T043318/multicluster/logs/test_report_1668602449.html
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-16npr/sgatfane-16npr_20221116T043318/multicluster/logs/test_report_1668602490.html

Must gather:

Comment 1 suchita 2022-11-17 08:04:43 UTC
*** Bug 2143555 has been marked as a duplicate of this bug. ***

Comment 2 Elad 2022-11-20 11:25:06 UTC
Suchita, would it be possible to compare the time it takes the OSD to recover between an SDN based cluster and an OVN one?

Comment 8 suchita 2023-04-10 13:47:38 UTC
This has been observed where the Provider is SDN and Consumer is OVN. 

From SDN Provider + SDN Consumer we have a few Runs where this behavior is inconsistently observed. 

Now this time we have Provider 4.11 (OVN) and Consumer4.11 (OVN) , with the first 2 attempts on the freshly deployed cluster this behavior is not observed.

However, as this issue is inconsistent, the rebalancing time for OSD is not the same every time.

Comment 9 suchita 2023-04-10 13:56:05 UTC
I have observed a similar issue mostly while running node operation tests on the upgraded cluster.

Comment 10 Rewant 2023-07-03 13:20:27 UTC
As all our provider and consumers are already on ROSA4.11 and going forward with the new service we are not going to have providers on SDN. Hence moving it to won't fix.