Description of problem: Ceph health is not okay during Tier4b tests on ROSA4.11 V2.0.9 clusters. OSD recovery taking unusual time than expected Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Deploy ROSA 4.11 provider consumer cluster 2. perform day 2 operation/ run tier4b regression tests 3. ( Will update the more specific Reproducer details after analysis) Actual results: OSD is down and ceph health is not okay Expected results: Ceph health should be okay and OSD should recover within a timeout Additional info: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1013/ http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-c2p/sgatfane-c2p_20221116T060915/multicluster/logs/test_report_1668602746.html https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1011/ http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-16npr/sgatfane-16npr_20221116T043318/multicluster/logs/test_report_1668602449.html http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-16npr/sgatfane-16npr_20221116T043318/multicluster/logs/test_report_1668602490.html Must gather:
*** Bug 2143555 has been marked as a duplicate of this bug. ***
Suchita, would it be possible to compare the time it takes the OSD to recover between an SDN based cluster and an OVN one?
This has been observed where the Provider is SDN and Consumer is OVN. From SDN Provider + SDN Consumer we have a few Runs where this behavior is inconsistently observed. Now this time we have Provider 4.11 (OVN) and Consumer4.11 (OVN) , with the first 2 attempts on the freshly deployed cluster this behavior is not observed. However, as this issue is inconsistent, the rebalancing time for OSD is not the same every time.
I have observed a similar issue mostly while running node operation tests on the upgraded cluster.
As all our provider and consumers are already on ROSA4.11 and going forward with the new service we are not going to have providers on SDN. Hence moving it to won't fix.