Description of problem (please be detailed as possible and provide log snippests): Lately we see a lot of teardown failures in add_capacity test case. I recently added the check to repeat 3 times before consider it as completed here: https://github.com/red-hat-storage/ocs-ci/pull/6578/files And even after this change where I can see it completed 3 times with HEALTH OK: 2022-10-29 04:14:25 04:14:25 - MainThread - ocs_ci.ocs.cluster - INFO - Re-balance completed! This is attempt 1 out of 3 repeats. This rebalance check needs to prove it 3 times in row. 2022-10-29 04:15:09 04:15:09 - MainThread - ocs_ci.ocs.cluster - INFO - Re-balance completed! This is attempt 2 out of 3 repeats. This rebalance check needs to prove it 3 times in row. 2022-10-29 04:16:09 04:16:08 - MainThread - ocs_ci.ocs.cluster - INFO - Re-balance completed! This is attempt 3 out of 3 repeats. This rebalance check needs to prove it 3 times in row. I see that even between the attempts it's changing to warn and OK again. See here: 2022-10-29 04:15:39 04:15:39 - MainThread - ocs_ci.ocs.cluster - INFO - {'status': 'HEALTH_WARN', 'checks': {'PG_AVAILABILITY': {'severity': 'HEALTH_WARN', 'summary': {'message': 'Reduced data availability: 1 pg peering', 'count': 1}, 'muted': False}}, 'mutes': []} 2022-10-29 04:15:39 04:15:39 - MainThread - ocs_ci.ocs.cluster - INFO - [{'state_name': 'active+clean', 'count': 192}, {'state_name': 'peering', 'count': 1}] 2022-10-29 04:15:54 04:15:54 - MainThread - ocs_ci.ocs.cluster - INFO - {'status': 'HEALTH_OK', 'checks': {}, 'mutes': []} 2022-10-29 04:15:54 04:15:54 - MainThread - ocs_ci.ocs.cluster - INFO - [{'state_name': 'active+clean', 'count': 189}, {'state_name': 'peering', 'count': 2}, {'state_name': 'remapped+peering', 'count': 2}]2022-10-29 04:16:09 04:16:08 - MainThread - ocs_ci.ocs.cluster - INFO - {'status': 'HEALTH_OK', 'checks': {}, 'mutes': []} 2022-10-29 04:16:09 04:16:08 - MainThread - ocs_ci.ocs.cluster - INFO - [{'state_name': 'active+clean', 'count': 193}] 2022-10-29 04:16:11 04:16:11 - MainThread - ocs_ci.utility.retry - WARNING - Ceph cluster health is not OK. Health: HEALTH_WARN Reduced data availability: 1 pg peering 2022-10-29 04:16:11 , Retrying in 30 seconds... So it's failing with: 2022-10-29 04:36:41 E ocs_ci.ocs.exceptions.CephHealthException: Ceph cluster health is not OK. Health: HEALTH_WARN Reduced data availability: 1 pg peering From what I see it's happening only on AWS and started to happen like 2-3 weeks back with no change to test case. Is this ok or something wrong in ceph or AWS or is this normal behavior? Version of all relevant components (if applicable): ODF: 4.12.0-82 OCP: 4.12 nightly Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Add capacity 2. Wait for re-balance to complete - ceph health is OK 3. then it's flip flopping again showing HEALTH_WARN Reduced data availability: 1 pg peering and Health OK. Actual results: HEALTH_WARN Reduced data availability: 1 Expected results: Not changing back and forth between OK and Warn Additional info: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/5907/ Parent job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-acceptance/299/ Must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-299ai3c33-a/j-299ai3c33-a_20221029T020428/logs/failed_testcase_ocs_logs_1667013094/test_add_capacity_ocs_logs/
@pbalogh Yes, this is normal behavior, when you are adding new capacity to exist cluster, remapping is occurring, and new osdmap are published, it is not happening all at once, pgs can move around when we are reweighting the cluster. Health warning doesn't mean that the osd is not up, when pg is peering\remapped we will see it in the health status.
Closing