Bug 2138917

Summary: [AWS] HEALTH_WARN Reduced data availability: 1 pg peering after add capacity and re-balance completed
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Petr Balogh <pbalogh>
Component: cephAssignee: Nitzan mordechai <nmordech>
ceph sub component: RADOS QA Contact: Elad <ebenahar>
Status: CLOSED NOTABUG Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bniver, madam, mkasturi, muagarwa, nmordech, ocs-bugs, odf-bz-bot, pdhiran
Version: 4.12Keywords: Regression
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-14 12:57:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Petr Balogh 2022-10-31 16:54:47 UTC
Description of problem (please be detailed as possible and provide log
snippests):
Lately we see a lot of teardown failures in add_capacity test case.
I recently added the check to repeat 3 times before consider it as completed here:
https://github.com/red-hat-storage/ocs-ci/pull/6578/files

And even after this change where I can see it completed 3 times with HEALTH OK:
2022-10-29 04:14:25  04:14:25 - MainThread - ocs_ci.ocs.cluster - INFO  - Re-balance completed! This is attempt 1 out of 3 repeats. This rebalance check needs to prove it 3 times in row.
2022-10-29 04:15:09  04:15:09 - MainThread - ocs_ci.ocs.cluster - INFO  - Re-balance completed! This is attempt 2 out of 3 repeats. This rebalance check needs to prove it 3 times in row.
2022-10-29 04:16:09  04:16:08 - MainThread - ocs_ci.ocs.cluster - INFO  - Re-balance completed! This is attempt 3 out of 3 repeats. This rebalance check needs to prove it 3 times in row.

I see that even between the attempts it's changing to warn and OK again.

See here:
2022-10-29 04:15:39  04:15:39 - MainThread - ocs_ci.ocs.cluster - INFO  - {'status': 'HEALTH_WARN', 'checks': {'PG_AVAILABILITY': {'severity': 'HEALTH_WARN', 'summary': {'message': 'Reduced data availability: 1 pg peering', 'count': 1}, 'muted': False}}, 'mutes': []}
2022-10-29 04:15:39  04:15:39 - MainThread - ocs_ci.ocs.cluster - INFO  - [{'state_name': 'active+clean', 'count': 192}, {'state_name': 'peering', 'count': 1}]
2022-10-29 04:15:54  04:15:54 - MainThread - ocs_ci.ocs.cluster - INFO  - {'status': 'HEALTH_OK', 'checks': {}, 'mutes': []}
2022-10-29 04:15:54  04:15:54 - MainThread - ocs_ci.ocs.cluster - INFO  - [{'state_name': 'active+clean', 'count': 189}, {'state_name': 'peering', 'count': 2}, {'state_name': 'remapped+peering', 'count': 2}]2022-10-29 04:16:09  04:16:08 - MainThread - ocs_ci.ocs.cluster - INFO  - {'status': 'HEALTH_OK', 'checks': {}, 'mutes': []}
2022-10-29 04:16:09  04:16:08 - MainThread - ocs_ci.ocs.cluster - INFO  - [{'state_name': 'active+clean', 'count': 193}]
2022-10-29 04:16:11  04:16:11 - MainThread - ocs_ci.utility.retry - WARNING  - Ceph cluster health is not OK. Health: HEALTH_WARN Reduced data availability: 1 pg peering
2022-10-29 04:16:11  , Retrying in 30 seconds...


So it's failing with:
2022-10-29 04:36:41  E           ocs_ci.ocs.exceptions.CephHealthException: Ceph cluster health is not OK. Health: HEALTH_WARN Reduced data availability: 1 pg peering


From what I see it's happening only on AWS and started to happen like 2-3 weeks back with no change to test case.

Is this ok or something wrong in ceph or AWS or is this normal behavior?


Version of all relevant components (if applicable):
ODF: 4.12.0-82
OCP: 4.12 nightly


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Add capacity
2. Wait for re-balance to complete - ceph health is OK
3. then it's flip flopping again showing HEALTH_WARN Reduced data availability: 1 pg peering and Health OK.


Actual results:
HEALTH_WARN Reduced data availability: 1 

Expected results:
Not changing back and forth between OK and Warn


Additional info:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/5907/
Parent job:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-acceptance/299/
Must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-299ai3c33-a/j-299ai3c33-a_20221029T020428/logs/failed_testcase_ocs_logs_1667013094/test_add_capacity_ocs_logs/

Comment 8 Nitzan mordechai 2022-11-03 06:43:25 UTC
@pbalogh Yes, this is normal behavior, when you are adding new capacity to exist cluster, remapping is occurring, and new osdmap are published, it is not happening all at once, pgs can move around when we are reweighting the cluster.
Health warning doesn't mean that the osd is not up, when pg is peering\remapped we will see it in the health status.

Comment 11 Petr Balogh 2022-11-14 12:57:40 UTC
Closing