Bug 2138917 - [AWS] HEALTH_WARN Reduced data availability: 1 pg peering after add capacity and re-balance completed
Summary: [AWS] HEALTH_WARN Reduced data availability: 1 pg peering after add capacity ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.12
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Nitzan mordechai
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-10-31 16:54 UTC by Petr Balogh
Modified: 2023-08-09 16:37 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-11-14 12:57:40 UTC
Embargoed:


Attachments (Terms of Use)

Description Petr Balogh 2022-10-31 16:54:47 UTC
Description of problem (please be detailed as possible and provide log
snippests):
Lately we see a lot of teardown failures in add_capacity test case.
I recently added the check to repeat 3 times before consider it as completed here:
https://github.com/red-hat-storage/ocs-ci/pull/6578/files

And even after this change where I can see it completed 3 times with HEALTH OK:
2022-10-29 04:14:25  04:14:25 - MainThread - ocs_ci.ocs.cluster - INFO  - Re-balance completed! This is attempt 1 out of 3 repeats. This rebalance check needs to prove it 3 times in row.
2022-10-29 04:15:09  04:15:09 - MainThread - ocs_ci.ocs.cluster - INFO  - Re-balance completed! This is attempt 2 out of 3 repeats. This rebalance check needs to prove it 3 times in row.
2022-10-29 04:16:09  04:16:08 - MainThread - ocs_ci.ocs.cluster - INFO  - Re-balance completed! This is attempt 3 out of 3 repeats. This rebalance check needs to prove it 3 times in row.

I see that even between the attempts it's changing to warn and OK again.

See here:
2022-10-29 04:15:39  04:15:39 - MainThread - ocs_ci.ocs.cluster - INFO  - {'status': 'HEALTH_WARN', 'checks': {'PG_AVAILABILITY': {'severity': 'HEALTH_WARN', 'summary': {'message': 'Reduced data availability: 1 pg peering', 'count': 1}, 'muted': False}}, 'mutes': []}
2022-10-29 04:15:39  04:15:39 - MainThread - ocs_ci.ocs.cluster - INFO  - [{'state_name': 'active+clean', 'count': 192}, {'state_name': 'peering', 'count': 1}]
2022-10-29 04:15:54  04:15:54 - MainThread - ocs_ci.ocs.cluster - INFO  - {'status': 'HEALTH_OK', 'checks': {}, 'mutes': []}
2022-10-29 04:15:54  04:15:54 - MainThread - ocs_ci.ocs.cluster - INFO  - [{'state_name': 'active+clean', 'count': 189}, {'state_name': 'peering', 'count': 2}, {'state_name': 'remapped+peering', 'count': 2}]2022-10-29 04:16:09  04:16:08 - MainThread - ocs_ci.ocs.cluster - INFO  - {'status': 'HEALTH_OK', 'checks': {}, 'mutes': []}
2022-10-29 04:16:09  04:16:08 - MainThread - ocs_ci.ocs.cluster - INFO  - [{'state_name': 'active+clean', 'count': 193}]
2022-10-29 04:16:11  04:16:11 - MainThread - ocs_ci.utility.retry - WARNING  - Ceph cluster health is not OK. Health: HEALTH_WARN Reduced data availability: 1 pg peering
2022-10-29 04:16:11  , Retrying in 30 seconds...


So it's failing with:
2022-10-29 04:36:41  E           ocs_ci.ocs.exceptions.CephHealthException: Ceph cluster health is not OK. Health: HEALTH_WARN Reduced data availability: 1 pg peering


From what I see it's happening only on AWS and started to happen like 2-3 weeks back with no change to test case.

Is this ok or something wrong in ceph or AWS or is this normal behavior?


Version of all relevant components (if applicable):
ODF: 4.12.0-82
OCP: 4.12 nightly


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Add capacity
2. Wait for re-balance to complete - ceph health is OK
3. then it's flip flopping again showing HEALTH_WARN Reduced data availability: 1 pg peering and Health OK.


Actual results:
HEALTH_WARN Reduced data availability: 1 

Expected results:
Not changing back and forth between OK and Warn


Additional info:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/5907/
Parent job:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-acceptance/299/
Must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-299ai3c33-a/j-299ai3c33-a_20221029T020428/logs/failed_testcase_ocs_logs_1667013094/test_add_capacity_ocs_logs/

Comment 8 Nitzan mordechai 2022-11-03 06:43:25 UTC
@pbalogh Yes, this is normal behavior, when you are adding new capacity to exist cluster, remapping is occurring, and new osdmap are published, it is not happening all at once, pgs can move around when we are reweighting the cluster.
Health warning doesn't mean that the osd is not up, when pg is peering\remapped we will see it in the health status.

Comment 11 Petr Balogh 2022-11-14 12:57:40 UTC
Closing


Note You need to log in before you can comment on or make changes to this bug.