Bug 2088438

Summary: HEALTH_WARN on freshly deployed cluster
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Vijay Avuthu <vavuthu>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED NOTABUG QA Contact: Neha Berry <nberry>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.11CC: jrivera, madam, muagarwa, ocs-bugs, odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-25 13:40:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vijay Avuthu 2022-05-19 12:54:11 UTC
Description of problem (please be detailed as possible and provide log
snippests):

HEALTH_WARN on freshly deployed cluster and its recovered with single retry


Version of all relevant components (if applicable):

openshift installer (4.11.0-0.nightly-2022-05-18-010528)
ocs-registry:4.11.0-69


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
Yes

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
intermittent

Can this issue reproduce from the UI?
Not tried

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. install ODF using ocs-ci ( VSPHERE UPI KMS VAULT )
2. during teardown, check ceph health
3.


Actual results:

2022-05-18 14:28:17  08:58:17 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-65cb756c9f-pvgvs -- ceph health
2022-05-18 14:28:17  08:58:17 - MainThread - ocs_ci.utility.retry - WARNING - Ceph cluster health is not OK. Health: HEALTH_WARN Reduced data availability: 1 pg peering
2022-05-18 14:28:17  , Retrying in 30 seconds...

NOTE: its recovered on next retry

Expected results:

ceph health should be OK on newly deployed cluster

Additional info:

mon logs:

2022-05-18T08:51:46.667191912Z cluster 2022-05-18T08:51:45.649373+0000 mon.a (mon.0) 532 : cluster [DBG] osdmap e69: 3 total, 3 up, 3 in
2022-05-18T08:51:47.659511530Z debug 2022-05-18T08:51:47.658+0000 7fd31aeb4700  1 mon.a@0(leader).osd e70 do_prune osdmap full prune enabled
2022-05-18T08:51:47.661193028Z debug 2022-05-18T08:51:47.659+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)

2022-05-18T08:51:47.677647782Z audit 2022-05-18T08:51:47.677652732Z 2022-05-18T08:51:2022-05-18T08:51:47.677657303Z 46.664099+0000 mon.a (mon.2022-05-18T08:51:47.677662080Z 0) 535 : audit [INF] from='mgr.24140 ' entity='mgr.a' cmd=[{"prefix": "osd pool set", "pool": "ocs-storagecluster-cephobjectstore.rgw.control", "var": "pg_num_actual", "val": "28"}]: dispatch2022-05-18T08:51:47.677666665Z 
2022-05-18T08:51:47.677666665Z cluster 2022-05-18T08:51:47.677671658Z 2022-05-18T08:51:2022-05-18T08:51:47.677676216Z 47.661095+0000 mon.a (mon2022-05-18T08:51:47.677680857Z .0) 536 : cluster [WRN] 2022-05-18T08:51:47.677685782Z Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)


2022-05-18T08:51:52.725615426Z debug 2022-05-18T08:51:52.725+0000 7fd31aeb4700  1 mon.a@0(leader).osd e75 do_prune osdmap full prune enabled
2022-05-18T08:51:52.727486757Z debug 2022-05-18T08:51:52.726+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:51:52.727486757Z debug 2022-05-18T08:51:52.726+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Cluster is now healthy


job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/4250//consoleFull

must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-006vukv21cs33-t4a/j-006vukv21cs33-t4a_20220518T081531/logs/failed_testcase_ocs_logs_1652862006/test_deployment_ocs_logs/


> since we didn't see this issue on previous deployments ( intermittent ), would like to know what causes health check failed on freshly deployed cluster and is this expected/OK to see on newly deployed clusters

Comment 2 Nitin Goyal 2022-05-19 14:04:21 UTC
Moving it to rook

Comment 3 Travis Nielsen 2022-05-19 18:17:59 UTC
According to your log, there is a 5-second window where the health warning is raised, then it was cleared. Is that 5-second warning really a concern? During cluster creation I thought Ceph would always have intermittent health warnings while the services are still initializing. Are you saying this is a new behavior? After OSDs start and pools are created, it always takes a few seconds for the PGs to settle. 


2022-05-18T08:51:47.661193028Z debug 2022-05-18T08:51:47.659+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)

...
2022-05-18T08:51:52.727486757Z debug 2022-05-18T08:51:52.726+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)

Comment 4 Vijay Avuthu 2022-05-20 01:35:42 UTC
(In reply to Travis Nielsen from comment #3)
> According to your log, there is a 5-second window where the health warning
> is raised, then it was cleared. Is that 5-second warning really a concern?
> During cluster creation I thought Ceph would always have intermittent health
> warnings while the services are still initializing. Are you saying this is a
> new behavior? After OSDs start and pools are created, it always takes a few
> seconds for the PGs to settle. 
> 
> 2022-05-18T08:51:47.661193028Z debug 2022-05-18T08:51:47.659+0000
> 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed:
> Reduced data availability: 1 pg peering (PG_AVAILABILITY)
> 
> ...
> 2022-05-18T08:51:52.727486757Z debug 2022-05-18T08:51:52.726+0000
> 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared:
> PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)

We didn't see health warnings ( previous jobs ) in teardown phase of deployment test. 

from logs this behaviour was flipping for few minutes ( must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-006vukv21cs33-t4a/j-006vukv21cs33-t4a_20220518T081531/logs/failed_testcase_ocs_logs_1652862006/test_deployment_ocs_logs/  )

$ egrep -i "Health check failed|Health check cleared" namespaces/openshift-storage/pods/rook-ceph-mon-a-79cb4c96c9-qpj9d/mon/mon/logs/current.log
2022-05-18T08:47:33.461978070Z debug 2022-05-18T08:47:33.460+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: mon is allowing insecure global_id reclaim (AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED)
2022-05-18T08:47:33.517419111Z cluster 2022-05-18T08:47:33.461956+0000 mon.a (mon.0) 7 : cluster [WRN]2022-05-18T08:47:33.517452250Z  Health check failed: mon is allowing insecure global_id reclaim (AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED)
2022-05-18T08:49:16.620033685Z debug 2022-05-18T08:49:16.618+0000 7fd3186af700  0 log_channel(cluster) log [WRN] : Health check failed: OSD count 0 < osd_pool_default_size 3 (TOO_FEW_OSDS)
2022-05-18T08:49:16.677372330Z cluster 2022-05-18T08:49:16.619798+0000 mon.a (mon.0) 123 : cluster [WRN] Health check failed: OSD count 0 < osd_pool_default_size 3 (TOO_FEW_OSDS)
2022-05-18T08:49:23.409971275Z debug 2022-05-18T08:49:23.408+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: TOO_FEW_OSDS (was: OSD count 2 < osd_pool_default_size 3)
2022-05-18T08:49:23.936210262Z cluster 2022-05-18T08:49:23.409861+0000 mon.a (mon.0) 138 : cluster [INF] Health check cleared: TOO_FEW_OSDS (was: OSD count 2 < osd_pool_default_size 3)
2022-05-18T08:49:41.551818517Z debug 2022-05-18T08:49:41.550+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED (was: mon is allowing insecure global_id reclaim)
2022-05-18T08:49:42.566440963Z cluster 2022-05-18T08:49:41.551672+0000 mon.a (mon.0) 153 : cluster [INF] Health check cleared: AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED (was: mon is allowing insecure global_id reclaim)
2022-05-18T08:50:03.787853301Z debug 2022-05-18T08:50:03.786+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: insufficient standby MDS daemons available (MDS_INSUFFICIENT_STANDBY)
2022-05-18T08:50:04.468124830Z cluster 2022-05-18T08:50:04.468129660Z 2022-05-18T08:502022-05-18T08:50:04.468134044Z :03.787821+0000 mon.a (mon.2022-05-18T08:50:04.468138418Z 0) 366 : cluster [WRN] Health check failed: insufficient standby MDS daemons available (MDS_INSUFFICIENT_STANDBY)2022-05-18T08:50:04.468142824Z 
2022-05-18T08:50:04.794200267Z debug 2022-05-18T08:50:04.792+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: MDS_INSUFFICIENT_STANDBY (was: insufficient standby MDS daemons available)
2022-05-18T08:50:05.473909268Z cluster 2022-05-18T08:50:04.794102+0000 mon.a (mon.0) 370 : cluster [INF] Health check cleared: MDS_INSUFFICIENT_STANDBY (was: insufficient standby MDS daemons available)
2022-05-18T08:51:47.661193028Z debug 2022-05-18T08:51:47.659+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:51:47.677666665Z cluster 2022-05-18T08:51:47.677671658Z 2022-05-18T08:51:2022-05-18T08:51:47.677676216Z 47.661095+0000 mon.a (mon2022-05-18T08:51:47.677680857Z .0) 536 : cluster [WRN] 2022-05-18T08:51:47.677685782Z Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:51:52.727486757Z debug 2022-05-18T08:51:52.726+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:51:53.745340867Z cluster 2022-05-18T08:51:52.727427+0000 mon.a (mon.0) 553 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:51:59.589481247Z debug 2022-05-18T08:51:59.588+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 2 pgs inactive, 2 pgs peering (PG_AVAILABILITY)
2022-05-18T08:51:59.821351025Z cluster 2022-05-18T08:51:59.589411+0000 mon.a (mon.0) 576 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs inactive, 2 pgs peering (PG_AVAILABILITY)
2022-05-18T08:52:05.643842508Z debug 2022-05-18T08:52:05.642+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 1 pg peering)
2022-05-18T08:52:05.853442204Z cluster 20222022-05-18T08:52:05.853447770Z -05-18T08:52:05.6437442022-05-18T08:52:05.853452667Z +0000 mon.a (mon.0) 5912022-05-18T08:52:05.853457416Z  : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 1 pg peering)
2022-05-18T08:52:23.183596689Z debug 2022-05-18T08:52:23.182+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:52:24.201905792Z cluster 2022-05-2022-05-18T08:52:24.201915037Z 18T08:52:23.183493+0000 mon.a (mon.0) 643 : 2022-05-18T08:52:24.201919861Z cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:52:29.600365870Z debug 2022-05-18T08:52:29.598+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:52:29.721380540Z cluster 2022-052022-05-18T08:52:29.721385409Z -18T08:52:29.6002952022-05-18T08:52:29.721390320Z +0000 mon.a (mon.0) 6572022-05-18T08:52:29.721394810Z  : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:52:43.641645475Z debug 2022-05-18T08:52:43.640+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 1 pg inactive, 2 pgs peering (PG_AVAILABILITY)
2022-05-18T08:52:43.767932952Z cluster 2022-05-18T08:52:43.641572+00002022-05-18T08:52:43.767940740Z  mon.a (mon.0) 693 : cluster [WRN] Health check failed: Reduced data availability: 1 pg inactive, 2 pgs peering (PG_AVAILABILITY)
2022-05-18T08:52:49.608663964Z debug 2022-05-18T08:52:49.607+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 2 pgs peering)
2022-05-18T08:52:50.624597521Z cluster 2022-05-18T08:52:49.608498+0000 mon.a (mon.0) 711 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 2 pgs peering)
2022-05-18T08:52:59.612711976Z debug 2022-05-18T08:52:59.611+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 1 pg inactive, 2 pgs peering (PG_AVAILABILITY)
2022-05-18T08:52:59.855596523Z cluster 2022-05-2022-05-18T08:52:59.855600781Z 18T08:52:59.6124672022-05-18T08:52:59.855605298Z +0000 mon.a (mon.0) 7512022-05-18T08:52:59.855609539Z  : cluster [WRN] Health check failed: Reduced data availability: 1 pg inactive, 2 pgs peering (PG_AVAILABILITY)
2022-05-18T08:53:04.910549587Z debug 2022-05-18T08:53:04.909+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 2 pgs peering)
2022-05-18T08:53:05.936069148Z cluster 2022-052022-05-18T08:53:05.936075044Z -18T08:53:04.910519+00002022-05-18T08:53:05.936079988Z  mon.a (mon.0) 764 : cluster2022-05-18T08:53:05.936084694Z  [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 2 pgs peering)
2022-05-18T08:53:33.671035694Z debug 2022-05-18T08:53:33.669+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 1 pg inactive, 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:53:34.689902582Z cluster 2022-05-18T08:532022-05-18T08:53:34.689925526Z :33.670996+0000 mon.a (mon.0) 822 : cluster [WRN] Health check failed: Reduced data availability: 1 pg inactive, 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:53:36.716237937Z debug 2022-05-18T08:53:36.714+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 1 pg peering)
2022-05-18T08:53:37.725569809Z cluster 2022-05-18T08:53:36.716219+0000 mon.a (mon.0) 830 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 1 pg peering)
2022-05-18T08:53:59.633768400Z debug 2022-05-18T08:53:59.632+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
2022-05-18T08:54:00.004776863Z cluster 20222022-05-18T08:54:00.004783835Z -05-18T08:53:59.633663+0000 mon.a2022-05-18T08:54:00.004788616Z  (mon.0) 893 : cluster [WRN]2022-05-18T08:54:00.004793129Z  Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
2022-05-18T08:54:03.629429068Z debug 2022-05-18T08:54:03.628+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 2 pgs peering)
2022-05-18T08:54:04.056796020Z cluster 2022-2022-05-18T08:54:04.056800519Z 05-18T08:54:03.2022-05-18T08:54:04.056805140Z 629327+0000 mon.a (mon.0) 2022-05-18T08:54:04.056809450Z 905 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 2 pgs peering)
2022-05-18T08:54:19.644829980Z debug 2022-05-18T08:54:19.643+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
2022-05-18T08:54:20.300936542Z cluster 2022-05-18T08:54:19.644733+0000 2022-05-18T08:54:20.300944554Z mon.a (mon.0) 944 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
2022-05-18T08:54:23.702815621Z debug 2022-05-18T08:54:23.701+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 2 pgs peering)
2022-05-18T08:54:24.722381253Z cluster 2022-05-18T08:2022-05-18T08:54:24.722420354Z 54:23.702731+0000 mon.a (mon.0) 954 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 2 pgs peering)
2022-05-18T08:54:39.649372386Z debug 2022-05-18T08:54:39.647+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:54:39.959143901Z cluster 2022-052022-05-18T08:54:39.959148421Z -18T08:54:39.2022-05-18T08:54:39.959152996Z 649257+0000 mon.a (mon.0) 2022-05-18T08:54:39.959157356Z 1029 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:54:43.061661339Z debug 2022-05-18T08:54:43.060+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:54:44.074344205Z cluster 2022-05-18T08:54:44.074394651Z 2022-05-18T08:54:43.061513+0000 mon.a (mon.0) 1041 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:54:53.198857535Z debug 2022-05-18T08:54:53.197+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:54:54.208340926Z cluster 2022-05-18T08:54:53.198833+0000 mon.a (mon.0) 1074 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:54:57.246042929Z debug 2022-05-18T08:54:57.245+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:54:58.257316740Z cluster 2022-2022-05-18T08:54:58.257321205Z 05-18T08:54:572022-05-18T08:54:58.257325535Z .245980+0000 mon.a (mon.02022-05-18T08:54:58.257329824Z ) 1086 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)2022-05-18T08:54:58.257334280Z 
2022-05-18T08:56:23.132776959Z debug 2022-05-18T08:56:23.131+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:56:24.149706747Z cluster 2022-2022-05-18T08:56:24.149711610Z 05-18T08:56:232022-05-18T08:56:24.149716149Z .132741+0000 mon.a (mon.02022-05-18T08:56:24.149720489Z ) 1259 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:56:27.662701501Z debug 2022-05-18T08:56:27.661+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:56:28.197837715Z cluster 2022-05-18T08:56:27.662662+0000 mon.a (mon.0) 1268 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:57:02.969662630Z debug 2022-05-18T08:57:02.968+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:57:03.979222132Z cluster 2022-05-18T08:57:02.969627+0000 mon.a (mon2022-05-18T08:57:03.979276936Z .0) 1333 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:57:07.015448466Z debug 2022-05-18T08:57:07.014+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:57:08.033629473Z cluster 2022-05-18T08:57:07.015366+0000 mon.a2022-05-18T08:57:08.033634842Z  (mon.0) 1341 : cluster [INF] 2022-05-18T08:57:08.033640148Z Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:57:59.735302814Z debug 2022-05-18T08:57:59.733+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:58:00.093250241Z cluster 2022-05-18T08:57:2022-05-18T08:58:00.093265095Z 59.735244+0000 mon.a (mon.0) 1441 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:58:02.768527912Z debug 2022-05-18T08:58:02.767+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:58:03.777360277Z cluster 2022-05-18T08:58:02.768486+0000 mon.a (mon.0) 1447 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:58:12.779607330Z debug 2022-05-18T08:58:12.778+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:58:13.789396631Z cluster 2022-05-18T08:58:12.779575+0000 mon.a (mon.0) 1465 : cluster [WRN] 2022-05-18T08:58:13.789403613Z Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:58:19.743389818Z debug 2022-05-18T08:58:19.742+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:58:20.768977588Z cluster 2022-05-18T08:58:19.7433522022-05-18T08:58:20.768999628Z +0000 mon.a (mon.0) 1474 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:59:12.832246598Z debug 2022-05-18T08:59:12.831+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:59:13.841800994Z cluster 2022-05-18T08:59:12.832210+0000 mon.a (mon.0) 1570 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:59:16.877674347Z debug 2022-05-18T08:59:16.876+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:59:17.887540448Z cluster 2022-05-18T08:59:16.877625+0000 mon.a (mon.0) 1578 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:59:42.803776865Z debug 2022-05-18T08:59:42.803+0000 7fd31aeb4700  0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:59:43.814063002Z cluster 2022-05-18T08:59:42.803755+0000 mon.a (mon.0) 1632 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2022-05-18T08:59:49.774397313Z debug 2022-05-18T08:59:49.772+0000 7fd31aeb4700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2022-05-18T08:59:49.898388559Z cluster 2022-05-18T08:59:49.898393018Z 2022-05-18T2022-05-18T08:59:49.898397336Z 08:59:49.2022-05-18T08:59:49.898409489Z 774263+0000 mon.a (mon.0) 1649 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)

Comment 5 Travis Nielsen 2022-05-23 18:41:27 UTC
These health warnings all look normal during cluster creation, or for example during pool creation when the PGs take a bit to settle. It all looks expected from Rook's perspective.

Are you expecting no health warnings? Or what is your expectation? Do you want the warnings to be automatically suppressed by ceph for some time period? Rook is just passing on the health warnings, nothing rook can do about these. Please clarify your expectations, and then it seems this needs to move to the ceph component, but I suspect it will be closed as by design. Health warnings are a good thing to show accurate status for the ceph cluster.

Comment 6 Vijay Avuthu 2022-05-24 08:30:41 UTC
(In reply to Travis Nielsen from comment #5)
> These health warnings all look normal during cluster creation, or for
> example during pool creation when the PGs take a bit to settle. It all looks
> expected from Rook's perspective.
> 
> Are you expecting no health warnings? Or what is your expectation? Do you
> want the warnings to be automatically suppressed by ceph for some time
> period? Rook is just passing on the health warnings, nothing rook can do
> about these. Please clarify your expectations, and then it seems this needs
> to move to the ceph component, but I suspect it will be closed as by design.
> Health warnings are a good thing to show accurate status for the ceph
> cluster.

we expect health warnings after cluster deployment and we have multiple places in automation where we check and wait for Ceph Health to OK ( to settle down ).

eg: 

2022-05-18 14:19:52  08:49:51 - MainThread - ocs_ci.ocs.cluster - INFO - PVC ocs-deviceset-0-data-0h57cs is in Bound state
2022-05-18 14:19:52  08:49:51 - MainThread - ocs_ci.ocs.cluster - INFO - PVC ocs-deviceset-1-data-0z7v8v is in Bound state
2022-05-18 14:19:52  08:49:51 - MainThread - ocs_ci.ocs.cluster - INFO - PVC ocs-deviceset-2-data-044dwh is in Bound state
2022-05-18 14:19:52  08:49:51 - MainThread - ocs_ci.ocs.cluster - INFO - PVC rook-ceph-mon-a is in Bound state
2022-05-18 14:19:52  08:49:51 - MainThread - ocs_ci.ocs.cluster - INFO - PVC rook-ceph-mon-b is in Bound state
2022-05-18 14:19:52  08:49:51 - MainThread - ocs_ci.ocs.cluster - INFO - PVC rook-ceph-mon-c is in Bound state
2022-05-18 14:19:52  08:49:52 - MainThread - ocs_ci.ocs.cluster - INFO - Validating all mon pods have PVC
2022-05-18 14:19:52  08:49:52 - MainThread - ocs_ci.ocs.cluster - INFO - Validating all osd pods have PVC

2022-05-18 14:22:32  08:52:32 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-65cb756c9f-pvgvs -- ceph health
2022-05-18 14:22:32  08:52:32 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK


2022-05-18 14:26:39  08:56:39 - MainThread - ocs_ci.ocs.resources.storage_cluster - INFO - Verifying ceph health
2022-05-18 14:26:41  08:56:40 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK


2022-05-18 14:27:17  08:57:17 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-65cb756c9f-pvgvs -- ceph health
2022-05-18 14:27:17  08:57:17 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK.


2022-05-18 14:28:16  08:58:16 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-65cb756c9f-pvgvs -- ceph health
2022-05-18 14:28:17  08:58:16 - MainThread - tests.conftest - INFO - Ceph health check failed at teardown

> If this expected , what is the good amount of time to wait for ceph health to settle down ( without flipping behaviour ) for new deployments, so that we can include the same time in automation.

Comment 7 Travis Nielsen 2022-05-24 19:00:56 UTC
> If this expected , what is the good amount of time to wait for ceph health to settle down ( without flipping behaviour ) for new deployments, so that we can include the same time in automation.

The wait time will depend a lot on the test environment, how big the cluster is, how many pools or other resources are being added, and so on. With all of these factors there is not a specific recommended time to wait for the cluster to settle. Perhaps you can watch for certain resources to be created such as all the mon, osd, rgw, mds, or other pods, then wait for a few minutes after that.

Comment 8 Vijay Avuthu 2022-05-25 09:04:05 UTC
(In reply to Travis Nielsen from comment #7)
> > If this expected , what is the good amount of time to wait for ceph health to settle down ( without flipping behaviour ) for new deployments, so that we can include the same time in automation.
> 
> The wait time will depend a lot on the test environment, how big the cluster
> is, how many pools or other resources are being added, and so on. With all
> of these factors there is not a specific recommended time to wait for the
> cluster to settle. Perhaps you can watch for certain resources to be created
> such as all the mon, osd, rgw, mds, or other pods, then wait for a few
> minutes after that.

Thanks @tnielsen . We can close this bug