2292208 – Missing Data Points in Ceph Health Status Metrics when one of the monitors is downscaled

Bug 2292208 - Missing Data Points in Ceph Health Status Metrics when one of the monitors is downscaled

Summary: Missing Data Points in Ceph Health Status Metrics when one of the monitors is...

Keywords:
Status:	NEW
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph-monitoring
Sub Component:
Version:	4.16
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Divyansh Kamboj
QA Contact:	Harish NV Rao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-06-13 13:49 UTC by Daniel Osypenko
Modified:	2024-09-24 12:35 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-06-25 10:18:53 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OCSBZM-8516	0	None	None	None	2024-09-04 11:40:06 UTC

Description Daniel Osypenko 2024-06-13 13:49:39 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Repetitive issue when temporarily downscaling ceph monitor pod to 0, leaving 2 monitor pods running. 

There are missing data points in the Ceph health status metrics retrieved from the Prometheus query range API. The expected data points, which should be recorded at 15-second intervals, show a gap of 45 seconds between 1715325580.828 and 1715325625.828. This indicates that three data points are missing within this range.

2024-05-10 03:06:55,827 - MainThread - INFO - tests.functional.monitoring.conftest.measure_stop_ceph_mon.144 - Monitors to stop: ['rook-ceph-mon-c']
2024-05-10 03:06:55,827 - MainThread - INFO - tests.functional.monitoring.conftest.measure_stop_ceph_mon.145 - Monitors left to run: ['rook-ceph-mon-a', 'rook-ceph-mon-b']
...
2024-05-10 03:21:09,670 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster-prod/ocs-ci/ocs_ci/utility/prometheus.py.get.431 - params={'query': 'ceph_health_status', 'start': 1715324815.827674, 'end': 1715325663.9413576, 'step': 15}
2024-05-10 03:21:09,672 - MainThread - DEBUG - urllib3.connectionpool._new_conn.1019 - Starting new HTTPS connection (1): prometheus-k8s-openshift-monitoring.apps.j-075vi1cs33-t3.qe.rh-ocs.com:443
2024-05-10 03:21:09,693 - MainThread - DEBUG - urllib3.connectionpool._make_request.474 - https://prometheus-k8s-openshift-monitoring.apps.j-075vi1cs33-t3.qe.rh-ocs.com:443 "GET /api/v1/query_range?query=ceph_health_status&start=1715324815.827674&end=1715325663.9413576&step=15 HTTP/1.1" 200 439
2024-05-10 03:21:09,705 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster-prod/ocs-ci/ocs_ci/utility/prometheus.py.validate_status.304 - content value: {'status': 'success', 'data': {'resultType': 'matrix', 'result': [{'metric': {'__name__': 'ceph_health_status', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.128.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7ffddcf45f-nkb8z', 'service': 'rook-ceph-mgr'}, 'values': [[1715324815.828, '0'], [1715324830.828, '0'], [1715324845.828, '0'], [1715324860.828, '0'], [1715324875.828, '1'], [1715324890.828, '1'], [1715324905.828, '1'], [1715324920.828, '1'], [1715324935.828, '1'], [1715324950.828, '1'], [1715324965.828, '1'], [1715324980.828, '1'], [1715324995.828, '1'], [1715325010.828, '1'], [1715325025.828, '1'], [1715325040.828, '1'], [1715325055.828, '1'], [1715325070.828, '1'], [1715325085.828, '1'], [1715325100.828, '1'], [1715325115.828, '1'], [1715325130.828, '1'], [1715325145.828, '1'], [1715325160.828, '1'], [1715325175.828, '1'], [1715325190.828, '1'], [1715325205.828, '1'], [1715325220.828, '1'], [1715325235.828, '1'], [1715325250.828, '1'], [1715325265.828, '1'], [1715325280.828, '1'], [1715325295.828, '1'], [1715325310.828, '1'], [1715325325.828, '1'], [1715325340.828, '1'], [1715325355.828, '1'], [1715325370.828, '1'], [1715325385.828, '1'], [1715325400.828, '1'], [1715325415.828, '1'], [1715325430.828, '1'], [1715325445.828, '1'], [1715325460.828, '1'], [1715325475.828, '1'], [1715325490.828, '1'], [1715325505.828, '1'], [1715325520.828, '1'], [1715325535.828, '1'], [1715325550.828, '1'], [1715325565.828, '1'], [1715325580.828, '1'], [1715325625.828, '0'], [1715325640.828, '0'], [1715325655.828, '0']]}]}}
2024-05-10 03:21:09,705 - MainThread - ERROR - /home/jenkins/workspace/qe-deploy-ocs-cluster-prod/ocs-ci/ocs_ci/utility/prometheus.py.query_range.597 - there are holes in prometheus data: result size is 55 while expected sample size is 56 +-1



Test test_monitoring_shows_mon_down is failing on a variety of platforms. 

Version of all relevant components (if applicable):

Cluster version 4.16.0-0.nightly-2024-05-08-222442
ODF Operator 4.16.0-95
Test run name OCS4-16-Downstream-OCP4-16-VSPHERE6-IPI-1AZ-RHCOS-VSAN-3M-3W-tier3

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
yes

Can this issue reproduce from the UI?
-

If this is a regression, please provide more details to justify this:
regression, only 1/10 of tests are passing

Steps to Reproduce:
1. Downscale monitor pod replica. Make ranged Prometheus req similar to https://prometheus-k8s-openshift-monitoring.apps.j-075vi1cs33-t3.qe.rh-ocs.com:443 "GET /api/v1/query_range?query=ceph_health_status&start=1715324815.827674&end=1715325663.9413576&step=15 HTTP/1.1". 
2.
3.


Actual results:
data holes detected

Expected results:
no data holes detected

Additional info:
test logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-075vi1cs33-t3/j-075vi1cs33-t3_20240510T004136/logs/ocs-ci-logs-1715321166/by_outcome/failed/tests/functional/monitoring/prometheus/metrics/test_monitoring_negative.py/test_monitoring_shows_mon_down/logs

must-gather logs OCS http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-075vi1cs33-t3/j-075vi1cs33-t3_20240510T004136/logs/failed_testcase_ocs_logs_1715321166/test_monitoring_shows_mon_down_ocs_logs/j-075vi1cs33-t3/ocs_must_gather/

must-gather logs OCP http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-075vi1cs33-t3/j-075vi1cs33-t3_20240510T004136/logs/testcases_1715321166/j-075vi1cs33-t3/ocp_must_gather/

Note You need to log in before you can comment on or make changes to this bug.