Description of problem (please be detailed as possible and provide log snippests): Tier1 testcase - "tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py::test_ceph_metrics_available" is failing due to Assertion Error as results for ceph_pg_* and ceph_rocksdb_submit_transaction_sync is not getting collected. Version of all relevant components (if applicable): ODF 4.9 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install OCP4.9 cluster 2. Deploy ODF4.9 along with LSO 3. execute the ocs-ci test. "tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py::test_ceph_metrics_available" Actual results: Expected results: Additional info:
tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py::test_ceph_metrics_available on errors like : 05:25:43 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_pg_incomplete 05:25:44 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_pg_degraded 05:25:44 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_pg_backfill_unfound 05:25:45 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_pg_stale 05:25:46 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_rocksdb_submit_transaction_sync ... ... ... Which leads to the failure of the test: > assert list_of_metrics_without_results == [], msg E AssertionError: OCS Monitoring should provide some value(s) for all tested metrics, so that the list of metrics without results is empty. E assert ['ceph_pg_inc...overing', ...] == [] E Left contains 30 more items, first extra item: 'ceph_pg_incomplete' E Full diff: E [ E + , E - 'ceph_pg_incomplete', E - 'ceph_pg_degraded', E - 'ceph_pg_backfill_unfound', E - 'ceph_pg_stale'... E E ...Full output truncated (29 lines hidden), use '-vv' to show tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py:155: AssertionError
Created attachment 1829744 [details] log file for the testcase
must-gather logs : https://drive.google.com/file/d/1zQRblSxQx2Z5Rc0BzT4zVH2dwrjLfQLe/view?usp=sharing
ceph health is also HEALTH_OK [root@rdr-aaruni-syd04-bastion-0 ocs-ci]# oc rsh rook-ceph-tools-f57d97cc6-plxpc sh-4.4$ sh-4.4$ ceph -s cluster: id: ce3a1148-ed7c-45c6-bc7f-870ba9300535 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 14h) mgr: a(active, since 4d) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 14h), 3 in (since 4d) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 13 pools, 241 pgs objects: 17.37k objects, 63 GiB usage: 185 GiB used, 1.3 TiB / 1.5 TiB avail pgs: 241 active+clean io: client: 2.7 KiB/s rd, 19 KiB/s wr, 3 op/s rd, 4 op/s wr sh-4.4$ Tried running the queries related to ceph_pg_* and ceph_rocksdb_submit_* from OCP console , but got "No datapoints found". Is this expected output when ceph health is HEALTH_OK ?
Created attachment 1829746 [details] Attaching the image of OCP console where "ceph_pg_forced_recovery" query produced "no datapoints" output
Prashant, did you get a chance to look into this issue?
Hi Mudit, (In reply to Mudit Agarwal from comment #7) > Prashant, did you get a chance to look into this issue? Not yet. We had discussion in our weekly downstream BZ scrub meeting on BZs reported for ocs-ci testcase failures. We feel these testcase failures are more of environment specific. @vumrao can provide more details on it.
(In reply to Aaruni Aggarwal from comment #5) > ceph health is also HEALTH_OK > > [root@rdr-aaruni-syd04-bastion-0 ocs-ci]# oc rsh > rook-ceph-tools-f57d97cc6-plxpc > sh-4.4$ > sh-4.4$ ceph -s > cluster: > id: ce3a1148-ed7c-45c6-bc7f-870ba9300535 > health: HEALTH_OK > > services: > mon: 3 daemons, quorum a,b,c (age 14h) > mgr: a(active, since 4d) > mds: 1/1 daemons up, 1 hot standby > osd: 3 osds: 3 up (since 14h), 3 in (since 4d) > rgw: 1 daemon active (1 hosts, 1 zones) > > data: > volumes: 1/1 healthy > pools: 13 pools, 241 pgs > objects: 17.37k objects, 63 GiB > usage: 185 GiB used, 1.3 TiB / 1.5 TiB avail > pgs: 241 active+clean > > io: > client: 2.7 KiB/s rd, 19 KiB/s wr, 3 op/s rd, 4 op/s wr > > sh-4.4$ > > Tried running the queries related to ceph_pg_* and ceph_rocksdb_submit_* > from OCP console , but got "No datapoints found". > > Is this expected output when ceph health is HEALTH_OK ? Hi Aaruni, When the cluster is in HEALTH_OK from the output of "ceph -s", it implies that everything is fine from Ceph's perspective. Could you help us understand what the test is trying to achieve and your understanding of what the bug is? At the moment, it seems like something went wrong in the monitoring script.
@nojha , Is there any comment which is private , because I am not able to see which info is needed from me ?
(In reply to Aaruni Aggarwal from comment #10) > @nojha , Is there any comment which is private , because I am not > able to see which info is needed from me ? you should be able to see it now
(In reply to Neha Ojha from comment #9) > (In reply to Aaruni Aggarwal from comment #5) > > ceph health is also HEALTH_OK > > > > [root@rdr-aaruni-syd04-bastion-0 ocs-ci]# oc rsh > > rook-ceph-tools-f57d97cc6-plxpc > > sh-4.4$ > > sh-4.4$ ceph -s > > cluster: > > id: ce3a1148-ed7c-45c6-bc7f-870ba9300535 > > health: HEALTH_OK > > > > services: > > mon: 3 daemons, quorum a,b,c (age 14h) > > mgr: a(active, since 4d) > > mds: 1/1 daemons up, 1 hot standby > > osd: 3 osds: 3 up (since 14h), 3 in (since 4d) > > rgw: 1 daemon active (1 hosts, 1 zones) > > > > data: > > volumes: 1/1 healthy > > pools: 13 pools, 241 pgs > > objects: 17.37k objects, 63 GiB > > usage: 185 GiB used, 1.3 TiB / 1.5 TiB avail > > pgs: 241 active+clean > > > > io: > > client: 2.7 KiB/s rd, 19 KiB/s wr, 3 op/s rd, 4 op/s wr > > > > sh-4.4$ > > > > Tried running the queries related to ceph_pg_* and ceph_rocksdb_submit_* > > from OCP console , but got "No datapoints found". > > > > Is this expected output when ceph health is HEALTH_OK ? > > Hi Aaruni, > > When the cluster is in HEALTH_OK from the output of "ceph -s", it implies > that everything is fine from Ceph's perspective. Could you help us > understand what the test is trying to achieve and your understanding of what > the bug is? At the moment, it seems like something went wrong in the > monitoring script. Hii Neha This is the testcase which we are running : https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py#L136-L155 It is performing prometheus instant queries on Cephcluster . Here are the queries which the testcase is performing : https://github.com/red-hat-storage/ocs-ci/blob/8d7cbaf6e1b31627ad66add9de680bc1530be612/ocs_ci/ocs/metrics.py#L31-L274 I have also attached the log file here: https://bugzilla.redhat.com/show_bug.cgi?id=2011173#c3 , which contains all the queries that are failing. Some of the queries are returning `no datapoints found`, so wanted to know if this is expected behaviour when ceph health is HEALTH_OK
(In reply to Aaruni Aggarwal from comment #12) > (In reply to Neha Ojha from comment #9) > > (In reply to Aaruni Aggarwal from comment #5) > > > ceph health is also HEALTH_OK > > > > > > [root@rdr-aaruni-syd04-bastion-0 ocs-ci]# oc rsh > > > rook-ceph-tools-f57d97cc6-plxpc > > > sh-4.4$ > > > sh-4.4$ ceph -s > > > cluster: > > > id: ce3a1148-ed7c-45c6-bc7f-870ba9300535 > > > health: HEALTH_OK > > > > > > services: > > > mon: 3 daemons, quorum a,b,c (age 14h) > > > mgr: a(active, since 4d) > > > mds: 1/1 daemons up, 1 hot standby > > > osd: 3 osds: 3 up (since 14h), 3 in (since 4d) > > > rgw: 1 daemon active (1 hosts, 1 zones) > > > > > > data: > > > volumes: 1/1 healthy > > > pools: 13 pools, 241 pgs > > > objects: 17.37k objects, 63 GiB > > > usage: 185 GiB used, 1.3 TiB / 1.5 TiB avail > > > pgs: 241 active+clean > > > > > > io: > > > client: 2.7 KiB/s rd, 19 KiB/s wr, 3 op/s rd, 4 op/s wr > > > > > > sh-4.4$ > > > > > > Tried running the queries related to ceph_pg_* and ceph_rocksdb_submit_* > > > from OCP console , but got "No datapoints found". > > > > > > Is this expected output when ceph health is HEALTH_OK ? > > > > Hi Aaruni, > > > > When the cluster is in HEALTH_OK from the output of "ceph -s", it implies > > that everything is fine from Ceph's perspective. Could you help us > > understand what the test is trying to achieve and your understanding of what > > the bug is? At the moment, it seems like something went wrong in the > > monitoring script. > > > Hii Neha > This is the testcase which we are running : > https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/ > monitoring/prometheusmetrics/test_monitoring_defaults.py#L136-L155 > It is performing prometheus instant queries on Cephcluster . > Here are the queries which the testcase is performing : > https://github.com/red-hat-storage/ocs-ci/blob/ > 8d7cbaf6e1b31627ad66add9de680bc1530be612/ocs_ci/ocs/metrics.py#L31-L274 > > I have also attached the log file here: > https://bugzilla.redhat.com/show_bug.cgi?id=2011173#c3 , which contains all > the queries that are failing. > > Some of the queries are returning `no datapoints found`, so wanted to know > if this is expected behaviour when ceph health is HEALTH_OK These queries are not direct ceph commands, they seem to be queries that rely on ceph commands. You need to figure how these are extracted from the ceph cluster. For example: There is nothing called "ceph_pg_incomplete" in ceph. The states of PGs in a ceph cluster can be seen from the out of "ceph -s", in the above example "pgs: 241 active+clean", see https://docs.ceph.com/en/latest/rados/operations/pg-states/ for more details.