Bug 2011173
| Summary: | [IBM P/Z] ocs-ci test related to test_ceph_metrics_available failing due to AssertionError | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Aaruni Aggarwal <aaaggarw> | ||||||
| Component: | ceph | Assignee: | Prashant Dhange <pdhange> | ||||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Elad <ebenahar> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 4.9 | CC: | bniver, madam, mbukatov, muagarwa, nojha, ocs-bugs, odf-bz-bot, pdhange, vumrao | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | ppc64le | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2022-02-14 16:18:58 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Aaruni Aggarwal
2021-10-06 08:02:49 UTC
tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py::test_ceph_metrics_available on errors like :
05:25:43 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_pg_incomplete
05:25:44 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_pg_degraded
05:25:44 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_pg_backfill_unfound
05:25:45 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_pg_stale
05:25:46 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_rocksdb_submit_transaction_sync
...
...
...
Which leads to the failure of the test:
> assert list_of_metrics_without_results == [], msg
E AssertionError: OCS Monitoring should provide some value(s) for all tested metrics, so that the list of metrics without results is empty.
E assert ['ceph_pg_inc...overing', ...] == []
E Left contains 30 more items, first extra item: 'ceph_pg_incomplete'
E Full diff:
E [
E + ,
E - 'ceph_pg_incomplete',
E - 'ceph_pg_degraded',
E - 'ceph_pg_backfill_unfound',
E - 'ceph_pg_stale'...
E
E ...Full output truncated (29 lines hidden), use '-vv' to show
tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py:155: AssertionError
Created attachment 1829744 [details]
log file for the testcase
must-gather logs : https://drive.google.com/file/d/1zQRblSxQx2Z5Rc0BzT4zVH2dwrjLfQLe/view?usp=sharing ceph health is also HEALTH_OK
[root@rdr-aaruni-syd04-bastion-0 ocs-ci]# oc rsh rook-ceph-tools-f57d97cc6-plxpc
sh-4.4$
sh-4.4$ ceph -s
cluster:
id: ce3a1148-ed7c-45c6-bc7f-870ba9300535
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 14h)
mgr: a(active, since 4d)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 14h), 3 in (since 4d)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 13 pools, 241 pgs
objects: 17.37k objects, 63 GiB
usage: 185 GiB used, 1.3 TiB / 1.5 TiB avail
pgs: 241 active+clean
io:
client: 2.7 KiB/s rd, 19 KiB/s wr, 3 op/s rd, 4 op/s wr
sh-4.4$
Tried running the queries related to ceph_pg_* and ceph_rocksdb_submit_* from OCP console , but got "No datapoints found".
Is this expected output when ceph health is HEALTH_OK ?
Created attachment 1829746 [details]
Attaching the image of OCP console where "ceph_pg_forced_recovery" query produced "no datapoints" output
Prashant, did you get a chance to look into this issue? Hi Mudit, (In reply to Mudit Agarwal from comment #7) > Prashant, did you get a chance to look into this issue? Not yet. We had discussion in our weekly downstream BZ scrub meeting on BZs reported for ocs-ci testcase failures. We feel these testcase failures are more of environment specific. @vumrao can provide more details on it. (In reply to Aaruni Aggarwal from comment #5) > ceph health is also HEALTH_OK > > [root@rdr-aaruni-syd04-bastion-0 ocs-ci]# oc rsh > rook-ceph-tools-f57d97cc6-plxpc > sh-4.4$ > sh-4.4$ ceph -s > cluster: > id: ce3a1148-ed7c-45c6-bc7f-870ba9300535 > health: HEALTH_OK > > services: > mon: 3 daemons, quorum a,b,c (age 14h) > mgr: a(active, since 4d) > mds: 1/1 daemons up, 1 hot standby > osd: 3 osds: 3 up (since 14h), 3 in (since 4d) > rgw: 1 daemon active (1 hosts, 1 zones) > > data: > volumes: 1/1 healthy > pools: 13 pools, 241 pgs > objects: 17.37k objects, 63 GiB > usage: 185 GiB used, 1.3 TiB / 1.5 TiB avail > pgs: 241 active+clean > > io: > client: 2.7 KiB/s rd, 19 KiB/s wr, 3 op/s rd, 4 op/s wr > > sh-4.4$ > > Tried running the queries related to ceph_pg_* and ceph_rocksdb_submit_* > from OCP console , but got "No datapoints found". > > Is this expected output when ceph health is HEALTH_OK ? Hi Aaruni, When the cluster is in HEALTH_OK from the output of "ceph -s", it implies that everything is fine from Ceph's perspective. Could you help us understand what the test is trying to achieve and your understanding of what the bug is? At the moment, it seems like something went wrong in the monitoring script. @nojha , Is there any comment which is private , because I am not able to see which info is needed from me ? (In reply to Aaruni Aggarwal from comment #10) > @nojha , Is there any comment which is private , because I am not > able to see which info is needed from me ? you should be able to see it now (In reply to Neha Ojha from comment #9) > (In reply to Aaruni Aggarwal from comment #5) > > ceph health is also HEALTH_OK > > > > [root@rdr-aaruni-syd04-bastion-0 ocs-ci]# oc rsh > > rook-ceph-tools-f57d97cc6-plxpc > > sh-4.4$ > > sh-4.4$ ceph -s > > cluster: > > id: ce3a1148-ed7c-45c6-bc7f-870ba9300535 > > health: HEALTH_OK > > > > services: > > mon: 3 daemons, quorum a,b,c (age 14h) > > mgr: a(active, since 4d) > > mds: 1/1 daemons up, 1 hot standby > > osd: 3 osds: 3 up (since 14h), 3 in (since 4d) > > rgw: 1 daemon active (1 hosts, 1 zones) > > > > data: > > volumes: 1/1 healthy > > pools: 13 pools, 241 pgs > > objects: 17.37k objects, 63 GiB > > usage: 185 GiB used, 1.3 TiB / 1.5 TiB avail > > pgs: 241 active+clean > > > > io: > > client: 2.7 KiB/s rd, 19 KiB/s wr, 3 op/s rd, 4 op/s wr > > > > sh-4.4$ > > > > Tried running the queries related to ceph_pg_* and ceph_rocksdb_submit_* > > from OCP console , but got "No datapoints found". > > > > Is this expected output when ceph health is HEALTH_OK ? > > Hi Aaruni, > > When the cluster is in HEALTH_OK from the output of "ceph -s", it implies > that everything is fine from Ceph's perspective. Could you help us > understand what the test is trying to achieve and your understanding of what > the bug is? At the moment, it seems like something went wrong in the > monitoring script. Hii Neha This is the testcase which we are running : https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py#L136-L155 It is performing prometheus instant queries on Cephcluster . Here are the queries which the testcase is performing : https://github.com/red-hat-storage/ocs-ci/blob/8d7cbaf6e1b31627ad66add9de680bc1530be612/ocs_ci/ocs/metrics.py#L31-L274 I have also attached the log file here: https://bugzilla.redhat.com/show_bug.cgi?id=2011173#c3 , which contains all the queries that are failing. Some of the queries are returning `no datapoints found`, so wanted to know if this is expected behaviour when ceph health is HEALTH_OK (In reply to Aaruni Aggarwal from comment #12) > (In reply to Neha Ojha from comment #9) > > (In reply to Aaruni Aggarwal from comment #5) > > > ceph health is also HEALTH_OK > > > > > > [root@rdr-aaruni-syd04-bastion-0 ocs-ci]# oc rsh > > > rook-ceph-tools-f57d97cc6-plxpc > > > sh-4.4$ > > > sh-4.4$ ceph -s > > > cluster: > > > id: ce3a1148-ed7c-45c6-bc7f-870ba9300535 > > > health: HEALTH_OK > > > > > > services: > > > mon: 3 daemons, quorum a,b,c (age 14h) > > > mgr: a(active, since 4d) > > > mds: 1/1 daemons up, 1 hot standby > > > osd: 3 osds: 3 up (since 14h), 3 in (since 4d) > > > rgw: 1 daemon active (1 hosts, 1 zones) > > > > > > data: > > > volumes: 1/1 healthy > > > pools: 13 pools, 241 pgs > > > objects: 17.37k objects, 63 GiB > > > usage: 185 GiB used, 1.3 TiB / 1.5 TiB avail > > > pgs: 241 active+clean > > > > > > io: > > > client: 2.7 KiB/s rd, 19 KiB/s wr, 3 op/s rd, 4 op/s wr > > > > > > sh-4.4$ > > > > > > Tried running the queries related to ceph_pg_* and ceph_rocksdb_submit_* > > > from OCP console , but got "No datapoints found". > > > > > > Is this expected output when ceph health is HEALTH_OK ? > > > > Hi Aaruni, > > > > When the cluster is in HEALTH_OK from the output of "ceph -s", it implies > > that everything is fine from Ceph's perspective. Could you help us > > understand what the test is trying to achieve and your understanding of what > > the bug is? At the moment, it seems like something went wrong in the > > monitoring script. > > > Hii Neha > This is the testcase which we are running : > https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/ > monitoring/prometheusmetrics/test_monitoring_defaults.py#L136-L155 > It is performing prometheus instant queries on Cephcluster . > Here are the queries which the testcase is performing : > https://github.com/red-hat-storage/ocs-ci/blob/ > 8d7cbaf6e1b31627ad66add9de680bc1530be612/ocs_ci/ocs/metrics.py#L31-L274 > > I have also attached the log file here: > https://bugzilla.redhat.com/show_bug.cgi?id=2011173#c3 , which contains all > the queries that are failing. > > Some of the queries are returning `no datapoints found`, so wanted to know > if this is expected behaviour when ceph health is HEALTH_OK These queries are not direct ceph commands, they seem to be queries that rely on ceph commands. You need to figure how these are extracted from the ceph cluster. For example: There is nothing called "ceph_pg_incomplete" in ceph. The states of PGs in a ceph cluster can be seen from the out of "ceph -s", in the above example "pgs: 241 active+clean", see https://docs.ceph.com/en/latest/rados/operations/pg-states/ for more details. |