2011173 – [IBM P/Z] ocs-ci test related to test_ceph_metrics_available failing due to AssertionError

Bug 2011173 - [IBM P/Z] ocs-ci test related to test_ceph_metrics_available failing due to AssertionError

Summary: [IBM P/Z] ocs-ci test related to test_ceph_metrics_available failing due to A...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.9
Hardware:	ppc64le
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Prashant Dhange
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-06 08:02 UTC by Aaruni Aggarwal
Modified:	2023-08-09 16:37 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-02-14 16:18:58 UTC
Embargoed:

Attachments	(Terms of Use)
log file for the testcase (60.50 KB, text/plain) 2021-10-06 08:06 UTC, Aaruni Aggarwal	no flags	Details
Attaching the image of OCP console where "ceph_pg_forced_recovery" query produced "no datapoints" output (100.22 KB, image/png) 2021-10-06 08:21 UTC, Aaruni Aggarwal	no flags	Details
View All

Description Aaruni Aggarwal 2021-10-06 08:02:49 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Tier1 testcase - "tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py::test_ceph_metrics_available" is failing due to Assertion Error as results for ceph_pg_* and ceph_rocksdb_submit_transaction_sync is not getting collected.

Version of all relevant components (if applicable):
ODF 4.9 


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCP4.9 cluster
2. Deploy ODF4.9 along with LSO
3. execute the ocs-ci test.
"tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py::test_ceph_metrics_available"




Actual results:


Expected results:


Additional info:

Comment 2 Aaruni Aggarwal 2021-10-06 08:05:23 UTC

tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py::test_ceph_metrics_available on errors like :

05:25:43 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_pg_incomplete
05:25:44 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_pg_degraded
05:25:44 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_pg_backfill_unfound
05:25:45 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_pg_stale
05:25:46 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_rocksdb_submit_transaction_sync
...
...
...


Which leads to the failure of the test:

>       assert list_of_metrics_without_results == [], msg
E       AssertionError: OCS Monitoring should provide some value(s) for all tested metrics, so that the list of metrics without results is empty.
E       assert ['ceph_pg_inc...overing', ...] == []
E         Left contains 30 more items, first extra item: 'ceph_pg_incomplete'
E         Full diff:
E           [
E         +  ,
E         -  'ceph_pg_incomplete',
E         -  'ceph_pg_degraded',
E         -  'ceph_pg_backfill_unfound',
E         -  'ceph_pg_stale'...
E         
E         ...Full output truncated (29 lines hidden), use '-vv' to show

tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py:155: AssertionError

Comment 3 Aaruni Aggarwal 2021-10-06 08:06:55 UTC

Created attachment 1829744 [details]
log file for the testcase

Comment 4 Aaruni Aggarwal 2021-10-06 08:11:03 UTC

must-gather logs : https://drive.google.com/file/d/1zQRblSxQx2Z5Rc0BzT4zVH2dwrjLfQLe/view?usp=sharing

Comment 5 Aaruni Aggarwal 2021-10-06 08:15:19 UTC

ceph health is also HEALTH_OK

[root@rdr-aaruni-syd04-bastion-0 ocs-ci]# oc rsh rook-ceph-tools-f57d97cc6-plxpc
sh-4.4$ 
sh-4.4$ ceph -s
  cluster:
    id:     ce3a1148-ed7c-45c6-bc7f-870ba9300535
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 14h)
    mgr: a(active, since 4d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 14h), 3 in (since 4d)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   13 pools, 241 pgs
    objects: 17.37k objects, 63 GiB
    usage:   185 GiB used, 1.3 TiB / 1.5 TiB avail
    pgs:     241 active+clean
 
  io:
    client:   2.7 KiB/s rd, 19 KiB/s wr, 3 op/s rd, 4 op/s wr
 
sh-4.4$ 

Tried running the queries related to ceph_pg_* and ceph_rocksdb_submit_* from OCP console , but got "No datapoints found".

Is this expected output when ceph health is HEALTH_OK  ?

Comment 6 Aaruni Aggarwal 2021-10-06 08:21:15 UTC

Created attachment 1829746 [details]
Attaching the image of OCP console where "ceph_pg_forced_recovery" query produced "no datapoints" output

Comment 7 Mudit Agarwal 2021-10-14 16:21:47 UTC

Prashant, did you get a chance to look into this issue?

Comment 8 Prashant Dhange 2021-10-15 04:33:02 UTC

Hi Mudit,

(In reply to Mudit Agarwal from comment #7)
> Prashant, did you get a chance to look into this issue?
Not yet. We had discussion in our weekly downstream BZ scrub meeting on BZs reported for ocs-ci testcase failures. We feel these testcase failures are more of environment specific. @vumrao can provide more details on it.

Comment 9 Neha Ojha 2021-10-18 23:34:55 UTC

(In reply to Aaruni Aggarwal from comment #5)
> ceph health is also HEALTH_OK
> 
> [root@rdr-aaruni-syd04-bastion-0 ocs-ci]# oc rsh
> rook-ceph-tools-f57d97cc6-plxpc
> sh-4.4$ 
> sh-4.4$ ceph -s
>   cluster:
>     id:     ce3a1148-ed7c-45c6-bc7f-870ba9300535
>     health: HEALTH_OK
>  
>   services:
>     mon: 3 daemons, quorum a,b,c (age 14h)
>     mgr: a(active, since 4d)
>     mds: 1/1 daemons up, 1 hot standby
>     osd: 3 osds: 3 up (since 14h), 3 in (since 4d)
>     rgw: 1 daemon active (1 hosts, 1 zones)
>  
>   data:
>     volumes: 1/1 healthy
>     pools:   13 pools, 241 pgs
>     objects: 17.37k objects, 63 GiB
>     usage:   185 GiB used, 1.3 TiB / 1.5 TiB avail
>     pgs:     241 active+clean
>  
>   io:
>     client:   2.7 KiB/s rd, 19 KiB/s wr, 3 op/s rd, 4 op/s wr
>  
> sh-4.4$ 
> 
> Tried running the queries related to ceph_pg_* and ceph_rocksdb_submit_*
> from OCP console , but got "No datapoints found".
> 
> Is this expected output when ceph health is HEALTH_OK  ?

Hi Aaruni,

When the cluster is in HEALTH_OK from the output of "ceph -s", it implies that everything is fine from Ceph's perspective. Could you help us understand what the test is trying to achieve and your understanding of what the bug is? At the moment, it seems like something went wrong in the monitoring script.

Comment 10 Aaruni Aggarwal 2021-10-19 06:02:09 UTC

@nojha , Is there any comment which is private , because I am not able to see which info is needed from me ?

Comment 11 Neha Ojha 2021-10-19 22:19:08 UTC

(In reply to Aaruni Aggarwal from comment #10)
> @nojha , Is there any comment which is private , because I am not
> able to see which info is needed from me ?

you should be able to see it now

Comment 12 Aaruni Aggarwal 2021-10-20 06:43:22 UTC

(In reply to Neha Ojha from comment #9)
> (In reply to Aaruni Aggarwal from comment #5)
> > ceph health is also HEALTH_OK
> > 
> > [root@rdr-aaruni-syd04-bastion-0 ocs-ci]# oc rsh
> > rook-ceph-tools-f57d97cc6-plxpc
> > sh-4.4$ 
> > sh-4.4$ ceph -s
> >   cluster:
> >     id:     ce3a1148-ed7c-45c6-bc7f-870ba9300535
> >     health: HEALTH_OK
> >  
> >   services:
> >     mon: 3 daemons, quorum a,b,c (age 14h)
> >     mgr: a(active, since 4d)
> >     mds: 1/1 daemons up, 1 hot standby
> >     osd: 3 osds: 3 up (since 14h), 3 in (since 4d)
> >     rgw: 1 daemon active (1 hosts, 1 zones)
> >  
> >   data:
> >     volumes: 1/1 healthy
> >     pools:   13 pools, 241 pgs
> >     objects: 17.37k objects, 63 GiB
> >     usage:   185 GiB used, 1.3 TiB / 1.5 TiB avail
> >     pgs:     241 active+clean
> >  
> >   io:
> >     client:   2.7 KiB/s rd, 19 KiB/s wr, 3 op/s rd, 4 op/s wr
> >  
> > sh-4.4$ 
> > 
> > Tried running the queries related to ceph_pg_* and ceph_rocksdb_submit_*
> > from OCP console , but got "No datapoints found".
> > 
> > Is this expected output when ceph health is HEALTH_OK  ?
> 
> Hi Aaruni,
> 
> When the cluster is in HEALTH_OK from the output of "ceph -s", it implies
> that everything is fine from Ceph's perspective. Could you help us
> understand what the test is trying to achieve and your understanding of what
> the bug is? At the moment, it seems like something went wrong in the
> monitoring script.


Hii Neha 
This is the testcase which we are running : https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py#L136-L155 
It is performing prometheus instant queries on Cephcluster . 
Here are the queries which the testcase is performing : https://github.com/red-hat-storage/ocs-ci/blob/8d7cbaf6e1b31627ad66add9de680bc1530be612/ocs_ci/ocs/metrics.py#L31-L274 

I have also attached the log file here: https://bugzilla.redhat.com/show_bug.cgi?id=2011173#c3 , which contains all the queries that are failing. 

Some of the queries are returning `no datapoints found`, so wanted to know if this is expected behaviour when ceph health is HEALTH_OK

Comment 13 Neha Ojha 2021-10-27 16:59:48 UTC

(In reply to Aaruni Aggarwal from comment #12)
> (In reply to Neha Ojha from comment #9)
> > (In reply to Aaruni Aggarwal from comment #5)
> > > ceph health is also HEALTH_OK
> > > 
> > > [root@rdr-aaruni-syd04-bastion-0 ocs-ci]# oc rsh
> > > rook-ceph-tools-f57d97cc6-plxpc
> > > sh-4.4$ 
> > > sh-4.4$ ceph -s
> > >   cluster:
> > >     id:     ce3a1148-ed7c-45c6-bc7f-870ba9300535
> > >     health: HEALTH_OK
> > >  
> > >   services:
> > >     mon: 3 daemons, quorum a,b,c (age 14h)
> > >     mgr: a(active, since 4d)
> > >     mds: 1/1 daemons up, 1 hot standby
> > >     osd: 3 osds: 3 up (since 14h), 3 in (since 4d)
> > >     rgw: 1 daemon active (1 hosts, 1 zones)
> > >  
> > >   data:
> > >     volumes: 1/1 healthy
> > >     pools:   13 pools, 241 pgs
> > >     objects: 17.37k objects, 63 GiB
> > >     usage:   185 GiB used, 1.3 TiB / 1.5 TiB avail
> > >     pgs:     241 active+clean
> > >  
> > >   io:
> > >     client:   2.7 KiB/s rd, 19 KiB/s wr, 3 op/s rd, 4 op/s wr
> > >  
> > > sh-4.4$ 
> > > 
> > > Tried running the queries related to ceph_pg_* and ceph_rocksdb_submit_*
> > > from OCP console , but got "No datapoints found".
> > > 
> > > Is this expected output when ceph health is HEALTH_OK  ?
> > 
> > Hi Aaruni,
> > 
> > When the cluster is in HEALTH_OK from the output of "ceph -s", it implies
> > that everything is fine from Ceph's perspective. Could you help us
> > understand what the test is trying to achieve and your understanding of what
> > the bug is? At the moment, it seems like something went wrong in the
> > monitoring script.
> 
> 
> Hii Neha 
> This is the testcase which we are running :
> https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/
> monitoring/prometheusmetrics/test_monitoring_defaults.py#L136-L155 
> It is performing prometheus instant queries on Cephcluster . 
> Here are the queries which the testcase is performing :
> https://github.com/red-hat-storage/ocs-ci/blob/
> 8d7cbaf6e1b31627ad66add9de680bc1530be612/ocs_ci/ocs/metrics.py#L31-L274 
> 
> I have also attached the log file here:
> https://bugzilla.redhat.com/show_bug.cgi?id=2011173#c3 , which contains all
> the queries that are failing. 
> 
> Some of the queries are returning `no datapoints found`, so wanted to know
> if this is expected behaviour when ceph health is HEALTH_OK

These queries are not direct ceph commands, they seem to be queries that rely on ceph commands. You need to figure how these are extracted from the ceph cluster. For example:

There is nothing called "ceph_pg_incomplete" in ceph. The states of PGs in a ceph cluster can be seen from the out of "ceph -s", in the above example "pgs:     241 active+clean", see https://docs.ceph.com/en/latest/rados/operations/pg-states/ for more details.

Note You need to log in before you can comment on or make changes to this bug.