Bug 2011173

Summary: [IBM P/Z] ocs-ci test related to test_ceph_metrics_available failing due to AssertionError
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Aaruni Aggarwal <aaaggarw>
Component: cephAssignee: Prashant Dhange <pdhange>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Elad <ebenahar>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.9CC: bniver, madam, mbukatov, muagarwa, nojha, ocs-bugs, odf-bz-bot, pdhange, vumrao
Target Milestone: ---   
Target Release: ---   
Hardware: ppc64le   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-02-14 16:18:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log file for the testcase
none
Attaching the image of OCP console where "ceph_pg_forced_recovery" query produced "no datapoints" output none

Description Aaruni Aggarwal 2021-10-06 08:02:49 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Tier1 testcase - "tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py::test_ceph_metrics_available" is failing due to Assertion Error as results for ceph_pg_* and ceph_rocksdb_submit_transaction_sync is not getting collected.

Version of all relevant components (if applicable):
ODF 4.9 


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCP4.9 cluster
2. Deploy ODF4.9 along with LSO
3. execute the ocs-ci test.
"tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py::test_ceph_metrics_available"




Actual results:


Expected results:


Additional info:

Comment 2 Aaruni Aggarwal 2021-10-06 08:05:23 UTC
tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py::test_ceph_metrics_available on errors like :

05:25:43 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_pg_incomplete
05:25:44 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_pg_degraded
05:25:44 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_pg_backfill_unfound
05:25:45 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_pg_stale
05:25:46 - MainThread - ocs_ci.ocs.metrics - ERROR - failed to get results for ceph_rocksdb_submit_transaction_sync
...
...
...


Which leads to the failure of the test:

>       assert list_of_metrics_without_results == [], msg
E       AssertionError: OCS Monitoring should provide some value(s) for all tested metrics, so that the list of metrics without results is empty.
E       assert ['ceph_pg_inc...overing', ...] == []
E         Left contains 30 more items, first extra item: 'ceph_pg_incomplete'
E         Full diff:
E           [
E         +  ,
E         -  'ceph_pg_incomplete',
E         -  'ceph_pg_degraded',
E         -  'ceph_pg_backfill_unfound',
E         -  'ceph_pg_stale'...
E         
E         ...Full output truncated (29 lines hidden), use '-vv' to show

tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py:155: AssertionError

Comment 3 Aaruni Aggarwal 2021-10-06 08:06:55 UTC
Created attachment 1829744 [details]
log file for the testcase

Comment 4 Aaruni Aggarwal 2021-10-06 08:11:03 UTC
must-gather logs : https://drive.google.com/file/d/1zQRblSxQx2Z5Rc0BzT4zVH2dwrjLfQLe/view?usp=sharing

Comment 5 Aaruni Aggarwal 2021-10-06 08:15:19 UTC
ceph health is also HEALTH_OK

[root@rdr-aaruni-syd04-bastion-0 ocs-ci]# oc rsh rook-ceph-tools-f57d97cc6-plxpc
sh-4.4$ 
sh-4.4$ ceph -s
  cluster:
    id:     ce3a1148-ed7c-45c6-bc7f-870ba9300535
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 14h)
    mgr: a(active, since 4d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 14h), 3 in (since 4d)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   13 pools, 241 pgs
    objects: 17.37k objects, 63 GiB
    usage:   185 GiB used, 1.3 TiB / 1.5 TiB avail
    pgs:     241 active+clean
 
  io:
    client:   2.7 KiB/s rd, 19 KiB/s wr, 3 op/s rd, 4 op/s wr
 
sh-4.4$ 

Tried running the queries related to ceph_pg_* and ceph_rocksdb_submit_* from OCP console , but got "No datapoints found".

Is this expected output when ceph health is HEALTH_OK  ?

Comment 6 Aaruni Aggarwal 2021-10-06 08:21:15 UTC
Created attachment 1829746 [details]
Attaching the image of OCP console where "ceph_pg_forced_recovery" query produced "no datapoints" output

Comment 7 Mudit Agarwal 2021-10-14 16:21:47 UTC
Prashant, did you get a chance to look into this issue?

Comment 8 Prashant Dhange 2021-10-15 04:33:02 UTC
Hi Mudit,

(In reply to Mudit Agarwal from comment #7)
> Prashant, did you get a chance to look into this issue?
Not yet. We had discussion in our weekly downstream BZ scrub meeting on BZs reported for ocs-ci testcase failures. We feel these testcase failures are more of environment specific. @vumrao can provide more details on it.

Comment 9 Neha Ojha 2021-10-18 23:34:55 UTC
(In reply to Aaruni Aggarwal from comment #5)
> ceph health is also HEALTH_OK
> 
> [root@rdr-aaruni-syd04-bastion-0 ocs-ci]# oc rsh
> rook-ceph-tools-f57d97cc6-plxpc
> sh-4.4$ 
> sh-4.4$ ceph -s
>   cluster:
>     id:     ce3a1148-ed7c-45c6-bc7f-870ba9300535
>     health: HEALTH_OK
>  
>   services:
>     mon: 3 daemons, quorum a,b,c (age 14h)
>     mgr: a(active, since 4d)
>     mds: 1/1 daemons up, 1 hot standby
>     osd: 3 osds: 3 up (since 14h), 3 in (since 4d)
>     rgw: 1 daemon active (1 hosts, 1 zones)
>  
>   data:
>     volumes: 1/1 healthy
>     pools:   13 pools, 241 pgs
>     objects: 17.37k objects, 63 GiB
>     usage:   185 GiB used, 1.3 TiB / 1.5 TiB avail
>     pgs:     241 active+clean
>  
>   io:
>     client:   2.7 KiB/s rd, 19 KiB/s wr, 3 op/s rd, 4 op/s wr
>  
> sh-4.4$ 
> 
> Tried running the queries related to ceph_pg_* and ceph_rocksdb_submit_*
> from OCP console , but got "No datapoints found".
> 
> Is this expected output when ceph health is HEALTH_OK  ?

Hi Aaruni,

When the cluster is in HEALTH_OK from the output of "ceph -s", it implies that everything is fine from Ceph's perspective. Could you help us understand what the test is trying to achieve and your understanding of what the bug is? At the moment, it seems like something went wrong in the monitoring script.

Comment 10 Aaruni Aggarwal 2021-10-19 06:02:09 UTC
@nojha , Is there any comment which is private , because I am not able to see which info is needed from me ?

Comment 11 Neha Ojha 2021-10-19 22:19:08 UTC
(In reply to Aaruni Aggarwal from comment #10)
> @nojha , Is there any comment which is private , because I am not
> able to see which info is needed from me ?

you should be able to see it now

Comment 12 Aaruni Aggarwal 2021-10-20 06:43:22 UTC
(In reply to Neha Ojha from comment #9)
> (In reply to Aaruni Aggarwal from comment #5)
> > ceph health is also HEALTH_OK
> > 
> > [root@rdr-aaruni-syd04-bastion-0 ocs-ci]# oc rsh
> > rook-ceph-tools-f57d97cc6-plxpc
> > sh-4.4$ 
> > sh-4.4$ ceph -s
> >   cluster:
> >     id:     ce3a1148-ed7c-45c6-bc7f-870ba9300535
> >     health: HEALTH_OK
> >  
> >   services:
> >     mon: 3 daemons, quorum a,b,c (age 14h)
> >     mgr: a(active, since 4d)
> >     mds: 1/1 daemons up, 1 hot standby
> >     osd: 3 osds: 3 up (since 14h), 3 in (since 4d)
> >     rgw: 1 daemon active (1 hosts, 1 zones)
> >  
> >   data:
> >     volumes: 1/1 healthy
> >     pools:   13 pools, 241 pgs
> >     objects: 17.37k objects, 63 GiB
> >     usage:   185 GiB used, 1.3 TiB / 1.5 TiB avail
> >     pgs:     241 active+clean
> >  
> >   io:
> >     client:   2.7 KiB/s rd, 19 KiB/s wr, 3 op/s rd, 4 op/s wr
> >  
> > sh-4.4$ 
> > 
> > Tried running the queries related to ceph_pg_* and ceph_rocksdb_submit_*
> > from OCP console , but got "No datapoints found".
> > 
> > Is this expected output when ceph health is HEALTH_OK  ?
> 
> Hi Aaruni,
> 
> When the cluster is in HEALTH_OK from the output of "ceph -s", it implies
> that everything is fine from Ceph's perspective. Could you help us
> understand what the test is trying to achieve and your understanding of what
> the bug is? At the moment, it seems like something went wrong in the
> monitoring script.


Hii Neha 
This is the testcase which we are running : https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py#L136-L155 
It is performing prometheus instant queries on Cephcluster . 
Here are the queries which the testcase is performing : https://github.com/red-hat-storage/ocs-ci/blob/8d7cbaf6e1b31627ad66add9de680bc1530be612/ocs_ci/ocs/metrics.py#L31-L274 

I have also attached the log file here: https://bugzilla.redhat.com/show_bug.cgi?id=2011173#c3 , which contains all the queries that are failing. 

Some of the queries are returning `no datapoints found`, so wanted to know if this is expected behaviour when ceph health is HEALTH_OK

Comment 13 Neha Ojha 2021-10-27 16:59:48 UTC
(In reply to Aaruni Aggarwal from comment #12)
> (In reply to Neha Ojha from comment #9)
> > (In reply to Aaruni Aggarwal from comment #5)
> > > ceph health is also HEALTH_OK
> > > 
> > > [root@rdr-aaruni-syd04-bastion-0 ocs-ci]# oc rsh
> > > rook-ceph-tools-f57d97cc6-plxpc
> > > sh-4.4$ 
> > > sh-4.4$ ceph -s
> > >   cluster:
> > >     id:     ce3a1148-ed7c-45c6-bc7f-870ba9300535
> > >     health: HEALTH_OK
> > >  
> > >   services:
> > >     mon: 3 daemons, quorum a,b,c (age 14h)
> > >     mgr: a(active, since 4d)
> > >     mds: 1/1 daemons up, 1 hot standby
> > >     osd: 3 osds: 3 up (since 14h), 3 in (since 4d)
> > >     rgw: 1 daemon active (1 hosts, 1 zones)
> > >  
> > >   data:
> > >     volumes: 1/1 healthy
> > >     pools:   13 pools, 241 pgs
> > >     objects: 17.37k objects, 63 GiB
> > >     usage:   185 GiB used, 1.3 TiB / 1.5 TiB avail
> > >     pgs:     241 active+clean
> > >  
> > >   io:
> > >     client:   2.7 KiB/s rd, 19 KiB/s wr, 3 op/s rd, 4 op/s wr
> > >  
> > > sh-4.4$ 
> > > 
> > > Tried running the queries related to ceph_pg_* and ceph_rocksdb_submit_*
> > > from OCP console , but got "No datapoints found".
> > > 
> > > Is this expected output when ceph health is HEALTH_OK  ?
> > 
> > Hi Aaruni,
> > 
> > When the cluster is in HEALTH_OK from the output of "ceph -s", it implies
> > that everything is fine from Ceph's perspective. Could you help us
> > understand what the test is trying to achieve and your understanding of what
> > the bug is? At the moment, it seems like something went wrong in the
> > monitoring script.
> 
> 
> Hii Neha 
> This is the testcase which we are running :
> https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/
> monitoring/prometheusmetrics/test_monitoring_defaults.py#L136-L155 
> It is performing prometheus instant queries on Cephcluster . 
> Here are the queries which the testcase is performing :
> https://github.com/red-hat-storage/ocs-ci/blob/
> 8d7cbaf6e1b31627ad66add9de680bc1530be612/ocs_ci/ocs/metrics.py#L31-L274 
> 
> I have also attached the log file here:
> https://bugzilla.redhat.com/show_bug.cgi?id=2011173#c3 , which contains all
> the queries that are failing. 
> 
> Some of the queries are returning `no datapoints found`, so wanted to know
> if this is expected behaviour when ceph health is HEALTH_OK

These queries are not direct ceph commands, they seem to be queries that rely on ceph commands. You need to figure how these are extracted from the ceph cluster. For example:

There is nothing called "ceph_pg_incomplete" in ceph. The states of PGs in a ceph cluster can be seen from the out of "ceph -s", in the above example "pgs:     241 active+clean", see https://docs.ceph.com/en/latest/rados/operations/pg-states/ for more details.