1850947 – enabling RBD stats collection breaks ceph metrics endpoint

Bug 1850947 - enabling RBD stats collection breaks ceph metrics endpoint

Summary: enabling RBD stats collection breaks ceph metrics endpoint

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Mgr Plugins
Sub Component:
Version:	4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.2
Assignee:	Boris Ranto
QA Contact:	Sunil Angadi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1779336
TreeView+	depends on / blocked

Reported:	2020-06-25 08:46 UTC by umanga
Modified:	2021-01-12 14:56 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-01-12 14:55:59 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1779336	0	unspecified	CLOSED	OCS Monitoring is missing ceph_rbd_* metrics	2023-08-09 16:37:41 UTC
Red Hat Product Errata	RHSA-2021:0081	0	None	None	None	2021-01-12 14:56:57 UTC

Description umanga 2020-06-25 08:46:08 UTC

Description of problem:

When "rbd_stats_pools" list is empty
```
sh-4.4# ceph config get mgr mgr/prometheus/rbd_stats_pools
```

and I curl the ceph-mgr metrics endpoint, I get following output
```
$ curl -XGET 10.107.123.127:9283/metrics

# HELP ceph_health_status Cluster health status
# TYPE ceph_health_status untyped
ceph_health_status 1.0
# HELP ceph_mon_quorum_status Monitors in quorum
# TYPE ceph_mon_quorum_status gauge
ceph_mon_quorum_status{ceph_daemon="mon.a"} 1.0
ceph_mon_quorum_status{ceph_daemon="mon.b"} 1.0
ceph_mon_quorum_status{ceph_daemon="mon.c"} 1.0
# HELP ceph_fs_metadata FS Metadata
# TYPE ceph_fs_metadata untyped
# HELP ceph_mds_metadata MDS Metadata
# TYPE ceph_mds_metadata untyped
# HELP ceph_mon_metadata MON Metadata
# TYPE ceph_mon_metadata untyped
ceph_mon_metadata{ceph_daemon="mon.a",hostname="minikube",public_addr="10.104.211.218",rank="0",ceph_version="ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable)"} 1.0
ceph_mon_metadata{ceph_daemon="mon.b",hostname="minikube",public_addr="10.96.251.169",rank="1",ceph_version="ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable)"} 1.0
ceph_mon_metadata{ceph_daemon="mon.c",hostname="minikube",public_addr="10.109.161.142",rank="2",ceph_version="ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable)"} 1.0
```

But, when I set some value to "rbd_stats_pools"
```
sh-4.4# ceph config set mgr mgr/prometheus/rbd_stats_pools replicapool                      
sh-4.4# ceph config get mgr mgr/prometheus/rbd_stats_pools
replicapool
```

and curl the ceph-mgr metrics endpoint, I hit the following error
```
$ curl -XGET 10.107.123.127:9283/metrics
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>
    <title>500 Internal Server Error</title>
    <style type="text/css">
    #powered_by {
        margin-top: 20px;
        border-top: 2px solid black;
        font-style: italic;
    }

    #traceback {
        color: red;
    }
    </style>
</head>
    <body>
        <h2>500 Internal Server Error</h2>
        <p>The server encountered an unexpected condition which prevented it from fulfilling the request.</p>
        <pre id="traceback">Traceback (most recent call last):
  File "/lib/python3.6/site-packages/cherrypy/_cprequest.py", line 638, in respond
    self._do_respond(path_info)
  File "/lib/python3.6/site-packages/cherrypy/_cprequest.py", line 697, in _do_respond
    response.body = self.handler()
  File "/lib/python3.6/site-packages/cherrypy/lib/encoding.py", line 219, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 54, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1047, in metrics
    return self._metrics(instance)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1062, in _metrics
    instance.collect_cache = instance.collect()
  File "/usr/share/ceph/mgr/prometheus/module.py", line 965, in collect
    self.get_rbd_stats()
  File "/usr/share/ceph/mgr/prometheus/module.py", line 726, in get_rbd_stats
    'rbd_stats_pools_refresh_interval', 300)
TypeError: unsupported operand type(s) for +: 'int' and 'str'
</pre>
    <div id="powered_by">
      <span>
        Powered by <a href="http://www.cherrypy.org">CherryPy 18.4.0</a>
      </span>
    </div>
    </body>
</html>
```

This error disappears as soon as I reset "rbd_stats_pools" to an empty list.

Version-Release number of selected component (if applicable):

ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable)


How reproducible:
100%

Steps to Reproduce:
1. Refer to description
2.
3.

Actual results:
I get an error

Expected results:
I should get the exported metrics list, like in any other cases

Additional info:
Tried on Rook-Ceph

Comment 3 umanga 2020-06-29 12:36:46 UTC

Looked a little more into it.

Turns out we hit this error only when we do `ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval <ANY_INTERVAL>`.

If we do `ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval ""`, the error disappears and stats collection starts again.

Comment 6 Boris Ranto 2020-07-14 16:50:06 UTC

@Umanga: Thanks for digging deeper into this, this is actually an easy fix:

https://github.com/ceph/ceph/pull/36102

Please review to get this fixed and back-ported quickly.

Comment 7 Ernesto Puerta 2020-07-16 17:50:00 UTC

Boris, as commented in that PR, I think the missing backport here is https://github.com/ceph/ceph/pull/33991/commits/6d5f88450e61122016fcf7b6cf9431dc67128d3d (not in Nautilus/4.*)

Comment 8 Boris Ranto 2020-07-16 19:09:14 UTC

@Umanga: You mentioned you were able to hit this in octopus, did it have the python fix Ernesto mentioned above in it? You should be able to check that by looking inside the '/usr/share/ceph/mgr/prometheus/module.py' in the ceph-mgr container.

Comment 9 umanga 2020-07-27 09:20:37 UTC

(In reply to Boris Ranto from comment #8)
> @Umanga: You mentioned you were able to hit this in octopus, did it have the
> python fix Ernesto mentioned above in it? You should be able to check that
> by looking inside the '/usr/share/ceph/mgr/prometheus/module.py' in the
> ceph-mgr container.

I don't have that cluster to verify this, but I don't think it had this fix.
Because when I checked for default value, it was empty not 300 as expected.

Comment 15 errata-xmlrpc 2021-01-12 14:55:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0081

Note You need to log in before you can comment on or make changes to this bug.