Bug 1901036 - [Ceph-dashboard] Default https based dashboard Ceph metric endpoint broken
Summary: [Ceph-dashboard] Default https based dashboard Ceph metric endpoint broken
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Mgr Plugins
Version: 4.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.2
Assignee: Boris Ranto
QA Contact: Pawan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-24 10:48 UTC by Sunil Angadi
Modified: 2021-01-12 14:58 UTC (History)
7 users (show)

Fixed In Version: ceph-14.2.11-87.el8cp, ceph-14.2.11-87.el7cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-12 14:58:11 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph pull 38277 0 None closed mgr/prometheus: Make module more stable 2021-01-13 16:59:30 UTC
Red Hat Product Errata RHSA-2021:0081 0 None None None 2021-01-12 14:58:31 UTC

Comment 3 Boris Ranto 2020-11-24 13:28:51 UTC
I was looking at the cluster and I have a couple of notes, here:

Upstream back-ported this to nautilus (it is in 14.2.11) :

https://github.com/ceph/ceph/pull/35918

This means that we are now spawning a new thread in ceph-mgr prometheus module.

This does seem to cause issues when restarting the ceph-mgr daemon as it fails to collect the metrics in the thread after a restart.

I can get the module to work properly after a couple of (force) restarts. The module misbehaves after another restart (or two) though.

I am wondering if we should just revert the change in nautilus.

I am also wondering if this is reproducible in Octopus. Maybe, we are just missing some code that improves thread shutdown in nautilus?

Comment 4 Boris Ranto 2020-11-25 09:56:34 UTC
OK, some more details. The patch that I mentioned in my previous comment is actually hiding the real error. After I played with this a little more, the metrics collection fails because of

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/prometheus/module.py", line 194, in collect
    data = self.mod.collect()
  File "/usr/share/ceph/mgr/prometheus/module.py", line 977, in collect
    self.get_mgr_status()
  File "/usr/share/ceph/mgr/prometheus/module.py", line 546, in get_mgr_status
    ceph_release = host_version[1].split()[-2] # e.g. nautilus
IndexError: list index out of range

The host_version is defined as

host_version = servers.get((mgr, 'mgr'), ('', ''))

However, the code doesn't work for the default value as there is really nothing to split and it fails the whole metrics collection and eventually the whole metrics collection thread.

We only needed this to get the always on modules for the given ceph release. This is not something we should do anyway and we should instead join all the modules for all the releases.

This PR should fix this:

https://github.com/ceph/ceph/pull/38277

Comment 11 errata-xmlrpc 2021-01-12 14:58:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0081


Note You need to log in before you can comment on or make changes to this bug.