Bug 2114835 - prometheus reports an error during evaluation of CephPoolGrowthWarning alert rule
Summary: prometheus reports an error during evaluation of CephPoolGrowthWarning alert ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Dashboard
Version: 5.2
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 6.1
Assignee: Aashish sharma
QA Contact: Sayalee
Akash Raj
URL:
Whiteboard:
Depends On:
Blocks: 2192813
TreeView+ depends on / blocked
 
Reported: 2022-08-03 11:43 UTC by Martin Bukatovic
Modified: 2023-06-15 09:16 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
.No CephPoolGrowthWarning alerts are fired on the dashboard Previously, incorrect query for CephPoolGrowthWarning alert caused “Evaluating rule failed” errors to repeat indefinitely in Prometheus logs of a stretch cluster. With this release, the query is fixed and no errors are observed.
Clone Of:
Environment:
Last Closed: 2023-06-15 09:15:36 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph pull 48526 0 None open mgr/dashboard: Fix CephPoolGrowthWarning alert 2022-11-15 06:42:29 UTC
Red Hat Issue Tracker RHCEPH-4991 0 None None None 2022-08-03 11:52:53 UTC
Red Hat Issue Tracker RHCSDASH-811 0 None None None 2022-08-03 11:52:54 UTC
Red Hat Product Errata RHSA-2023:3623 0 None None None 2023-06-15 09:16:26 UTC

Description Martin Bukatovic 2022-08-03 11:43:22 UTC
Description of problem
======================

Error "Evaluating rule failed" for CephPoolGrowthWarning repeats indefinitely
in Prometheus logs of my stretched ceph cluster.

I'm reporting this bug after initial discussion about the problem in rh-ceph
chat room.

Version-Release number of selected component
============================================

compose id: RHCEPH-5.2-RHEL-8-20220715.ci.0
container: ceph-5.2-rhel-8-containers-candidate-66591-20220715201234

cephadm-16.2.8-76.el8cp.noarch
ceph-common-16.2.8-76.el8cp.x86_64
ceph-mgr-dashboard-16.2.8-76.el8cp.noarch
ceph-mgr-16.2.8-76.el8cp.x86_64
ceph-mon-16.2.8-76.el8cp.x86_64
cephfs-mirror-16.2.8-76.el8cp.x86_64
ceph-base-16.2.8-76.el8cp.x86_64
ceph-prometheus-alerts-16.2.8-76.el8cp.noarch
ceph-mgr-cephadm-16.2.8-76.el8cp.noarch
ceph-mgr-diskprediction-local-16.2.8-76.el8cp.noarch
ceph-mgr-modules-core-16.2.8-76.el8cp.noarch
ceph-mgr-rook-16.2.8-76.el8cp.noarch
ceph-radosgw-16.2.8-76.el8cp.x86_64
ceph-osd-16.2.8-76.el8cp.x86_64
ceph-mds-16.2.8-76.el8cp.x86_64
ceph-selinux-16.2.8-76.el8cp.x86_64
ceph-grafana-dashboards-16.2.8-76.el8cp.noarch
ceph-mgr-k8sevents-16.2.8-76.el8cp.noarch
ceph-iscsi-3.5-3.el8cp.noarch
ceph-immutable-object-cache-16.2.8-76.el8cp.x86_64

ceph version 16.2.8-76.el8cp (0643f29badd17e972dfdee80c4ee64dc272931a4) pacific (stable)

How reproducible
================

1/1

Steps to Reproduce
==================

1. Install ceph cluster via ceph orchestrator with ceph dashboard and
   monitoring enabled, following ODF Metro DR Stretched Ceph setup[1]
2. Restart all nodes of the cluster and wait for the ceph to be up and
   healthy again.
3. On admin node where the prometheus is running, locate systemd unit of
   prometheus instance: systemctl -l | grep ceph.*prometheus
4. And check the logs via journald, eg.:
   journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service

[1] https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html-single/configuring_openshift_data_foundation_for_metro-dr_with_advanced_cluster_management/index

Actual results
==============

Error "Evaluating rule failed" repeats over and over again consuming most of
the prometheus logs. Here we see that about 97% of log lines are about this
error:

```
[root@osd-0 ~]# journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service | grep -v "Evaluating rule failed" | wc -l
244
[root@osd-0 ~]# journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service | grep "Evaluating rule failed" | wc -l
7855
```

The full error:

```
[root@osd-0 ~]# journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service | tail -1
Aug 03 17:08:02 osd-0 ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc-prometheus-osd-0[11519]: ts=2022-08-03T11:38:02.326Z caller=manager.go:609 level=warn component="rule manager" group=pools msg="Evaluating rule failed" rule="alert: CephPoolGrowthWarning\nexpr: (predict_linear(ceph_pool_percent_used[2d], 3600 * 24 * 5) * on(pool_id) group_right()\n  ceph_pool_metadata) >= 95\nlabels:\n  oid: 1.3.6.1.4.1.50495.1.2.1.9.2\n  severity: warning\n  type: ceph_default\nannotations:\n  description: |\n    Pool '{{ $labels.name }}' will be full in less than 5 days assuming the average fill-up rate of the past 48 hours.\n  summary: Pool growth rate may soon exceed it's capacity\n" err="found duplicate series for the match group {pool_id=\"1\"} on the left hand-side of the operation: [{instance=\"10.1.161.89:9283\", job=\"ceph\", pool_id=\"1\"}, {instance=\"10.1.161.69:9283\", job=\"ceph\", pool_id=\"1\"}];many-to-many matching not allowed: matching labels must be unique on one side"
```

Expected results
================

There are no "Evaluating rule failed" errors in prometheus log.

Additional info
===============

I noticed this when I restarted all nodes of my ceph clutser, and then noticed
that ceph dashboard complains that "Could not reach Prometheus's API on
osd-0:9095/api/v1". While this got resolved itself after a while, I noticed
that the prometheus logs are spammed with the error message as explained in
this bug report.

See attached log dump fetched from the admin node:

```
# journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service > ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.log
```

Details about the ceph cluster:

```
[root@osd-0 ~]# ceph osd lspools
1 device_health_metrics
2 rbdpool
3 cephfs.cephfs.meta
4 cephfs.cephfs.data
5 .rgw.root
6 default.rgw.log
7 default.rgw.control
8 default.rgw.meta
9 default.rgw.buckets.index
10 default.rgw.buckets.data
[root@osd-0 ~]# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    192 GiB  183 GiB  9.1 GiB   9.1 GiB       4.75
TOTAL  192 GiB  183 GiB  9.1 GiB   9.1 GiB       4.75
 
--- POOLS ---
POOL                       ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
device_health_metrics       1    1  356 KiB       17  1.4 MiB      0     21 GiB
rbdpool                     2   32  521 MiB      232  2.0 GiB   2.34     21 GiB
cephfs.cephfs.meta          3   32   24 KiB       22  212 KiB      0     21 GiB
cephfs.cephfs.data          4   32      0 B        0      0 B      0     21 GiB
.rgw.root                   5   32  1.3 KiB        4   64 KiB      0     21 GiB
default.rgw.log             6   32  3.6 KiB      209  544 KiB      0     21 GiB
default.rgw.control         7   32      0 B        8      0 B      0     21 GiB
default.rgw.meta            8   32  5.5 KiB       21  288 KiB      0     21 GiB
default.rgw.buckets.index   9   32      0 B       44      0 B      0     21 GiB
default.rgw.buckets.data   10   32    2 KiB        2   32 KiB      0     21 GiB
```

Comment 2 Martin Bukatovic 2022-08-03 12:03:25 UTC
The query in question is:

```
(predict_linear(ceph_pool_percent_used[2d], 3600 * 24 * 5) * on(pool_id) group_right() ceph_pool_metadata) >= 95
```

Values of ceph_pool_metadata metric (via prometheus query) looks ok:

```
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name=".rgw.root", pool_id="5", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="cephfs.cephfs.data", pool_id="4", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="cephfs.cephfs.meta", pool_id="3", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.buckets.data", pool_id="10", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.buckets.index", pool_id="9", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.control", pool_id="7", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.log", pool_id="6", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.meta", pool_id="8", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="device_health_metrics", pool_id="1", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="rbdpool", pool_id="2", type="replicated"} 1
```

But `predict_linear(ceph_pool_percent_used[2d], 3600 * 24 * 5)` expression contains duplicated pool ids:

```
{instance="10.1.161.69:9283", job="ceph", pool_id="1"} 0.00002235054085654088
{instance="10.1.161.69:9283", job="ceph", pool_id="10"} 0.0000005170602257106434
{instance="10.1.161.69:9283", job="ceph", pool_id="2"} 0.14350355292439385
{instance="10.1.161.69:9283", job="ceph", pool_id="3"} 0.000003241211518389691
{instance="10.1.161.69:9283", job="ceph", pool_id="4"} 0
{instance="10.1.161.69:9283", job="ceph", pool_id="5"} 0.0000009754056280110012
{instance="10.1.161.69:9283", job="ceph", pool_id="6"} 0.00000830233856533691
{instance="10.1.161.69:9283", job="ceph", pool_id="7"} 0
{instance="10.1.161.69:9283", job="ceph", pool_id="8"} 0.00001201077839360858
{instance="10.1.161.69:9283", job="ceph", pool_id="9"} 0
{instance="10.1.161.89:9283", job="ceph", pool_id="1"} 0
{instance="10.1.161.89:9283", job="ceph", pool_id="2"} 0
{instance="10.1.161.89:9283", job="ceph", pool_id="3"} 0.000001370114318888227
{instance="10.1.161.89:9283", job="ceph", pool_id="4"} 0
{instance="10.1.161.89:9283", job="ceph", pool_id="5"} 0.0000006850576141914644
{instance="10.1.161.89:9283", job="ceph", pool_id="6"} 0.000005822959792567417
{instance="10.1.161.89:9283", job="ceph", pool_id="7"} 0
{instance="10.1.161.89:9283", job="ceph", pool_id="8"} 0.0000003425289207825699
```

Comment 13 Ken Dreyer (Red Hat) 2023-01-16 21:34:57 UTC
Merged to quincy upstream in https://github.com/ceph/ceph/pull/49475 . Will be in v17.2.6

Comment 27 errata-xmlrpc 2023-06-15 09:15:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3623


Note You need to log in before you can comment on or make changes to this bug.