2114835 – prometheus reports an error during evaluation of CephPoolGrowthWarning alert rule

Bug 2114835 - prometheus reports an error during evaluation of CephPoolGrowthWarning alert rule

Summary: prometheus reports an error during evaluation of CephPoolGrowthWarning alert ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Dashboard
Sub Component:
Version:	5.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	6.1
Assignee:	Aashish sharma
QA Contact:	Sayalee
Docs Contact:	Akash Raj
URL:
Whiteboard:
Depends On:
Blocks:	2192813
TreeView+	depends on / blocked

Reported:	2022-08-03 11:43 UTC by Martin Bukatovic
Modified:	2023-06-15 09:16 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	.No CephPoolGrowthWarning alerts are fired on the dashboard Previously, incorrect query for CephPoolGrowthWarning alert caused “Evaluating rule failed” errors to repeat indefinitely in Prometheus logs of a stretch cluster. With this release, the query is fixed and no errors are observed.
Clone Of:
Environment:
Last Closed:	2023-06-15 09:15:36 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph pull 48526	None	open	mgr/dashboard: Fix CephPoolGrowthWarning alert	2022-11-15 06:42:29 UTC
Red Hat Issue Tracker	RHCEPH-4991	None	None	None	2022-08-03 11:52:53 UTC
Red Hat Issue Tracker	RHCSDASH-811	None	None	None	2022-08-03 11:52:54 UTC
Red Hat Product Errata	RHSA-2023:3623	None	None	None	2023-06-15 09:16:26 UTC

Description Martin Bukatovic 2022-08-03 11:43:22 UTC

Description of problem
======================

Error "Evaluating rule failed" for CephPoolGrowthWarning repeats indefinitely
in Prometheus logs of my stretched ceph cluster.

I'm reporting this bug after initial discussion about the problem in rh-ceph
chat room.

Version-Release number of selected component
============================================

compose id: RHCEPH-5.2-RHEL-8-20220715.ci.0
container: ceph-5.2-rhel-8-containers-candidate-66591-20220715201234

cephadm-16.2.8-76.el8cp.noarch
ceph-common-16.2.8-76.el8cp.x86_64
ceph-mgr-dashboard-16.2.8-76.el8cp.noarch
ceph-mgr-16.2.8-76.el8cp.x86_64
ceph-mon-16.2.8-76.el8cp.x86_64
cephfs-mirror-16.2.8-76.el8cp.x86_64
ceph-base-16.2.8-76.el8cp.x86_64
ceph-prometheus-alerts-16.2.8-76.el8cp.noarch
ceph-mgr-cephadm-16.2.8-76.el8cp.noarch
ceph-mgr-diskprediction-local-16.2.8-76.el8cp.noarch
ceph-mgr-modules-core-16.2.8-76.el8cp.noarch
ceph-mgr-rook-16.2.8-76.el8cp.noarch
ceph-radosgw-16.2.8-76.el8cp.x86_64
ceph-osd-16.2.8-76.el8cp.x86_64
ceph-mds-16.2.8-76.el8cp.x86_64
ceph-selinux-16.2.8-76.el8cp.x86_64
ceph-grafana-dashboards-16.2.8-76.el8cp.noarch
ceph-mgr-k8sevents-16.2.8-76.el8cp.noarch
ceph-iscsi-3.5-3.el8cp.noarch
ceph-immutable-object-cache-16.2.8-76.el8cp.x86_64

ceph version 16.2.8-76.el8cp (0643f29badd17e972dfdee80c4ee64dc272931a4) pacific (stable)

How reproducible
================

1/1

Steps to Reproduce
==================

1. Install ceph cluster via ceph orchestrator with ceph dashboard and
   monitoring enabled, following ODF Metro DR Stretched Ceph setup[1]
2. Restart all nodes of the cluster and wait for the ceph to be up and
   healthy again.
3. On admin node where the prometheus is running, locate systemd unit of
   prometheus instance: systemctl -l | grep ceph.*prometheus
4. And check the logs via journald, eg.:
   journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service

[1] https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html-single/configuring_openshift_data_foundation_for_metro-dr_with_advanced_cluster_management/index

Actual results
==============

Error "Evaluating rule failed" repeats over and over again consuming most of
the prometheus logs. Here we see that about 97% of log lines are about this
error:

```
[root@osd-0 ~]# journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service | grep -v "Evaluating rule failed" | wc -l
244
[root@osd-0 ~]# journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service | grep "Evaluating rule failed" | wc -l
7855
```

The full error:

```
[root@osd-0 ~]# journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service | tail -1
Aug 03 17:08:02 osd-0 ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc-prometheus-osd-0[11519]: ts=2022-08-03T11:38:02.326Z caller=manager.go:609 level=warn component="rule manager" group=pools msg="Evaluating rule failed" rule="alert: CephPoolGrowthWarning\nexpr: (predict_linear(ceph_pool_percent_used[2d], 3600 * 24 * 5) * on(pool_id) group_right()\n  ceph_pool_metadata) >= 95\nlabels:\n  oid: 1.3.6.1.4.1.50495.1.2.1.9.2\n  severity: warning\n  type: ceph_default\nannotations:\n  description: |\n    Pool '{{ $labels.name }}' will be full in less than 5 days assuming the average fill-up rate of the past 48 hours.\n  summary: Pool growth rate may soon exceed it's capacity\n" err="found duplicate series for the match group {pool_id=\"1\"} on the left hand-side of the operation: [{instance=\"10.1.161.89:9283\", job=\"ceph\", pool_id=\"1\"}, {instance=\"10.1.161.69:9283\", job=\"ceph\", pool_id=\"1\"}];many-to-many matching not allowed: matching labels must be unique on one side"
```

Expected results
================

There are no "Evaluating rule failed" errors in prometheus log.

Additional info
===============

I noticed this when I restarted all nodes of my ceph clutser, and then noticed
that ceph dashboard complains that "Could not reach Prometheus's API on
osd-0:9095/api/v1". While this got resolved itself after a while, I noticed
that the prometheus logs are spammed with the error message as explained in
this bug report.

See attached log dump fetched from the admin node:

```
# journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service > ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.log
```

Details about the ceph cluster:

```
[root@osd-0 ~]# ceph osd lspools
1 device_health_metrics
2 rbdpool
3 cephfs.cephfs.meta
4 cephfs.cephfs.data
5 .rgw.root
6 default.rgw.log
7 default.rgw.control
8 default.rgw.meta
9 default.rgw.buckets.index
10 default.rgw.buckets.data
[root@osd-0 ~]# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    192 GiB  183 GiB  9.1 GiB   9.1 GiB       4.75
TOTAL  192 GiB  183 GiB  9.1 GiB   9.1 GiB       4.75
 
--- POOLS ---
POOL                       ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
device_health_metrics       1    1  356 KiB       17  1.4 MiB      0     21 GiB
rbdpool                     2   32  521 MiB      232  2.0 GiB   2.34     21 GiB
cephfs.cephfs.meta          3   32   24 KiB       22  212 KiB      0     21 GiB
cephfs.cephfs.data          4   32      0 B        0      0 B      0     21 GiB
.rgw.root                   5   32  1.3 KiB        4   64 KiB      0     21 GiB
default.rgw.log             6   32  3.6 KiB      209  544 KiB      0     21 GiB
default.rgw.control         7   32      0 B        8      0 B      0     21 GiB
default.rgw.meta            8   32  5.5 KiB       21  288 KiB      0     21 GiB
default.rgw.buckets.index   9   32      0 B       44      0 B      0     21 GiB
default.rgw.buckets.data   10   32    2 KiB        2   32 KiB      0     21 GiB
```

Comment 2 Martin Bukatovic 2022-08-03 12:03:25 UTC

The query in question is:

```
(predict_linear(ceph_pool_percent_used[2d], 3600 * 24 * 5) * on(pool_id) group_right() ceph_pool_metadata) >= 95
```

Values of ceph_pool_metadata metric (via prometheus query) looks ok:

```
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name=".rgw.root", pool_id="5", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="cephfs.cephfs.data", pool_id="4", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="cephfs.cephfs.meta", pool_id="3", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.buckets.data", pool_id="10", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.buckets.index", pool_id="9", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.control", pool_id="7", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.log", pool_id="6", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.meta", pool_id="8", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="device_health_metrics", pool_id="1", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="rbdpool", pool_id="2", type="replicated"} 1
```

But `predict_linear(ceph_pool_percent_used[2d], 3600 * 24 * 5)` expression contains duplicated pool ids:

```
{instance="10.1.161.69:9283", job="ceph", pool_id="1"} 0.00002235054085654088
{instance="10.1.161.69:9283", job="ceph", pool_id="10"} 0.0000005170602257106434
{instance="10.1.161.69:9283", job="ceph", pool_id="2"} 0.14350355292439385
{instance="10.1.161.69:9283", job="ceph", pool_id="3"} 0.000003241211518389691
{instance="10.1.161.69:9283", job="ceph", pool_id="4"} 0
{instance="10.1.161.69:9283", job="ceph", pool_id="5"} 0.0000009754056280110012
{instance="10.1.161.69:9283", job="ceph", pool_id="6"} 0.00000830233856533691
{instance="10.1.161.69:9283", job="ceph", pool_id="7"} 0
{instance="10.1.161.69:9283", job="ceph", pool_id="8"} 0.00001201077839360858
{instance="10.1.161.69:9283", job="ceph", pool_id="9"} 0
{instance="10.1.161.89:9283", job="ceph", pool_id="1"} 0
{instance="10.1.161.89:9283", job="ceph", pool_id="2"} 0
{instance="10.1.161.89:9283", job="ceph", pool_id="3"} 0.000001370114318888227
{instance="10.1.161.89:9283", job="ceph", pool_id="4"} 0
{instance="10.1.161.89:9283", job="ceph", pool_id="5"} 0.0000006850576141914644
{instance="10.1.161.89:9283", job="ceph", pool_id="6"} 0.000005822959792567417
{instance="10.1.161.89:9283", job="ceph", pool_id="7"} 0
{instance="10.1.161.89:9283", job="ceph", pool_id="8"} 0.0000003425289207825699
```

Comment 13 Ken Dreyer (Red Hat) 2023-01-16 21:34:57 UTC

Merged to quincy upstream in https://github.com/ceph/ceph/pull/49475 . Will be in v17.2.6

Comment 27 errata-xmlrpc 2023-06-15 09:15:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3623

Note You need to log in before you can comment on or make changes to this bug.