Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2114835

Summary: prometheus reports an error during evaluation of CephPoolGrowthWarning alert rule
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Martin Bukatovic <mbukatov>
Component: Ceph-DashboardAssignee: Aashish sharma <aasharma>
Status: CLOSED ERRATA QA Contact: Sayalee <saraut>
Severity: low Docs Contact: Akash Raj <akraj>
Priority: unspecified    
Version: 5.2CC: aasharma, akraj, ceph-eng-bugs, cephqe-warriors, kdreyer, rmandyam, saraut, vereddy
Target Milestone: ---Keywords: Rebase
Target Release: 6.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
.No CephPoolGrowthWarning alerts are fired on the dashboard Previously, incorrect query for CephPoolGrowthWarning alert caused “Evaluating rule failed” errors to repeat indefinitely in Prometheus logs of a stretch cluster. With this release, the query is fixed and no errors are observed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-15 09:15:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2192813    

Description Martin Bukatovic 2022-08-03 11:43:22 UTC
Description of problem
======================

Error "Evaluating rule failed" for CephPoolGrowthWarning repeats indefinitely
in Prometheus logs of my stretched ceph cluster.

I'm reporting this bug after initial discussion about the problem in rh-ceph
chat room.

Version-Release number of selected component
============================================

compose id: RHCEPH-5.2-RHEL-8-20220715.ci.0
container: ceph-5.2-rhel-8-containers-candidate-66591-20220715201234

cephadm-16.2.8-76.el8cp.noarch
ceph-common-16.2.8-76.el8cp.x86_64
ceph-mgr-dashboard-16.2.8-76.el8cp.noarch
ceph-mgr-16.2.8-76.el8cp.x86_64
ceph-mon-16.2.8-76.el8cp.x86_64
cephfs-mirror-16.2.8-76.el8cp.x86_64
ceph-base-16.2.8-76.el8cp.x86_64
ceph-prometheus-alerts-16.2.8-76.el8cp.noarch
ceph-mgr-cephadm-16.2.8-76.el8cp.noarch
ceph-mgr-diskprediction-local-16.2.8-76.el8cp.noarch
ceph-mgr-modules-core-16.2.8-76.el8cp.noarch
ceph-mgr-rook-16.2.8-76.el8cp.noarch
ceph-radosgw-16.2.8-76.el8cp.x86_64
ceph-osd-16.2.8-76.el8cp.x86_64
ceph-mds-16.2.8-76.el8cp.x86_64
ceph-selinux-16.2.8-76.el8cp.x86_64
ceph-grafana-dashboards-16.2.8-76.el8cp.noarch
ceph-mgr-k8sevents-16.2.8-76.el8cp.noarch
ceph-iscsi-3.5-3.el8cp.noarch
ceph-immutable-object-cache-16.2.8-76.el8cp.x86_64

ceph version 16.2.8-76.el8cp (0643f29badd17e972dfdee80c4ee64dc272931a4) pacific (stable)

How reproducible
================

1/1

Steps to Reproduce
==================

1. Install ceph cluster via ceph orchestrator with ceph dashboard and
   monitoring enabled, following ODF Metro DR Stretched Ceph setup[1]
2. Restart all nodes of the cluster and wait for the ceph to be up and
   healthy again.
3. On admin node where the prometheus is running, locate systemd unit of
   prometheus instance: systemctl -l | grep ceph.*prometheus
4. And check the logs via journald, eg.:
   journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service

[1] https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html-single/configuring_openshift_data_foundation_for_metro-dr_with_advanced_cluster_management/index

Actual results
==============

Error "Evaluating rule failed" repeats over and over again consuming most of
the prometheus logs. Here we see that about 97% of log lines are about this
error:

```
[root@osd-0 ~]# journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service | grep -v "Evaluating rule failed" | wc -l
244
[root@osd-0 ~]# journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service | grep "Evaluating rule failed" | wc -l
7855
```

The full error:

```
[root@osd-0 ~]# journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service | tail -1
Aug 03 17:08:02 osd-0 ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc-prometheus-osd-0[11519]: ts=2022-08-03T11:38:02.326Z caller=manager.go:609 level=warn component="rule manager" group=pools msg="Evaluating rule failed" rule="alert: CephPoolGrowthWarning\nexpr: (predict_linear(ceph_pool_percent_used[2d], 3600 * 24 * 5) * on(pool_id) group_right()\n  ceph_pool_metadata) >= 95\nlabels:\n  oid: 1.3.6.1.4.1.50495.1.2.1.9.2\n  severity: warning\n  type: ceph_default\nannotations:\n  description: |\n    Pool '{{ $labels.name }}' will be full in less than 5 days assuming the average fill-up rate of the past 48 hours.\n  summary: Pool growth rate may soon exceed it's capacity\n" err="found duplicate series for the match group {pool_id=\"1\"} on the left hand-side of the operation: [{instance=\"10.1.161.89:9283\", job=\"ceph\", pool_id=\"1\"}, {instance=\"10.1.161.69:9283\", job=\"ceph\", pool_id=\"1\"}];many-to-many matching not allowed: matching labels must be unique on one side"
```

Expected results
================

There are no "Evaluating rule failed" errors in prometheus log.

Additional info
===============

I noticed this when I restarted all nodes of my ceph clutser, and then noticed
that ceph dashboard complains that "Could not reach Prometheus's API on
osd-0:9095/api/v1". While this got resolved itself after a while, I noticed
that the prometheus logs are spammed with the error message as explained in
this bug report.

See attached log dump fetched from the admin node:

```
# journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service > ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.log
```

Details about the ceph cluster:

```
[root@osd-0 ~]# ceph osd lspools
1 device_health_metrics
2 rbdpool
3 cephfs.cephfs.meta
4 cephfs.cephfs.data
5 .rgw.root
6 default.rgw.log
7 default.rgw.control
8 default.rgw.meta
9 default.rgw.buckets.index
10 default.rgw.buckets.data
[root@osd-0 ~]# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    192 GiB  183 GiB  9.1 GiB   9.1 GiB       4.75
TOTAL  192 GiB  183 GiB  9.1 GiB   9.1 GiB       4.75
 
--- POOLS ---
POOL                       ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
device_health_metrics       1    1  356 KiB       17  1.4 MiB      0     21 GiB
rbdpool                     2   32  521 MiB      232  2.0 GiB   2.34     21 GiB
cephfs.cephfs.meta          3   32   24 KiB       22  212 KiB      0     21 GiB
cephfs.cephfs.data          4   32      0 B        0      0 B      0     21 GiB
.rgw.root                   5   32  1.3 KiB        4   64 KiB      0     21 GiB
default.rgw.log             6   32  3.6 KiB      209  544 KiB      0     21 GiB
default.rgw.control         7   32      0 B        8      0 B      0     21 GiB
default.rgw.meta            8   32  5.5 KiB       21  288 KiB      0     21 GiB
default.rgw.buckets.index   9   32      0 B       44      0 B      0     21 GiB
default.rgw.buckets.data   10   32    2 KiB        2   32 KiB      0     21 GiB
```

Comment 2 Martin Bukatovic 2022-08-03 12:03:25 UTC
The query in question is:

```
(predict_linear(ceph_pool_percent_used[2d], 3600 * 24 * 5) * on(pool_id) group_right() ceph_pool_metadata) >= 95
```

Values of ceph_pool_metadata metric (via prometheus query) looks ok:

```
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name=".rgw.root", pool_id="5", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="cephfs.cephfs.data", pool_id="4", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="cephfs.cephfs.meta", pool_id="3", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.buckets.data", pool_id="10", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.buckets.index", pool_id="9", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.control", pool_id="7", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.log", pool_id="6", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.meta", pool_id="8", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="device_health_metrics", pool_id="1", type="replicated"} 1
ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="rbdpool", pool_id="2", type="replicated"} 1
```

But `predict_linear(ceph_pool_percent_used[2d], 3600 * 24 * 5)` expression contains duplicated pool ids:

```
{instance="10.1.161.69:9283", job="ceph", pool_id="1"} 0.00002235054085654088
{instance="10.1.161.69:9283", job="ceph", pool_id="10"} 0.0000005170602257106434
{instance="10.1.161.69:9283", job="ceph", pool_id="2"} 0.14350355292439385
{instance="10.1.161.69:9283", job="ceph", pool_id="3"} 0.000003241211518389691
{instance="10.1.161.69:9283", job="ceph", pool_id="4"} 0
{instance="10.1.161.69:9283", job="ceph", pool_id="5"} 0.0000009754056280110012
{instance="10.1.161.69:9283", job="ceph", pool_id="6"} 0.00000830233856533691
{instance="10.1.161.69:9283", job="ceph", pool_id="7"} 0
{instance="10.1.161.69:9283", job="ceph", pool_id="8"} 0.00001201077839360858
{instance="10.1.161.69:9283", job="ceph", pool_id="9"} 0
{instance="10.1.161.89:9283", job="ceph", pool_id="1"} 0
{instance="10.1.161.89:9283", job="ceph", pool_id="2"} 0
{instance="10.1.161.89:9283", job="ceph", pool_id="3"} 0.000001370114318888227
{instance="10.1.161.89:9283", job="ceph", pool_id="4"} 0
{instance="10.1.161.89:9283", job="ceph", pool_id="5"} 0.0000006850576141914644
{instance="10.1.161.89:9283", job="ceph", pool_id="6"} 0.000005822959792567417
{instance="10.1.161.89:9283", job="ceph", pool_id="7"} 0
{instance="10.1.161.89:9283", job="ceph", pool_id="8"} 0.0000003425289207825699
```

Comment 13 Ken Dreyer (Red Hat) 2023-01-16 21:34:57 UTC
Merged to quincy upstream in https://github.com/ceph/ceph/pull/49475 . Will be in v17.2.6

Comment 27 errata-xmlrpc 2023-06-15 09:15:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3623