Description of problem ====================== Error "Evaluating rule failed" for CephPoolGrowthWarning repeats indefinitely in Prometheus logs of my stretched ceph cluster. I'm reporting this bug after initial discussion about the problem in rh-ceph chat room. Version-Release number of selected component ============================================ compose id: RHCEPH-5.2-RHEL-8-20220715.ci.0 container: ceph-5.2-rhel-8-containers-candidate-66591-20220715201234 cephadm-16.2.8-76.el8cp.noarch ceph-common-16.2.8-76.el8cp.x86_64 ceph-mgr-dashboard-16.2.8-76.el8cp.noarch ceph-mgr-16.2.8-76.el8cp.x86_64 ceph-mon-16.2.8-76.el8cp.x86_64 cephfs-mirror-16.2.8-76.el8cp.x86_64 ceph-base-16.2.8-76.el8cp.x86_64 ceph-prometheus-alerts-16.2.8-76.el8cp.noarch ceph-mgr-cephadm-16.2.8-76.el8cp.noarch ceph-mgr-diskprediction-local-16.2.8-76.el8cp.noarch ceph-mgr-modules-core-16.2.8-76.el8cp.noarch ceph-mgr-rook-16.2.8-76.el8cp.noarch ceph-radosgw-16.2.8-76.el8cp.x86_64 ceph-osd-16.2.8-76.el8cp.x86_64 ceph-mds-16.2.8-76.el8cp.x86_64 ceph-selinux-16.2.8-76.el8cp.x86_64 ceph-grafana-dashboards-16.2.8-76.el8cp.noarch ceph-mgr-k8sevents-16.2.8-76.el8cp.noarch ceph-iscsi-3.5-3.el8cp.noarch ceph-immutable-object-cache-16.2.8-76.el8cp.x86_64 ceph version 16.2.8-76.el8cp (0643f29badd17e972dfdee80c4ee64dc272931a4) pacific (stable) How reproducible ================ 1/1 Steps to Reproduce ================== 1. Install ceph cluster via ceph orchestrator with ceph dashboard and monitoring enabled, following ODF Metro DR Stretched Ceph setup[1] 2. Restart all nodes of the cluster and wait for the ceph to be up and healthy again. 3. On admin node where the prometheus is running, locate systemd unit of prometheus instance: systemctl -l | grep ceph.*prometheus 4. And check the logs via journald, eg.: journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service [1] https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html-single/configuring_openshift_data_foundation_for_metro-dr_with_advanced_cluster_management/index Actual results ============== Error "Evaluating rule failed" repeats over and over again consuming most of the prometheus logs. Here we see that about 97% of log lines are about this error: ``` [root@osd-0 ~]# journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service | grep -v "Evaluating rule failed" | wc -l 244 [root@osd-0 ~]# journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service | grep "Evaluating rule failed" | wc -l 7855 ``` The full error: ``` [root@osd-0 ~]# journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service | tail -1 Aug 03 17:08:02 osd-0 ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc-prometheus-osd-0[11519]: ts=2022-08-03T11:38:02.326Z caller=manager.go:609 level=warn component="rule manager" group=pools msg="Evaluating rule failed" rule="alert: CephPoolGrowthWarning\nexpr: (predict_linear(ceph_pool_percent_used[2d], 3600 * 24 * 5) * on(pool_id) group_right()\n ceph_pool_metadata) >= 95\nlabels:\n oid: 1.3.6.1.4.1.50495.1.2.1.9.2\n severity: warning\n type: ceph_default\nannotations:\n description: |\n Pool '{{ $labels.name }}' will be full in less than 5 days assuming the average fill-up rate of the past 48 hours.\n summary: Pool growth rate may soon exceed it's capacity\n" err="found duplicate series for the match group {pool_id=\"1\"} on the left hand-side of the operation: [{instance=\"10.1.161.89:9283\", job=\"ceph\", pool_id=\"1\"}, {instance=\"10.1.161.69:9283\", job=\"ceph\", pool_id=\"1\"}];many-to-many matching not allowed: matching labels must be unique on one side" ``` Expected results ================ There are no "Evaluating rule failed" errors in prometheus log. Additional info =============== I noticed this when I restarted all nodes of my ceph clutser, and then noticed that ceph dashboard complains that "Could not reach Prometheus's API on osd-0:9095/api/v1". While this got resolved itself after a while, I noticed that the prometheus logs are spammed with the error message as explained in this bug report. See attached log dump fetched from the admin node: ``` # journalctl -u ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.service > ceph-d5e6fc12-077a-11ed-be0b-0050568fbefc.log ``` Details about the ceph cluster: ``` [root@osd-0 ~]# ceph osd lspools 1 device_health_metrics 2 rbdpool 3 cephfs.cephfs.meta 4 cephfs.cephfs.data 5 .rgw.root 6 default.rgw.log 7 default.rgw.control 8 default.rgw.meta 9 default.rgw.buckets.index 10 default.rgw.buckets.data [root@osd-0 ~]# ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 192 GiB 183 GiB 9.1 GiB 9.1 GiB 4.75 TOTAL 192 GiB 183 GiB 9.1 GiB 9.1 GiB 4.75 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 356 KiB 17 1.4 MiB 0 21 GiB rbdpool 2 32 521 MiB 232 2.0 GiB 2.34 21 GiB cephfs.cephfs.meta 3 32 24 KiB 22 212 KiB 0 21 GiB cephfs.cephfs.data 4 32 0 B 0 0 B 0 21 GiB .rgw.root 5 32 1.3 KiB 4 64 KiB 0 21 GiB default.rgw.log 6 32 3.6 KiB 209 544 KiB 0 21 GiB default.rgw.control 7 32 0 B 8 0 B 0 21 GiB default.rgw.meta 8 32 5.5 KiB 21 288 KiB 0 21 GiB default.rgw.buckets.index 9 32 0 B 44 0 B 0 21 GiB default.rgw.buckets.data 10 32 2 KiB 2 32 KiB 0 21 GiB ```
The query in question is: ``` (predict_linear(ceph_pool_percent_used[2d], 3600 * 24 * 5) * on(pool_id) group_right() ceph_pool_metadata) >= 95 ``` Values of ceph_pool_metadata metric (via prometheus query) looks ok: ``` ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name=".rgw.root", pool_id="5", type="replicated"} 1 ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="cephfs.cephfs.data", pool_id="4", type="replicated"} 1 ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="cephfs.cephfs.meta", pool_id="3", type="replicated"} 1 ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.buckets.data", pool_id="10", type="replicated"} 1 ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.buckets.index", pool_id="9", type="replicated"} 1 ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.control", pool_id="7", type="replicated"} 1 ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.log", pool_id="6", type="replicated"} 1 ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="default.rgw.meta", pool_id="8", type="replicated"} 1 ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="device_health_metrics", pool_id="1", type="replicated"} 1 ceph_pool_metadata{compression_mode="none", description="replica:4", instance="10.1.161.69:9283", job="ceph", name="rbdpool", pool_id="2", type="replicated"} 1 ``` But `predict_linear(ceph_pool_percent_used[2d], 3600 * 24 * 5)` expression contains duplicated pool ids: ``` {instance="10.1.161.69:9283", job="ceph", pool_id="1"} 0.00002235054085654088 {instance="10.1.161.69:9283", job="ceph", pool_id="10"} 0.0000005170602257106434 {instance="10.1.161.69:9283", job="ceph", pool_id="2"} 0.14350355292439385 {instance="10.1.161.69:9283", job="ceph", pool_id="3"} 0.000003241211518389691 {instance="10.1.161.69:9283", job="ceph", pool_id="4"} 0 {instance="10.1.161.69:9283", job="ceph", pool_id="5"} 0.0000009754056280110012 {instance="10.1.161.69:9283", job="ceph", pool_id="6"} 0.00000830233856533691 {instance="10.1.161.69:9283", job="ceph", pool_id="7"} 0 {instance="10.1.161.69:9283", job="ceph", pool_id="8"} 0.00001201077839360858 {instance="10.1.161.69:9283", job="ceph", pool_id="9"} 0 {instance="10.1.161.89:9283", job="ceph", pool_id="1"} 0 {instance="10.1.161.89:9283", job="ceph", pool_id="2"} 0 {instance="10.1.161.89:9283", job="ceph", pool_id="3"} 0.000001370114318888227 {instance="10.1.161.89:9283", job="ceph", pool_id="4"} 0 {instance="10.1.161.89:9283", job="ceph", pool_id="5"} 0.0000006850576141914644 {instance="10.1.161.89:9283", job="ceph", pool_id="6"} 0.000005822959792567417 {instance="10.1.161.89:9283", job="ceph", pool_id="7"} 0 {instance="10.1.161.89:9283", job="ceph", pool_id="8"} 0.0000003425289207825699 ```
Merged to quincy upstream in https://github.com/ceph/ceph/pull/49475 . Will be in v17.2.6
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:3623