Bug 1821268 - Thanos Ruler should send alerts to all Alertmanager pods
Summary: Thanos Ruler should send alerts to all Alertmanager pods
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.5
Hardware: Unspecified
OS: Unspecified
low
unspecified
Target Milestone: ---
: 4.5.0
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-06 12:57 UTC by Simon Pasquier
Modified: 2020-07-13 17:25 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:25:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github coreos prometheus-operator pull 3125 0 None closed pkg/alertmanager: fix definition of web service port 2020-09-17 09:38:13 UTC
Github openshift cluster-monitoring-operator pull 745 0 None closed Bug 1821268: fix Alertmanager address for Thanos Ruler 2020-09-17 09:38:13 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:25:53 UTC

Description Simon Pasquier 2020-04-06 12:57:39 UTC
Description of problem:
Thanos Ruler sends alerts to the Alertmanager service (alertmanager-main.openshift-monitoring.svc) instead of all Alertmanager pods.
This might that each Alertmanager will get an incomplete view of the alerts depending which .

Version-Release number of selected component (if applicable):
4.5

How reproducible:
Always

Steps to Reproduce:
1. Enable user workload monitoring.
2. Create a user alert that always fires.
***
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: test
  namespace: default
spec:
  groups:
  - name: Test rules
    rules:
    - alert: Drill
      expr: vector(1)
      labels:
        severity: warning
***

3.Query each Alertmanager for the list of active alerts

for i in {0,1,2}; do echo "alertmanager-main-$i"; oc exec -n openshift-monitoring -t  alertmanager-main-$i -c alertmanager --  curl -s http://localhost:9093/api/v2/alerts | jq -r '.[].labels.alertname'; done


Actual results:

alertmanager-main-0
AlertmanagerReceiversNotConfigured
Watchdog
Drill
alertmanager-main-1
AlertmanagerReceiversNotConfigured
Watchdog
Drill
alertmanager-main-2
AlertmanagerReceiversNotConfigured
Watchdog

Expected results:
The "Drill" alert should be present for every Alertmanager.

Additional info:
Right now Thanos Ruler is configured with to with "alertmanager-main.openshift-monitoring.svc:9094". It needs to be "dnssrv+_web._tcp.alertmanager-operated.openshift-monitoring.svc" instead. But for this to work, a first fix is needed in prometheus-operator.

Comment 3 Junqi Zhao 2020-04-22 03:35:12 UTC
Tested with 4.5.0-0.nightly-2020-04-21-233210, followed the steps in Comment 0, The "Drill" alert is present for every Alertmanager
# for i in {0,1,2}; do echo "alertmanager-main-$i"; oc exec -n openshift-monitoring -t  alertmanager-main-$i -c alertmanager --  curl -s http://localhost:9093/api/v2/alerts | jq -r '.[].labels.alertname'; echo -e "\n"; done
alertmanager-main-0
AlertmanagerReceiversNotConfigured
Drill
CustomResourceDetected
KubePodCrashLooping
KubeDeploymentReplicasMismatch
Watchdog


alertmanager-main-1
AlertmanagerReceiversNotConfigured
Drill
CustomResourceDetected
KubePodCrashLooping
KubeDeploymentReplicasMismatch
Watchdog


alertmanager-main-2
AlertmanagerReceiversNotConfigured
Drill
CustomResourceDetected
KubePodCrashLooping
KubeDeploymentReplicasMismatch
Watchdog

Comment 4 errata-xmlrpc 2020-07-13 17:25:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.