Bug 1821268

Summary: Thanos Ruler should send alerts to all Alertmanager pods
Product: OpenShift Container Platform Reporter: Simon Pasquier <spasquie>
Component: MonitoringAssignee: Simon Pasquier <spasquie>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: unspecified Docs Contact:
Priority: low    
Version: 4.5CC: alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:25:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Simon Pasquier 2020-04-06 12:57:39 UTC
Description of problem:
Thanos Ruler sends alerts to the Alertmanager service (alertmanager-main.openshift-monitoring.svc) instead of all Alertmanager pods.
This might that each Alertmanager will get an incomplete view of the alerts depending which .

Version-Release number of selected component (if applicable):
4.5

How reproducible:
Always

Steps to Reproduce:
1. Enable user workload monitoring.
2. Create a user alert that always fires.
***
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: test
  namespace: default
spec:
  groups:
  - name: Test rules
    rules:
    - alert: Drill
      expr: vector(1)
      labels:
        severity: warning
***

3.Query each Alertmanager for the list of active alerts

for i in {0,1,2}; do echo "alertmanager-main-$i"; oc exec -n openshift-monitoring -t  alertmanager-main-$i -c alertmanager --  curl -s http://localhost:9093/api/v2/alerts | jq -r '.[].labels.alertname'; done


Actual results:

alertmanager-main-0
AlertmanagerReceiversNotConfigured
Watchdog
Drill
alertmanager-main-1
AlertmanagerReceiversNotConfigured
Watchdog
Drill
alertmanager-main-2
AlertmanagerReceiversNotConfigured
Watchdog

Expected results:
The "Drill" alert should be present for every Alertmanager.

Additional info:
Right now Thanos Ruler is configured with to with "alertmanager-main.openshift-monitoring.svc:9094". It needs to be "dnssrv+_web._tcp.alertmanager-operated.openshift-monitoring.svc" instead. But for this to work, a first fix is needed in prometheus-operator.

Comment 3 Junqi Zhao 2020-04-22 03:35:12 UTC
Tested with 4.5.0-0.nightly-2020-04-21-233210, followed the steps in Comment 0, The "Drill" alert is present for every Alertmanager
# for i in {0,1,2}; do echo "alertmanager-main-$i"; oc exec -n openshift-monitoring -t  alertmanager-main-$i -c alertmanager --  curl -s http://localhost:9093/api/v2/alerts | jq -r '.[].labels.alertname'; echo -e "\n"; done
alertmanager-main-0
AlertmanagerReceiversNotConfigured
Drill
CustomResourceDetected
KubePodCrashLooping
KubeDeploymentReplicasMismatch
Watchdog


alertmanager-main-1
AlertmanagerReceiversNotConfigured
Drill
CustomResourceDetected
KubePodCrashLooping
KubeDeploymentReplicasMismatch
Watchdog


alertmanager-main-2
AlertmanagerReceiversNotConfigured
Drill
CustomResourceDetected
KubePodCrashLooping
KubeDeploymentReplicasMismatch
Watchdog

Comment 4 errata-xmlrpc 2020-07-13 17:25:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409