Bug 1821268

Summary:	Thanos Ruler should send alerts to all Alertmanager pods
Product:	OpenShift Container Platform	Reporter:	Simon Pasquier <spasquie>
Component:	Monitoring	Assignee:	Simon Pasquier <spasquie>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	unspecified	Docs Contact:
Priority:	low
Version:	4.5	CC:	alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-13 17:25:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Simon Pasquier 2020-04-06 12:57:39 UTC

Description of problem:
Thanos Ruler sends alerts to the Alertmanager service (alertmanager-main.openshift-monitoring.svc) instead of all Alertmanager pods.
This might that each Alertmanager will get an incomplete view of the alerts depending which .

Version-Release number of selected component (if applicable):
4.5

How reproducible:
Always

Steps to Reproduce:
1. Enable user workload monitoring.
2. Create a user alert that always fires.
***
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: test
namespace: default
spec:
groups:
- name: Test rules
rules:
- alert: Drill
expr: vector(1)
labels:
severity: warning
***

3.Query each Alertmanager for the list of active alerts

for i in {0,1,2}; do echo "alertmanager-main-$i"; oc exec -n openshift-monitoring -t alertmanager-main-$i -c alertmanager -- curl -s http://localhost:9093/api/v2/alerts | jq -r '.[].labels.alertname'; done

Actual results:

alertmanager-main-0
AlertmanagerReceiversNotConfigured
Watchdog
Drill
alertmanager-main-1
AlertmanagerReceiversNotConfigured
Watchdog
Drill
alertmanager-main-2
AlertmanagerReceiversNotConfigured
Watchdog

Expected results:
The "Drill" alert should be present for every Alertmanager.

Additional info:
Right now Thanos Ruler is configured with to with "alertmanager-main.openshift-monitoring.svc:9094". It needs to be "dnssrv+_web._tcp.alertmanager-operated.openshift-monitoring.svc" instead. But for this to work, a first fix is needed in prometheus-operator.

Comment 3 Junqi Zhao 2020-04-22 03:35:12 UTC

Tested with 4.5.0-0.nightly-2020-04-21-233210, followed the steps in Comment 0, The "Drill" alert is present for every Alertmanager
# for i in {0,1,2}; do echo "alertmanager-main-$i"; oc exec -n openshift-monitoring -t  alertmanager-main-$i -c alertmanager --  curl -s http://localhost:9093/api/v2/alerts | jq -r '.[].labels.alertname'; echo -e "\n"; done
alertmanager-main-0
AlertmanagerReceiversNotConfigured
Drill
CustomResourceDetected
KubePodCrashLooping
KubeDeploymentReplicasMismatch
Watchdog


alertmanager-main-1
AlertmanagerReceiversNotConfigured
Drill
CustomResourceDetected
KubePodCrashLooping
KubeDeploymentReplicasMismatch
Watchdog


alertmanager-main-2
AlertmanagerReceiversNotConfigured
Drill
CustomResourceDetected
KubePodCrashLooping
KubeDeploymentReplicasMismatch
Watchdog

Comment 4 errata-xmlrpc 2020-07-13 17:25:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409