2099939 – enabled UWM alertmanager only, user project AlertmanagerConfig is not loaded to UWM alertmanager or platform alertmanager

Bug 2099939 - enabled UWM alertmanager only, user project AlertmanagerConfig is not loaded to UWM alertmanager or platform alertmanager

Summary: enabled UWM alertmanager only, user project AlertmanagerConfig is not loaded ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Joao Marcal
QA Contact:	Junqi Zhao
Docs Contact:	Brian Burt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-22 04:12 UTC by Junqi Zhao
Modified:	2023-01-17 19:50 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	* Previously, if you enabled an instance of Alertmanager dedicated to user-defined projects, a misconfiguration could occur in certain circumstances, and you would not be informed that the user-defined project Alertmanager config map settings did not load for either the main instance of Alertmanager or the instance dedicated to user-defined projects. With this release, if this misconfiguration occurs, the Cluster Monitoring Operator now displays a message that informs you of the issue and provides resolution steps. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2099939[BZ#2099939])
Clone Of:
Environment:
Last Closed:	2023-01-17 19:50:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1724	0	None	open	Bug 2099939: Sets status when UserAlertmanagerConfig is missconfigured	2022-07-20 12:09:19 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:50:25 UTC

Description Junqi Zhao 2022-06-22 04:12:23 UTC

Description of problem:
this is a corner case, we could close it if the scenario is not valid.

enabled UWM and UserAlertmanagerConfig in cluster-monitoring-config map, and enabled UWM alertmanager only, user project AlertmanagerConfig is not loaded to UWM alertmanager or platform alertmanager

enabled UWM and UserAlertmanagerConfig in cluster-monitoring-config map
# oc -n openshift-monitoring get cm cluster-monitoring-config -oyaml
apiVersion: v1
data:
  config.yaml: |
    enableUserWorkload: true
    alertmanagerMain:
      enableUserAlertmanagerConfig: true
kind: ConfigMap
metadata:
  creationTimestamp: "2022-06-22T02:47:57Z"
  name: cluster-monitoring-config
  namespace: openshift-monitoring
  resourceVersion: "73038"
  uid: bac78028-354f-4fdd-81a2-bb3b3601744b


enabled UWM alertmanager only
# oc -n openshift-user-workload-monitoring get cm user-workload-monitoring-config -oyaml
apiVersion: v1
data:
  config.yaml: |
    alertmanager:
      enabled: true
kind: ConfigMap
metadata:
  creationTimestamp: "2022-06-22T02:48:04Z"
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
  resourceVersion: "86637"
  uid: 96e32301-16d1-47e7-8ced-fe0668b678cb

default alertnanager configuration in UWM alertmanager
# oc -n openshift-user-workload-monitoring exec -c alertmanager alertmanager-user-workload-0 -- cat /etc/alertmanager/config/alertmanager.yaml
"receivers":
- "name": "Default"
"route":
  "group_by":
  - "namespace"
  "receiver": "Default"


create AlertmanagerConfig under user project ns1
***********************
apiVersion: monitoring.coreos.com/v1beta1
kind: AlertmanagerConfig
metadata:
  name: example-routing
  namespace: ns1
spec:
  route:
    receiver: default
    groupBy: [job]
  receivers:
  - name: default
    webhookConfigs:
    - url: https://example.org/post
***********************

# oc -n ns1 get AlertmanagerConfig example-routing -oyaml
apiVersion: monitoring.coreos.com/v1beta1
kind: AlertmanagerConfig
metadata:
  creationTimestamp: "2022-06-22T02:59:37Z"
  generation: 1
  name: example-routing
  namespace: ns1
  resourceVersion: "64491"
  uid: e31c35ad-bf78-4bc3-ace5-63b4a1f70c65
spec:
  receivers:
  - name: default
    webhookConfigs:
    - url: https://example.org/post
  route:
    groupBy:
    - job
    receiver: default

ns1 AlertmanagerConfig is not loaded to UWM alertmanager
# oc -n openshift-user-workload-monitoring exec -c alertmanager alertmanager-user-workload-0 -- cat /etc/alertmanager/config/alertmanager.yaml
"receivers":
- "name": "Default"
"route":
  "group_by":
  - "namespace"
  "receiver": "Default"

not in platform alertmanager either
# oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- cat /etc/alertmanager/config/alertmanager.yaml
"global":
  "resolve_timeout": "5m"
"inhibit_rules":
- "equal":
  - "namespace"
  - "alertname"
  "source_matchers":
  - "severity = critical"
  "target_matchers":
  - "severity =~ warning|info"
- "equal":
  - "namespace"
  - "alertname"
  "source_matchers":
  - "severity = warning"
  "target_matchers":
  - "severity = info"
- "equal":
  - "namespace"
  "source_matchers":
  - "alertname = InfoInhibitor"
  "target_matchers":
  - "severity = info"
"receivers":
- "name": "Default"
- "name": "Watchdog"
- "name": "Critical"
- "name": "null"
"route":
  "group_by":
  - "namespace"
  "group_interval": "5m"
  "group_wait": "30s"
  "receiver": "Default"
  "repeat_interval": "12h"
  "routes":
  - "matchers":
    - "alertname = Watchdog"
    "receiver": "Watchdog"
  - "matchers":
    - "alertname = InfoInhibitor"
    "receiver": "null"
  - "matchers":
    - "severity = critical"
    "receiver": "Critical"

Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-06-21-151125

How reproducible:
always

Steps to Reproduce:
1. see the description
2.
3.

Actual results:
enabled UWM alertmanager only, user project AlertmanagerConfig is not loaded to UWM alertmanager or platform alertmanager

Expected results:
not sure for the expected result

Additional info:

Comment 1 Simon Pasquier 2022-06-22 07:38:11 UTC

This is expected because settings from the UWM configmap take precedence. But CMO could surface the inconsistency in the Available condition with a specific reason/message (like we do already with the PrometheusDataPersistenceNotConfigured reason).

Comment 5 Junqi Zhao 2022-09-01 10:19:38 UTC

tested with 4.12.0-0.nightly-2022-08-31-101631, and followed the steps in comment 0 which does not attach PV, the message is like below
# oc get co monitoring -oyaml
...
  - lastTransitionTime: "2022-09-01T00:50:08Z"
    message: 'Prometheus is running without persistent storage which can lead to data
      loss during upgrades and cluster disruptions. Please refer to the official documentation
      to see how to configure storage for Prometheus: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html'
    reason: PrometheusDataPersistenceNotConfigured
    status: "False"
    type: Degraded

if we attached PVs for prometheus and keep the same settings as comment 0, would see the UserAlertmanagerMisconfigured message, issue is fixed, change to verified
# oc get co monitoring -oyaml
...
  - lastTransitionTime: "2022-09-01T09:54:38Z"
    message: 'Misconfigured Alertmanager:  Alertmanager for user-defined alerting
      is enabled in the openshift-monitoring/cluster-monitoring-config configmap by
      setting ''enableUserAlertmanagerConfig: true'' field. This conflicts with a
      dedicated Alertmanager instance enabled in  openshift-user-workload-monitoring/user-workload-monitoring-config.
      Alertmanager enabled in openshift-user-workload-monitoring takes precedence
      over the one in openshift-monitoring, so please remove the ''enableUserAlertmanagerConfig''
      field in openshift-monitoring/cluster-monitoring-config.'
    reason: UserAlertmanagerMisconfigured
    status: "False"
    type: Degraded

Comment 8 errata-xmlrpc 2023-01-17 19:50:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.