Bug 2099939

Summary: enabled UWM alertmanager only, user project AlertmanagerConfig is not loaded to UWM alertmanager or platform alertmanager
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: MonitoringAssignee: Joao Marcal <jmarcal>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact: Brian Burt <bburt>
Priority: low    
Version: 4.11CC: anpicker, bburt
Target Milestone: ---   
Target Release: 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
* Previously, if you enabled an instance of Alertmanager dedicated to user-defined projects, a misconfiguration could occur in certain circumstances, and you would not be informed that the user-defined project Alertmanager config map settings did not load for either the main instance of Alertmanager or the instance dedicated to user-defined projects. With this release, if this misconfiguration occurs, the Cluster Monitoring Operator now displays a message that informs you of the issue and provides resolution steps. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2099939[*BZ#2099939*])
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-17 19:50:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Junqi Zhao 2022-06-22 04:12:23 UTC
Description of problem:
this is a corner case, we could close it if the scenario is not valid.

enabled UWM and UserAlertmanagerConfig in cluster-monitoring-config map, and enabled UWM alertmanager only, user project AlertmanagerConfig is not loaded to UWM alertmanager or platform alertmanager

enabled UWM and UserAlertmanagerConfig in cluster-monitoring-config map
# oc -n openshift-monitoring get cm cluster-monitoring-config -oyaml
apiVersion: v1
data:
  config.yaml: |
    enableUserWorkload: true
    alertmanagerMain:
      enableUserAlertmanagerConfig: true
kind: ConfigMap
metadata:
  creationTimestamp: "2022-06-22T02:47:57Z"
  name: cluster-monitoring-config
  namespace: openshift-monitoring
  resourceVersion: "73038"
  uid: bac78028-354f-4fdd-81a2-bb3b3601744b


enabled UWM alertmanager only
# oc -n openshift-user-workload-monitoring get cm user-workload-monitoring-config -oyaml
apiVersion: v1
data:
  config.yaml: |
    alertmanager:
      enabled: true
kind: ConfigMap
metadata:
  creationTimestamp: "2022-06-22T02:48:04Z"
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
  resourceVersion: "86637"
  uid: 96e32301-16d1-47e7-8ced-fe0668b678cb

default alertnanager configuration in UWM alertmanager
# oc -n openshift-user-workload-monitoring exec -c alertmanager alertmanager-user-workload-0 -- cat /etc/alertmanager/config/alertmanager.yaml
"receivers":
- "name": "Default"
"route":
  "group_by":
  - "namespace"
  "receiver": "Default"


create AlertmanagerConfig under user project ns1
***********************
apiVersion: monitoring.coreos.com/v1beta1
kind: AlertmanagerConfig
metadata:
  name: example-routing
  namespace: ns1
spec:
  route:
    receiver: default
    groupBy: [job]
  receivers:
  - name: default
    webhookConfigs:
    - url: https://example.org/post
***********************

# oc -n ns1 get AlertmanagerConfig example-routing -oyaml
apiVersion: monitoring.coreos.com/v1beta1
kind: AlertmanagerConfig
metadata:
  creationTimestamp: "2022-06-22T02:59:37Z"
  generation: 1
  name: example-routing
  namespace: ns1
  resourceVersion: "64491"
  uid: e31c35ad-bf78-4bc3-ace5-63b4a1f70c65
spec:
  receivers:
  - name: default
    webhookConfigs:
    - url: https://example.org/post
  route:
    groupBy:
    - job
    receiver: default

ns1 AlertmanagerConfig is not loaded to UWM alertmanager
# oc -n openshift-user-workload-monitoring exec -c alertmanager alertmanager-user-workload-0 -- cat /etc/alertmanager/config/alertmanager.yaml
"receivers":
- "name": "Default"
"route":
  "group_by":
  - "namespace"
  "receiver": "Default"

not in platform alertmanager either
# oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- cat /etc/alertmanager/config/alertmanager.yaml
"global":
  "resolve_timeout": "5m"
"inhibit_rules":
- "equal":
  - "namespace"
  - "alertname"
  "source_matchers":
  - "severity = critical"
  "target_matchers":
  - "severity =~ warning|info"
- "equal":
  - "namespace"
  - "alertname"
  "source_matchers":
  - "severity = warning"
  "target_matchers":
  - "severity = info"
- "equal":
  - "namespace"
  "source_matchers":
  - "alertname = InfoInhibitor"
  "target_matchers":
  - "severity = info"
"receivers":
- "name": "Default"
- "name": "Watchdog"
- "name": "Critical"
- "name": "null"
"route":
  "group_by":
  - "namespace"
  "group_interval": "5m"
  "group_wait": "30s"
  "receiver": "Default"
  "repeat_interval": "12h"
  "routes":
  - "matchers":
    - "alertname = Watchdog"
    "receiver": "Watchdog"
  - "matchers":
    - "alertname = InfoInhibitor"
    "receiver": "null"
  - "matchers":
    - "severity = critical"
    "receiver": "Critical"

Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-06-21-151125

How reproducible:
always

Steps to Reproduce:
1. see the description
2.
3.

Actual results:
enabled UWM alertmanager only, user project AlertmanagerConfig is not loaded to UWM alertmanager or platform alertmanager

Expected results:
not sure for the expected result

Additional info:

Comment 1 Simon Pasquier 2022-06-22 07:38:11 UTC
This is expected because settings from the UWM configmap take precedence. But CMO could surface the inconsistency in the Available condition with a specific reason/message (like we do already with the PrometheusDataPersistenceNotConfigured reason).

Comment 5 Junqi Zhao 2022-09-01 10:19:38 UTC
tested with 4.12.0-0.nightly-2022-08-31-101631, and followed the steps in comment 0 which does not attach PV, the message is like below
# oc get co monitoring -oyaml
...
  - lastTransitionTime: "2022-09-01T00:50:08Z"
    message: 'Prometheus is running without persistent storage which can lead to data
      loss during upgrades and cluster disruptions. Please refer to the official documentation
      to see how to configure storage for Prometheus: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html'
    reason: PrometheusDataPersistenceNotConfigured
    status: "False"
    type: Degraded

if we attached PVs for prometheus and keep the same settings as comment 0, would see the UserAlertmanagerMisconfigured message, issue is fixed, change to verified
# oc get co monitoring -oyaml
...
  - lastTransitionTime: "2022-09-01T09:54:38Z"
    message: 'Misconfigured Alertmanager:  Alertmanager for user-defined alerting
      is enabled in the openshift-monitoring/cluster-monitoring-config configmap by
      setting ''enableUserAlertmanagerConfig: true'' field. This conflicts with a
      dedicated Alertmanager instance enabled in  openshift-user-workload-monitoring/user-workload-monitoring-config.
      Alertmanager enabled in openshift-user-workload-monitoring takes precedence
      over the one in openshift-monitoring, so please remove the ''enableUserAlertmanagerConfig''
      field in openshift-monitoring/cluster-monitoring-config.'
    reason: UserAlertmanagerMisconfigured
    status: "False"
    type: Degraded

Comment 8 errata-xmlrpc 2023-01-17 19:50:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399