Bug 1881258

Summary: Prometheus (Community) Operator clashes with cluster-monitoring-config prometheus-k8s PVCs
Product: OpenShift Container Platform Reporter: Brendan Shirren <bshirren>
Component: MonitoringAssignee: Sergiusz Urbaniak <surbania>
Status: CLOSED UPSTREAM QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.5CC: aabhishe, alegrand, anpicker, bshirren, dahernan, dsover, ecordell, erooth, kakkoyun, lcosic, mloibl, naoto30, nhale, pkrupa, spasquie, surbania
Target Milestone: ---Flags: bshirren: needinfo-
Target Release: 4.7.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Due to a race of installing prometheus-operator via OLM and the in-cluster prometheus-operator there can be a clash of custom resources because OLM may register a different version of the monitoring CRD. Consequence: If you specified custom names for PVCs in the cluster-monitoring-operator configmap then this can cause Prometheus & Alertmanager PVCs configured in "cluster-monitoring-config" configmap (openshift-monitoring) to revert to default naming. Fix: The prometheus operator pod in the openshift-monitoring namespace must be killed. This will cause them to be restarted and the CRDs reinitialized to the correct version. Result: Custom configured PVC names start working.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-07 08:15:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Brendan Shirren 2020-09-22 01:53:54 UTC
Description of problem:

Installing Prometheus (Community) Operator causes Prometheus & Alertmanager PVCs configured in "cluster-monitoring-config" CM (openshift-monitoring) to revert to default naming.


Version-Release number of selected component (if applicable):

OCP v4.4 and v4.5


How reproducible: Always?


Steps to Reproduce:
1. Follow documentation to specify custom PVC naming for prometheus-k8s in "cluster-monitoring-config" CM [1]
2. Install Prometheus (Community) Operator to other namespace
3. Check prometheus-k8s & alertmanager PVCs in openshift-monitoring


Actual results:

Prometheus & Alertmanager PVCs configured in "cluster-monitoring-config" CM revert to default naming.


Expected results:

Prometheus & Alertmanager PVCs configured in "cluster-monitoring-config" CM persist using custom naming.


Additional info:

NAME         PACKAGE      SOURCE                CHANNEL
prometheus   prometheus   community-operators   beta

NAME                                           DISPLAY                  VERSION                 REPLACES                                       PHASE
elasticsearch-operator.4.5.0-202009041228.p0   Elasticsearch Operator   4.5.0-202009041228.p0   elasticsearch-operator.4.5.0-202008100413.p0   Succeeded
prometheusoperator.0.37.0                      Prometheus Operator      0.37.0                  prometheusoperator.0.32.0                      Succeeded


Resulted in "cluster-monitoring-config" PVC names reverting to defaults:

$ oc -n openshift-monitoring get pvc
NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
alertman-pvc-alertmanager-main-0           Bound    pvc-ba2de818-781a-11ea-8d7d-005056340097   10Gi       RWO            thin           164d
alertman-pvc-alertmanager-main-1           Bound    pvc-ba33ade2-781a-11ea-8d7d-005056340097   10Gi       RWO            thin           164d
alertman-pvc-alertmanager-main-2           Bound    pvc-ba3a83aa-781a-11ea-8d7d-005056340097   10Gi       RWO            thin           164d
alertmanager-main-db-alertmanager-main-0   Bound    pvc-007ba60e-3d2d-4fc0-8669-f81ffec30e69   10Gi       RWO            thin           80d
alertmanager-main-db-alertmanager-main-1   Bound    pvc-076431be-a1fa-4f86-835c-c42cc5c20c93   10Gi       RWO            thin           80d
alertmanager-main-db-alertmanager-main-2   Bound    pvc-ffee27f8-bf01-446c-890b-f309ec86f4fa   10Gi       RWO            thin           80d
prom-pvc-prometheus-k8s-0                  Bound    pvc-bfff7fdf-781a-11ea-8d7d-005056340097   40Gi       RWO            thin           164d
prom-pvc-prometheus-k8s-1                  Bound    pvc-c00c6d58-781a-11ea-8d7d-005056340097   40Gi       RWO            thin           164d
prometheus-k8s-db-prometheus-k8s-0         Bound    pvc-171ce912-83be-48fb-a1c7-41289e5228f4   40Gi       RWO            thin           80d
prometheus-k8s-db-prometheus-k8s-1         Bound    pvc-1ef5c852-96de-4197-b1f1-0024888ccb52   40Gi       RWO            thin           80d



$ oc -n openshift-monitoring get cm cluster-monitoring-config -o yaml
apiVersion: v1
data:
  config.yaml: |
    alertmanagerMain:
      nodeSelector:
        role: infra
      volumeClaimTemplate:
        metadata:
          name: alertman-pvc
        spec:
          storageClassName: "thin"
          resources:
            requests:
              storage: 10Gi
    prometheusK8s:
      nodeSelector:
        role: infra
      retention: 30d
      volumeClaimTemplate:
        metadata:
          name: prom-pvc
        spec:
          storageClassName: "thin"
          resources:
            requests:
              storage: 40Gi



[1] https://docs.openshift.com/container-platform/4.4/monitoring/cluster_monitoring/configuring-the-monitoring-stack.html#configuring-persistent-storage

Comment 7 David Hernández Fernández 2020-10-15 07:50:00 UTC
The Community Prometheus operator is installed in a completely different namespace(i.e "metrics" ns) and that triggered Prometheus and Alertmanager pods in openshift-monitoring to recycle. Age of STS, deployments, etc. show their parent resources weren't replaced.

I guess expecting OLM to deny the installation of th community operator is not the way to go, instead ensuring that same resources ids are not used or not affecting rest of namespaces.

Comment 8 Simon Pasquier 2020-10-20 12:07:48 UTC
*** Bug 1889681 has been marked as a duplicate of this bug. ***

Comment 10 Naoto Sano 2020-11-12 01:56:58 UTC
My opinion is that CMO should use different CRD names like *.monitoring.openshift.io from community PO's name like *.monitoring.coreos.com.

So IMO this bug should be filed against CMO, not OLM.