Bug 2089224

Summary: openshift-monitoring/cluster-monitoring-config configmap always revert to default setting
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: HyperShiftAssignee: aaleman
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.11CC: aaleman, amuller, anpicker, calfonso, cewong, jmarcal, sjenning
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:13:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Junqi Zhao 2022-05-23 09:21:08 UTC
Description of problem:
login 4.11.0-0.nightly-2022-05-20-213928 hypershift cluster with Guest cluster kubeconfig, default cluster-monitoring-config configmap under openshift-monitoring project see from bug 2089191, update the configmap to attach PVs.
at first, the PVs are created and attached to prometheus pods, but after a while, cluster-monitoring-config is reverted to the default setting, this caused the prometheus pod restarted and no PVs are attached.

update the configmap to attach PVs
# oc -n openshift-monitoring get cm cluster-monitoring-config -oyaml
apiVersion: v1
data:
  config.yaml: |
    alertmanagerMain: null
    enableUserWorkload: null
    grafana: null
    http: null
    k8sPrometheusAdapter: null
    kubeStateMetrics: null
    openshiftStateMetrics: null
    prometheusK8s:
      retention: 3h
      volumeClaimTemplate:
        metadata:
          name: prometheus
        spec:
          volumeMode: Filesystem
          resources:
            requests:
              storage: 10Gi
    prometheusOperator:
      logLevel: ""
      nodeSelector:
        kubernetes.io/os: linux
      tolerations: null
    telemeterClient: null
    thanosQuerier: null
kind: ConfigMap
metadata:
  creationTimestamp: "2022-05-23T03:29:46Z"
  labels:
    hypershift.io/managed: "true"
  name: cluster-monitoring-config
  namespace: openshift-monitoring
  resourceVersion: "1516"
  uid: 02a7ae1e-4f03-4b85-bda5-871fa187cd10

# oc get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                   STORAGECLASS   REASON   AGE
pvc-6c8c04dc-a009-4d46-a495-0b29290c6c4a   10Gi       RWO            Delete           Bound    openshift-monitoring/prometheus-prometheus-k8s-1        gp2                     53m
pvc-bac05a66-dcd0-4e24-aed4-e02fbabc5c8e   10Gi       RWO            Delete           Bound    openshift-monitoring/prometheus-prometheus-k8s-0        gp2                     53m

# oc -n openshift-monitoring get pvc
NAME                               STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
prometheus-prometheus-k8s-0        Bound    pvc-bac05a66-dcd0-4e24-aed4-e02fbabc5c8e   10Gi       RWO            gp2            53m
prometheus-prometheus-k8s-1        Bound    pvc-6c8c04dc-a009-4d46-a495-0b29290c6c4a   10Gi       RWO            gp2            53m

# oc -n openshift-monitoring get event | grep prometheus-k8s
...
64m         Normal    WaitForFirstConsumer     persistentvolumeclaim/prometheus-prometheus-k8s-0        waiting for first consumer to be created before binding
64m         Normal    ProvisioningSucceeded    persistentvolumeclaim/prometheus-prometheus-k8s-0        Successfully provisioned volume pvc-bac05a66-dcd0-4e24-aed4-e02fbabc5c8e using kubernetes.io/aws-ebs
64m         Normal    WaitForFirstConsumer     persistentvolumeclaim/prometheus-prometheus-k8s-1        waiting for first consumer to be created before binding
64m         Normal    ProvisioningSucceeded    persistentvolumeclaim/prometheus-prometheus-k8s-1        Successfully provisioned volume pvc-6c8c04dc-a009-4d46-a495-0b29290c6c4a using kubernetes.io/aws-ebs

after a while, the configmap is reverted
# oc -n openshift-monitoring get cm cluster-monitoring-config -oyaml
apiVersion: v1
data:
  config.yaml: |
    alertmanagerMain: null
    enableUserWorkload: null
    grafana: null
    http: null
    k8sPrometheusAdapter: null
    kubeStateMetrics: null
    openshiftStateMetrics: null
    prometheusK8s: null
    prometheusOperator:
      logLevel: ""
      nodeSelector:
        kubernetes.io/os: linux
      tolerations: null
    telemeterClient: null
    thanosQuerier: null
kind: ConfigMap
metadata:
  creationTimestamp: "2022-05-23T03:29:46Z"
  labels:
    hypershift.io/managed: "true"
  name: cluster-monitoring-config
  namespace: openshift-monitoring
  resourceVersion: "60293"
  uid: 02a7ae1e-4f03-4b85-bda5-871fa187cd10

Version-Release number of selected component (if applicable):
login 4.11.0-0.nightly-2022-05-20-213928 hypershift cluster with Guest cluster kubeconfig

How reproducible:
always

Steps to Reproduce:
1. update the configmap to attach PVs
2.
3.

Actual results:
after a while, the configmap is reverted

Expected results:
should not revert

Additional info:

Comment 1 Joao Marcal 2022-05-25 13:01:14 UTC
I think this is a bug for HyperShift folks as this is not a problem of CMO but of the HyperShift controller as CMO does not reset values in any condition unless the ConfigMap is removed. I suspect something with https://github.com/openshift/hypershift/blob/9fba0b6ed55808f86b1f9d5d13d2837cf5107b5e/control-plane-operator/hostedclusterconfigoperator/controllers/resources/monitoring/config.go#L20

Comment 2 Cesar Wong 2022-05-26 21:02:08 UTC
This is definitely a bug in our reconciliation code. However, instead of fixing the reconciliation code, we should remove any reconciliation of the config.
@jmarcal if the CMO can default the prometheus operator deployment node selector to not include master when running inside a cluster with a hosted control plane, then we can leave the config as something entirely modified by the user, which is the case with standalone OCP.

Comment 4 Joao Marcal 2022-05-30 08:05:15 UTC
Just to remove the needinfo and to make things more traceable, in the CMO PR https://github.com/openshift/cluster-monitoring-operator/pull/1679 we changed the default prometheus operator deployment node selector to not include master when running inside a cluster with a hosted control plane

Comment 8 Junqi Zhao 2022-06-16 04:09:52 UTC
fix is in 4.11.0-0.nightly-2022-06-15-161625 and configmap could reloaded based on change

Comment 9 Junqi Zhao 2022-06-16 04:11:20 UTC
(In reply to Junqi Zhao from comment #8)
> fix is in 4.11.0-0.nightly-2022-06-15-161625 and configmap could reloaded
> based on change

ignore, paste to this bug wrongly

Comment 10 Junqi Zhao 2022-06-17 01:16:12 UTC
tested 4.11.0-0.nightly-2022-06-15-222801 hypershift cluster with Guest cluster kubeconfig, default configmap cluster-monitoring-config is removed
# oc -n openshift-monitoring get cm cluster-monitoring-config
Error from server (NotFound): configmaps "cluster-monitoring-config" not found

followed steps in Comment 0, we can configure monitoring now
#  oc -n openshift-monitoring get pvc
NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
prometheus-prometheus-k8s-0   Bound    pvc-303fc231-2d44-4810-a49d-b7a510743d7e   10Gi       RWO            gp2            9m57s
#  oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep persistentVolumeClaim -A1
    persistentVolumeClaim:
      claimName: prometheus-prometheus-k8s-0

Comment 14 errata-xmlrpc 2022-08-10 11:13:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069