Bug 1832124

Summary:

Upgrading from 4.3.18 to 4.4.3 , causing Prometheus creating new PVC

Product:

OpenShift Container Platform

Reporter:

Muhammad Aizuddin Zali <mzali>

Component:

Monitoring

Assignee:

Paul Gier <pgier>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

medium

Docs Contact:

Priority:

high

Version:

4.4

CC:

alegrand, anpicker, dyocum, erooth, juzhao, kakkoyun, lcosic, lmohanty, mloibl, mzali, pamoedom, pgier, pkrupa, rsandu, sdodson, surbania, wking

Target Milestone:

---

Keywords:

Upgrades

Target Release:

4.5.0

Hardware:

All

OS:

All

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: A bug in the handling of metadata related to the Prometheus PVC name can cause upgrade failures to or from versions 4.4.0-4.4.8 Consequence: Upgrade may fail due to lack of a new PV, and metric data will be lost if it is not manually migrated. Fix: Copy data from the old physical volumes to the new ones to retain metric data. Result: Prometheus will use the copied data and will be able to access historical metrics, and the upgrade will complete.

Story Points:

---

Clone Of:

Clones:

1833427 1839911 (view as bug list)

Environment:

Last Closed:

2020-07-13 17:35:12 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1833427, 1839911

Attachments:

Description	Flags
AWS Cluster Prom operator	none

Description Muhammad Aizuddin Zali 2020-05-06 07:00:22 UTC

Description of problem:
Prometheus deployment creating a new PVC instead of using previous PVC when upgrading from 4.3.18 to 4.4.3.


Version-Release number of selected component (if applicable):
4.4.3


How reproducible:
Only able to try once, assuming this is reproducible every time cluster upgrade from 4.3.18 to 4.4.3, cluster is standard deployment as per offical docs, nothing customized.


Steps to Reproduce:
Using local storage storageclass for PV. Upgrade 4.3.18 to 4.4.3, prom deployment will create new PVC instead of using previous PVC.

 mzali  ~  oc get pods -l app=prometheus
NAME               READY   STATUS    RESTARTS   AGE
prometheus-k8s-0   0/7     Pending   0          54m
prometheus-k8s-1   0/7     Pending   0          54m


 mzali  ~  oc get pvc
NAME                                 STATUS    VOLUME              CAPACITY   ACCESS MODES   STORAGECLASS   AGE
local-sc-pvc-prometheus-k8s-0        Bound     local-pv-9f13a8e4   20Gi       RWO            local-sc       49d
local-sc-pvc-prometheus-k8s-1        Bound     local-pv-fb8df8fb   20Gi       RWO            local-sc       49d
prometheus-k8s-db-prometheus-k8s-0   Pending                                                 local-sc       54m
prometheus-k8s-db-prometheus-k8s-1   Pending                                                 local-sc       54m


 mzali  ~  oc get pv
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                                STORAGECLASS   REASON   AGE
local-pv-9f13a8e4   20Gi       RWO            Delete           Bound      openshift-monitoring/local-sc-pvc-prometheus-k8s-0   local-sc                49d
local-pv-fb8df8fb   20Gi       RWO            Delete           Bound      openshift-monitoring/local-sc-pvc-prometheus-k8s-1   local-sc                49d
registry-pv         100Gi      RWX            Retain           Bound      openshift-image-registry/image-registry-storage                              28d
vault-pv            10Gi       RWO            Retain           Released   hashicorp/vault-storage                                                      7d20h


 mzali  ~  oc get sc
NAME       PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-sc   kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  49d




Actual results:
Prometheus upgrade failed because no free PV and it does not use current PVC, instead created a new one.

Expected results:
Prometheus upgrade should use current PVC and should not create a new PVC.


Additional info:

Comment 11 Muhammad Aizuddin Zali 2020-05-08 11:07:33 UTC

Created attachment 1686444 [details]
AWS Cluster Prom operator

Comment 20 Junqi Zhao 2020-05-15 04:34:09 UTC

set metadata.name for prometheus and alertmanager, 
# oc -n openshift-monitoring get cm/cluster-monitoring-config -oyaml
apiVersion: v1
data:
  config.yaml: |
    prometheusK8s:
      volumeClaimTemplate:
        metadata:
          name: prometheus
...
    alertmanagerMain:
      volumeClaimTemplate:
        metadata:
          name: alertmanager
...
kind: ConfigMap

# oc -n openshift-monitoring get pvc
NAME                               STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
alertmanager-alertmanager-main-0   Bound    pvc-48d8c41d-d2ba-4a2f-aac8-ca3a373f79f3   4Gi        RWO            gp2            12m
alertmanager-alertmanager-main-1   Bound    pvc-c3df3cef-41b5-4c65-884d-85c67c478388   4Gi        RWO            gp2            12m
alertmanager-alertmanager-main-2   Bound    pvc-6de8a046-4d5d-4d2b-a847-3a080c65e79b   4Gi        RWO            gp2            12m
prometheus-prometheus-k8s-0        Bound    pvc-dbd99862-c7a9-46fb-a115-ec7d56c44347   10Gi       RWO            gp2            12m
prometheus-prometheus-k8s-1        Bound    pvc-7a0ba5ca-9853-439d-ac62-cb578712a85d   10Gi       RWO            gp2            12m


# oc -n openshift-monitoring get sts/alertmanager-main -oyaml
...
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: alertmanager
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 4Gi
      storageClassName: gp2
      volumeMode: Filesystem
...

# oc -n openshift-monitoring get sts/prometheus-k8s -oyaml
...
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: prometheus
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
      storageClassName: gp2
      volumeMode: Filesystem
...

Comment 21 Junqi Zhao 2020-05-15 04:35:38 UTC

*** Bug 1793328 has been marked as a duplicate of this bug. ***

Comment 25 W. Trevor King 2020-06-15 17:54:53 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Clusters lose all historic metrics from before the update.  This might cause...
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: The data is gone; there is no possible remediation.
Is this a regression?
  example: No, minor-version-bumping updates have always cleared Prometheus data.
  example: Yes.  4.2 -> 4.3 preserved Prometheus data by..., so this is new in 4.3 -> 4.4.

Comment 27 Paul Gier 2020-06-18 15:44:24 UTC

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  Customers upgrading from 4.3 to 4.4.0-4.4.8 who follwed the documentation for configuring a local PVC for prometheus storage (https://docs.openshift.com/container-platform/4.4/monitoring/cluster_monitoring/configuring-the-monitoring-stack.html#configuring-a-local-persistent-volume-claim_configuring-monitoring)
  Customers upgrading from 4.4.0-4.4.8 to 4.4.9 or 4.5.x who followed the same documentation will likely also be affected.

  Customers upgrading from 4.3.x to 4.4.9 and higher will *not* be affected.


What is the impact?  Is it serious enough to warrant blocking edges?
  Customers will need to either migrate prometheus data from one PV to another, or they will lose historic metric data.
  We believe this is serious enough to warrant blocking endges.

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  Remediation is moderately involved/difficult and involves copying data from one PV to another.

Is this a regression?
  Yes, this is a regression that affects releases between 4.4.0 and 4.4.8

Comment 28 Paul Gier 2020-06-18 20:48:17 UTC

Created a related documentation issue: https://bugzilla.redhat.com/show_bug.cgi?id=1848738

Comment 36 Pedro Amoedo 2020-06-22 19:06:34 UTC

FWIW, I have documented a workaround here (still in progress):

https://access.redhat.com/solutions/5174781

Comment 37 Pedro Amoedo 2020-06-28 13:53:14 UTC

Hi all, IHAC using OCS that have recently upgraded from 4.4.6 to 4.4.9 and lost all previous PVCs from alertmanager and prometheus (they were simply replaced), I think the patch pushed via 4.4.9 (BZ#1833427) could be related, can someone please investigate/corroborate? thanks.

From case description:

~~~
oc -n openshift-monitoring create configmap cluster-monitoring-config --from-file=config.yaml

the config.yaml is:

    Prometheus Operator:
      baseImage: quay.io/coreos/prometheus-operator
      prometheusConfigReloaderBaseImage: quay.io/coreos/prometheus-config-reloader
      configReloaderBaseImage: quay.io/coreos/configmap-reload
      nodeSelector:
        node-role.kubernetes.io/infra: ""
    prometheusK8s:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      retention: 48h
      baseImage: openshift/prometheus
      volumeClaimTemplate:
        metadata:
          name: ocs-prometheus-claim
        spec:
          storageClassName: ocs-storagecluster-ceph-rbd
          resources:
            requests:
              storage: 100Gi
    alertmanagerMain:
      baseImage: openshift/prometheus-alertmanager
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      volumeClaimTemplate:
        metadata:
          name: ocs-alertmanager-claim
        spec:
          storageClassName: ocs-storagecluster-ceph-rbd
          resources:
            requests:
              storage: 20Gi
    kubeStateMetrics:
      baseImage: quay.io/coreos/kube-state-metrics
      nodeSelector:
        node-role.kubernetes.io/infra: ""
    grafana:
      baseImage: grafana/grafana
      nodeSelector:
        node-role.kubernetes.io/infra: ""
    telemeterClient:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
    k8sPrometheusAdapter:
      nodeSelector:
        node-role.kubernetes.io/infra: ""


the original files deployed 9 days ago are:


pvc-04f59982-fb5d-4f18-af3a-a881fce5de0b   100Gi      RWO            Delete           Bound    openshift-monitoring/prometheus-k8s-db-prometheus-k8s-0           ocs-storagecluster-ceph-rbd            9d
pvc-1cb2c2a5-cf47-465c-92c6-4bc64293efb3   20Gi       RWO            Delete           Bound    openshift-monitoring/alertmanager-main-db-alertmanager-main-0     ocs-storagecluster-ceph-rbd            9d
pvc-4c8429fe-547d-4bea-b8e3-3566290d2659   20Gi       RWO            Delete           Bound    openshift-monitoring/alertmanager-main-db-alertmanager-main-1     ocs-storagecluster-ceph-rbd            9d
pvc-5c0f0c9d-e98f-48e4-9815-7c4e8eea1abf   20Gi       RWO            Delete           Bound    openshift-monitoring/alertmanager-main-db-alertmanager-main-2     ocs-storagecluster-ceph-rbd            9d
pvc-a20075bb-4161-4714-bf87-86ffc4961da8   100Gi      RWO            Delete           Bound    openshift-monitoring/prometheus-k8s-db-prometheus-k8s-1           ocs-storagecluster-ceph-rbd            9d

I noticed after the upgrade I have a second set of files and my prometheus history is no longer display in grafana

pvc-0ea759d4-0b02-4829-8328-6e5b5ab9a2b0   20Gi       RWO            Delete           Bound    openshift-monitoring/ocs-alertmanager-claim-alertmanager-main-0   ocs-storagecluster-ceph-rbd            15h
pvc-29a18e3f-3c7b-44d5-baa7-091c874d8161   20Gi       RWO            Delete           Bound    openshift-monitoring/ocs-alertmanager-claim-alertmanager-main-1   ocs-storagecluster-ceph-rbd            15h
pvc-31bf1991-6d9b-4803-9881-604996e5528c   100Gi      RWO            Delete           Bound    openshift-monitoring/ocs-prometheus-claim-prometheus-k8s-1        ocs-storagecluster-ceph-rbd            15h
pvc-a91ae6f1-d764-45f5-9b09-cf0eaf096146   100Gi      RWO            Delete           Bound    openshift-monitoring/ocs-prometheus-claim-prometheus-k8s-0        ocs-storagecluster-ceph-rbd            15h
pvc-d8c70ddc-e6fb-41bc-9a56-39217bfe8198   20Gi       RWO            Delete           Bound    openshift-monitoring/ocs-alertmanager-claim-alertmanager-main-2   ocs-storagecluster-ceph-rbd            15h

what happened?
~~~

NOTE: I will attach the must-gather logs ASAP, it's a big one and I need to split it first removing audit logs, etc.

Best Regards.

Comment 47 errata-xmlrpc 2020-07-13 17:35:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 48 W. Trevor King 2021-04-05 17:46:10 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475