Description of problem: Prometheus deployment creating a new PVC instead of using previous PVC when upgrading from 4.3.18 to 4.4.3. Version-Release number of selected component (if applicable): 4.4.3 How reproducible: Only able to try once, assuming this is reproducible every time cluster upgrade from 4.3.18 to 4.4.3, cluster is standard deployment as per offical docs, nothing customized. Steps to Reproduce: Using local storage storageclass for PV. Upgrade 4.3.18 to 4.4.3, prom deployment will create new PVC instead of using previous PVC. mzali ~ oc get pods -l app=prometheus NAME READY STATUS RESTARTS AGE prometheus-k8s-0 0/7 Pending 0 54m prometheus-k8s-1 0/7 Pending 0 54m mzali ~ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE local-sc-pvc-prometheus-k8s-0 Bound local-pv-9f13a8e4 20Gi RWO local-sc 49d local-sc-pvc-prometheus-k8s-1 Bound local-pv-fb8df8fb 20Gi RWO local-sc 49d prometheus-k8s-db-prometheus-k8s-0 Pending local-sc 54m prometheus-k8s-db-prometheus-k8s-1 Pending local-sc 54m mzali ~ oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE local-pv-9f13a8e4 20Gi RWO Delete Bound openshift-monitoring/local-sc-pvc-prometheus-k8s-0 local-sc 49d local-pv-fb8df8fb 20Gi RWO Delete Bound openshift-monitoring/local-sc-pvc-prometheus-k8s-1 local-sc 49d registry-pv 100Gi RWX Retain Bound openshift-image-registry/image-registry-storage 28d vault-pv 10Gi RWO Retain Released hashicorp/vault-storage 7d20h mzali ~ oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE local-sc kubernetes.io/no-provisioner Delete WaitForFirstConsumer false 49d Actual results: Prometheus upgrade failed because no free PV and it does not use current PVC, instead created a new one. Expected results: Prometheus upgrade should use current PVC and should not create a new PVC. Additional info:
Created attachment 1686444 [details] AWS Cluster Prom operator
set metadata.name for prometheus and alertmanager, # oc -n openshift-monitoring get cm/cluster-monitoring-config -oyaml apiVersion: v1 data: config.yaml: | prometheusK8s: volumeClaimTemplate: metadata: name: prometheus ... alertmanagerMain: volumeClaimTemplate: metadata: name: alertmanager ... kind: ConfigMap # oc -n openshift-monitoring get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE alertmanager-alertmanager-main-0 Bound pvc-48d8c41d-d2ba-4a2f-aac8-ca3a373f79f3 4Gi RWO gp2 12m alertmanager-alertmanager-main-1 Bound pvc-c3df3cef-41b5-4c65-884d-85c67c478388 4Gi RWO gp2 12m alertmanager-alertmanager-main-2 Bound pvc-6de8a046-4d5d-4d2b-a847-3a080c65e79b 4Gi RWO gp2 12m prometheus-prometheus-k8s-0 Bound pvc-dbd99862-c7a9-46fb-a115-ec7d56c44347 10Gi RWO gp2 12m prometheus-prometheus-k8s-1 Bound pvc-7a0ba5ca-9853-439d-ac62-cb578712a85d 10Gi RWO gp2 12m # oc -n openshift-monitoring get sts/alertmanager-main -oyaml ... volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null name: alertmanager spec: accessModes: - ReadWriteOnce resources: requests: storage: 4Gi storageClassName: gp2 volumeMode: Filesystem ... # oc -n openshift-monitoring get sts/prometheus-k8s -oyaml ... volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null name: prometheus spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi storageClassName: gp2 volumeMode: Filesystem ...
*** Bug 1793328 has been marked as a duplicate of this bug. ***
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Clusters lose all historic metrics from before the update. This might cause... How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: The data is gone; there is no possible remediation. Is this a regression? example: No, minor-version-bumping updates have always cleared Prometheus data. example: Yes. 4.2 -> 4.3 preserved Prometheus data by..., so this is new in 4.3 -> 4.4.
Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? Customers upgrading from 4.3 to 4.4.0-4.4.8 who follwed the documentation for configuring a local PVC for prometheus storage (https://docs.openshift.com/container-platform/4.4/monitoring/cluster_monitoring/configuring-the-monitoring-stack.html#configuring-a-local-persistent-volume-claim_configuring-monitoring) Customers upgrading from 4.4.0-4.4.8 to 4.4.9 or 4.5.x who followed the same documentation will likely also be affected. Customers upgrading from 4.3.x to 4.4.9 and higher will *not* be affected. What is the impact? Is it serious enough to warrant blocking edges? Customers will need to either migrate prometheus data from one PV to another, or they will lose historic metric data. We believe this is serious enough to warrant blocking endges. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Remediation is moderately involved/difficult and involves copying data from one PV to another. Is this a regression? Yes, this is a regression that affects releases between 4.4.0 and 4.4.8
Created a related documentation issue: https://bugzilla.redhat.com/show_bug.cgi?id=1848738
FWIW, I have documented a workaround here (still in progress): https://access.redhat.com/solutions/5174781
Hi all, IHAC using OCS that have recently upgraded from 4.4.6 to 4.4.9 and lost all previous PVCs from alertmanager and prometheus (they were simply replaced), I think the patch pushed via 4.4.9 (BZ#1833427) could be related, can someone please investigate/corroborate? thanks. From case description: ~~~ oc -n openshift-monitoring create configmap cluster-monitoring-config --from-file=config.yaml the config.yaml is: Prometheus Operator: baseImage: quay.io/coreos/prometheus-operator prometheusConfigReloaderBaseImage: quay.io/coreos/prometheus-config-reloader configReloaderBaseImage: quay.io/coreos/configmap-reload nodeSelector: node-role.kubernetes.io/infra: "" prometheusK8s: nodeSelector: node-role.kubernetes.io/infra: "" retention: 48h baseImage: openshift/prometheus volumeClaimTemplate: metadata: name: ocs-prometheus-claim spec: storageClassName: ocs-storagecluster-ceph-rbd resources: requests: storage: 100Gi alertmanagerMain: baseImage: openshift/prometheus-alertmanager nodeSelector: node-role.kubernetes.io/infra: "" volumeClaimTemplate: metadata: name: ocs-alertmanager-claim spec: storageClassName: ocs-storagecluster-ceph-rbd resources: requests: storage: 20Gi kubeStateMetrics: baseImage: quay.io/coreos/kube-state-metrics nodeSelector: node-role.kubernetes.io/infra: "" grafana: baseImage: grafana/grafana nodeSelector: node-role.kubernetes.io/infra: "" telemeterClient: nodeSelector: node-role.kubernetes.io/infra: "" k8sPrometheusAdapter: nodeSelector: node-role.kubernetes.io/infra: "" the original files deployed 9 days ago are: pvc-04f59982-fb5d-4f18-af3a-a881fce5de0b 100Gi RWO Delete Bound openshift-monitoring/prometheus-k8s-db-prometheus-k8s-0 ocs-storagecluster-ceph-rbd 9d pvc-1cb2c2a5-cf47-465c-92c6-4bc64293efb3 20Gi RWO Delete Bound openshift-monitoring/alertmanager-main-db-alertmanager-main-0 ocs-storagecluster-ceph-rbd 9d pvc-4c8429fe-547d-4bea-b8e3-3566290d2659 20Gi RWO Delete Bound openshift-monitoring/alertmanager-main-db-alertmanager-main-1 ocs-storagecluster-ceph-rbd 9d pvc-5c0f0c9d-e98f-48e4-9815-7c4e8eea1abf 20Gi RWO Delete Bound openshift-monitoring/alertmanager-main-db-alertmanager-main-2 ocs-storagecluster-ceph-rbd 9d pvc-a20075bb-4161-4714-bf87-86ffc4961da8 100Gi RWO Delete Bound openshift-monitoring/prometheus-k8s-db-prometheus-k8s-1 ocs-storagecluster-ceph-rbd 9d I noticed after the upgrade I have a second set of files and my prometheus history is no longer display in grafana pvc-0ea759d4-0b02-4829-8328-6e5b5ab9a2b0 20Gi RWO Delete Bound openshift-monitoring/ocs-alertmanager-claim-alertmanager-main-0 ocs-storagecluster-ceph-rbd 15h pvc-29a18e3f-3c7b-44d5-baa7-091c874d8161 20Gi RWO Delete Bound openshift-monitoring/ocs-alertmanager-claim-alertmanager-main-1 ocs-storagecluster-ceph-rbd 15h pvc-31bf1991-6d9b-4803-9881-604996e5528c 100Gi RWO Delete Bound openshift-monitoring/ocs-prometheus-claim-prometheus-k8s-1 ocs-storagecluster-ceph-rbd 15h pvc-a91ae6f1-d764-45f5-9b09-cf0eaf096146 100Gi RWO Delete Bound openshift-monitoring/ocs-prometheus-claim-prometheus-k8s-0 ocs-storagecluster-ceph-rbd 15h pvc-d8c70ddc-e6fb-41bc-9a56-39217bfe8198 20Gi RWO Delete Bound openshift-monitoring/ocs-alertmanager-claim-alertmanager-main-2 ocs-storagecluster-ceph-rbd 15h what happened? ~~~ NOTE: I will attach the must-gather logs ASAP, it's a big one and I need to split it first removing audit logs, etc. Best Regards.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475