Bug 1832124
Summary: | Upgrading from 4.3.18 to 4.4.3 , causing Prometheus creating new PVC | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Muhammad Aizuddin Zali <mzali> | ||||
Component: | Monitoring | Assignee: | Paul Gier <pgier> | ||||
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 4.4 | CC: | alegrand, anpicker, dyocum, erooth, juzhao, kakkoyun, lcosic, lmohanty, mloibl, mzali, pamoedom, pgier, pkrupa, rsandu, sdodson, surbania, wking | ||||
Target Milestone: | --- | Keywords: | Upgrades | ||||
Target Release: | 4.5.0 | ||||||
Hardware: | All | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause:
A bug in the handling of metadata related to the Prometheus PVC name can cause upgrade failures to or from versions 4.4.0-4.4.8
Consequence:
Upgrade may fail due to lack of a new PV, and metric data will be lost if it is not manually migrated.
Fix:
Copy data from the old physical volumes to the new ones to retain metric data.
Result:
Prometheus will use the copied data and will be able to access historical metrics, and the upgrade will complete.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1833427 1839911 (view as bug list) | Environment: | |||||
Last Closed: | 2020-07-13 17:35:12 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1833427, 1839911 | ||||||
Attachments: |
|
Description
Muhammad Aizuddin Zali
2020-05-06 07:00:22 UTC
Created attachment 1686444 [details]
AWS Cluster Prom operator
set metadata.name for prometheus and alertmanager, # oc -n openshift-monitoring get cm/cluster-monitoring-config -oyaml apiVersion: v1 data: config.yaml: | prometheusK8s: volumeClaimTemplate: metadata: name: prometheus ... alertmanagerMain: volumeClaimTemplate: metadata: name: alertmanager ... kind: ConfigMap # oc -n openshift-monitoring get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE alertmanager-alertmanager-main-0 Bound pvc-48d8c41d-d2ba-4a2f-aac8-ca3a373f79f3 4Gi RWO gp2 12m alertmanager-alertmanager-main-1 Bound pvc-c3df3cef-41b5-4c65-884d-85c67c478388 4Gi RWO gp2 12m alertmanager-alertmanager-main-2 Bound pvc-6de8a046-4d5d-4d2b-a847-3a080c65e79b 4Gi RWO gp2 12m prometheus-prometheus-k8s-0 Bound pvc-dbd99862-c7a9-46fb-a115-ec7d56c44347 10Gi RWO gp2 12m prometheus-prometheus-k8s-1 Bound pvc-7a0ba5ca-9853-439d-ac62-cb578712a85d 10Gi RWO gp2 12m # oc -n openshift-monitoring get sts/alertmanager-main -oyaml ... volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null name: alertmanager spec: accessModes: - ReadWriteOnce resources: requests: storage: 4Gi storageClassName: gp2 volumeMode: Filesystem ... # oc -n openshift-monitoring get sts/prometheus-k8s -oyaml ... volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null name: prometheus spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi storageClassName: gp2 volumeMode: Filesystem ... *** Bug 1793328 has been marked as a duplicate of this bug. *** We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Clusters lose all historic metrics from before the update. This might cause... How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: The data is gone; there is no possible remediation. Is this a regression? example: No, minor-version-bumping updates have always cleared Prometheus data. example: Yes. 4.2 -> 4.3 preserved Prometheus data by..., so this is new in 4.3 -> 4.4. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? Customers upgrading from 4.3 to 4.4.0-4.4.8 who follwed the documentation for configuring a local PVC for prometheus storage (https://docs.openshift.com/container-platform/4.4/monitoring/cluster_monitoring/configuring-the-monitoring-stack.html#configuring-a-local-persistent-volume-claim_configuring-monitoring) Customers upgrading from 4.4.0-4.4.8 to 4.4.9 or 4.5.x who followed the same documentation will likely also be affected. Customers upgrading from 4.3.x to 4.4.9 and higher will *not* be affected. What is the impact? Is it serious enough to warrant blocking edges? Customers will need to either migrate prometheus data from one PV to another, or they will lose historic metric data. We believe this is serious enough to warrant blocking endges. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Remediation is moderately involved/difficult and involves copying data from one PV to another. Is this a regression? Yes, this is a regression that affects releases between 4.4.0 and 4.4.8 Created a related documentation issue: https://bugzilla.redhat.com/show_bug.cgi?id=1848738 FWIW, I have documented a workaround here (still in progress): https://access.redhat.com/solutions/5174781 Hi all, IHAC using OCS that have recently upgraded from 4.4.6 to 4.4.9 and lost all previous PVCs from alertmanager and prometheus (they were simply replaced), I think the patch pushed via 4.4.9 (BZ#1833427) could be related, can someone please investigate/corroborate? thanks. From case description: ~~~ oc -n openshift-monitoring create configmap cluster-monitoring-config --from-file=config.yaml the config.yaml is: Prometheus Operator: baseImage: quay.io/coreos/prometheus-operator prometheusConfigReloaderBaseImage: quay.io/coreos/prometheus-config-reloader configReloaderBaseImage: quay.io/coreos/configmap-reload nodeSelector: node-role.kubernetes.io/infra: "" prometheusK8s: nodeSelector: node-role.kubernetes.io/infra: "" retention: 48h baseImage: openshift/prometheus volumeClaimTemplate: metadata: name: ocs-prometheus-claim spec: storageClassName: ocs-storagecluster-ceph-rbd resources: requests: storage: 100Gi alertmanagerMain: baseImage: openshift/prometheus-alertmanager nodeSelector: node-role.kubernetes.io/infra: "" volumeClaimTemplate: metadata: name: ocs-alertmanager-claim spec: storageClassName: ocs-storagecluster-ceph-rbd resources: requests: storage: 20Gi kubeStateMetrics: baseImage: quay.io/coreos/kube-state-metrics nodeSelector: node-role.kubernetes.io/infra: "" grafana: baseImage: grafana/grafana nodeSelector: node-role.kubernetes.io/infra: "" telemeterClient: nodeSelector: node-role.kubernetes.io/infra: "" k8sPrometheusAdapter: nodeSelector: node-role.kubernetes.io/infra: "" the original files deployed 9 days ago are: pvc-04f59982-fb5d-4f18-af3a-a881fce5de0b 100Gi RWO Delete Bound openshift-monitoring/prometheus-k8s-db-prometheus-k8s-0 ocs-storagecluster-ceph-rbd 9d pvc-1cb2c2a5-cf47-465c-92c6-4bc64293efb3 20Gi RWO Delete Bound openshift-monitoring/alertmanager-main-db-alertmanager-main-0 ocs-storagecluster-ceph-rbd 9d pvc-4c8429fe-547d-4bea-b8e3-3566290d2659 20Gi RWO Delete Bound openshift-monitoring/alertmanager-main-db-alertmanager-main-1 ocs-storagecluster-ceph-rbd 9d pvc-5c0f0c9d-e98f-48e4-9815-7c4e8eea1abf 20Gi RWO Delete Bound openshift-monitoring/alertmanager-main-db-alertmanager-main-2 ocs-storagecluster-ceph-rbd 9d pvc-a20075bb-4161-4714-bf87-86ffc4961da8 100Gi RWO Delete Bound openshift-monitoring/prometheus-k8s-db-prometheus-k8s-1 ocs-storagecluster-ceph-rbd 9d I noticed after the upgrade I have a second set of files and my prometheus history is no longer display in grafana pvc-0ea759d4-0b02-4829-8328-6e5b5ab9a2b0 20Gi RWO Delete Bound openshift-monitoring/ocs-alertmanager-claim-alertmanager-main-0 ocs-storagecluster-ceph-rbd 15h pvc-29a18e3f-3c7b-44d5-baa7-091c874d8161 20Gi RWO Delete Bound openshift-monitoring/ocs-alertmanager-claim-alertmanager-main-1 ocs-storagecluster-ceph-rbd 15h pvc-31bf1991-6d9b-4803-9881-604996e5528c 100Gi RWO Delete Bound openshift-monitoring/ocs-prometheus-claim-prometheus-k8s-1 ocs-storagecluster-ceph-rbd 15h pvc-a91ae6f1-d764-45f5-9b09-cf0eaf096146 100Gi RWO Delete Bound openshift-monitoring/ocs-prometheus-claim-prometheus-k8s-0 ocs-storagecluster-ceph-rbd 15h pvc-d8c70ddc-e6fb-41bc-9a56-39217bfe8198 20Gi RWO Delete Bound openshift-monitoring/ocs-alertmanager-claim-alertmanager-main-2 ocs-storagecluster-ceph-rbd 15h what happened? ~~~ NOTE: I will attach the must-gather logs ASAP, it's a big one and I need to split it first removing audit logs, etc. Best Regards. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475 |