Description of problem: This bug report originates from https://bugzilla.redhat.com/show_bug.cgi?id=1967614#c21 2 prometheus tied to one instance due to volumes is a high severity bug and the admin needs to take corrective action. Are we alerting on this situation now? The PDB is what we want - these users are wasting resources (they expect prometheus to be HA) and are not able to fix it. The product bug is not the PDB, the bug is that we allowed the cluster to get in this state and didn't notify the admin of why. I expect us to a) deliver an alert that flags this with corrective action b) once that alert rate is down, redeliver the PDB in 4.9 to fix the issue c) potentially broaden the alert if necessary to other similar cases Version-Release number of selected component (if applicable): 4.8 How reproducible: Sometimes Steps to Reproduce: 1. TBC 2. 3. Actual results: Nothing tells the cluster admin when both Prometheus pods are scheduled on the same node. Expected results: An alert fires. Additional info: See bug 1967614 and bug 1949262 for context
tested with 4.9.0-0.nightly-2021-07-04-140102, bind PVs for alertmanager/prometheus pods, and schedule these pods to the same one node, HighlyAvailableWorkloadIncorrectlySpread alert would be triggered $ oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-k8s|alertmanager-main" alertmanager-main-0 5/5 Running 0 3m26s 10.128.2.21 ip-10-0-165-247.ec2.internal <none> <none> alertmanager-main-1 5/5 Running 0 3m26s 10.128.2.22 ip-10-0-165-247.ec2.internal <none> <none> alertmanager-main-2 5/5 Running 0 3m26s 10.128.2.24 ip-10-0-165-247.ec2.internal <none> <none> prometheus-k8s-0 7/7 Running 1 3m26s 10.128.2.23 ip-10-0-165-247.ec2.internal <none> <none> prometheus-k8s-1 7/7 Running 1 3m26s 10.128.2.25 ip-10-0-165-247.ec2.internal <none> <none> $ oc -n openshift-monitoring get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE alert-alertmanager-main-0 Bound pvc-a6c8d2e8-487b-471d-84b8-f87f8842055f 1Gi RWO gp2 3m35s alert-alertmanager-main-1 Bound pvc-f1f62b80-0cf7-421e-b6da-d28d87fd880d 1Gi RWO gp2 3m34s alert-alertmanager-main-2 Bound pvc-aecb815b-0bd1-40e3-a0f4-64f85a274989 1Gi RWO gp2 3m34s prometheus-prometheus-k8s-0 Bound pvc-e3a5829c-ada7-4134-b407-a5aeba21683b 2Gi RWO gp2 3m34s prometheus-prometheus-k8s-1 Bound pvc-74a9f586-2886-4a27-8613-9d1b17323a3f 2Gi RWO gp2 3m34s $ token=`oc sa get-token prometheus-k8s -n openshift-monitoring` $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq ... { "labels": { "alertname": "HighlyAvailableWorkloadIncorrectlySpread", "namespace": "openshift-monitoring", "node": "ip-10-0-165-247.ec2.internal", "severity": "warning", "workload": "alertmanager-main" }, "annotations": { "description": "Workload openshift-monitoring/alertmanager-main is incorrectly spread across multiple nodes which breaks high-availability requirements. There are 3 pods on node ip-10-0-165-247.ec2.internal, where there should only be one. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.", "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md", "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed." }, "state": "pending", "activeAt": "2021-07-05T03:09:35.421613011Z", "value": "3e+00" }, { "labels": { "alertname": "HighlyAvailableWorkloadIncorrectlySpread", "namespace": "openshift-monitoring", "node": "ip-10-0-165-247.ec2.internal", "severity": "warning", "workload": "prometheus-k8s" }, "annotations": { "description": "Workload openshift-monitoring/prometheus-k8s is incorrectly spread across multiple nodes which breaks high-availability requirements. There are 2 pods on node ip-10-0-165-247.ec2.internal, where there should only be one. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.", "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md", "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed." }, "state": "pending", "activeAt": "2021-07-05T03:09:35.421613011Z", "value": "2e+00" },
The alert needed some corrections to handle cases where we have 3 instances of a particular workload for only 2 worker nodes, which is the minimum requirement for highly available OCP clusters. Moving the Bugzilla back to assigned to get the new changes verified again.
tested with 4.9.0-0.nightly-2021-07-11-143719, scenarios: 1. bind PVs for alertmanager/prometheus pods, and schedule these pods to the same one node, HighlyAvailableWorkloadIncorrectlySpread alert would be triggered for alertmanager/prometheus pods, verified in Comment 5. 2. bind PVs for alertmanager/prometheus pods, and schedule these pods to the two nodes, HighlyAvailableWorkloadIncorrectlySpread alert would not be triggered # oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-k8s|alertmanager-main" alertmanager-main-0 5/5 Running 0 111s 10.129.2.16 ci-ln-vhq70tb-f76d1-4gldx-worker-b-f874d <none> <none> alertmanager-main-1 5/5 Running 0 111s 10.131.0.34 ci-ln-vhq70tb-f76d1-4gldx-worker-a-hkqkj <none> <none> alertmanager-main-2 5/5 Running 0 111s 10.129.2.17 ci-ln-vhq70tb-f76d1-4gldx-worker-b-f874d <none> <none> prometheus-k8s-0 7/7 Running 0 2m 10.129.2.15 ci-ln-vhq70tb-f76d1-4gldx-worker-b-f874d <none> <none> prometheus-k8s-1 7/7 Running 0 2m 10.131.0.33 ci-ln-vhq70tb-f76d1-4gldx-worker-a-hkqkj <none> <none> # oc -n openshift-monitoring get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE alertmanager-main-db-alertmanager-main-0 Bound pvc-41a9f119-6b07-41a2-a67c-63d2d1699a55 4Gi RWO standard 119s alertmanager-main-db-alertmanager-main-1 Bound pvc-7f7d4163-51b8-48f1-9fe2-db8a843341b3 4Gi RWO standard 119s alertmanager-main-db-alertmanager-main-2 Bound pvc-5bcbe21b-c5b1-4aa9-a59e-0af9066059f7 4Gi RWO standard 119s prometheus-k8s-db-prometheus-k8s-0 Bound pvc-a6d5445c-bdb6-49b7-9bbf-07a734a2fc17 10Gi RWO standard 2m8s prometheus-k8s-db-prometheus-k8s-1 Bound pvc-d07f9597-3af0-41bb-bcba-d594186d2b86 10Gi RWO standard 2m8s # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq '.data.alerts[].labels.alertname' "AlertmanagerReceiversNotConfigured" "Watchdog"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759