Bug 1974832
Summary: | The monitoring stack should alert when 2 Prometheus pods are scheduled on the same node | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Simon Pasquier <spasquie> | |
Component: | Monitoring | Assignee: | Damien Grisonnet <dgrisonn> | |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.8 | CC: | anpicker, aos-bugs, dofinn, erooth, vjaypurk, wking | |
Target Milestone: | --- | |||
Target Release: | 4.9.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1981246 (view as bug list) | Environment: | ||
Last Closed: | 2021-10-18 17:35:57 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1981246 |
Description
Simon Pasquier
2021-06-22 16:01:30 UTC
tested with 4.9.0-0.nightly-2021-07-04-140102, bind PVs for alertmanager/prometheus pods, and schedule these pods to the same one node, HighlyAvailableWorkloadIncorrectlySpread alert would be triggered $ oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-k8s|alertmanager-main" alertmanager-main-0 5/5 Running 0 3m26s 10.128.2.21 ip-10-0-165-247.ec2.internal <none> <none> alertmanager-main-1 5/5 Running 0 3m26s 10.128.2.22 ip-10-0-165-247.ec2.internal <none> <none> alertmanager-main-2 5/5 Running 0 3m26s 10.128.2.24 ip-10-0-165-247.ec2.internal <none> <none> prometheus-k8s-0 7/7 Running 1 3m26s 10.128.2.23 ip-10-0-165-247.ec2.internal <none> <none> prometheus-k8s-1 7/7 Running 1 3m26s 10.128.2.25 ip-10-0-165-247.ec2.internal <none> <none> $ oc -n openshift-monitoring get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE alert-alertmanager-main-0 Bound pvc-a6c8d2e8-487b-471d-84b8-f87f8842055f 1Gi RWO gp2 3m35s alert-alertmanager-main-1 Bound pvc-f1f62b80-0cf7-421e-b6da-d28d87fd880d 1Gi RWO gp2 3m34s alert-alertmanager-main-2 Bound pvc-aecb815b-0bd1-40e3-a0f4-64f85a274989 1Gi RWO gp2 3m34s prometheus-prometheus-k8s-0 Bound pvc-e3a5829c-ada7-4134-b407-a5aeba21683b 2Gi RWO gp2 3m34s prometheus-prometheus-k8s-1 Bound pvc-74a9f586-2886-4a27-8613-9d1b17323a3f 2Gi RWO gp2 3m34s $ token=`oc sa get-token prometheus-k8s -n openshift-monitoring` $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq ... { "labels": { "alertname": "HighlyAvailableWorkloadIncorrectlySpread", "namespace": "openshift-monitoring", "node": "ip-10-0-165-247.ec2.internal", "severity": "warning", "workload": "alertmanager-main" }, "annotations": { "description": "Workload openshift-monitoring/alertmanager-main is incorrectly spread across multiple nodes which breaks high-availability requirements. There are 3 pods on node ip-10-0-165-247.ec2.internal, where there should only be one. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.", "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md", "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed." }, "state": "pending", "activeAt": "2021-07-05T03:09:35.421613011Z", "value": "3e+00" }, { "labels": { "alertname": "HighlyAvailableWorkloadIncorrectlySpread", "namespace": "openshift-monitoring", "node": "ip-10-0-165-247.ec2.internal", "severity": "warning", "workload": "prometheus-k8s" }, "annotations": { "description": "Workload openshift-monitoring/prometheus-k8s is incorrectly spread across multiple nodes which breaks high-availability requirements. There are 2 pods on node ip-10-0-165-247.ec2.internal, where there should only be one. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.", "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md", "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed." }, "state": "pending", "activeAt": "2021-07-05T03:09:35.421613011Z", "value": "2e+00" }, The alert needed some corrections to handle cases where we have 3 instances of a particular workload for only 2 worker nodes, which is the minimum requirement for highly available OCP clusters. Moving the Bugzilla back to assigned to get the new changes verified again. tested with 4.9.0-0.nightly-2021-07-11-143719, scenarios: 1. bind PVs for alertmanager/prometheus pods, and schedule these pods to the same one node, HighlyAvailableWorkloadIncorrectlySpread alert would be triggered for alertmanager/prometheus pods, verified in Comment 5. 2. bind PVs for alertmanager/prometheus pods, and schedule these pods to the two nodes, HighlyAvailableWorkloadIncorrectlySpread alert would not be triggered # oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-k8s|alertmanager-main" alertmanager-main-0 5/5 Running 0 111s 10.129.2.16 ci-ln-vhq70tb-f76d1-4gldx-worker-b-f874d <none> <none> alertmanager-main-1 5/5 Running 0 111s 10.131.0.34 ci-ln-vhq70tb-f76d1-4gldx-worker-a-hkqkj <none> <none> alertmanager-main-2 5/5 Running 0 111s 10.129.2.17 ci-ln-vhq70tb-f76d1-4gldx-worker-b-f874d <none> <none> prometheus-k8s-0 7/7 Running 0 2m 10.129.2.15 ci-ln-vhq70tb-f76d1-4gldx-worker-b-f874d <none> <none> prometheus-k8s-1 7/7 Running 0 2m 10.131.0.33 ci-ln-vhq70tb-f76d1-4gldx-worker-a-hkqkj <none> <none> # oc -n openshift-monitoring get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE alertmanager-main-db-alertmanager-main-0 Bound pvc-41a9f119-6b07-41a2-a67c-63d2d1699a55 4Gi RWO standard 119s alertmanager-main-db-alertmanager-main-1 Bound pvc-7f7d4163-51b8-48f1-9fe2-db8a843341b3 4Gi RWO standard 119s alertmanager-main-db-alertmanager-main-2 Bound pvc-5bcbe21b-c5b1-4aa9-a59e-0af9066059f7 4Gi RWO standard 119s prometheus-k8s-db-prometheus-k8s-0 Bound pvc-a6d5445c-bdb6-49b7-9bbf-07a734a2fc17 10Gi RWO standard 2m8s prometheus-k8s-db-prometheus-k8s-1 Bound pvc-d07f9597-3af0-41bb-bcba-d594186d2b86 10Gi RWO standard 2m8s # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq '.data.alerts[].labels.alertname' "AlertmanagerReceiversNotConfigured" "Watchdog" Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |