Bug 1981246 - The monitoring stack should alert when 2 Prometheus pods are scheduled on the same node
Summary: The monitoring stack should alert when 2 Prometheus pods are scheduled on the...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.z
Assignee: Damien Grisonnet
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 1974832
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-12 07:36 UTC by Damien Grisonnet
Modified: 2021-10-18 09:15 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1974832
Environment:
Last Closed: 2021-08-23 07:51:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1276 0 None open Bug 1981246: [4.8]: Add HighlyAvailableWorkloadIncorrectlySpread alert 2021-07-22 17:51:20 UTC
Github openshift origin pull 26319 0 None open Bug 1981246: [4.8]: test/e2e: allow workload incorrectly spread alert 2021-07-29 07:29:46 UTC

Comment 5 Junqi Zhao 2021-08-12 10:27:34 UTC
tested with 4.8.0-0.nightly-2021-08-12-022728, scenarios:
1. bind PVs for alertmanager/prometheus pods, and schedule these pods to the same one node, HighlyAvailableWorkloadIncorrectlySpread alert would be triggered for alertmanager/prometheus pods
# oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-k8s|alertmanager-main"
alertmanager-main-0                            5/5     Running   0          109s    10.129.2.160   ip-10-0-213-244.ap-northeast-2.compute.internal   <none>           <none>
alertmanager-main-1                            5/5     Running   0          109s    10.129.2.162   ip-10-0-213-244.ap-northeast-2.compute.internal   <none>           <none>
alertmanager-main-2                            5/5     Running   0          109s    10.129.2.161   ip-10-0-213-244.ap-northeast-2.compute.internal   <none>           <none>
prometheus-k8s-0                               7/7     Running   1          116s    10.129.2.158   ip-10-0-213-244.ap-northeast-2.compute.internal   <none>           <none>
prometheus-k8s-1                               7/7     Running   1          116s    10.129.2.159   ip-10-0-213-244.ap-northeast-2.compute.internal   <none>           <none>

# oc -n openshift-monitoring get pvc
NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
alertmanager-main-db-alertmanager-main-0   Bound    pvc-07e5e73e-70d7-40d9-85ad-d9d63a83d9b1   4Gi        RWO            gp2            119s
alertmanager-main-db-alertmanager-main-1   Bound    pvc-3c720964-155e-4986-adba-8430220ea10c   4Gi        RWO            gp2            119s
alertmanager-main-db-alertmanager-main-2   Bound    pvc-bb98124d-8f85-4829-ac7c-26a978685ed3   4Gi        RWO            gp2            119s
prometheus-k8s-db-prometheus-k8s-0         Bound    pvc-705a5da1-f25b-44e6-99ea-ad232b36c054   10Gi       RWO            gp2            2m6s
prometheus-k8s-db-prometheus-k8s-1         Bound    pvc-fdf59341-3b20-48e4-a4f4-d3d326b92a54   10Gi       RWO            gp2            2m6s

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq
...
      {
        "labels": {
          "alertname": "HighlyAvailableWorkloadIncorrectlySpread",
          "namespace": "openshift-monitoring",
          "severity": "warning",
          "workload": "prometheus-k8s"
        },
        "annotations": {
          "description": "Workload openshift-monitoring/prometheus-k8s is incorrectly spread across multiple nodes which breaks high-availability requirements. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.",
          "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md",
          "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed."
        },
        "state": "pending",
        "activeAt": "2021-08-12T10:03:35.421613011Z",
        "value": "1e+00"
      },
      {
        "labels": {
          "alertname": "HighlyAvailableWorkloadIncorrectlySpread",
          "namespace": "openshift-monitoring",
          "severity": "warning",
          "workload": "alertmanager-main"
        },
        "annotations": {
          "description": "Workload openshift-monitoring/alertmanager-main is incorrectly spread across multiple nodes which breaks high-availability requirements. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.",
          "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md",
          "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed."
        },
        "state": "pending",
        "activeAt": "2021-08-12T10:03:35.421613011Z",
        "value": "1e+00"
      },
...

2. bind PVs for alertmanager/prometheus pods, and schedule these pods to the two nodes, HighlyAvailableWorkloadIncorrectlySpread alert would not be triggered
# oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-k8s|alertmanager-main"
alertmanager-main-0                            5/5     Running   0          73s     10.128.2.24    ip-10-0-171-125.ap-northeast-2.compute.internal   <none>           <none>
alertmanager-main-1                            5/5     Running   0          73s     10.129.2.170   ip-10-0-213-244.ap-northeast-2.compute.internal   <none>           <none>
alertmanager-main-2                            5/5     Running   0          73s     10.129.2.171   ip-10-0-213-244.ap-northeast-2.compute.internal   <none>           <none>
prometheus-k8s-0                               7/7     Running   1          71s     10.128.2.23    ip-10-0-171-125.ap-northeast-2.compute.internal   <none>           <none>
prometheus-k8s-1                               7/7     Running   1          71s     10.129.2.169   ip-10-0-213-244.ap-northeast-2.compute.internal   <none>           <none>

# oc -n openshift-monitoring get pvc
NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
alertmanager-main-db-alertmanager-main-0   Bound    pvc-41a85819-77d0-4bac-a533-b5b636ed36c9   4Gi        RWO            gp2            113s
alertmanager-main-db-alertmanager-main-1   Bound    pvc-4df38d90-3bec-4e28-a0a9-b872267dbb40   4Gi        RWO            gp2            113s
alertmanager-main-db-alertmanager-main-2   Bound    pvc-bae69be4-58df-42ac-8a5f-fc31a1e3bc5e   4Gi        RWO            gp2            113s
prometheus-k8s-db-prometheus-k8s-0         Bound    pvc-a80a416d-9bdc-4db0-8e3f-4998192992fc   10Gi       RWO            gp2            111s
prometheus-k8s-db-prometheus-k8s-1         Bound    pvc-c131add3-15d1-4a3b-8222-65b4a0ed8fb6   10Gi       RWO            gp2            111s

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq '.data.alerts[].labels.alertname'
"HighOverallControlPlaneCPU"
"CannotRetrieveUpdates"
"Watchdog"
"AlertmanagerReceiversNotConfigured"

Comment 6 Scott Dodson 2021-08-17 01:25:53 UTC
The alert shipped in 4.8.4, should this bug be CLOSED CURRENTRELEASE now?

Comment 8 ximhan 2021-08-20 07:26:57 UTC
OpenShift engineering has decided to NOT ship 4.8.6 on 8/23 due to the following issue.
https://bugzilla.redhat.com/show_bug.cgi?id=1995785
All the fixes part will be now included in 4.8.7 on 8/30.

Comment 10 Junqi Zhao 2021-08-23 07:51:46 UTC
(In reply to Scott Dodson from comment #6)
> The alert shipped in 4.8.4, should this bug be CLOSED CURRENTRELEASE now?

makes sense, set to CURRENTRELEASE


Note You need to log in before you can comment on or make changes to this bug.