Bug 1974832 - The monitoring stack should alert when 2 Prometheus pods are scheduled on the same node
Summary: The monitoring stack should alert when 2 Prometheus pods are scheduled on the...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.9.0
Assignee: Damien Grisonnet
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1981246
TreeView+ depends on / blocked
 
Reported: 2021-06-22 16:01 UTC by Simon Pasquier
Modified: 2021-10-18 17:36 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1981246 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:35:57 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1242 0 None open Bug 1974832: Add HighlyAvailableWorkloadIncorrectlySpread alert 2021-06-23 13:37:09 UTC
Github openshift cluster-monitoring-operator pull 1262 0 None open Bug 1974832: Improve HighlyAvailableWorkloadIncorrectlySpread to detect single point of failure 2021-07-05 08:59:52 UTC
Github openshift runbooks pull 3 0 None open Bug 1974832: Add runbook to HighlyAvailableWorkloadIncorrectlySpread 2021-06-28 14:38:28 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:36:22 UTC

Description Simon Pasquier 2021-06-22 16:01:30 UTC
Description of problem:

This bug report originates from https://bugzilla.redhat.com/show_bug.cgi?id=1967614#c21

2 prometheus tied to one instance due to volumes is a high severity bug and the admin needs to take corrective action.  Are we alerting on this situation now?  The PDB is what we want - these users are wasting resources (they expect prometheus to be HA) and are not able to fix it. The product bug is not the PDB, the bug is that we allowed the cluster to get in this state and didn't notify the admin of why.

I expect us to 

a) deliver an alert that flags this with corrective action
b) once that alert rate is down, redeliver the PDB in 4.9 to fix the issue
c) potentially broaden the alert if necessary to other similar cases


Version-Release number of selected component (if applicable):
4.8

How reproducible:
Sometimes

Steps to Reproduce:
1. TBC
2.
3.

Actual results:
Nothing tells the cluster admin when both Prometheus pods are scheduled on the same node.

Expected results:
An alert fires.

Additional info:

See bug 1967614 and bug 1949262 for context

Comment 5 Junqi Zhao 2021-07-05 03:16:35 UTC
tested with 4.9.0-0.nightly-2021-07-04-140102, bind PVs for alertmanager/prometheus pods, and schedule these pods to the same one node, HighlyAvailableWorkloadIncorrectlySpread alert would be triggered
$ oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-k8s|alertmanager-main"
alertmanager-main-0                            5/5     Running   0          3m26s   10.128.2.21    ip-10-0-165-247.ec2.internal   <none>           <none>
alertmanager-main-1                            5/5     Running   0          3m26s   10.128.2.22    ip-10-0-165-247.ec2.internal   <none>           <none>
alertmanager-main-2                            5/5     Running   0          3m26s   10.128.2.24    ip-10-0-165-247.ec2.internal   <none>           <none>
prometheus-k8s-0                               7/7     Running   1          3m26s   10.128.2.23    ip-10-0-165-247.ec2.internal   <none>           <none>
prometheus-k8s-1                               7/7     Running   1          3m26s   10.128.2.25    ip-10-0-165-247.ec2.internal   <none>           <none>
$ oc -n openshift-monitoring get pvc
NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
alert-alertmanager-main-0     Bound    pvc-a6c8d2e8-487b-471d-84b8-f87f8842055f   1Gi        RWO            gp2            3m35s
alert-alertmanager-main-1     Bound    pvc-f1f62b80-0cf7-421e-b6da-d28d87fd880d   1Gi        RWO            gp2            3m34s
alert-alertmanager-main-2     Bound    pvc-aecb815b-0bd1-40e3-a0f4-64f85a274989   1Gi        RWO            gp2            3m34s
prometheus-prometheus-k8s-0   Bound    pvc-e3a5829c-ada7-4134-b407-a5aeba21683b   2Gi        RWO            gp2            3m34s
prometheus-prometheus-k8s-1   Bound    pvc-74a9f586-2886-4a27-8613-9d1b17323a3f   2Gi        RWO            gp2            3m34s


$ token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq
...
      {
        "labels": {
          "alertname": "HighlyAvailableWorkloadIncorrectlySpread",
          "namespace": "openshift-monitoring",
          "node": "ip-10-0-165-247.ec2.internal",
          "severity": "warning",
          "workload": "alertmanager-main"
        },
        "annotations": {
          "description": "Workload openshift-monitoring/alertmanager-main is incorrectly spread across multiple nodes which breaks high-availability requirements. There are 3 pods on node ip-10-0-165-247.ec2.internal, where there should only be one. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.",
          "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md",
          "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed."
        },
        "state": "pending",
        "activeAt": "2021-07-05T03:09:35.421613011Z",
        "value": "3e+00"
      },
      {
        "labels": {
          "alertname": "HighlyAvailableWorkloadIncorrectlySpread",
          "namespace": "openshift-monitoring",
          "node": "ip-10-0-165-247.ec2.internal",
          "severity": "warning",
          "workload": "prometheus-k8s"
        },
        "annotations": {
          "description": "Workload openshift-monitoring/prometheus-k8s is incorrectly spread across multiple nodes which breaks high-availability requirements. There are 2 pods on node ip-10-0-165-247.ec2.internal, where there should only be one. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.",
          "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md",
          "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed."
        },
        "state": "pending",
        "activeAt": "2021-07-05T03:09:35.421613011Z",
        "value": "2e+00"
      },

Comment 6 Damien Grisonnet 2021-07-05 08:59:30 UTC
The alert needed some corrections to handle cases where we have 3 instances of a particular workload for only 2 worker nodes, which is the minimum requirement for highly available OCP clusters.

Moving the Bugzilla back to assigned to get the new changes verified again.

Comment 8 Junqi Zhao 2021-07-12 04:26:27 UTC
tested with 4.9.0-0.nightly-2021-07-11-143719, scenarios:
1. bind PVs for alertmanager/prometheus pods, and schedule these pods to the same one node, HighlyAvailableWorkloadIncorrectlySpread alert would be triggered for alertmanager/prometheus pods, verified in Comment 5.
2. bind PVs for alertmanager/prometheus pods, and schedule these pods to the two nodes, HighlyAvailableWorkloadIncorrectlySpread alert would not be triggered
# oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-k8s|alertmanager-main"
alertmanager-main-0                            5/5     Running   0          111s   10.129.2.16   ci-ln-vhq70tb-f76d1-4gldx-worker-b-f874d   <none>           <none>
alertmanager-main-1                            5/5     Running   0          111s   10.131.0.34   ci-ln-vhq70tb-f76d1-4gldx-worker-a-hkqkj   <none>           <none>
alertmanager-main-2                            5/5     Running   0          111s   10.129.2.17   ci-ln-vhq70tb-f76d1-4gldx-worker-b-f874d   <none>           <none>
prometheus-k8s-0                               7/7     Running   0          2m     10.129.2.15   ci-ln-vhq70tb-f76d1-4gldx-worker-b-f874d   <none>           <none>
prometheus-k8s-1                               7/7     Running   0          2m     10.131.0.33   ci-ln-vhq70tb-f76d1-4gldx-worker-a-hkqkj   <none>           <none>
# oc -n openshift-monitoring get pvc
NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
alertmanager-main-db-alertmanager-main-0   Bound    pvc-41a9f119-6b07-41a2-a67c-63d2d1699a55   4Gi        RWO            standard       119s
alertmanager-main-db-alertmanager-main-1   Bound    pvc-7f7d4163-51b8-48f1-9fe2-db8a843341b3   4Gi        RWO            standard       119s
alertmanager-main-db-alertmanager-main-2   Bound    pvc-5bcbe21b-c5b1-4aa9-a59e-0af9066059f7   4Gi        RWO            standard       119s
prometheus-k8s-db-prometheus-k8s-0         Bound    pvc-a6d5445c-bdb6-49b7-9bbf-07a734a2fc17   10Gi       RWO            standard       2m8s
prometheus-k8s-db-prometheus-k8s-1         Bound    pvc-d07f9597-3af0-41bb-bcba-d594186d2b86   10Gi       RWO            standard       2m8s
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq '.data.alerts[].labels.alertname'
"AlertmanagerReceiversNotConfigured"
"Watchdog"

Comment 15 errata-xmlrpc 2021-10-18 17:35:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.