Bug 1974832

Summary:	The monitoring stack should alert when 2 Prometheus pods are scheduled on the same node
Product:	OpenShift Container Platform	Reporter:	Simon Pasquier <spasquie>
Component:	Monitoring	Assignee:	Damien Grisonnet <dgrisonn>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	anpicker, aos-bugs, dofinn, erooth, vjaypurk, wking
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1981246 (view as bug list)		Environment:
Last Closed:	2021-10-18 17:35:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1981246

Description Simon Pasquier 2021-06-22 16:01:30 UTC

Description of problem:

This bug report originates from https://bugzilla.redhat.com/show_bug.cgi?id=1967614#c21

2 prometheus tied to one instance due to volumes is a high severity bug and the admin needs to take corrective action.  Are we alerting on this situation now?  The PDB is what we want - these users are wasting resources (they expect prometheus to be HA) and are not able to fix it. The product bug is not the PDB, the bug is that we allowed the cluster to get in this state and didn't notify the admin of why.

I expect us to 

a) deliver an alert that flags this with corrective action
b) once that alert rate is down, redeliver the PDB in 4.9 to fix the issue
c) potentially broaden the alert if necessary to other similar cases


Version-Release number of selected component (if applicable):
4.8

How reproducible:
Sometimes

Steps to Reproduce:
1. TBC
2.
3.

Actual results:
Nothing tells the cluster admin when both Prometheus pods are scheduled on the same node.

Expected results:
An alert fires.

Additional info:

See bug 1967614 and bug 1949262 for context

Comment 5 Junqi Zhao 2021-07-05 03:16:35 UTC

tested with 4.9.0-0.nightly-2021-07-04-140102, bind PVs for alertmanager/prometheus pods, and schedule these pods to the same one node, HighlyAvailableWorkloadIncorrectlySpread alert would be triggered
$ oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-k8s|alertmanager-main"
alertmanager-main-0                            5/5     Running   0          3m26s   10.128.2.21    ip-10-0-165-247.ec2.internal   <none>           <none>
alertmanager-main-1                            5/5     Running   0          3m26s   10.128.2.22    ip-10-0-165-247.ec2.internal   <none>           <none>
alertmanager-main-2                            5/5     Running   0          3m26s   10.128.2.24    ip-10-0-165-247.ec2.internal   <none>           <none>
prometheus-k8s-0                               7/7     Running   1          3m26s   10.128.2.23    ip-10-0-165-247.ec2.internal   <none>           <none>
prometheus-k8s-1                               7/7     Running   1          3m26s   10.128.2.25    ip-10-0-165-247.ec2.internal   <none>           <none>
$ oc -n openshift-monitoring get pvc
NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
alert-alertmanager-main-0     Bound    pvc-a6c8d2e8-487b-471d-84b8-f87f8842055f   1Gi        RWO            gp2            3m35s
alert-alertmanager-main-1     Bound    pvc-f1f62b80-0cf7-421e-b6da-d28d87fd880d   1Gi        RWO            gp2            3m34s
alert-alertmanager-main-2     Bound    pvc-aecb815b-0bd1-40e3-a0f4-64f85a274989   1Gi        RWO            gp2            3m34s
prometheus-prometheus-k8s-0   Bound    pvc-e3a5829c-ada7-4134-b407-a5aeba21683b   2Gi        RWO            gp2            3m34s
prometheus-prometheus-k8s-1   Bound    pvc-74a9f586-2886-4a27-8613-9d1b17323a3f   2Gi        RWO            gp2            3m34s


$ token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq
...
      {
        "labels": {
          "alertname": "HighlyAvailableWorkloadIncorrectlySpread",
          "namespace": "openshift-monitoring",
          "node": "ip-10-0-165-247.ec2.internal",
          "severity": "warning",
          "workload": "alertmanager-main"
        },
        "annotations": {
          "description": "Workload openshift-monitoring/alertmanager-main is incorrectly spread across multiple nodes which breaks high-availability requirements. There are 3 pods on node ip-10-0-165-247.ec2.internal, where there should only be one. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.",
          "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md",
          "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed."
        },
        "state": "pending",
        "activeAt": "2021-07-05T03:09:35.421613011Z",
        "value": "3e+00"
      },
      {
        "labels": {
          "alertname": "HighlyAvailableWorkloadIncorrectlySpread",
          "namespace": "openshift-monitoring",
          "node": "ip-10-0-165-247.ec2.internal",
          "severity": "warning",
          "workload": "prometheus-k8s"
        },
        "annotations": {
          "description": "Workload openshift-monitoring/prometheus-k8s is incorrectly spread across multiple nodes which breaks high-availability requirements. There are 2 pods on node ip-10-0-165-247.ec2.internal, where there should only be one. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.",
          "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md",
          "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed."
        },
        "state": "pending",
        "activeAt": "2021-07-05T03:09:35.421613011Z",
        "value": "2e+00"
      },

Comment 6 Damien Grisonnet 2021-07-05 08:59:30 UTC

The alert needed some corrections to handle cases where we have 3 instances of a particular workload for only 2 worker nodes, which is the minimum requirement for highly available OCP clusters.

Moving the Bugzilla back to assigned to get the new changes verified again.

Comment 8 Junqi Zhao 2021-07-12 04:26:27 UTC

tested with 4.9.0-0.nightly-2021-07-11-143719, scenarios:
1. bind PVs for alertmanager/prometheus pods, and schedule these pods to the same one node, HighlyAvailableWorkloadIncorrectlySpread alert would be triggered for alertmanager/prometheus pods, verified in Comment 5.
2. bind PVs for alertmanager/prometheus pods, and schedule these pods to the two nodes, HighlyAvailableWorkloadIncorrectlySpread alert would not be triggered
# oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-k8s|alertmanager-main"
alertmanager-main-0                            5/5     Running   0          111s   10.129.2.16   ci-ln-vhq70tb-f76d1-4gldx-worker-b-f874d   <none>           <none>
alertmanager-main-1                            5/5     Running   0          111s   10.131.0.34   ci-ln-vhq70tb-f76d1-4gldx-worker-a-hkqkj   <none>           <none>
alertmanager-main-2                            5/5     Running   0          111s   10.129.2.17   ci-ln-vhq70tb-f76d1-4gldx-worker-b-f874d   <none>           <none>
prometheus-k8s-0                               7/7     Running   0          2m     10.129.2.15   ci-ln-vhq70tb-f76d1-4gldx-worker-b-f874d   <none>           <none>
prometheus-k8s-1                               7/7     Running   0          2m     10.131.0.33   ci-ln-vhq70tb-f76d1-4gldx-worker-a-hkqkj   <none>           <none>
# oc -n openshift-monitoring get pvc
NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
alertmanager-main-db-alertmanager-main-0   Bound    pvc-41a9f119-6b07-41a2-a67c-63d2d1699a55   4Gi        RWO            standard       119s
alertmanager-main-db-alertmanager-main-1   Bound    pvc-7f7d4163-51b8-48f1-9fe2-db8a843341b3   4Gi        RWO            standard       119s
alertmanager-main-db-alertmanager-main-2   Bound    pvc-5bcbe21b-c5b1-4aa9-a59e-0af9066059f7   4Gi        RWO            standard       119s
prometheus-k8s-db-prometheus-k8s-0         Bound    pvc-a6d5445c-bdb6-49b7-9bbf-07a734a2fc17   10Gi       RWO            standard       2m8s
prometheus-k8s-db-prometheus-k8s-1         Bound    pvc-d07f9597-3af0-41bb-bcba-d594186d2b86   10Gi       RWO            standard       2m8s
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq '.data.alerts[].labels.alertname'
"AlertmanagerReceiversNotConfigured"
"Watchdog"

Comment 15 errata-xmlrpc 2021-10-18 17:35:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759