tested with 4.8.0-0.nightly-2021-08-12-022728, scenarios: 1. bind PVs for alertmanager/prometheus pods, and schedule these pods to the same one node, HighlyAvailableWorkloadIncorrectlySpread alert would be triggered for alertmanager/prometheus pods # oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-k8s|alertmanager-main" alertmanager-main-0 5/5 Running 0 109s 10.129.2.160 ip-10-0-213-244.ap-northeast-2.compute.internal <none> <none> alertmanager-main-1 5/5 Running 0 109s 10.129.2.162 ip-10-0-213-244.ap-northeast-2.compute.internal <none> <none> alertmanager-main-2 5/5 Running 0 109s 10.129.2.161 ip-10-0-213-244.ap-northeast-2.compute.internal <none> <none> prometheus-k8s-0 7/7 Running 1 116s 10.129.2.158 ip-10-0-213-244.ap-northeast-2.compute.internal <none> <none> prometheus-k8s-1 7/7 Running 1 116s 10.129.2.159 ip-10-0-213-244.ap-northeast-2.compute.internal <none> <none> # oc -n openshift-monitoring get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE alertmanager-main-db-alertmanager-main-0 Bound pvc-07e5e73e-70d7-40d9-85ad-d9d63a83d9b1 4Gi RWO gp2 119s alertmanager-main-db-alertmanager-main-1 Bound pvc-3c720964-155e-4986-adba-8430220ea10c 4Gi RWO gp2 119s alertmanager-main-db-alertmanager-main-2 Bound pvc-bb98124d-8f85-4829-ac7c-26a978685ed3 4Gi RWO gp2 119s prometheus-k8s-db-prometheus-k8s-0 Bound pvc-705a5da1-f25b-44e6-99ea-ad232b36c054 10Gi RWO gp2 2m6s prometheus-k8s-db-prometheus-k8s-1 Bound pvc-fdf59341-3b20-48e4-a4f4-d3d326b92a54 10Gi RWO gp2 2m6s # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq ... { "labels": { "alertname": "HighlyAvailableWorkloadIncorrectlySpread", "namespace": "openshift-monitoring", "severity": "warning", "workload": "prometheus-k8s" }, "annotations": { "description": "Workload openshift-monitoring/prometheus-k8s is incorrectly spread across multiple nodes which breaks high-availability requirements. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.", "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md", "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed." }, "state": "pending", "activeAt": "2021-08-12T10:03:35.421613011Z", "value": "1e+00" }, { "labels": { "alertname": "HighlyAvailableWorkloadIncorrectlySpread", "namespace": "openshift-monitoring", "severity": "warning", "workload": "alertmanager-main" }, "annotations": { "description": "Workload openshift-monitoring/alertmanager-main is incorrectly spread across multiple nodes which breaks high-availability requirements. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.", "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md", "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed." }, "state": "pending", "activeAt": "2021-08-12T10:03:35.421613011Z", "value": "1e+00" }, ... 2. bind PVs for alertmanager/prometheus pods, and schedule these pods to the two nodes, HighlyAvailableWorkloadIncorrectlySpread alert would not be triggered # oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-k8s|alertmanager-main" alertmanager-main-0 5/5 Running 0 73s 10.128.2.24 ip-10-0-171-125.ap-northeast-2.compute.internal <none> <none> alertmanager-main-1 5/5 Running 0 73s 10.129.2.170 ip-10-0-213-244.ap-northeast-2.compute.internal <none> <none> alertmanager-main-2 5/5 Running 0 73s 10.129.2.171 ip-10-0-213-244.ap-northeast-2.compute.internal <none> <none> prometheus-k8s-0 7/7 Running 1 71s 10.128.2.23 ip-10-0-171-125.ap-northeast-2.compute.internal <none> <none> prometheus-k8s-1 7/7 Running 1 71s 10.129.2.169 ip-10-0-213-244.ap-northeast-2.compute.internal <none> <none> # oc -n openshift-monitoring get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE alertmanager-main-db-alertmanager-main-0 Bound pvc-41a85819-77d0-4bac-a533-b5b636ed36c9 4Gi RWO gp2 113s alertmanager-main-db-alertmanager-main-1 Bound pvc-4df38d90-3bec-4e28-a0a9-b872267dbb40 4Gi RWO gp2 113s alertmanager-main-db-alertmanager-main-2 Bound pvc-bae69be4-58df-42ac-8a5f-fc31a1e3bc5e 4Gi RWO gp2 113s prometheus-k8s-db-prometheus-k8s-0 Bound pvc-a80a416d-9bdc-4db0-8e3f-4998192992fc 10Gi RWO gp2 111s prometheus-k8s-db-prometheus-k8s-1 Bound pvc-c131add3-15d1-4a3b-8222-65b4a0ed8fb6 10Gi RWO gp2 111s # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq '.data.alerts[].labels.alertname' "HighOverallControlPlaneCPU" "CannotRetrieveUpdates" "Watchdog" "AlertmanagerReceiversNotConfigured"
The alert shipped in 4.8.4, should this bug be CLOSED CURRENTRELEASE now?
OpenShift engineering has decided to NOT ship 4.8.6 on 8/23 due to the following issue. https://bugzilla.redhat.com/show_bug.cgi?id=1995785 All the fixes part will be now included in 4.8.7 on 8/30.
(In reply to Scott Dodson from comment #6) > The alert shipped in 4.8.4, should this bug be CLOSED CURRENTRELEASE now? makes sense, set to CURRENTRELEASE