Bug 1955489
| Summary: | Alertmanager Statefulsets should have 2 replicas and hard affinity set | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Simon Pasquier <spasquie> |
| Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> |
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | high | Docs Contact: | Brian Burt <bburt> |
| Priority: | low | ||
| Version: | 4.8 | CC: | anpicker, bburt, erooth, hongyli, jeder, rgudimet, wking |
| Target Milestone: | --- | Keywords: | ServiceDeliveryImpact |
| Target Release: | 4.10.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Prviously, during {product-title} upgrades, the Alertmanager service might become unavailable because either the three Alertmanager pods were located on the same node or the nodes running the Alertmanager pods happened to reboot at the same time. This situation was possible because the Alertmanager pods had soft anti-affinity rules regarding node placement and no pod disruption budget. This release enables hard anti-affinity rules and pod disruption budgets to ensure no downtime during patch upgrades for the Alertmanager and other monitoring components.
Consequence: alert notifications wouldn't be dispatched during some time.
Fix: the cluster monitoring operator configures hard anti-affinity rules to ensure that the Alertmanager pods are scheduled on different nodes. It also provision a pod disruption budget to ensure that at least 1 Alertmanager pod is always running.
Result: during upgrades, the nodes should reboot in sequence to ensure that at least 1 Alertmanager pod is always running.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-03-10 16:03:07 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Simon Pasquier
2021-04-30 08:49:47 UTC
tested with PR, expected alertmanager pods changed to 2 and pods can not be started
# oc -n openshift-monitoring get pdb alertmanager-main
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
alertmanager-main N/A 1 0 41m
# oc -n openshift-monitoring get pdb alertmanager-main -oyaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
creationTimestamp: "2021-11-25T02:52:56Z"
generation: 1
labels:
app.kubernetes.io/component: alert-router
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: openshift-monitoring
app.kubernetes.io/version: 0.22.2
name: alertmanager-main
namespace: openshift-monitoring
resourceVersion: "95240"
uid: 94eea939-798d-48a0-8f24-aa89aaa525c2
spec:
maxUnavailable: 1
selector:
matchLabels:
alertmanager: main
app.kubernetes.io/component: alert-router
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: openshift-monitoring
status:
conditions:
- lastTransitionTime: "2021-11-25T03:34:55Z"
message: ""
observedGeneration: 1
reason: InsufficientPods
status: "False"
type: DisruptionAllowed
currentHealthy: 0
desiredHealthy: 1
disruptionsAllowed: 0
expectedPods: 2
observedGeneration: 1
# while true; do date; oc -n openshift-monitoring get pod | grep alertmanager; sleep 10s; echo -e "\n"; done
Wed Nov 24 22:29:54 EST 2021
alertmanager-main-0 0/6 Terminating 0 3s
alertmanager-main-1 0/6 ContainerCreating 0 3s
Wed Nov 24 22:30:10 EST 2021
alertmanager-main-0 0/6 Terminating 0 1s
alertmanager-main-1 0/6 Terminating 0 1s
Wed Nov 24 22:30:25 EST 2021
alertmanager-main-0 0/6 Terminating 0 1s
alertmanager-main-1 0/6 Terminating 0 1s
Wed Nov 24 22:30:41 EST 2021
alertmanager-main-1 6/6 Terminating 0 6s
Wed Nov 24 22:30:56 EST 2021
alertmanager-main-1 6/6 Terminating 0 6s
Wed Nov 24 22:31:11 EST 2021
alertmanager-main-0 6/6 Terminating 0 4s
alertmanager-main-1 6/6 Terminating 0 4s
Wed Nov 24 22:31:27 EST 2021
alertmanager-main-0 0/6 Terminating 0 4s
alertmanager-main-1 0/6 Terminating 0 4s
Wed Nov 24 22:31:42 EST 2021
alertmanager-main-0 0/6 Terminating 0 3s
alertmanager-main-1 0/6 Terminating 0 3s
Wed Nov 24 22:31:57 EST 2021
alertmanager-main-0 0/6 ContainerCreating 0 0s
alertmanager-main-1 0/6 Pending 0 0s
Wed Nov 24 22:32:13 EST 2021
alertmanager-main-0 6/6 Terminating 0 6s
# oc -n openshift-monitoring get event | grep alertmanager-main
...
13s Warning FailedCreate statefulset/alertmanager-main create Pod alertmanager-main-0 in StatefulSet alertmanager-main failed error: The POST operation against Pod could not be completed at this time, please try again.
13s Warning FailedCreate statefulset/alertmanager-main create Pod alertmanager-main-0 in StatefulSet alertmanager-main failed error: The POST operation against Pod could not be completed at this time, please try again.
13s Normal SuccessfulCreate statefulset/alertmanager-main create Pod alertmanager-main-0 in StatefulSet alertmanager-main successful
13s Warning FailedCreate statefulset/alertmanager-main create Pod alertmanager-main-1 in StatefulSet alertmanager-main failed error: The POST operation against Pod could not be completed at this time, please try again.
# oc -n openshift-monitoring logs prometheus-operator-84c85586d6-bpf2r
...
level=info ts=2021-11-25T02:53:02.3094545Z caller=operator.go:814 component=alertmanageroperator key=openshift-monitoring/main msg="recreating AlertManager StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden"
level=info ts=2021-11-25T02:53:02.316758637Z caller=operator.go:741 component=alertmanageroperator key=openshift-monitoring/main msg="sync alertmanager"
level=info ts=2021-11-25T02:53:02.426700671Z caller=operator.go:814 component=alertmanageroperator key=openshift-monitoring/main msg="recreating AlertManager StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden"
level=info ts=2021-11-25T02:53:02.432181021Z caller=operator.go:741 component=alertmanageroperator key=openshift-monitoring/main msg="sync alertmanager"
level=info ts=2021-11-25T02:53:02.50330463Z caller=operator.go:814 component=alertmanageroperator key=openshift-monitoring/main msg="recreating AlertManager StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden"
level=info ts=2021-11-25T02:53:02.509158493Z caller=operator.go:741 component=alertmanageroperator key=openshift-monitoring/main msg="sync alertmanager"
level=info ts=2021-11-25T02:53:02.553180316Z caller=operator.go:814 component=alertmanageroperator key=openshift-monitoring/main msg="recreating AlertManager StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden"
...
should remove
finalizers:
- foregroundDeletion
from alertmanager-main statefulset
# oc -n openshift-monitoring get sts alertmanager-main -oyaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
annotations:
prometheus-operator-input-hash: "14523878381744334873"
creationTimestamp: "2021-11-25T03:49:58Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2021-11-25T03:49:58Z"
finalizers:
- foregroundDeletion
Pull request submitted checked with 4.10.0-0.nightly-2021-12-12-184227, the fix is in it. Alertmanager Statefulsets have 2 replicas and hard affinity set
# oc -n openshift-monitoring get pod | grep alertmanager-main
alertmanager-main-0 6/6 Running 0 4h13m
alertmanager-main-1 6/6 Running 0 4h12m
# oc -n openshift-monitoring get sts alertmanager-main -oyaml
...
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: alert-router
app.kubernetes.io/instance: main
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: openshift-monitoring
namespaces:
- openshift-monitoring
topologyKey: kubernetes.io/hostname
# oc -n openshift-monitoring get pdb alertmanager-main
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
alertmanager-main N/A 1 1 10h
# oc -n openshift-monitoring get pdb alertmanager-main -oyaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
creationTimestamp: "2021-12-12T23:30:56Z"
generation: 1
labels:
app.kubernetes.io/component: alert-router
app.kubernetes.io/instance: main
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: openshift-monitoring
app.kubernetes.io/version: 0.22.2
name: alertmanager-main
namespace: openshift-monitoring
resourceVersion: "149472"
uid: 74e9b3dd-a3c8-45fb-8b5a-6b627a0a3acd
spec:
maxUnavailable: 1
selector:
matchLabels:
app.kubernetes.io/component: alert-router
app.kubernetes.io/instance: main
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: openshift-monitoring
status:
conditions:
- lastTransitionTime: "2021-12-13T06:01:03Z"
message: ""
observedGeneration: 1
reason: SufficientPods
status: "True"
type: DisruptionAllowed
currentHealthy: 2
desiredHealthy: 1
disruptionsAllowed: 1
expectedPods: 2
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |