Bug 1955489
Summary: | Alertmanager Statefulsets should have 2 replicas and hard affinity set | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Simon Pasquier <spasquie> |
Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | high | Docs Contact: | Brian Burt <bburt> |
Priority: | low | ||
Version: | 4.8 | CC: | anpicker, bburt, erooth, hongyli, jeder, rgudimet, wking |
Target Milestone: | --- | Keywords: | ServiceDeliveryImpact |
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Prviously, during {product-title} upgrades, the Alertmanager service might become unavailable because either the three Alertmanager pods were located on the same node or the nodes running the Alertmanager pods happened to reboot at the same time. This situation was possible because the Alertmanager pods had soft anti-affinity rules regarding node placement and no pod disruption budget. This release enables hard anti-affinity rules and pod disruption budgets to ensure no downtime during patch upgrades for the Alertmanager and other monitoring components.
Consequence: alert notifications wouldn't be dispatched during some time.
Fix: the cluster monitoring operator configures hard anti-affinity rules to ensure that the Alertmanager pods are scheduled on different nodes. It also provision a pod disruption budget to ensure that at least 1 Alertmanager pod is always running.
Result: during upgrades, the nodes should reboot in sequence to ensure that at least 1 Alertmanager pod is always running.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-10 16:03:07 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Simon Pasquier
2021-04-30 08:49:47 UTC
tested with PR, expected alertmanager pods changed to 2 and pods can not be started # oc -n openshift-monitoring get pdb alertmanager-main NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE alertmanager-main N/A 1 0 41m # oc -n openshift-monitoring get pdb alertmanager-main -oyaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: creationTimestamp: "2021-11-25T02:52:56Z" generation: 1 labels: app.kubernetes.io/component: alert-router app.kubernetes.io/name: alertmanager app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.22.2 name: alertmanager-main namespace: openshift-monitoring resourceVersion: "95240" uid: 94eea939-798d-48a0-8f24-aa89aaa525c2 spec: maxUnavailable: 1 selector: matchLabels: alertmanager: main app.kubernetes.io/component: alert-router app.kubernetes.io/name: alertmanager app.kubernetes.io/part-of: openshift-monitoring status: conditions: - lastTransitionTime: "2021-11-25T03:34:55Z" message: "" observedGeneration: 1 reason: InsufficientPods status: "False" type: DisruptionAllowed currentHealthy: 0 desiredHealthy: 1 disruptionsAllowed: 0 expectedPods: 2 observedGeneration: 1 # while true; do date; oc -n openshift-monitoring get pod | grep alertmanager; sleep 10s; echo -e "\n"; done Wed Nov 24 22:29:54 EST 2021 alertmanager-main-0 0/6 Terminating 0 3s alertmanager-main-1 0/6 ContainerCreating 0 3s Wed Nov 24 22:30:10 EST 2021 alertmanager-main-0 0/6 Terminating 0 1s alertmanager-main-1 0/6 Terminating 0 1s Wed Nov 24 22:30:25 EST 2021 alertmanager-main-0 0/6 Terminating 0 1s alertmanager-main-1 0/6 Terminating 0 1s Wed Nov 24 22:30:41 EST 2021 alertmanager-main-1 6/6 Terminating 0 6s Wed Nov 24 22:30:56 EST 2021 alertmanager-main-1 6/6 Terminating 0 6s Wed Nov 24 22:31:11 EST 2021 alertmanager-main-0 6/6 Terminating 0 4s alertmanager-main-1 6/6 Terminating 0 4s Wed Nov 24 22:31:27 EST 2021 alertmanager-main-0 0/6 Terminating 0 4s alertmanager-main-1 0/6 Terminating 0 4s Wed Nov 24 22:31:42 EST 2021 alertmanager-main-0 0/6 Terminating 0 3s alertmanager-main-1 0/6 Terminating 0 3s Wed Nov 24 22:31:57 EST 2021 alertmanager-main-0 0/6 ContainerCreating 0 0s alertmanager-main-1 0/6 Pending 0 0s Wed Nov 24 22:32:13 EST 2021 alertmanager-main-0 6/6 Terminating 0 6s # oc -n openshift-monitoring get event | grep alertmanager-main ... 13s Warning FailedCreate statefulset/alertmanager-main create Pod alertmanager-main-0 in StatefulSet alertmanager-main failed error: The POST operation against Pod could not be completed at this time, please try again. 13s Warning FailedCreate statefulset/alertmanager-main create Pod alertmanager-main-0 in StatefulSet alertmanager-main failed error: The POST operation against Pod could not be completed at this time, please try again. 13s Normal SuccessfulCreate statefulset/alertmanager-main create Pod alertmanager-main-0 in StatefulSet alertmanager-main successful 13s Warning FailedCreate statefulset/alertmanager-main create Pod alertmanager-main-1 in StatefulSet alertmanager-main failed error: The POST operation against Pod could not be completed at this time, please try again. # oc -n openshift-monitoring logs prometheus-operator-84c85586d6-bpf2r ... level=info ts=2021-11-25T02:53:02.3094545Z caller=operator.go:814 component=alertmanageroperator key=openshift-monitoring/main msg="recreating AlertManager StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden" level=info ts=2021-11-25T02:53:02.316758637Z caller=operator.go:741 component=alertmanageroperator key=openshift-monitoring/main msg="sync alertmanager" level=info ts=2021-11-25T02:53:02.426700671Z caller=operator.go:814 component=alertmanageroperator key=openshift-monitoring/main msg="recreating AlertManager StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden" level=info ts=2021-11-25T02:53:02.432181021Z caller=operator.go:741 component=alertmanageroperator key=openshift-monitoring/main msg="sync alertmanager" level=info ts=2021-11-25T02:53:02.50330463Z caller=operator.go:814 component=alertmanageroperator key=openshift-monitoring/main msg="recreating AlertManager StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden" level=info ts=2021-11-25T02:53:02.509158493Z caller=operator.go:741 component=alertmanageroperator key=openshift-monitoring/main msg="sync alertmanager" level=info ts=2021-11-25T02:53:02.553180316Z caller=operator.go:814 component=alertmanageroperator key=openshift-monitoring/main msg="recreating AlertManager StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden" ... should remove finalizers: - foregroundDeletion from alertmanager-main statefulset # oc -n openshift-monitoring get sts alertmanager-main -oyaml apiVersion: apps/v1 kind: StatefulSet metadata: annotations: prometheus-operator-input-hash: "14523878381744334873" creationTimestamp: "2021-11-25T03:49:58Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2021-11-25T03:49:58Z" finalizers: - foregroundDeletion Pull request submitted checked with 4.10.0-0.nightly-2021-12-12-184227, the fix is in it. Alertmanager Statefulsets have 2 replicas and hard affinity set # oc -n openshift-monitoring get pod | grep alertmanager-main alertmanager-main-0 6/6 Running 0 4h13m alertmanager-main-1 6/6 Running 0 4h12m # oc -n openshift-monitoring get sts alertmanager-main -oyaml ... affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/component: alert-router app.kubernetes.io/instance: main app.kubernetes.io/name: alertmanager app.kubernetes.io/part-of: openshift-monitoring namespaces: - openshift-monitoring topologyKey: kubernetes.io/hostname # oc -n openshift-monitoring get pdb alertmanager-main NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE alertmanager-main N/A 1 1 10h # oc -n openshift-monitoring get pdb alertmanager-main -oyaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: creationTimestamp: "2021-12-12T23:30:56Z" generation: 1 labels: app.kubernetes.io/component: alert-router app.kubernetes.io/instance: main app.kubernetes.io/name: alertmanager app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.22.2 name: alertmanager-main namespace: openshift-monitoring resourceVersion: "149472" uid: 74e9b3dd-a3c8-45fb-8b5a-6b627a0a3acd spec: maxUnavailable: 1 selector: matchLabels: app.kubernetes.io/component: alert-router app.kubernetes.io/instance: main app.kubernetes.io/name: alertmanager app.kubernetes.io/part-of: openshift-monitoring status: conditions: - lastTransitionTime: "2021-12-13T06:01:03Z" message: "" observedGeneration: 1 reason: SufficientPods status: "True" type: DisruptionAllowed currentHealthy: 2 desiredHealthy: 1 disruptionsAllowed: 1 expectedPods: 2 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |