Bug 1955490
Summary: | Thanos ruler Statefulsets should have 2 replicas and hard affinity set | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Simon Pasquier <spasquie> |
Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | medium | Docs Contact: | Brian Burt <bburt> |
Priority: | medium | ||
Version: | 4.6 | CC: | anpicker, bburt, dgrisonn, erooth, juzhao, kai-uwe.rommel, oarribas |
Target Milestone: | --- | Keywords: | EasyFix |
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Previously, the Thanos Ruler service would become unavailable when the node that contains the two Thanos Ruler pods experienced an outage. This situation occurred because the Thanos Ruler pods had only soft anti-affinity rules regarding node placement. Consequently, user-defined rules would not be evaluated until the node came back online.
With this release, the Cluster Monitoring Operator (CMO) now configures hard anti-affinity rules to ensure that the two Thanos Ruler pods are scheduled on different nodes. As a result, a single-node outage no longer creates a gap in user-defined rule evaluation.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-10 16:03:07 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Simon Pasquier
2021-04-30 08:52:21 UTC
*** Bug 1950035 has been marked as a duplicate of this bug. *** The pull request is on hold because of an issue about hard affinity and persistent volumes, as detailed in bug: https://bugzilla.redhat.com/show_bug.cgi?id=1967614 PR has been closed for the reasons mentioned above. Moving the BZ back to assigned. *** Bug 1997948 has been marked as a duplicate of this bug. *** *** Bug 2016753 has been marked as a duplicate of this bug. *** checked with 4.10.0-0.nightly-2021-11-28-164900, Thanos ruler Statefulset now has 2 replicas and hard affinity set # oc -n openshift-user-workload-monitoring get pod -o wide | grep thanos-ruler thanos-ruler-user-workload-0 3/3 Running 0 8m55s 10.129.2.65 ip-10-0-194-46.us-east-2.compute.internal <none> <none> thanos-ruler-user-workload-1 3/3 Running 0 8m55s 10.128.2.123 ip-10-0-191-20.us-east-2.compute.internal <none> <none> # oc -n openshift-user-workload-monitoring get sts thanos-ruler-user-workload -oyaml ... spec: podManagementPolicy: Parallel replicas: 2 revisionHistoryLimit: 10 ... template: metadata: annotations: kubectl.kubernetes.io/default-container: thanos-ruler target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}' creationTimestamp: null labels: app.kubernetes.io/instance: user-workload app.kubernetes.io/managed-by: prometheus-operator app.kubernetes.io/name: thanos-ruler thanos-ruler: user-workload spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/name: thanos-ruler thanos-ruler: user-workload namespaces: - openshift-user-workload-monitoring topologyKey: kubernetes.io/hostname # oc -n openshift-user-workload-monitoring get pdb thanos-ruler-user-workload -oyaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: creationTimestamp: "2021-11-29T06:21:45Z" generation: 1 labels: thanosRulerName: user-workload name: thanos-ruler-user-workload namespace: openshift-user-workload-monitoring resourceVersion: "149008" uid: 76c8db6f-f489-4493-8b43-84239abb9ff4 spec: minAvailable: 1 selector: matchLabels: app.kubernetes.io/name: thanos-ruler thanos-ruler: user-workload status: conditions: - lastTransitionTime: "2021-11-29T06:21:48Z" message: "" observedGeneration: 1 reason: SufficientPods status: "True" type: DisruptionAllowed currentHealthy: 2 desiredHealthy: 1 disruptionsAllowed: 1 expectedPods: 2 observedGeneration: 1 Any news when this will be fixed? It's still in latest 4.9 releases. Problem is annoying because it generates unnecessary alerts. I mean should be easy "backport" from 4.10 nightlies into current 4.9 stable releases? We have no plan to backport the fix because it wouldn't be easy. We considered that switching from soft anti-affinity to hard anti-affinity is too risky to happen in a z stream release. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |