Description of problem: As mentioned in the conventions doc https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability, Thanos ruler should have replica count of 2 with hard affinities set till we bring descheduler into our product. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Follow-up of bug 1949262.
*** Bug 1950035 has been marked as a duplicate of this bug. ***
The pull request is on hold because of an issue about hard affinity and persistent volumes, as detailed in bug: https://bugzilla.redhat.com/show_bug.cgi?id=1967614
PR has been closed for the reasons mentioned above. Moving the BZ back to assigned.
*** Bug 1997948 has been marked as a duplicate of this bug. ***
*** Bug 2016753 has been marked as a duplicate of this bug. ***
https://github.com/openshift/cluster-monitoring-operator/pull/1341 has been merged
checked with 4.10.0-0.nightly-2021-11-28-164900, Thanos ruler Statefulset now has 2 replicas and hard affinity set # oc -n openshift-user-workload-monitoring get pod -o wide | grep thanos-ruler thanos-ruler-user-workload-0 3/3 Running 0 8m55s 10.129.2.65 ip-10-0-194-46.us-east-2.compute.internal <none> <none> thanos-ruler-user-workload-1 3/3 Running 0 8m55s 10.128.2.123 ip-10-0-191-20.us-east-2.compute.internal <none> <none> # oc -n openshift-user-workload-monitoring get sts thanos-ruler-user-workload -oyaml ... spec: podManagementPolicy: Parallel replicas: 2 revisionHistoryLimit: 10 ... template: metadata: annotations: kubectl.kubernetes.io/default-container: thanos-ruler target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}' creationTimestamp: null labels: app.kubernetes.io/instance: user-workload app.kubernetes.io/managed-by: prometheus-operator app.kubernetes.io/name: thanos-ruler thanos-ruler: user-workload spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/name: thanos-ruler thanos-ruler: user-workload namespaces: - openshift-user-workload-monitoring topologyKey: kubernetes.io/hostname # oc -n openshift-user-workload-monitoring get pdb thanos-ruler-user-workload -oyaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: creationTimestamp: "2021-11-29T06:21:45Z" generation: 1 labels: thanosRulerName: user-workload name: thanos-ruler-user-workload namespace: openshift-user-workload-monitoring resourceVersion: "149008" uid: 76c8db6f-f489-4493-8b43-84239abb9ff4 spec: minAvailable: 1 selector: matchLabels: app.kubernetes.io/name: thanos-ruler thanos-ruler: user-workload status: conditions: - lastTransitionTime: "2021-11-29T06:21:48Z" message: "" observedGeneration: 1 reason: SufficientPods status: "True" type: DisruptionAllowed currentHealthy: 2 desiredHealthy: 1 disruptionsAllowed: 1 expectedPods: 2 observedGeneration: 1
Any news when this will be fixed? It's still in latest 4.9 releases. Problem is annoying because it generates unnecessary alerts.
I mean should be easy "backport" from 4.10 nightlies into current 4.9 stable releases?
We have no plan to backport the fix because it wouldn't be easy. We considered that switching from soft anti-affinity to hard anti-affinity is too risky to happen in a z stream release.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056