Hide Forgot
Description of problem: As mentioned in the conventions doc https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability, both prometheus and alertmanager should have replica count of 2 with hard affinities set till we bring descheduler into our product. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Alertmanager needs to stay with 3 replicas with soft affinity until the statefulset resource implements MinReadySecs [1]. [1] https://github.com/kubernetes/kubernetes/issues/65098
tested with 4.8.0-0.nightly-2021-04-29-222100, hard anti-affinity to Prometheuses is added oc -n openshift-monitoring get sts prometheus-k8s -oyaml ... affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: prometheus operator: In values: - k8s namespaces: - openshift-monitoring topologyKey: kubernetes.io/hostname # oc -n openshift-user-workload-monitoring get sts prometheus-user-workload -oyaml ... affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: prometheus operator: In values: - user-workload namespaces: - openshift-user-workload-monitoring topologyKey: kubernetes.io/hostname
*** Bug 1955147 has been marked as a duplicate of this bug. ***
Unsetting target release for now.
The path forward is to wait for bug 1974832 which adds an alert to detect workloads with persistent storage that are scheduled on the same node. With this alert, a runbook will be provided to the users to help them fix their cluster in order for us to be able to re-enable hard pod anti-affinity on hostname in 4.9.
https://github.com/openshift/cluster-monitoring-operator/pull/1341 has been merged
checked with 4.10.0-0.nightly-2021-11-28-164900, Prometheus Statefulsets now have 2 replicas and hard affinity set oc -n openshift-monitoring get pod -o wide | grep prometheus-k8s prometheus-k8s-0 6/6 Running 0 5m36s 10.129.2.60 ip-10-0-194-46.us-east-2.compute.internal <none> <none> prometheus-k8s-1 6/6 Running 0 5m36s 10.131.0.23 ip-10-0-129-166.us-east-2.compute.internal <none> <none> # oc -n openshift-monitoring get sts prometheus-k8s -oyaml ... spec: podManagementPolicy: Parallel replicas: 2 revisionHistoryLimit: 10 ... spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/component: prometheus app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: openshift-monitoring prometheus: k8s namespaces: - openshift-monitoring topologyKey: kubernetes.io/hostname # oc -n openshift-user-workload-monitoring get pod -o wide | grep prometheus-user-workload prometheus-user-workload-0 5/5 Running 0 2m22s 10.129.2.64 ip-10-0-194-46.us-east-2.compute.internal <none> <none> prometheus-user-workload-1 5/5 Running 0 2m22s 10.128.2.122 ip-10-0-191-20.us-east-2.compute.internal <none> <none> # oc -n openshift-user-workload-monitoring get sts prometheus-user-workload -oyaml ... spec: podManagementPolicy: Parallel replicas: 2 revisionHistoryLimit: 10 ... spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/component: prometheus app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: openshift-monitoring prometheus: user-workload namespaces: - openshift-user-workload-monitoring topologyKey: kubernetes.io/hostname
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056