Bug 1957704 - Prometheus Statefulsets should have 2 replicas and hard affinity set
Summary: Prometheus Statefulsets should have 2 replicas and hard affinity set
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.z
Assignee: Damien Grisonnet
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 1957703
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-06 10:29 UTC by Damien Grisonnet
Modified: 2021-10-15 12:34 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1957703
Environment:
Last Closed: 2021-10-15 12:34:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Comment 1 Damien Grisonnet 2021-05-25 17:29:21 UTC
Backport to 4.7 is still waiting for patch manager approval.

Comment 2 Junqi Zhao 2021-05-31 12:08:31 UTC
Tested with the not merged PR, hard anti-affinity to Prometheuses is added, prometheus-k8s and prometheus-user-workload pods are scheduled to different nodes
# oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep podAntiAffinity -A10
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: prometheus
                operator: In
                values:
                - k8s
            namespaces:
            - openshift-monitoring
            topologyKey: kubernetes.io/hostname
# oc -n openshift-user-workload-monitoring get sts prometheus-user-workload -oyaml  | grep podAntiAffinity -A10
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: prometheus
                operator: In
                values:
                - user-workload
            namespaces:
            - openshift-user-workload-monitoring
            topologyKey: kubernetes.io/hostname
# oc -n openshift-monitoring get pod -o wide | grep prometheus-k8s
prometheus-k8s-0                               6/6     Running   1          41m   10.129.2.10   ci-ln-ckgb6k2-f76d1-ks7lc-worker-d-dccrt   <none>           <none>
prometheus-k8s-1                               6/6     Running   1          41m   10.128.2.6    ci-ln-ckgb6k2-f76d1-ks7lc-worker-c-5t7fg   <none>           <none>
# oc -n openshift-user-workload-monitoring get po -o wide 
NAME                                  READY   STATUS    RESTARTS   AGE    IP            NODE                                       NOMINATED NODE   READINESS GATES
prometheus-operator-7f9c94d4b-4jdnb   2/2     Running   0          102s   10.129.0.51   ci-ln-ckgb6k2-f76d1-ks7lc-master-2         <none>           <none>
prometheus-user-workload-0            4/4     Running   1          96s    10.128.2.9    ci-ln-ckgb6k2-f76d1-ks7lc-worker-c-5t7fg   <none>           <none>
prometheus-user-workload-1            4/4     Running   1          96s    10.129.2.26   ci-ln-ckgb6k2-f76d1-ks7lc-worker-d-dccrt   <none>           <none>
thanos-ruler-user-workload-0          3/3     Running   0          94s    10.129.2.27   ci-ln-ckgb6k2-f76d1-ks7lc-worker-d-dccrt   <none>           <none>
thanos-ruler-user-workload-1          3/3     Running   0          94s    10.128.2.10   ci-ln-ckgb6k2-f76d1-ks7lc-worker-c-5t7fg   <none>           <none>

Comment 5 W. Trevor King 2021-06-22 23:23:49 UTC
About the move back from POST, dropping 1186.  The issue was bug 1967966 showing some issues with the transition to hard-anti-affinity with a PDB guard when Prometheus was backed by a persistent volume [1].  So this whole hard-anti-affinity bug chain descended from bug 1949262 is back to pre-POST while we work out a fix that we can backport without surprising folks who currently have volumes that would block an attempt to push Prom out to separate nodes.  Bug 1974832 may be the next thing that moves in this space, but we'll see.

[1]: https://github.com/openshift/cluster-monitoring-operator/pull/1186#issuecomment-860766266


Note You need to log in before you can comment on or make changes to this bug.