Bug 1635802
| Summary: | prometheus & alertmanager pods should use anti-affinity | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Justin Pierce <jupierce> |
| Component: | Monitoring | Assignee: | Frederic Branczyk <fbranczy> |
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.11.0 | CC: | minden |
| Target Milestone: | --- | ||
| Target Release: | 3.11.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-03-14 02:17:59 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Justin Pierce
2018-10-03 17:28:02 UTC
Thanks for the hint Justin. We have fixed this upstream [1] and we will propagate this to the cluster-monitoring-operator next. [1] https://github.com/coreos/prometheus-operator/pull/1935 There is one little problem.
prometheus-operator and prometheus-k8s pod will be created on the nodes with the same nodeSelector.
In my testing, nodeSelector is
******************************************
nodeSelector:
role: node
******************************************
There are 3 nodes labelled with role=node, they are ip-172-18-10-235.ec2.internal, ip-172-18-12-50.ec2.internal, ip-172-18-15-213.ec2.internal.
But prometheus-operator-7566fcccc8-t7wc5 and prometheus-k8s-0 is created in the same node: ip-172-18-10-235.ec2.internal, no prometheus or prometheus-operator is created on node ip-172-18-12-50.ec2.internal
# oc get node --show-labels | grep role=node
ip-172-18-10-235.ec2.internal Ready compute 7h v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-10-235.ec2.internal,node-role.kubernetes.io/compute=true,role=node
ip-172-18-12-50.ec2.internal Ready <none> 7h v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-12-50.ec2.internal,registry=enabled,role=node,router=enabled
ip-172-18-15-213.ec2.internal Ready compute 7h v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-15-213.ec2.internal,node-role.kubernetes.io/compute=true,role=node
*****************************************************************
# oc get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
alertmanager-main-0 3/3 Running 0 7h 10.131.0.4 ip-172-18-10-235.ec2.internal <none>
alertmanager-main-1 3/3 Running 0 7h 10.130.0.6 ip-172-18-12-50.ec2.internal <none>
alertmanager-main-2 3/3 Running 0 7h 10.129.0.5 ip-172-18-15-213.ec2.internal <none>
cluster-monitoring-operator-56bb5946c4-mzqdk 1/1 Running 0 7h 10.129.0.2 ip-172-18-15-213.ec2.internal <none>
grafana-56f6875b69-ljr8k 2/2 Running 0 7h 10.129.0.3 ip-172-18-15-213.ec2.internal <none>
kube-state-metrics-776f9667b-2lxdl 3/3 Running 0 7h 10.130.0.7 ip-172-18-12-50.ec2.internal <none>
node-exporter-gkdt6 2/2 Running 0 7h 172.18.1.217 ip-172-18-1-217.ec2.internal <none>
node-exporter-gnj2c 2/2 Running 0 7h 172.18.12.50 ip-172-18-12-50.ec2.internal <none>
node-exporter-l6md2 2/2 Running 0 7h 172.18.15.213 ip-172-18-15-213.ec2.internal <none>
node-exporter-wnsqd 2/2 Running 0 7h 172.18.10.235 ip-172-18-10-235.ec2.internal <none>
prometheus-k8s-0 4/4 Running 1 7h 10.131.0.3 ip-172-18-10-235.ec2.internal <none>
prometheus-k8s-1 4/4 Running 1 7h 10.129.0.4 ip-172-18-15-213.ec2.internal <none>
prometheus-operator-7566fcccc8-t7wc5 1/1 Running 0 7h 10.131.0.2 ip-172-18-10-235.ec2.internal <none>
*******************************************************************************
I think we should also add anti-affinity for prometheus-operator pod
# oc -n openshift-monitoring get pod prometheus-operator-7566fcccc8-t7wc5 -oyaml | grep -i affinity
nothing returned
Image: ose-prometheus-operator-v3.11.28-1 other images version is also v3.11.28-1 I don't think anti-affinity of Prometheus and Prometheus Operator have any effect. The Prometheus Operator has no issue being spun up again on a different node or same node as Prometheus, whereas two Prometheus servers being on the same node has availability concerns. (In reply to Frederic Branczyk from comment #4) > I don't think anti-affinity of Prometheus and Prometheus Operator have any > effect. The Prometheus Operator has no issue being spun up again on a > different node or same node as Prometheus, whereas two Prometheus servers > being on the same node has availability concerns. Thanks for confirmation, please change this defect to ON_QA Both 3.11 and 4.0 have anti-affinity on Prometheus and Alertmanager pods, moving to modified. prometheus & alertmanager pods are use anti-affinity now, eg:
prometheus-k8s-0 pod:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: prometheus
operator: In
values:
- k8s
namespaces:
- openshift-monitoring
cluster monitoring images: v3.11.88
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0407 |