Bug 1635802
Summary: | prometheus & alertmanager pods should use anti-affinity | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Justin Pierce <jupierce> |
Component: | Monitoring | Assignee: | Frederic Branczyk <fbranczy> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.11.0 | CC: | minden |
Target Milestone: | --- | ||
Target Release: | 3.11.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-03-14 02:17:59 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Justin Pierce
2018-10-03 17:28:02 UTC
Thanks for the hint Justin. We have fixed this upstream [1] and we will propagate this to the cluster-monitoring-operator next. [1] https://github.com/coreos/prometheus-operator/pull/1935 There is one little problem. prometheus-operator and prometheus-k8s pod will be created on the nodes with the same nodeSelector. In my testing, nodeSelector is ****************************************** nodeSelector: role: node ****************************************** There are 3 nodes labelled with role=node, they are ip-172-18-10-235.ec2.internal, ip-172-18-12-50.ec2.internal, ip-172-18-15-213.ec2.internal. But prometheus-operator-7566fcccc8-t7wc5 and prometheus-k8s-0 is created in the same node: ip-172-18-10-235.ec2.internal, no prometheus or prometheus-operator is created on node ip-172-18-12-50.ec2.internal # oc get node --show-labels | grep role=node ip-172-18-10-235.ec2.internal Ready compute 7h v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-10-235.ec2.internal,node-role.kubernetes.io/compute=true,role=node ip-172-18-12-50.ec2.internal Ready <none> 7h v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-12-50.ec2.internal,registry=enabled,role=node,router=enabled ip-172-18-15-213.ec2.internal Ready compute 7h v1.11.0+d4cacc0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-15-213.ec2.internal,node-role.kubernetes.io/compute=true,role=node ***************************************************************** # oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE alertmanager-main-0 3/3 Running 0 7h 10.131.0.4 ip-172-18-10-235.ec2.internal <none> alertmanager-main-1 3/3 Running 0 7h 10.130.0.6 ip-172-18-12-50.ec2.internal <none> alertmanager-main-2 3/3 Running 0 7h 10.129.0.5 ip-172-18-15-213.ec2.internal <none> cluster-monitoring-operator-56bb5946c4-mzqdk 1/1 Running 0 7h 10.129.0.2 ip-172-18-15-213.ec2.internal <none> grafana-56f6875b69-ljr8k 2/2 Running 0 7h 10.129.0.3 ip-172-18-15-213.ec2.internal <none> kube-state-metrics-776f9667b-2lxdl 3/3 Running 0 7h 10.130.0.7 ip-172-18-12-50.ec2.internal <none> node-exporter-gkdt6 2/2 Running 0 7h 172.18.1.217 ip-172-18-1-217.ec2.internal <none> node-exporter-gnj2c 2/2 Running 0 7h 172.18.12.50 ip-172-18-12-50.ec2.internal <none> node-exporter-l6md2 2/2 Running 0 7h 172.18.15.213 ip-172-18-15-213.ec2.internal <none> node-exporter-wnsqd 2/2 Running 0 7h 172.18.10.235 ip-172-18-10-235.ec2.internal <none> prometheus-k8s-0 4/4 Running 1 7h 10.131.0.3 ip-172-18-10-235.ec2.internal <none> prometheus-k8s-1 4/4 Running 1 7h 10.129.0.4 ip-172-18-15-213.ec2.internal <none> prometheus-operator-7566fcccc8-t7wc5 1/1 Running 0 7h 10.131.0.2 ip-172-18-10-235.ec2.internal <none> ******************************************************************************* I think we should also add anti-affinity for prometheus-operator pod # oc -n openshift-monitoring get pod prometheus-operator-7566fcccc8-t7wc5 -oyaml | grep -i affinity nothing returned Image: ose-prometheus-operator-v3.11.28-1 other images version is also v3.11.28-1 I don't think anti-affinity of Prometheus and Prometheus Operator have any effect. The Prometheus Operator has no issue being spun up again on a different node or same node as Prometheus, whereas two Prometheus servers being on the same node has availability concerns. (In reply to Frederic Branczyk from comment #4) > I don't think anti-affinity of Prometheus and Prometheus Operator have any > effect. The Prometheus Operator has no issue being spun up again on a > different node or same node as Prometheus, whereas two Prometheus servers > being on the same node has availability concerns. Thanks for confirmation, please change this defect to ON_QA Both 3.11 and 4.0 have anti-affinity on Prometheus and Alertmanager pods, moving to modified. prometheus & alertmanager pods are use anti-affinity now, eg: prometheus-k8s-0 pod: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchExpressions: - key: prometheus operator: In values: - k8s namespaces: - openshift-monitoring cluster monitoring images: v3.11.88 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0407 |