Bug 1635802

Summary:	prometheus & alertmanager pods should use anti-affinity
Product:	OpenShift Container Platform	Reporter:	Justin Pierce <jupierce>
Component:	Monitoring	Assignee:	Frederic Branczyk <fbranczy>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.11.0	CC:	minden
Target Milestone:	---
Target Release:	3.11.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-03-14 02:17:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Justin Pierce 2018-10-03 17:28:02 UTC

Description of problem:
Scheduler anti-affinity should be set for prometheus/alertmanager pods such that containers get evenly distributed across the available OpenShift infra nodes.

If not set, monitoring pods can be scheduled on the same nodes - diminishing HA and complicating capacity analysis for the infra nodes:

In the following case, both alertmanager and prometheus instances are bunched up on
>>>
[root@starter-us-east-1-master-25064 ~]# oc get pods -o=wide
NAME                                           READY     STATUS    RESTARTS   AGE       IP              NODE                            NOMINATED NODE
alertmanager-main-0                            3/3       Running   0          1d        10.129.6.226    ip-172-31-48-214.ec2.internal   <none>
alertmanager-main-1                            3/3       Running   0          1d        10.129.9.124    ip-172-31-51-95.ec2.internal    <none>
alertmanager-main-2                            3/3       Running   0          1d        10.129.6.229    ip-172-31-48-214.ec2.internal   <none>
...
prometheus-k8s-0                               4/4       Running   0          1d        10.129.6.230    ip-172-31-48-214.ec2.internal   <none>
prometheus-k8s-1                               4/4       Running   0          1d        10.129.6.232    ip-172-31-48-214.ec2.internal   <none>
prometheus-operator-579779cd5c-kn2nr           1/1       Running   0          1d        10.129.6.215    ip-172-31-48-214.ec2.internal   <none>
<<<

Version-Release number of selected component (if applicable):
3.11.16

Expected results:
pods should be distributed evenly across infra nodes.

Additional info:
https://github.com/openshift/openshift-ansible/blob/2ae9225b63d4ac9fcc7959e97d9932c99a20308c/roles/openshift_console/files/console-template.yaml#L69

Comment 1 minden 2018-10-04 09:54:48 UTC

Thanks for the hint Justin.

We have fixed this upstream [1] and we will propagate this to the cluster-monitoring-operator next.

[1] https://github.com/coreos/prometheus-operator/pull/1935

Comment 2 Junqi Zhao 2018-10-22 08:55:26 UTC

There is one little problem.
prometheus-operator and prometheus-k8s pod will be created on the nodes with the same nodeSelector.

In my testing, nodeSelector is
******************************************
  nodeSelector:
    role: node
******************************************
There are 3 nodes labelled with role=node, they are ip-172-18-10-235.ec2.internal, ip-172-18-12-50.ec2.internal, ip-172-18-15-213.ec2.internal.

But prometheus-operator-7566fcccc8-t7wc5 and prometheus-k8s-0 is created in the same node: ip-172-18-10-235.ec2.internal, no prometheus or prometheus-operator is created on node ip-172-18-12-50.ec2.internal

# oc get node --show-labels | grep role=node
ip-172-18-10-235.ec2.internal   Ready     compute   7h        v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-10-235.ec2.internal,node-role.kubernetes.io/compute=true,role=node
ip-172-18-12-50.ec2.internal    Ready     <none>    7h        v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-12-50.ec2.internal,registry=enabled,role=node,router=enabled
ip-172-18-15-213.ec2.internal   Ready     compute   7h        v1.11.0+d4cacc0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-15-213.ec2.internal,node-role.kubernetes.io/compute=true,role=node
*****************************************************************
# oc get pod -o wide
NAME                                           READY     STATUS    RESTARTS   AGE       IP              NODE                            NOMINATED NODE
alertmanager-main-0                            3/3       Running   0          7h        10.131.0.4      ip-172-18-10-235.ec2.internal   <none>
alertmanager-main-1                            3/3       Running   0          7h        10.130.0.6      ip-172-18-12-50.ec2.internal    <none>
alertmanager-main-2                            3/3       Running   0          7h        10.129.0.5      ip-172-18-15-213.ec2.internal   <none>
cluster-monitoring-operator-56bb5946c4-mzqdk   1/1       Running   0          7h        10.129.0.2      ip-172-18-15-213.ec2.internal   <none>
grafana-56f6875b69-ljr8k                       2/2       Running   0          7h        10.129.0.3      ip-172-18-15-213.ec2.internal   <none>
kube-state-metrics-776f9667b-2lxdl             3/3       Running   0          7h        10.130.0.7      ip-172-18-12-50.ec2.internal    <none>
node-exporter-gkdt6                            2/2       Running   0          7h        172.18.1.217    ip-172-18-1-217.ec2.internal    <none>
node-exporter-gnj2c                            2/2       Running   0          7h        172.18.12.50    ip-172-18-12-50.ec2.internal    <none>
node-exporter-l6md2                            2/2       Running   0          7h        172.18.15.213   ip-172-18-15-213.ec2.internal   <none>
node-exporter-wnsqd                            2/2       Running   0          7h        172.18.10.235   ip-172-18-10-235.ec2.internal   <none>
prometheus-k8s-0                               4/4       Running   1          7h        10.131.0.3      ip-172-18-10-235.ec2.internal   <none>
prometheus-k8s-1                               4/4       Running   1          7h        10.129.0.4      ip-172-18-15-213.ec2.internal   <none>
prometheus-operator-7566fcccc8-t7wc5           1/1       Running   0          7h        10.131.0.2      ip-172-18-10-235.ec2.internal   <none>
*******************************************************************************

I think we should also add anti-affinity for prometheus-operator pod
# oc -n openshift-monitoring get pod prometheus-operator-7566fcccc8-t7wc5 -oyaml | grep -i affinity
nothing returned

Comment 3 Junqi Zhao 2018-10-22 09:05:04 UTC

Image:
ose-prometheus-operator-v3.11.28-1

other images version is also v3.11.28-1

Comment 4 Frederic Branczyk 2018-10-24 20:26:17 UTC

I don't think anti-affinity of Prometheus and Prometheus Operator have any effect. The Prometheus Operator has no issue being spun up again on a different node or same node as Prometheus, whereas two Prometheus servers being on the same node has availability concerns.

Comment 5 Junqi Zhao 2018-10-25 00:48:55 UTC

(In reply to Frederic Branczyk from comment #4)
> I don't think anti-affinity of Prometheus and Prometheus Operator have any
> effect. The Prometheus Operator has no issue being spun up again on a
> different node or same node as Prometheus, whereas two Prometheus servers
> being on the same node has availability concerns.

Thanks for confirmation, please change this defect to ON_QA

Comment 6 Frederic Branczyk 2019-02-25 12:46:27 UTC

Both 3.11 and 4.0 have anti-affinity on Prometheus and Alertmanager pods, moving to modified.

Comment 8 Junqi Zhao 2019-02-27 02:11:13 UTC

prometheus & alertmanager pods are use anti-affinity now, eg:
prometheus-k8s-0 pod:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: prometheus
              operator: In
              values:
              - k8s
          namespaces:
          - openshift-monitoring

cluster monitoring images: v3.11.88

Comment 10 errata-xmlrpc 2019-03-14 02:17:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0407