Bug 1812834

Summary:	thanos-querier is having master tolerations set but no master node-selector set
Product:	OpenShift Container Platform	Reporter:	Sergiusz Urbaniak <surbania>
Component:	Monitoring	Assignee:	Sergiusz Urbaniak <surbania>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	4.3.0	CC:	alegrand, anpicker, cvogel, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Currently, it is possible for thanos-querier to be scheduled both on master and worker nodes as we only have master tolerations set. We should also set master node-selector so thanos querier is guaranteed to be deployed on master nodes. Consequence: Thanos querier could be scheduled both on master, and on worker nodes. Fix: Thanos querier belongs on worker nodes, hence master toleration has been removed. Additionally, as with every payload we deploy on worker nodes we make this configurable via a new `thanosQuerier` section in the cluster-monitoring-operator configmap. Result: Thanos Querier is deployed on worker nodes only and can be configured to consider a node selector, tolerations, and resources as is done for the in-cluster prometheus.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-04 18:05:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sergiusz Urbaniak 2020-03-12 10:17:17 UTC

Currently, it is possible for thanos-querier to be scheduled both on master and worker nodes as we only have master tolerations set.

We should also set master node-selector so thanos querier is guaranteed to be deployed on master nodes.

Comment 8 Junqi Zhao 2020-03-19 09:45:19 UTC

Tested on 4.5.0-0.nightly-2020-03-18-115438, thanos-querier pods are deployed on workers now, and can configure nodeSelector and tolerations for thanos-querier via cluster-monitoring-config configmap

verification steps:
1. thanos-querier pods are deployed on workers now
# oc get node | grep worker
ip-10-0-134-182.ap-northeast-2.compute.internal   Ready    worker   9h    v1.17.1
ip-10-0-150-200.ap-northeast-2.compute.internal   Ready    worker   9h    v1.17.1
ip-10-0-173-240.ap-northeast-2.compute.internal   Ready    worker   9h    v1.17.1

# oc -n openshift-monitoring get pod -o wide | grep thanos-querier
thanos-querier-56f9d46b78-gzd5n                4/4     Running   0          9h      10.128.2.9     ip-10-0-173-240.ap-northeast-2.compute.internal   <none>           <none>
thanos-querier-56f9d46b78-j4rpr                4/4     Running   0          9h      10.129.2.13    ip-10-0-150-200.ap-northeast-2.compute.internal   <none>           <none>

2. label master nodes
# for i in $(oc get node | grep master | awk '{print $1}'); do echo $i; oc label node $i thanosQuerier=deploy;done
3. add nodeSelector and tolerations in cluster-monitoring-config configmap, so that they can deploy on master nodes

****
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    thanosQuerier:
      nodeSelector:
        thanosQuerier: deploy
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
****
4. Check the configuration is in deploy thanos-querier and pods are scheduled to master nodes despite of the NoSchedule taint, and thanos-querier pods work well
#  oc -n openshift-monitoring get deploy thanos-querier -oyaml
...
      nodeSelector:
        thanosQuerier: deploy
...
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
..

# oc get node | grep  master
ip-10-0-129-123.ap-northeast-2.compute.internal   Ready    master   10h   v1.17.1
ip-10-0-150-17.ap-northeast-2.compute.internal    Ready    master   10h   v1.17.1
ip-10-0-171-6.ap-northeast-2.compute.internal     Ready    master   10h   v1.17.1

# oc -n openshift-monitoring get pod -o wide | grep thanos-querier
thanos-querier-674cbbb6c7-9hzlt                4/4     Running   0          3m36s   10.129.0.50    ip-10-0-129-123.ap-northeast-2.compute.internal   <none>           <none>
thanos-querier-674cbbb6c7-n645t                4/4     Running   0          3m4s    10.130.0.50    ip-10-0-171-6.ap-northeast-2.compute.internal     <none>           <none>

Comment 10 errata-xmlrpc 2020-08-04 18:05:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409