1812834 – thanos-querier is having master tolerations set but no master node-selector set

Bug 1812834 - thanos-querier is having master tolerations set but no master node-selector set

Summary: thanos-querier is having master tolerations set but no master node-selector set

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Sergiusz Urbaniak
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-12 10:17 UTC by Sergiusz Urbaniak
Modified:	2020-08-04 18:05 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Currently, it is possible for thanos-querier to be scheduled both on master and worker nodes as we only have master tolerations set. We should also set master node-selector so thanos querier is guaranteed to be deployed on master nodes. Consequence: Thanos querier could be scheduled both on master, and on worker nodes. Fix: Thanos querier belongs on worker nodes, hence master toleration has been removed. Additionally, as with every payload we deploy on worker nodes we make this configurable via a new `thanosQuerier` section in the cluster-monitoring-operator configmap. Result: Thanos Querier is deployed on worker nodes only and can be configured to consider a node selector, tolerations, and resources as is done for the in-cluster prometheus.
Clone Of:
Environment:
Last Closed:	2020-08-04 18:05:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 709	0	None	closed	Bug 1812834: schedule Thanos Querier on worker nodes, make resources configurable	2021-01-29 09:42:20 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-08-04 18:05:02 UTC

Description Sergiusz Urbaniak 2020-03-12 10:17:17 UTC

Currently, it is possible for thanos-querier to be scheduled both on master and worker nodes as we only have master tolerations set.

We should also set master node-selector so thanos querier is guaranteed to be deployed on master nodes.

Comment 8 Junqi Zhao 2020-03-19 09:45:19 UTC

Tested on 4.5.0-0.nightly-2020-03-18-115438, thanos-querier pods are deployed on workers now, and can configure nodeSelector and tolerations for thanos-querier via cluster-monitoring-config configmap

verification steps:
1. thanos-querier pods are deployed on workers now
# oc get node | grep worker
ip-10-0-134-182.ap-northeast-2.compute.internal   Ready    worker   9h    v1.17.1
ip-10-0-150-200.ap-northeast-2.compute.internal   Ready    worker   9h    v1.17.1
ip-10-0-173-240.ap-northeast-2.compute.internal   Ready    worker   9h    v1.17.1

# oc -n openshift-monitoring get pod -o wide | grep thanos-querier
thanos-querier-56f9d46b78-gzd5n                4/4     Running   0          9h      10.128.2.9     ip-10-0-173-240.ap-northeast-2.compute.internal   <none>           <none>
thanos-querier-56f9d46b78-j4rpr                4/4     Running   0          9h      10.129.2.13    ip-10-0-150-200.ap-northeast-2.compute.internal   <none>           <none>

2. label master nodes
# for i in $(oc get node | grep master | awk '{print $1}'); do echo $i; oc label node $i thanosQuerier=deploy;done
3. add nodeSelector and tolerations in cluster-monitoring-config configmap, so that they can deploy on master nodes

****
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    thanosQuerier:
      nodeSelector:
        thanosQuerier: deploy
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
****
4. Check the configuration is in deploy thanos-querier and pods are scheduled to master nodes despite of the NoSchedule taint, and thanos-querier pods work well
#  oc -n openshift-monitoring get deploy thanos-querier -oyaml
...
      nodeSelector:
        thanosQuerier: deploy
...
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
..

# oc get node | grep  master
ip-10-0-129-123.ap-northeast-2.compute.internal   Ready    master   10h   v1.17.1
ip-10-0-150-17.ap-northeast-2.compute.internal    Ready    master   10h   v1.17.1
ip-10-0-171-6.ap-northeast-2.compute.internal     Ready    master   10h   v1.17.1

# oc -n openshift-monitoring get pod -o wide | grep thanos-querier
thanos-querier-674cbbb6c7-9hzlt                4/4     Running   0          3m36s   10.129.0.50    ip-10-0-129-123.ap-northeast-2.compute.internal   <none>           <none>
thanos-querier-674cbbb6c7-n645t                4/4     Running   0          3m4s    10.130.0.50    ip-10-0-171-6.ap-northeast-2.compute.internal     <none>           <none>

Comment 10 errata-xmlrpc 2020-08-04 18:05:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.