Bug 1812834
Summary: | thanos-querier is having master tolerations set but no master node-selector set | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Sergiusz Urbaniak <surbania> |
Component: | Monitoring | Assignee: | Sergiusz Urbaniak <surbania> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | low | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.3.0 | CC: | alegrand, anpicker, cvogel, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania |
Target Milestone: | --- | ||
Target Release: | 4.5.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
Currently, it is possible for thanos-querier to be scheduled both on master and worker nodes as we only have master tolerations set.
We should also set master node-selector so thanos querier is guaranteed to be deployed on master nodes.
Consequence:
Thanos querier could be scheduled both on master, and on worker nodes.
Fix:
Thanos querier belongs on worker nodes, hence master toleration has been removed.
Additionally, as with every payload we deploy on worker nodes we make this configurable via a new `thanosQuerier` section in the cluster-monitoring-operator configmap.
Result:
Thanos Querier is deployed on worker nodes only and can be configured to consider a node selector, tolerations, and resources as is done for the in-cluster prometheus.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-08-04 18:05:00 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Sergiusz Urbaniak
2020-03-12 10:17:17 UTC
Tested on 4.5.0-0.nightly-2020-03-18-115438, thanos-querier pods are deployed on workers now, and can configure nodeSelector and tolerations for thanos-querier via cluster-monitoring-config configmap verification steps: 1. thanos-querier pods are deployed on workers now # oc get node | grep worker ip-10-0-134-182.ap-northeast-2.compute.internal Ready worker 9h v1.17.1 ip-10-0-150-200.ap-northeast-2.compute.internal Ready worker 9h v1.17.1 ip-10-0-173-240.ap-northeast-2.compute.internal Ready worker 9h v1.17.1 # oc -n openshift-monitoring get pod -o wide | grep thanos-querier thanos-querier-56f9d46b78-gzd5n 4/4 Running 0 9h 10.128.2.9 ip-10-0-173-240.ap-northeast-2.compute.internal <none> <none> thanos-querier-56f9d46b78-j4rpr 4/4 Running 0 9h 10.129.2.13 ip-10-0-150-200.ap-northeast-2.compute.internal <none> <none> 2. label master nodes # for i in $(oc get node | grep master | awk '{print $1}'); do echo $i; oc label node $i thanosQuerier=deploy;done 3. add nodeSelector and tolerations in cluster-monitoring-config configmap, so that they can deploy on master nodes **** apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | thanosQuerier: nodeSelector: thanosQuerier: deploy tolerations: - key: node-role.kubernetes.io/master operator: Exists effect: NoSchedule **** 4. Check the configuration is in deploy thanos-querier and pods are scheduled to master nodes despite of the NoSchedule taint, and thanos-querier pods work well # oc -n openshift-monitoring get deploy thanos-querier -oyaml ... nodeSelector: thanosQuerier: deploy ... tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists .. # oc get node | grep master ip-10-0-129-123.ap-northeast-2.compute.internal Ready master 10h v1.17.1 ip-10-0-150-17.ap-northeast-2.compute.internal Ready master 10h v1.17.1 ip-10-0-171-6.ap-northeast-2.compute.internal Ready master 10h v1.17.1 # oc -n openshift-monitoring get pod -o wide | grep thanos-querier thanos-querier-674cbbb6c7-9hzlt 4/4 Running 0 3m36s 10.129.0.50 ip-10-0-129-123.ap-northeast-2.compute.internal <none> <none> thanos-querier-674cbbb6c7-n645t 4/4 Running 0 3m4s 10.130.0.50 ip-10-0-171-6.ap-northeast-2.compute.internal <none> <none> Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |