Bug 1948711
| Summary: | thanos querier and prometheus-adapter should have 2 replicas | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | ravig <rgudimet> |
| Component: | Monitoring | Assignee: | Damien Grisonnet <dgrisonn> |
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.8 | CC: | alegrand, anpicker, dgrisonn, erooth, hongyli, kakkoyun, lcosic, mbukatov, pkrupa, spasquie, sraje, wking |
| Target Milestone: | --- | Flags: | sraje:
needinfo-
|
| Target Release: | 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-27 22:59:29 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1940933, 1984103 | ||
|
Description
ravig
2021-04-12 19:22:10 UTC
# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.8.0-0.nightly-2021-04-09-222447 True False 61m Cluster version is 4.8.0-0.nightly-2021-04-09-222447
# oc -n openshift-monitoring get deploy | grep -E "prometheus-adapter|thanos-querier"
prometheus-adapter 2/2 2 2 83m
thanos-querier 2/2 2 2 74m
# oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep maxUnavailable -C4
app.kubernetes.io/part-of: openshift-monitoring
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
creationTimestamp: null
# oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep maxUnavailable -C4
app.kubernetes.io/name: thanos-query
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
Raising priority to high and set release blocker since bug 1940933 is depending on this effort. Some failures caused by this change were noticed on SNO: https://bugzilla.redhat.com/show_bug.cgi?id=1950761. To fix the issue reported, we had to revert the changes made for this BZ, until we figure out how to handle the SNO use-case: https://github.com/openshift/cluster-monitoring-operator/pull/1122. So, I'm moving this BZ back to `Assigned`. Test with payload 4.8.0-0.nightly-2021-04-21-172405
HA conventions to thanos-querier and prometheus-adapter:
2 replicas
Hard pod anti-affinity on hostname
Set the maxUnavailable rollout strategy to 25% Set maxUnavailable rollout strategy to 1
#oc -n openshift-monitoring get pod|grep prometheus-adapter
prometheus-adapter-66bc95f656-qf55x 1/1 Running 0 32m
prometheus-adapter-66bc95f656-svlbp 1/1 Running 0 32m
#oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep maxUnavailable -A4
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
annotations:
#oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep maxUnavailable -A4
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
annotations
#oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep affinity -A10
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: metrics-adapter
app.kubernetes.io/managed-by: cluster-monitoring-operator
app.kubernetes.io/name: prometheus-adapter
app.kubernetes.io/part-of: openshift-monitoring
topologyKey: kubernetes.io/hostname
#oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep affinity -A10
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: query-layer
app.kubernetes.io/instance: thanos-querier
app.kubernetes.io/name: thanos-query
topologyKey: kubernetes.io/hostname
Correct he Payload version in comments 7 as 4.8.0-0.nightly-2021-04-22-182303 tested with 4.8.0-0.nightly-2021-04-22-225832, no issue now, prometheus-adapter and thanos-querier pods are scheduled to different node
# oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-adapter|thanos-querier"
prometheus-adapter-58648c6759-6tldh 1/1 Running 0 17m 10.128.2.28 ip-10-0-135-62.ap-northeast-2.compute.internal <none> <none>
prometheus-adapter-58648c6759-hn6jn 1/1 Running 0 19m 10.131.0.35 ip-10-0-185-132.ap-northeast-2.compute.internal <none> <none>
thanos-querier-64d8d8ff75-gwhhv 5/5 Running 0 17m 10.128.2.30 ip-10-0-135-62.ap-northeast-2.compute.internal <none> <none>
thanos-querier-64d8d8ff75-nggsn 5/5 Running 0 19m 10.131.0.33 ip-10-0-185-132.ap-northeast-2.compute.internal <none> <none>
# oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep maxUnavailable -C2
rollingUpdate:
maxSurge: 25%
maxUnavailable: 1
type: RollingUpdate
template:
# oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep affinity -A10
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: metrics-adapter
app.kubernetes.io/managed-by: cluster-monitoring-operator
app.kubernetes.io/name: prometheus-adapter
app.kubernetes.io/part-of: openshift-monitoring
topologyKey: kubernetes.io/hostname
containers:
# oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep maxUnavailable -C2
rollingUpdate:
maxSurge: 25%
maxUnavailable: 1
type: RollingUpdate
template:
# oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep affinity -A10
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: query-layer
app.kubernetes.io/instance: thanos-querier
app.kubernetes.io/name: thanos-query
topologyKey: kubernetes.io/hostname
containers:
- args:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |