Description of problem: Thanos querier and prometheus-adapter specified at https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability Please ensure that maxUnavailable is 25% in the rollout strategy. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-09-222447 True False 61m Cluster version is 4.8.0-0.nightly-2021-04-09-222447 # oc -n openshift-monitoring get deploy | grep -E "prometheus-adapter|thanos-querier" prometheus-adapter 2/2 2 2 83m thanos-querier 2/2 2 2 74m # oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep maxUnavailable -C4 app.kubernetes.io/part-of: openshift-monitoring strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 0 type: RollingUpdate template: metadata: creationTimestamp: null # oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep maxUnavailable -C4 app.kubernetes.io/name: thanos-query strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: creationTimestamp: null
Raising priority to high and set release blocker since bug 1940933 is depending on this effort.
Some failures caused by this change were noticed on SNO: https://bugzilla.redhat.com/show_bug.cgi?id=1950761. To fix the issue reported, we had to revert the changes made for this BZ, until we figure out how to handle the SNO use-case: https://github.com/openshift/cluster-monitoring-operator/pull/1122. So, I'm moving this BZ back to `Assigned`.
Test with payload 4.8.0-0.nightly-2021-04-21-172405 HA conventions to thanos-querier and prometheus-adapter: 2 replicas Hard pod anti-affinity on hostname Set the maxUnavailable rollout strategy to 25% Set maxUnavailable rollout strategy to 1 #oc -n openshift-monitoring get pod|grep prometheus-adapter prometheus-adapter-66bc95f656-qf55x 1/1 Running 0 32m prometheus-adapter-66bc95f656-svlbp 1/1 Running 0 32m #oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep maxUnavailable -A4 strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 1 type: RollingUpdate template: metadata: annotations: #oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep maxUnavailable -A4 strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 1 type: RollingUpdate template: metadata: annotations #oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep affinity -A10 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/component: metrics-adapter app.kubernetes.io/managed-by: cluster-monitoring-operator app.kubernetes.io/name: prometheus-adapter app.kubernetes.io/part-of: openshift-monitoring topologyKey: kubernetes.io/hostname #oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep affinity -A10 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/component: query-layer app.kubernetes.io/instance: thanos-querier app.kubernetes.io/name: thanos-query topologyKey: kubernetes.io/hostname
Correct he Payload version in comments 7 as 4.8.0-0.nightly-2021-04-22-182303
tested with 4.8.0-0.nightly-2021-04-22-225832, no issue now, prometheus-adapter and thanos-querier pods are scheduled to different node # oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-adapter|thanos-querier" prometheus-adapter-58648c6759-6tldh 1/1 Running 0 17m 10.128.2.28 ip-10-0-135-62.ap-northeast-2.compute.internal <none> <none> prometheus-adapter-58648c6759-hn6jn 1/1 Running 0 19m 10.131.0.35 ip-10-0-185-132.ap-northeast-2.compute.internal <none> <none> thanos-querier-64d8d8ff75-gwhhv 5/5 Running 0 17m 10.128.2.30 ip-10-0-135-62.ap-northeast-2.compute.internal <none> <none> thanos-querier-64d8d8ff75-nggsn 5/5 Running 0 19m 10.131.0.33 ip-10-0-185-132.ap-northeast-2.compute.internal <none> <none> # oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep maxUnavailable -C2 rollingUpdate: maxSurge: 25% maxUnavailable: 1 type: RollingUpdate template: # oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep affinity -A10 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/component: metrics-adapter app.kubernetes.io/managed-by: cluster-monitoring-operator app.kubernetes.io/name: prometheus-adapter app.kubernetes.io/part-of: openshift-monitoring topologyKey: kubernetes.io/hostname containers: # oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep maxUnavailable -C2 rollingUpdate: maxSurge: 25% maxUnavailable: 1 type: RollingUpdate template: # oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep affinity -A10 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/component: query-layer app.kubernetes.io/instance: thanos-querier app.kubernetes.io/name: thanos-query topologyKey: kubernetes.io/hostname containers: - args:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438