Bug 1948711

Summary: thanos querier and prometheus-adapter should have 2 replicas
Product: OpenShift Container Platform Reporter: ravig <rgudimet>
Component: MonitoringAssignee: Damien Grisonnet <dgrisonn>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: high    
Version: 4.8CC: alegrand, anpicker, dgrisonn, erooth, hongyli, kakkoyun, lcosic, mbukatov, pkrupa, spasquie, sraje, wking
Target Milestone: ---Flags: sraje: needinfo-
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:59:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1940933, 1984103    

Description ravig 2021-04-12 19:22:10 UTC
Description of problem:

Thanos querier and prometheus-adapter specified at https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability

Please ensure that maxUnavailable is 25% in the rollout strategy.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Junqi Zhao 2021-04-13 00:50:40 UTC
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-09-222447   True        False         61m     Cluster version is 4.8.0-0.nightly-2021-04-09-222447

# oc -n openshift-monitoring get deploy | grep -E "prometheus-adapter|thanos-querier"
prometheus-adapter            2/2     2            2           83m
thanos-querier                2/2     2            2           74m

# oc -n openshift-monitoring get deploy  prometheus-adapter -oyaml | grep maxUnavailable -C4
      app.kubernetes.io/part-of: openshift-monitoring
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null

# oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep maxUnavailable -C4
      app.kubernetes.io/name: thanos-query
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null

Comment 2 Damien Grisonnet 2021-04-15 08:36:23 UTC
Raising priority to high and set release blocker since bug 1940933 is depending on this effort.

Comment 5 Damien Grisonnet 2021-04-19 09:39:47 UTC
Some failures caused by this change were noticed on SNO: https://bugzilla.redhat.com/show_bug.cgi?id=1950761.
To fix the issue reported, we had to revert the changes made for this BZ, until we figure out how to handle the SNO use-case: https://github.com/openshift/cluster-monitoring-operator/pull/1122. So, I'm moving this BZ back to `Assigned`.

Comment 7 hongyan li 2021-04-23 06:13:11 UTC
Test with payload 4.8.0-0.nightly-2021-04-21-172405

HA conventions to thanos-querier and prometheus-adapter:

2 replicas
Hard pod anti-affinity on hostname
Set the maxUnavailable rollout strategy to 25% Set maxUnavailable rollout strategy to 1

#oc -n openshift-monitoring get pod|grep prometheus-adapter
prometheus-adapter-66bc95f656-qf55x            1/1     Running   0          32m
prometheus-adapter-66bc95f656-svlbp            1/1     Running   0          32m

#oc -n openshift-monitoring get deploy  prometheus-adapter -oyaml | grep maxUnavailable -A4
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      annotations:

#oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep maxUnavailable -A4
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      annotations

#oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep affinity -A10
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/component: metrics-adapter
                app.kubernetes.io/managed-by: cluster-monitoring-operator
                app.kubernetes.io/name: prometheus-adapter
                app.kubernetes.io/part-of: openshift-monitoring
            topologyKey: kubernetes.io/hostname

#oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep affinity -A10
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/component: query-layer
                app.kubernetes.io/instance: thanos-querier
                app.kubernetes.io/name: thanos-query
            topologyKey: kubernetes.io/hostname

Comment 8 hongyan li 2021-04-23 07:22:08 UTC
Correct he Payload version in comments 7 as 4.8.0-0.nightly-2021-04-22-182303

Comment 9 Junqi Zhao 2021-04-23 07:29:04 UTC
tested with 4.8.0-0.nightly-2021-04-22-225832, no issue now, prometheus-adapter and thanos-querier pods are scheduled to different node
# oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-adapter|thanos-querier"
prometheus-adapter-58648c6759-6tldh           1/1     Running   0          17m   10.128.2.28    ip-10-0-135-62.ap-northeast-2.compute.internal    <none>           <none>
prometheus-adapter-58648c6759-hn6jn           1/1     Running   0          19m   10.131.0.35    ip-10-0-185-132.ap-northeast-2.compute.internal   <none>           <none>
thanos-querier-64d8d8ff75-gwhhv               5/5     Running   0          17m   10.128.2.30    ip-10-0-135-62.ap-northeast-2.compute.internal    <none>           <none>
thanos-querier-64d8d8ff75-nggsn               5/5     Running   0          19m   10.131.0.33    ip-10-0-185-132.ap-northeast-2.compute.internal   <none>           <none>

# oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep maxUnavailable -C2
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
# oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep affinity -A10
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/component: metrics-adapter
                app.kubernetes.io/managed-by: cluster-monitoring-operator
                app.kubernetes.io/name: prometheus-adapter
                app.kubernetes.io/part-of: openshift-monitoring
            topologyKey: kubernetes.io/hostname
      containers:
# oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep maxUnavailable -C2
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
# oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep affinity -A10
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/component: query-layer
                app.kubernetes.io/instance: thanos-querier
                app.kubernetes.io/name: thanos-query
            topologyKey: kubernetes.io/hostname
      containers:
      - args:

Comment 15 errata-xmlrpc 2021-07-27 22:59:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438