Bug 1948711 - thanos querier and prometheus-adapter should have 2 replicas
Summary: thanos querier and prometheus-adapter should have 2 replicas
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Damien Grisonnet
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1940933 1984103
TreeView+ depends on / blocked
 
Reported: 2021-04-12 19:22 UTC by ravig
Modified: 2023-08-07 08:41 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:59:29 UTC
Target Upstream Version:
Embargoed:
sraje: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1119 0 None closed Bug 1948711: jsonnet: apply HA conventions 2021-04-19 12:43:33 UTC
Github openshift cluster-monitoring-operator pull 1124 0 None open WIP: Bug 1948711: Apply HA conventions to prometheus-adapter and thanos-ruler 2021-04-20 07:53:02 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:59:57 UTC

Description ravig 2021-04-12 19:22:10 UTC
Description of problem:

Thanos querier and prometheus-adapter specified at https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability

Please ensure that maxUnavailable is 25% in the rollout strategy.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Junqi Zhao 2021-04-13 00:50:40 UTC
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-09-222447   True        False         61m     Cluster version is 4.8.0-0.nightly-2021-04-09-222447

# oc -n openshift-monitoring get deploy | grep -E "prometheus-adapter|thanos-querier"
prometheus-adapter            2/2     2            2           83m
thanos-querier                2/2     2            2           74m

# oc -n openshift-monitoring get deploy  prometheus-adapter -oyaml | grep maxUnavailable -C4
      app.kubernetes.io/part-of: openshift-monitoring
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null

# oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep maxUnavailable -C4
      app.kubernetes.io/name: thanos-query
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null

Comment 2 Damien Grisonnet 2021-04-15 08:36:23 UTC
Raising priority to high and set release blocker since bug 1940933 is depending on this effort.

Comment 5 Damien Grisonnet 2021-04-19 09:39:47 UTC
Some failures caused by this change were noticed on SNO: https://bugzilla.redhat.com/show_bug.cgi?id=1950761.
To fix the issue reported, we had to revert the changes made for this BZ, until we figure out how to handle the SNO use-case: https://github.com/openshift/cluster-monitoring-operator/pull/1122. So, I'm moving this BZ back to `Assigned`.

Comment 7 hongyan li 2021-04-23 06:13:11 UTC
Test with payload 4.8.0-0.nightly-2021-04-21-172405

HA conventions to thanos-querier and prometheus-adapter:

2 replicas
Hard pod anti-affinity on hostname
Set the maxUnavailable rollout strategy to 25% Set maxUnavailable rollout strategy to 1

#oc -n openshift-monitoring get pod|grep prometheus-adapter
prometheus-adapter-66bc95f656-qf55x            1/1     Running   0          32m
prometheus-adapter-66bc95f656-svlbp            1/1     Running   0          32m

#oc -n openshift-monitoring get deploy  prometheus-adapter -oyaml | grep maxUnavailable -A4
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      annotations:

#oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep maxUnavailable -A4
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      annotations

#oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep affinity -A10
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/component: metrics-adapter
                app.kubernetes.io/managed-by: cluster-monitoring-operator
                app.kubernetes.io/name: prometheus-adapter
                app.kubernetes.io/part-of: openshift-monitoring
            topologyKey: kubernetes.io/hostname

#oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep affinity -A10
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/component: query-layer
                app.kubernetes.io/instance: thanos-querier
                app.kubernetes.io/name: thanos-query
            topologyKey: kubernetes.io/hostname

Comment 8 hongyan li 2021-04-23 07:22:08 UTC
Correct he Payload version in comments 7 as 4.8.0-0.nightly-2021-04-22-182303

Comment 9 Junqi Zhao 2021-04-23 07:29:04 UTC
tested with 4.8.0-0.nightly-2021-04-22-225832, no issue now, prometheus-adapter and thanos-querier pods are scheduled to different node
# oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-adapter|thanos-querier"
prometheus-adapter-58648c6759-6tldh           1/1     Running   0          17m   10.128.2.28    ip-10-0-135-62.ap-northeast-2.compute.internal    <none>           <none>
prometheus-adapter-58648c6759-hn6jn           1/1     Running   0          19m   10.131.0.35    ip-10-0-185-132.ap-northeast-2.compute.internal   <none>           <none>
thanos-querier-64d8d8ff75-gwhhv               5/5     Running   0          17m   10.128.2.30    ip-10-0-135-62.ap-northeast-2.compute.internal    <none>           <none>
thanos-querier-64d8d8ff75-nggsn               5/5     Running   0          19m   10.131.0.33    ip-10-0-185-132.ap-northeast-2.compute.internal   <none>           <none>

# oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep maxUnavailable -C2
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
# oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep affinity -A10
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/component: metrics-adapter
                app.kubernetes.io/managed-by: cluster-monitoring-operator
                app.kubernetes.io/name: prometheus-adapter
                app.kubernetes.io/part-of: openshift-monitoring
            topologyKey: kubernetes.io/hostname
      containers:
# oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep maxUnavailable -C2
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
# oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep affinity -A10
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/component: query-layer
                app.kubernetes.io/instance: thanos-querier
                app.kubernetes.io/name: thanos-query
            topologyKey: kubernetes.io/hostname
      containers:
      - args:

Comment 15 errata-xmlrpc 2021-07-27 22:59:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.