1948711 – thanos querier and prometheus-adapter should have 2 replicas

Bug 1948711 - thanos querier and prometheus-adapter should have 2 replicas

Summary: thanos querier and prometheus-adapter should have 2 replicas

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Damien Grisonnet
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1940933 1984103
TreeView+	depends on / blocked

Reported:	2021-04-12 19:22 UTC by ravig
Modified:	2023-08-07 08:41 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:59:29 UTC
Target Upstream Version:
Embargoed:
Flags:	sraje: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1119	None	closed	Bug 1948711: jsonnet: apply HA conventions	2021-04-19 12:43:33 UTC
Github	openshift cluster-monitoring-operator pull 1124	None	open	WIP: Bug 1948711: Apply HA conventions to prometheus-adapter and thanos-ruler	2021-04-20 07:53:02 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:59:57 UTC

Description ravig 2021-04-12 19:22:10 UTC

Description of problem:

Thanos querier and prometheus-adapter specified at https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability

Please ensure that maxUnavailable is 25% in the rollout strategy.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Junqi Zhao 2021-04-13 00:50:40 UTC

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-09-222447   True        False         61m     Cluster version is 4.8.0-0.nightly-2021-04-09-222447

# oc -n openshift-monitoring get deploy | grep -E "prometheus-adapter|thanos-querier"
prometheus-adapter            2/2     2            2           83m
thanos-querier                2/2     2            2           74m

# oc -n openshift-monitoring get deploy  prometheus-adapter -oyaml | grep maxUnavailable -C4
      app.kubernetes.io/part-of: openshift-monitoring
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null

# oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep maxUnavailable -C4
      app.kubernetes.io/name: thanos-query
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null

Comment 2 Damien Grisonnet 2021-04-15 08:36:23 UTC

Raising priority to high and set release blocker since bug 1940933 is depending on this effort.

Comment 5 Damien Grisonnet 2021-04-19 09:39:47 UTC

Some failures caused by this change were noticed on SNO: https://bugzilla.redhat.com/show_bug.cgi?id=1950761.
To fix the issue reported, we had to revert the changes made for this BZ, until we figure out how to handle the SNO use-case: https://github.com/openshift/cluster-monitoring-operator/pull/1122. So, I'm moving this BZ back to `Assigned`.

Comment 7 hongyan li 2021-04-23 06:13:11 UTC

Test with payload 4.8.0-0.nightly-2021-04-21-172405

HA conventions to thanos-querier and prometheus-adapter:

2 replicas
Hard pod anti-affinity on hostname
Set the maxUnavailable rollout strategy to 25% Set maxUnavailable rollout strategy to 1

#oc -n openshift-monitoring get pod|grep prometheus-adapter
prometheus-adapter-66bc95f656-qf55x            1/1     Running   0          32m
prometheus-adapter-66bc95f656-svlbp            1/1     Running   0          32m

#oc -n openshift-monitoring get deploy  prometheus-adapter -oyaml | grep maxUnavailable -A4
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      annotations:

#oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep maxUnavailable -A4
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      annotations

#oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep affinity -A10
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/component: metrics-adapter
                app.kubernetes.io/managed-by: cluster-monitoring-operator
                app.kubernetes.io/name: prometheus-adapter
                app.kubernetes.io/part-of: openshift-monitoring
            topologyKey: kubernetes.io/hostname

#oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep affinity -A10
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/component: query-layer
                app.kubernetes.io/instance: thanos-querier
                app.kubernetes.io/name: thanos-query
            topologyKey: kubernetes.io/hostname

Comment 8 hongyan li 2021-04-23 07:22:08 UTC

Correct he Payload version in comments 7 as 4.8.0-0.nightly-2021-04-22-182303

Comment 9 Junqi Zhao 2021-04-23 07:29:04 UTC

tested with 4.8.0-0.nightly-2021-04-22-225832, no issue now, prometheus-adapter and thanos-querier pods are scheduled to different node
# oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-adapter|thanos-querier"
prometheus-adapter-58648c6759-6tldh           1/1     Running   0          17m   10.128.2.28    ip-10-0-135-62.ap-northeast-2.compute.internal    <none>           <none>
prometheus-adapter-58648c6759-hn6jn           1/1     Running   0          19m   10.131.0.35    ip-10-0-185-132.ap-northeast-2.compute.internal   <none>           <none>
thanos-querier-64d8d8ff75-gwhhv               5/5     Running   0          17m   10.128.2.30    ip-10-0-135-62.ap-northeast-2.compute.internal    <none>           <none>
thanos-querier-64d8d8ff75-nggsn               5/5     Running   0          19m   10.131.0.33    ip-10-0-185-132.ap-northeast-2.compute.internal   <none>           <none>

# oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep maxUnavailable -C2
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
# oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep affinity -A10
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/component: metrics-adapter
                app.kubernetes.io/managed-by: cluster-monitoring-operator
                app.kubernetes.io/name: prometheus-adapter
                app.kubernetes.io/part-of: openshift-monitoring
            topologyKey: kubernetes.io/hostname
      containers:
# oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep maxUnavailable -C2
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
# oc -n openshift-monitoring get deploy thanos-querier -oyaml | grep affinity -A10
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/component: query-layer
                app.kubernetes.io/instance: thanos-querier
                app.kubernetes.io/name: thanos-query
            topologyKey: kubernetes.io/hostname
      containers:
      - args:

Comment 15 errata-xmlrpc 2021-07-27 22:59:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.