Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1950761

Summary:

Monitoring operator deployments anti-affinity rules prevent their rollout on single-node

Product:

OpenShift Container Platform

Reporter:

Omer Tuchfeld <otuchfel>

Component:

Monitoring

Assignee:

Damien Grisonnet <dgrisonn>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

4.8

CC:

alegrand, anpicker, david.karlsen, dgrisonn, erooth, hongyli, juzhao, kakkoyun, lcosic, minmli, pkrupa, rfreiman, wking

Target Milestone:

---

Target Release:

4.8.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-07-27 23:01:42 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
thanos-querier deployment file	none
prometheus-adapter deployment file	none

Description Omer Tuchfeld 2021-04-18 12:42:20 UTC

Description of problem:
The anti-affinity rules for the prometheus-adapter deployment (and I believe the thanos-querier deployment as well) created by the monitoring operator prevent rollout on a single-node cluster because the overlapping second pod required for rollout cannot be scheduled on the single node where the first pod is already scheduled.


Version-Release number of selected component (if applicable):
I believe this was recently introduced by this PR: https://github.com/openshift/cluster-monitoring-operator/pull/1119

How reproducible:
Seen during CI twice:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-assisted-test-infra-master-e2e-metal-single-node-live-iso-periodic/1383571155050303488

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-single-node-live-iso/1383571212126392320

Steps to Reproduce:
1.
2.
3.

Actual results:
The monitoring operator doesn't become ready after installation due to the issue described above (waiting for that rollout to finish). The rollout never ends because the scheduler refuses to schedule a second pod on the same node due to the anti-affinity rules


Expected results:
The rollout should complete on a single-node cluster and the operator should become ready


Additional info:
The current anti affinity rule is set to requiredDuringSchedulingIgnoredDuringExecution, maybe a less strict preferredDuringSchedulingIgnoredDuringExecution can be used instead?

Comment 1 Rom Freiman 2021-04-18 13:01:37 UTC

It broke SNO CI.
https://testgrid.k8s.io/redhat-single-node#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-single-node-live-iso

Comment 2 Omer Tuchfeld 2021-04-19 07:41:05 UTC

I created this WIP PR to verify that this is truly the cause, and it seems to confirm it: https://github.com/openshift/cluster-monitoring-operator/pull/1121

Comment 4 Junqi Zhao 2021-04-20 03:59:51 UTC

tested with 4.8.0-0.nightly-2021-04-19-121657, don't have rollout issue now, attach the prometheus-adapter/thanos-querier deployment files

Comment 5 Junqi Zhao 2021-04-20 04:00:17 UTC

Created attachment 1773611 [details]
thanos-querier deployment file

Comment 6 Junqi Zhao 2021-04-20 04:00:54 UTC

Created attachment 1773612 [details]
prometheus-adapter deployment file

Comment 7 hongyan li 2021-04-20 06:46:27 UTC

Behavior is not as expected, deployment prometheus-adapter should have affinity

#oc -n openshift-monitoring get deployment prometheus-adapter -oyaml|grep -A10 affinity
#oc -n openshift-monitoring get deployment thanos-querier -oyaml|grep -A10 affinity
--
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - thanos-query
              namespaces:

Comment 8 hongyan li 2021-04-20 07:16:34 UTC

(In reply to hongyan li from comment #7)
> Behavior is not as expected, deployment prometheus-adapter should have
> affinity
> 
> #oc -n openshift-monitoring get deployment prometheus-adapter -oyaml|grep
> -A10 affinity
> #oc -n openshift-monitoring get deployment thanos-querier -oyaml|grep -A10
> affinity
> --
>       affinity:
>         podAntiAffinity:
>           preferredDuringSchedulingIgnoredDuringExecution:
>           - podAffinityTerm:
>               labelSelector:
>                 matchExpressions:
>                 - key: app.kubernetes.io/name
>                   operator: In
>                   values:
>                   - thanos-query
>               namespaces:

Confirmed with Damien, this is an expected behavior for now and the fix is temporary.

Comment 9 Maciej Szulik 2021-04-20 09:55:57 UTC

*** Bug 1950911 has been marked as a duplicate of this bug. ***

Comment 10 hongyan li 2021-04-23 05:51:26 UTC

*** Bug 1952762 has been marked as a duplicate of this bug. ***

Comment 13 errata-xmlrpc 2021-07-27 23:01:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438