Bug 1950761 - Monitoring operator deployments anti-affinity rules prevent their rollout on single-node
Summary: Monitoring operator deployments anti-affinity rules prevent their rollout on ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.8.0
Assignee: Damien Grisonnet
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1950911 1952762 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-18 12:42 UTC by Omer Tuchfeld
Modified: 2021-09-21 17:05 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:01:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
thanos-querier deployment file (10.65 KB, text/plain)
2021-04-20 04:00 UTC, Junqi Zhao
no flags Details
prometheus-adapter deployment file (4.60 KB, text/plain)
2021-04-20 04:00 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1122 0 None open Bug 1950761: Revert: jsonnet: apply HA conventions 2021-04-19 08:45:36 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:01:58 UTC

Description Omer Tuchfeld 2021-04-18 12:42:20 UTC
Description of problem:
The anti-affinity rules for the prometheus-adapter deployment (and I believe the thanos-querier deployment as well) created by the monitoring operator prevent rollout on a single-node cluster because the overlapping second pod required for rollout cannot be scheduled on the single node where the first pod is already scheduled.


Version-Release number of selected component (if applicable):
I believe this was recently introduced by this PR: https://github.com/openshift/cluster-monitoring-operator/pull/1119

How reproducible:
Seen during CI twice:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-assisted-test-infra-master-e2e-metal-single-node-live-iso-periodic/1383571155050303488

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-single-node-live-iso/1383571212126392320

Steps to Reproduce:
1.
2.
3.

Actual results:
The monitoring operator doesn't become ready after installation due to the issue described above (waiting for that rollout to finish). The rollout never ends because the scheduler refuses to schedule a second pod on the same node due to the anti-affinity rules


Expected results:
The rollout should complete on a single-node cluster and the operator should become ready


Additional info:
The current anti affinity rule is set to requiredDuringSchedulingIgnoredDuringExecution, maybe a less strict preferredDuringSchedulingIgnoredDuringExecution can be used instead?

Comment 2 Omer Tuchfeld 2021-04-19 07:41:05 UTC
I created this WIP PR to verify that this is truly the cause, and it seems to confirm it: https://github.com/openshift/cluster-monitoring-operator/pull/1121

Comment 4 Junqi Zhao 2021-04-20 03:59:51 UTC
tested with 4.8.0-0.nightly-2021-04-19-121657, don't have rollout issue now, attach the prometheus-adapter/thanos-querier deployment files

Comment 5 Junqi Zhao 2021-04-20 04:00:17 UTC
Created attachment 1773611 [details]
thanos-querier deployment file

Comment 6 Junqi Zhao 2021-04-20 04:00:54 UTC
Created attachment 1773612 [details]
prometheus-adapter deployment file

Comment 7 hongyan li 2021-04-20 06:46:27 UTC
Behavior is not as expected, deployment prometheus-adapter should have affinity

#oc -n openshift-monitoring get deployment prometheus-adapter -oyaml|grep -A10 affinity
#oc -n openshift-monitoring get deployment thanos-querier -oyaml|grep -A10 affinity
--
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - thanos-query
              namespaces:

Comment 8 hongyan li 2021-04-20 07:16:34 UTC
(In reply to hongyan li from comment #7)
> Behavior is not as expected, deployment prometheus-adapter should have
> affinity
> 
> #oc -n openshift-monitoring get deployment prometheus-adapter -oyaml|grep
> -A10 affinity
> #oc -n openshift-monitoring get deployment thanos-querier -oyaml|grep -A10
> affinity
> --
>       affinity:
>         podAntiAffinity:
>           preferredDuringSchedulingIgnoredDuringExecution:
>           - podAffinityTerm:
>               labelSelector:
>                 matchExpressions:
>                 - key: app.kubernetes.io/name
>                   operator: In
>                   values:
>                   - thanos-query
>               namespaces:

Confirmed with Damien, this is an expected behavior for now and the fix is temporary.

Comment 9 Maciej Szulik 2021-04-20 09:55:57 UTC
*** Bug 1950911 has been marked as a duplicate of this bug. ***

Comment 10 hongyan li 2021-04-23 05:51:26 UTC
*** Bug 1952762 has been marked as a duplicate of this bug. ***

Comment 13 errata-xmlrpc 2021-07-27 23:01:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.