Description of problem:
The anti-affinity rules for the prometheus-adapter deployment (and I believe the thanos-querier deployment as well) created by the monitoring operator prevent rollout on a single-node cluster because the overlapping second pod required for rollout cannot be scheduled on the single node where the first pod is already scheduled.
Version-Release number of selected component (if applicable):
I believe this was recently introduced by this PR: https://github.com/openshift/cluster-monitoring-operator/pull/1119
Seen during CI twice:
Steps to Reproduce:
The monitoring operator doesn't become ready after installation due to the issue described above (waiting for that rollout to finish). The rollout never ends because the scheduler refuses to schedule a second pod on the same node due to the anti-affinity rules
The rollout should complete on a single-node cluster and the operator should become ready
The current anti affinity rule is set to requiredDuringSchedulingIgnoredDuringExecution, maybe a less strict preferredDuringSchedulingIgnoredDuringExecution can be used instead?
It broke SNO CI.
I created this WIP PR to verify that this is truly the cause, and it seems to confirm it: https://github.com/openshift/cluster-monitoring-operator/pull/1121
tested with 4.8.0-0.nightly-2021-04-19-121657, don't have rollout issue now, attach the prometheus-adapter/thanos-querier deployment files
Created attachment 1773611 [details]
thanos-querier deployment file
Created attachment 1773612 [details]
prometheus-adapter deployment file
Behavior is not as expected, deployment prometheus-adapter should have affinity
#oc -n openshift-monitoring get deployment prometheus-adapter -oyaml|grep -A10 affinity
#oc -n openshift-monitoring get deployment thanos-querier -oyaml|grep -A10 affinity
- key: app.kubernetes.io/name
(In reply to hongyan li from comment #7)
> Behavior is not as expected, deployment prometheus-adapter should have
> #oc -n openshift-monitoring get deployment prometheus-adapter -oyaml|grep
> -A10 affinity
> #oc -n openshift-monitoring get deployment thanos-querier -oyaml|grep -A10
> - podAffinityTerm:
> - key: app.kubernetes.io/name
> operator: In
> - thanos-query
Confirmed with Damien, this is an expected behavior for now and the fix is temporary.
*** Bug 1950911 has been marked as a duplicate of this bug. ***
*** Bug 1952762 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.