Hide Forgot
Description of problem: The anti-affinity rules for the prometheus-adapter deployment (and I believe the thanos-querier deployment as well) created by the monitoring operator prevent rollout on a single-node cluster because the overlapping second pod required for rollout cannot be scheduled on the single node where the first pod is already scheduled. Version-Release number of selected component (if applicable): I believe this was recently introduced by this PR: https://github.com/openshift/cluster-monitoring-operator/pull/1119 How reproducible: Seen during CI twice: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-assisted-test-infra-master-e2e-metal-single-node-live-iso-periodic/1383571155050303488 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-single-node-live-iso/1383571212126392320 Steps to Reproduce: 1. 2. 3. Actual results: The monitoring operator doesn't become ready after installation due to the issue described above (waiting for that rollout to finish). The rollout never ends because the scheduler refuses to schedule a second pod on the same node due to the anti-affinity rules Expected results: The rollout should complete on a single-node cluster and the operator should become ready Additional info: The current anti affinity rule is set to requiredDuringSchedulingIgnoredDuringExecution, maybe a less strict preferredDuringSchedulingIgnoredDuringExecution can be used instead?
It broke SNO CI. https://testgrid.k8s.io/redhat-single-node#periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-single-node-live-iso
I created this WIP PR to verify that this is truly the cause, and it seems to confirm it: https://github.com/openshift/cluster-monitoring-operator/pull/1121
tested with 4.8.0-0.nightly-2021-04-19-121657, don't have rollout issue now, attach the prometheus-adapter/thanos-querier deployment files
Created attachment 1773611 [details] thanos-querier deployment file
Created attachment 1773612 [details] prometheus-adapter deployment file
Behavior is not as expected, deployment prometheus-adapter should have affinity #oc -n openshift-monitoring get deployment prometheus-adapter -oyaml|grep -A10 affinity #oc -n openshift-monitoring get deployment thanos-querier -oyaml|grep -A10 affinity -- affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchExpressions: - key: app.kubernetes.io/name operator: In values: - thanos-query namespaces:
(In reply to hongyan li from comment #7) > Behavior is not as expected, deployment prometheus-adapter should have > affinity > > #oc -n openshift-monitoring get deployment prometheus-adapter -oyaml|grep > -A10 affinity > #oc -n openshift-monitoring get deployment thanos-querier -oyaml|grep -A10 > affinity > -- > affinity: > podAntiAffinity: > preferredDuringSchedulingIgnoredDuringExecution: > - podAffinityTerm: > labelSelector: > matchExpressions: > - key: app.kubernetes.io/name > operator: In > values: > - thanos-query > namespaces: Confirmed with Damien, this is an expected behavior for now and the fix is temporary.
*** Bug 1950911 has been marked as a duplicate of this bug. ***
*** Bug 1952762 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438