Hide Forgot
During a 4.8 to 4.8 PR job upgrade prometheus had both instances down at the same time (pod was on the same node). Prometheus needs anti-affinity and a PDB, otherwise availability of metrics is violated. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1366424416887508992 Mar 01 18:14:32.764 W ns/openshift-monitoring pod/prometheus-k8s-1 node/ci-op-rzz100ym-db044-jl74q-worker-d-xzb54 reason/Deleted Mar 01 18:14:32.885 W ns/openshift-monitoring pod/prometheus-k8s-0 node/ci-op-rzz100ym-db044-jl74q-worker-d-xzb54 reason/Deleted Mar 01 18:14:32.910 I ns/openshift-monitoring pod/prometheus-k8s-1 node/ reason/Created Mar 01 18:14:32.932 W ns/openshift-monitoring pod/prometheus-k8s-1 reason/FailedScheduling 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. Mar 01 18:14:32.932 I ns/openshift-monitoring statefulset/prometheus-k8s reason/SuccessfulCreate create Pod prometheus-k8s-1 in StatefulSet prometheus-k8s successful Mar 01 18:14:32.974 W ns/openshift-monitoring pod/prometheus-k8s-1 reason/FailedScheduling 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. Mar 01 18:14:33.070 I ns/openshift-monitoring pod/prometheus-k8s-0 node/ reason/Created Mar 01 18:14:33.080 W ns/openshift-monitoring pod/prometheus-k8s-0 reason/FailedScheduling 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. Mar 01 18:14:33.081 I ns/openshift-monitoring statefulset/prometheus-k8s reason/SuccessfulCreate create Pod prometheus-k8s-0 in StatefulSet prometheus-k8s successful Mar 01 18:14:33.112 W ns/openshift-monitoring pod/prometheus-k8s-0 reason/FailedScheduling 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. Mar 01 18:14:35.369 W ns/openshift-monitoring pod/prometheus-k8s-1 reason/FailedScheduling 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. Mar 01 18:15:05.240 W ns/openshift-monitoring pod/prometheus-k8s-0 reason/FailedScheduling 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. Mar 01 18:15:32.883 - 216s W ns/openshift-monitoring pod/prometheus-k8s-1 node/ pod has been pending longer than a minute Mar 01 18:15:33.883 - 217s W ns/openshift-monitoring pod/prometheus-k8s-0 node/ pod has been pending longer than a minute Mar 01 18:18:49.576 W ns/openshift-monitoring pod/prometheus-k8s-1 reason/FailedScheduling 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. Mar 01 18:18:49.703 W ns/openshift-monitoring pod/prometheus-k8s-0 reason/FailedScheduling 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. Mar 01 18:19:00.433 W ns/openshift-monitoring pod/prometheus-k8s-1 reason/FailedScheduling 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. Mar 01 18:19:00.433 W ns/openshift-monitoring pod/prometheus-k8s-0 reason/FailedScheduling 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. Mar 01 18:19:10.491 I ns/openshift-monitoring pod/prometheus-k8s-1 node/ci-op-rzz100ym-db044-jl74q-worker-d-xzb54 reason/Scheduled Mar 01 18:19:10.883 - 11s W ns/openshift-monitoring pod/prometheus-k8s-1 node/ci-op-rzz100ym-db044-jl74q-worker-d-xzb54 pod has been pending longer than a minute The PR job in question tests that thanos querier reports continuous availability of the Watchdog alert, which failed, which is because for 4 minutes there was no prometheus instance scheduled Mar 01 18:19:10.491 I ns/openshift-monitoring pod/prometheus-k8s-1 node/ci-op-rzz100ym-db044-jl74q-worker-d-xzb54 reason/Scheduled Mar 01 18:19:11.445 I ns/openshift-monitoring pod/prometheus-k8s-0 node/ci-op-rzz100ym-db044-jl74q-worker-d-xzb54 reason/Scheduled minute So the test correctly caught the lack of availability of prometheus. If this bug is duped on another bug please ensure both PDB and antiaffinity are set, not just one or the other.
the prom stateful set has: spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchExpressions: - key: prometheus operator: In values: - k8s namespaces: - openshift-monitoring topologyKey: kubernetes.io/hostname weight: 100 and alertmanager has: spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchExpressions: - key: alertmanager operator: In values: - main namespaces: - openshift-monitoring topologyKey: kubernetes.io/hostname weight: 100 thanos-querier has: spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchExpressions: - key: app.kubernetes.io/name operator: In values: - thanos-query namespaces: - openshift-monitoring topologyKey: kubernetes.io/hostname weight: 100 prometheus adapter has no anti-affinity so: 1) possible we need it on the adapter? 2) possible the existing anti-affinity rules need to be weighted higher, or something else happened on the cluster that meant anti affinity could not be enforced? on the 6 node (3 worker, 3 master) cluster i'm looking at, the pods are mostly spread across nodes, but alert manager does have 2 pods on one node (and the third one another node), which would imply its anti-affinity request was not respected: alertmanager-main-0 5/5 Running 0 18m 10.128.2.8 ip-10-0-191-156.us-east-2.compute.internal <none> <none> alertmanager-main-1 5/5 Running 0 18m 10.131.0.19 ip-10-0-192-222.us-east-2.compute.internal <none> <none> alertmanager-main-2 5/5 Running 0 18m 10.128.2.9 ip-10-0-191-156.us-east-2.compute.internal <none> <none>
Mar 01 18:14:33.112 W ns/openshift-monitoring pod/prometheus-k8s-0 reason/FailedScheduling 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. it seems it is related to storage
per see: https://github.com/openshift/enhancements/pull/718/files I guess the issue here is that we are using "preferredDuringSchedulingIgnoredDuringExecution" and not "requiredDuringSchedulingIgnoredDuringExecution": https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#always-co-located-in-the-same-node but it sounds like a pod disruption budget needs to be set also to ensure that at least 1 is kept running in cases where there is nowhere to schedule the 2nd instance.
Yes, PDB is necessary to survive dual disruption during normal upgrade, and spreading is necessary to survive single host failure.
As clayton is eluding to, we allow configuration of the parallelism in normal upgrades so it is not uncommon to have 2+ nodes evicted / upgraded at once. Without a PDB that could easily end up hitting both prometheus pods at the same time.
that and the current configuration of "preferred" instead of required means 2 pods can actually still land on the same node.
bug 1949262 targets the scheduling configuration for Prometheus and Alertmanager (e.g. switching from soft to hard affinity). This bug is about setting up pod disuption budget. As far as I understand, Pawel has done the ground work in kube-prometheus but he needs to wire it up in the cluster monitoring operator.
*** Bug 1953647 has been marked as a duplicate of this bug. ***
Since the discussion has been moved to upstream prometheus-operator, would it be meaningful to extend the scope of this BZ to thanos-ruler?
Yes. If Pawel thinks that is feasible, this BZ can cover all the resources managed by prometheus-operator (not just prometheus).
The solution that works for OpenShift already landed in CMO, so if needed let's open new BZ for thanos-ruler. For a more generic solution we have an open issue[1] in prometheus-operator to move the PDB setting logic into the operator itself. I am setting MODIFIED as from OpenShift perspective it should be fixed. [1]: https://github.com/prometheus-operator/prometheus-operator/issues/3917
upgrade from 4.8.0-0.nightly-2021-05-21-200728 to 4.8.0-0.nightly-2021-05-21-233425, no Prometheus instance go unavailable # oc -n openshift-monitoring get PodDisruptionBudget NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE prometheus-adapter 1 N/A 1 162m prometheus-k8s 1 N/A 1 158m # oc -n openshift-monitoring get prometheus k8s -oyaml | grep podAntiAffinity -A10 podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/component: prometheus app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: openshift-monitoring prometheus: k8s namespaces: - openshift-monitoring topologyKey: kubernetes.io/hostname # oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep podAntiAffinity -A10 podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/component: metrics-adapter app.kubernetes.io/name: prometheus-adapter app.kubernetes.io/part-of: openshift-monitoring namespaces: - openshift-monitoring topologyKey: kubernetes.io/hostname containers:
Moving this BZ back to assigned since the fix was reverted in [1], because of 4.8 blocker bug 1967614. [1] https://github.com/openshift/cluster-monitoring-operator/pull/1204
Can we get an updated summary of what's happening with this bug? For the time being I'm marking it as a blocker for 4.9 given it was reported well before 4.8 GA and we've had 3 months to come up with a plan by now.
We have a few prerequisites that need to happen before we can fix this specific issue. This is the plan we've discussed with Damien so far. 1. cluster monitoring operator sets Upgradeable condition to false whenever it detects that stateful pods aren't correctly spread across nodes (bug 1995924 addressed by https://github.com/openshift/cluster-monitoring-operator/pull/1330). 2. we're adding a script to the the upgrade CI jobs ensuring that pods are correctly spread before the upgrade happens (see https://github.com/openshift/release/pull/21258). 3. we backport bug 1995924 to 4.8.z so CMO blocks the upgrade to 4.9 when monitoring pods aren't correctly balanced. 4. we enable hard anti-affinity + PDB for the monitoring stateful workloads that can support it (*): bug 1933847, bug 1949262, bug 1955490. I'll work on a work-in-progress PR ASAP but the bits should already be there (we did the implementation in 4.8 already but we had to revert). We believe that steps 1, 2 and 3 are doable before 4.9 GA. Step 4 might be more risky (changing the scheduling constraints can break things as we've experienced during the 4.8 dev cycle) but if we notice regressions, a revert is always possible. (*) which excludes Alertmanager because we run 3 replicas of them and the minimal number of workers is 2 so there might environments where 2 pods end up on the same node, no matter what. The plan (as described in bug 1955489) is to scale down to 2 replicas but we don't have all the features needed yet in Prometheus operator to make it happen in 4.9.
*** Bug 1992446 has been marked as a duplicate of this bug. ***
https://github.com/openshift/cluster-monitoring-operator/pull/1341 has been merged
set the bug as verified as bug https://bugzilla.redhat.com/show_bug.cgi?id=1995924 has been verified.
Test with payload 4.10.0-0.nightly-2021-11-23-090522 Label node schedule prometheus-k8s to one node with node selector and create pvc at the same time remove label from node and delete node selector from config map annotate one pvc with "openshift.io/cluster-monitoring-drop-pvc":"yes" by editing pvc % oc -n openshift-monitoring get pod -owide |grep prometheus-k8s prometheus-k8s-0 7/7 Running 0 8m4s 10.129.2.41 ip-10-0-189-252.ap-northeast-2.compute.internal <none> <none> prometheus-k8s-1 7/7 Running 0 100m 10.128.2.192 ip-10-0-147-182.ap-northeast-2.compute.internal <none> <none> % oc adm upgrade Cluster version is 4.10.0-0.nightly-2021-11-23-090522 Upstream: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph Channel: stable-4.10 Updates: VERSION IMAGE Annotate pvc solved single point failure.
correct comment 33: Set up cluster with ocp 4.9.8 Label one worker node schedule prometheus-k8s to the node with node selector and create pvc at the same time Two instances of prometheus-k8s are running on the same node. hongyli@hongyli-mac Downloads % oc -n openshift-monitoring get pod -owide |grep prometheus-k8s prometheus-k8s-0 7/7 Running 0 61m 10.128.2.188 ip-10-0-147-182.ap-northeast-2.compute.internal <none> <none> prometheus-k8s-1 7/7 Running 0 72m 10.128.2.187 ip-10-0-147-182.ap-northeast-2.compute.internal <none> <none> hongyli@hongyli-mac Downloads % oc -n openshift-monitoring get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE prometheus-k8s-db-prometheus-k8s-0 Bound pvc-202c790a-5de0-438f-89d0-a6f4a0455c2c 10Gi RWO gp2 61m prometheus-k8s-db-prometheus-k8s-1 Bound pvc-6e9245d3-899e-4d70-9a12-14b2a6f3e5b6 10Gi RWO gp2 72m hongyli@hongyli-mac Downloads % oc adm upgrade Cluster version is 4.9.8 Upgradeable=False Reason: WorkloadSinglePointOfFailure Message: Cluster operator monitoring should not be upgraded between minor versions: Highly-available workload in namespace openshift-monitoring, with label map["app.kubernetes.io/name":"prometheus"] and persistent storage enabled has a single point of failure. Manual intervention is needed to upgrade to the next minor version. For each highly-available workload that has a single point of failure please mark at least one of their PersistentVolumeClaim for deletion by annotating them with map["openshift.io/cluster-monitoring-drop-pvc":"yes"]. Upstream: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph Channel: stable-4.10 Updates: VERSION IMAGE 4.10.0-0.ci-2021-11-20-192530 registry.ci.openshift.org/ocp/release@sha256:b59c2a8b6347e3bd91ce8920d5b1c84c4d267e7ea3085ddbd0419d4d6e95843c remove node selector from configmap and remove label from the node annotate one pvc with "openshift.io/cluster-monitoring-drop-pvc":"yes" by edit one pvc % oc -n openshift-monitoring get pod -owide |grep prometheus-k8s prometheus-k8s-0 7/7 Running 0 8m4s 10.129.2.41 ip-10-0-189-252.ap-northeast-2.compute.internal <none> <none> prometheus-k8s-1 7/7 Running 0 100m 10.128.2.192 ip-10-0-147-182.ap-northeast-2.compute.internal <none> <none> % oc adm upgrade Cluster version is 4.9.8 Upstream: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph Channel: stable-4.10 Updates: VERSION IMAGE 4.10.0-0.ci-2021-11-20-192530 registry.ci.openshift.org/ocp/release@sha256:b59c2a8b6347e3bd91ce8920d5b1c84c4d267e7ea3085ddbd0419d4d6e95843c % oc adm upgrade --to='4.10.0-0.nightly-2021-11-23-090522' % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-11-23-090522 True False 12h Cluster version is 4.10.0-0.nightly-2021-11-23-090522 % oc -n openshift-monitoring get PodDisruptionBudget prometheus-k8s -oyaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: creationTimestamp: "2021-11-23T12:57:14Z" generation: 1 labels: app.kubernetes.io/component: prometheus app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 2.30.3 name: prometheus-k8s namespace: openshift-monitoring spec: minAvailable: 1 selector: matchLabels: app.kubernetes.io/component: prometheus app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: openshift-monitoring prometheus: k8s status: % oc -n openshift-monitoring get sts prometheus-k8s -oyaml|grep podAntiAffinity -A10 podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/component: prometheus app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: openshift-monitoring prometheus: k8s namespaces: - openshift-monitoring % oc -n openshift-user-workload-monitoring get poddisruptionbudget prometheus-user-workload -oyaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: creationTimestamp: "2021-11-24T02:39:50Z" generation: 1 labels: app.kubernetes.io/component: prometheus app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 2.30.3 name: prometheus-user-workload namespace: openshift-user-workload-monitoring resourceVersion: "533928" uid: efa467ec-2a00-4c6d-a2f3-636fbc543425 spec: minAvailable: 1 selector: matchLabels: app.kubernetes.io/component: prometheus app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: openshift-monitoring prometheus: user-workload % oc -n openshift-user-workload-monitoring get poddisruptionbudget thanos-ruler-user-workload -oyaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: creationTimestamp: "2021-11-24T02:39:55Z" generation: 1 labels: thanosRulerName: user-workload name: thanos-ruler-user-workload namespace: openshift-user-workload-monitoring resourceVersion: "534024" uid: d2a70bc9-34d8-4eac-bc93-0d31e675eec3 spec: minAvailable: 1 selector: matchLabels: app.kubernetes.io/name: thanos-ruler thanos-ruler: user-workload % oc -n openshift-user-workload-monitoring get sts prometheus-user-workload -oyaml|grep affinity -A10 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/component: prometheus app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: openshift-monitoring prometheus: user-workload namespaces: - openshift-user-workload-monitoring % oc -n openshift-user-workload-monitoring get sts thanos-ruler-user-workload -oyaml|grep affinity -A10 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/name: thanos-ruler thanos-ruler: user-workload namespaces: - openshift-user-workload-monitoring topologyKey: kubernetes.io/hostname containers:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056