Description of problem: The KubeDeploymentReplicasMismatch alert will never fire. The alert expression is '(kube_deployment_spec_replicas{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)"} != kube_deployment_status_replicas_available{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)"}) and (changes(kube_deployment_status_replicas_updated{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)"}[5m]) == 0) and cluster:control_plane:all_nodes_ready'. The right hand side expression ("cluster:control_plane:all_nodes_ready") has no label while the left hand side has many. The evaluation will return an empty result because of that. Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-06-21-191858 How reproducible: Always Steps to Reproduce: 1. Create a deployment that can't be fulfilled by the scheduler, for instance using node anti-affinity and a number of replicas greater than the number of workers. Actual results: KubeDeploymentReplicasMismatch doesn't fire. Expected results: KubeDeploymentReplicasMismatch should fire. Additional info: I haven't actually tried the steps to reproduce but I'm almost sure about the issue. Please take some time to verify that I'm totally wrong though :)
I'm going to drop 1245 here; there's a revert in flight [1]. Once the revert lands, we can take another pass at fixing the issue in this bug; we don't need explicit links to the 1245/1249 waffling. Compare this background CI rate from before 1249: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=alert+KubeDeploymentReplicasMismatch+fired&maxAge=336h&type=junit' | grep -o '[0-9]* days ago' | sort -n | uniq -c 1 3 days ago 4 5 days ago 2 6 days ago 4 7 days ago 1 8 days ago 2 9 days ago 4 10 days ago 2 11 days ago 4 13 days ago with today's flood: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=alert+KubeDeploymentReplicasMismatch+fired&maxAge=24h&type=junit' | grep -o 'hours ago' | wc -l 167 I'm pretty fuzzy on the mechanism though. Poking around at [2], PromeCIeus shows: kube_deployment_status_replicas_available{namespace="openshift-etcd",deployment="etcd-quorum-guard"} dropping from 3 to 2 at 15:57:25Z and returning to 3 at 16:18:43Z. I'd expect the PDB guard to allow that to tick back up to 3 as the machine-config operator moved from node to node rolling the MachineConfigPool. But for some reason, Prom doesn't appear to be noticing the brief "all 3 ready again" windows. Or maybe it's the Deployment controller being slower to declare the recovered pods ready than the disruption controller, so we start draining the next etcd pod before the Deployment controller things we get back to 3. However, both before and after monitoring#1249, cluster:control_plane:all_nodes_ready should gate the whole alert expr. In PromeCIeus, that metric drops out at 16:06:52Z, comes back at 16:11:53Z, drops back out at 16:14:52Z, and comes back again at 16:18:53Z. Despite that, PromeCIeus shows: ALERTS{alertname="KubeDeploymentReplicasMismatch",namespace="openshift-etcd"} pending *the entire time* from 15:57:53Z to 16:17:52Z, and then firing from 16:17:53Z to 16:18:52Z. Pasting the full expression from before 1249 into PromeCIeus gives no hits, so it makes sense that the revert will recover CI. But pasting the full expression from after 1249 into PromeCIeus has the expected holes from where cluster:control_plane:all_nodes_ready was out, so I don't understand how the alert could possibly have been pending during that time. It's possible that the 1249 expression was fine, and is just uncovering some unrelated bug in whatever it is that evaluates the PrometheusRules. [1]: https://github.com/openshift/cluster-monitoring-operator/pull/1249 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/843/pull-ci-openshift-oc-master-e2e-aws-upgrade/1408423006006415360
err, and in that comment^, a bunch of my 1249 references were meant to be 1245 references (talking about the PR getting reverted, not about the revert itself).
Test with payload 4.9.0-0.nightly-2021-07-01-104253 Get expected alert by adding label to node and creating podAntiAffinity token=`oc -n openshift-monitoring sa get-token prometheus-k8s` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" https://alertmanager-main.openshift-monitoring.svc:9094/api/v1/alerts|jq |grep -A10 KubeDeploymentReplicasMismatch % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 6326 0 6326 0 0 441k 0 --:--:-- --:--:-- --:--:-- 441k "alertname": "KubeDeploymentReplicasMismatch", "container": "kube-rbac-proxy-main", "deployment": "prometheus-example-app", "endpoint": "https-main", "job": "kube-state-metrics", "namespace": "default", "prometheus": "openshift-monitoring/k8s", "service": "kube-state-metrics", "severity": "warning" }, ------------------------------------------------------ # oc label node $work_node_name prometheus=example-app # oc apply -f -<<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: prometheus-example-app name: prometheus-example-app spec: replicas: 2 selector: matchLabels: app: prometheus-example-app template: metadata: labels: app: prometheus-example-app spec: nodeSelector: prometheus: example-app affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - prometheus-example-app topologyKey: kubernetes.io/hostname containers: - image: quay.io/openshifttest/prometheus-example-app@sha256:6db44b54b8a6de24f947de9dfb16454a270a4e3a7cd69a14d61948512891ef96 imagePullPolicy: IfNotPresent name: prometheus-example-app EOF
I'm moving back to assigned. So far, we have 1245, reverted in 1249, and 1253, reverted in 1259. We'll need a third attempt that doesn't break CI before we can verify this bug.
PR 1261 is in 4.9.0-0.nightly-2021-07-06-133521 and later payloads
# oc label node $work_node_name prometheus=example-app # oc apply -f -<<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: prometheus-example-app name: prometheus-example-app spec: replicas: 2 selector: matchLabels: app: prometheus-example-app template: metadata: labels: app: prometheus-example-app spec: nodeSelector: prometheus: example-app affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - prometheus-example-app topologyKey: kubernetes.io/hostname containers: - image: quay.io/openshifttest/prometheus-example-app@sha256:6db44b54b8a6de24f947de9dfb16454a270a4e3a7cd69a14d61948512891ef96 imagePullPolicy: IfNotPresent name: prometheus-example-app EOF Get expected alert by adding label to node and creating podAntiAffinity token=`oc -n openshift-monitoring sa get-token prometheus-k8s oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq --------------------- { "labels": { "alertname": "KubeDeploymentReplicasMismatch", "container": "kube-rbac-proxy-main", "deployment": "prometheus-example-app", "endpoint": "https-main", "job": "kube-state-metrics", "namespace": "default", "service": "kube-state-metrics", "severity": "warning" }, "annotations": { "description": "Deployment default/prometheus-example-app has not matched the expected number of replicas for longer than 15 minutes. This indicates that cluster infrastructure is unable to start or restart the necessary components. This most often occurs when one or more nodes are down or partioned from the cluster, or a fault occurs on the node that prevents the workload from starting. In rare cases this may indicate a new version of a cluster component cannot start due to a bug or configuration error. Assess the pods for this deployment to verify they are running on healthy nodes and then contact support.", "summary": "Deployment has not matched the expected number of replicas" }, "state": "pending", "activeAt": "2021-07-08T07:54:22.563881375Z", "value": "2e+00" }, ---------------------
continue of comments 10 oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml|grep -A30 KubeDeploymentReplicasMismatch - alert: KubeDeploymentReplicasMismatch annotations: description: Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than 15 minutes. This indicates that cluster infrastructure is unable to start or restart the necessary components. This most often occurs when one or more nodes are down or partioned from the cluster, or a fault occurs on the node that prevents the workload from starting. In rare cases this may indicate a new version of a cluster component cannot start due to a bug or configuration error. Assess the pods for this deployment to verify they are running on healthy nodes and then contact support. summary: Deployment has not matched the expected number of replicas expr: | ((( kube_deployment_spec_replicas{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"} > kube_deployment_status_replicas_available{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"} ) and ( changes(kube_deployment_status_replicas_updated{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"}[5m]) == 0 )) * on() group_left cluster:control_plane:all_nodes_ready) > 0 for: 15m labels: severity: warning Test with payload 4.9.0-0.nightly-2021-07-07-021823
checked from CI result, KubeDeploymentReplicasMismatch will be fired when the expression matched https://search.ci.openshift.org/?search=alert+KubeDeploymentReplicasMismatch+fired&maxAge=24h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job example, KubeDeploymentReplicasMismatch is fired when the current replicas number does not matched the expected number of replicas : checked #1414233172719177728 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovn-4.9/1414233172719177728 : [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] expand_less Run #0: Failed expand_less 14s fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Jul 11 16:20:03.773: Unexpected alerts fired or pending after the test run: alert KubeDeploymentReplicasMismatch fired for 120 seconds with labels: {container="kube-rbac-proxy-main", deployment="thanos-querier", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", service="kube-state-metrics", severity="warning"} checked from logs https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovn-4.9/1414233172719177728/artifacts/e2e-aws/pods/openshift-monitoring_thanos-querier-7b64df4c89-trcfg_oauth-proxy.log oauth-proxy container error is 2021/07/11 16:06:15 reverseproxy.go:480: http: proxy error: dial tcp [::1]:9090: connect: connection refused so the alert triggered
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759