Bug 1974830 - KubeDeploymentReplicasMismatch alert will never fire
Summary: KubeDeploymentReplicasMismatch alert will never fire
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.9.0
Assignee: Philip Gough
QA Contact: hongyan li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-22 15:52 UTC by Simon Pasquier
Modified: 2021-10-18 17:36 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:35:57 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1261 0 None open Bug 1974830: Update KubeDeploymentReplicasMismatch alert 2021-07-02 12:00:50 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:36:22 UTC

Description Simon Pasquier 2021-06-22 15:52:52 UTC
Description of problem:
The KubeDeploymentReplicasMismatch alert will never fire.

The alert expression is '(kube_deployment_spec_replicas{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)"} != kube_deployment_status_replicas_available{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)"}) and (changes(kube_deployment_status_replicas_updated{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)"}[5m]) == 0) and cluster:control_plane:all_nodes_ready'. The right hand side expression ("cluster:control_plane:all_nodes_ready") has no label while the left hand side has many. The evaluation will return an empty result because of that.

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-06-21-191858

How reproducible:
Always

Steps to Reproduce:
1. Create a deployment that can't be fulfilled by the scheduler, for instance using node anti-affinity and a number of replicas greater than the number of workers.

Actual results:
KubeDeploymentReplicasMismatch doesn't fire.

Expected results:
KubeDeploymentReplicasMismatch should fire.

Additional info:
I haven't actually tried the steps to reproduce but I'm almost sure about the issue. Please take some time to verify that I'm totally wrong though :)

Comment 1 W. Trevor King 2021-06-25 21:10:21 UTC
I'm going to drop 1245 here; there's a revert in flight [1].  Once the revert lands, we can take another pass at fixing the issue in this bug; we don't need explicit links to the 1245/1249 waffling.  Compare this background CI rate from before 1249:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=alert+KubeDeploymentReplicasMismatch+fired&maxAge=336h&type=junit' | grep -o '[0-9]* days ago' | sort -n | uniq -c
      1 3 days ago
      4 5 days ago
      2 6 days ago
      4 7 days ago
      1 8 days ago
      2 9 days ago
      4 10 days ago
      2 11 days ago
      4 13 days ago

with today's flood:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=alert+KubeDeploymentReplicasMismatch+fired&maxAge=24h&type=junit' | grep -o 'hours ago' | wc -l
167

I'm pretty fuzzy on the mechanism though.  Poking around at [2], PromeCIeus shows:

  kube_deployment_status_replicas_available{namespace="openshift-etcd",deployment="etcd-quorum-guard"}

dropping from 3 to 2 at 15:57:25Z and returning to 3 at 16:18:43Z.  I'd expect the PDB guard to allow that to tick back up to 3 as the machine-config operator moved from node to node rolling the MachineConfigPool.  But for some reason, Prom doesn't appear to be noticing the brief "all 3 ready again" windows.  Or maybe it's the Deployment controller being slower to declare the recovered pods ready than the disruption controller, so we start draining the next etcd pod before the Deployment controller things we get back to 3.

However, both before and after monitoring#1249, cluster:control_plane:all_nodes_ready should gate the whole alert expr.  In PromeCIeus, that metric drops out at 16:06:52Z, comes back at 16:11:53Z, drops back out at 16:14:52Z, and comes back again at 16:18:53Z.  Despite that, PromeCIeus shows:

  ALERTS{alertname="KubeDeploymentReplicasMismatch",namespace="openshift-etcd"}

pending *the entire time* from 15:57:53Z to 16:17:52Z, and then firing from 16:17:53Z to 16:18:52Z.

Pasting the full expression from before 1249 into PromeCIeus gives no hits, so it makes sense that the revert will recover CI.  But pasting the full expression from after 1249 into PromeCIeus has the expected holes from where cluster:control_plane:all_nodes_ready was out, so I don't understand how the alert could possibly have been pending during that time.  It's possible that the 1249 expression was fine, and is just uncovering some unrelated bug in whatever it is that evaluates the PrometheusRules.

[1]: https://github.com/openshift/cluster-monitoring-operator/pull/1249
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/843/pull-ci-openshift-oc-master-e2e-aws-upgrade/1408423006006415360

Comment 2 W. Trevor King 2021-06-25 21:16:04 UTC
err, and in that comment^, a bunch of my 1249 references were meant to be 1245 references (talking about the PR getting reverted, not about the revert itself).

Comment 6 hongyan li 2021-07-02 03:42:12 UTC
Test with payload 4.9.0-0.nightly-2021-07-01-104253

Get expected alert by adding label to node and creating podAntiAffinity
token=`oc -n openshift-monitoring sa get-token prometheus-k8s`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" https://alertmanager-main.openshift-monitoring.svc:9094/api/v1/alerts|jq |grep -A10 KubeDeploymentReplicasMismatch
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6326    0  6326    0     0   441k      0 --:--:-- --:--:-- --:--:--  441k
        "alertname": "KubeDeploymentReplicasMismatch",
        "container": "kube-rbac-proxy-main",
        "deployment": "prometheus-example-app",
        "endpoint": "https-main",
        "job": "kube-state-metrics",
        "namespace": "default",
        "prometheus": "openshift-monitoring/k8s",
        "service": "kube-state-metrics",
        "severity": "warning"
      },

------------------------------------------------------
# oc label node $work_node_name prometheus=example-app
# oc apply -f -<<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: prometheus-example-app
  name: prometheus-example-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: prometheus-example-app
  template:
    metadata:
      labels:
        app: prometheus-example-app
    spec:
      nodeSelector:
        prometheus: example-app
      affinity:
        podAntiAffinity: 
          requiredDuringSchedulingIgnoredDuringExecution: 
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - prometheus-example-app
            topologyKey: kubernetes.io/hostname 
      containers:
      - image: quay.io/openshifttest/prometheus-example-app@sha256:6db44b54b8a6de24f947de9dfb16454a270a4e3a7cd69a14d61948512891ef96
        imagePullPolicy: IfNotPresent
        name: prometheus-example-app
EOF

Comment 7 W. Trevor King 2021-07-02 04:08:01 UTC
I'm moving back to assigned.  So far, we have 1245, reverted in 1249, and 1253, reverted in 1259.  We'll need a third attempt that doesn't break CI before we can verify this bug.

Comment 9 hongyan li 2021-07-08 07:52:56 UTC
PR 1261 is in 4.9.0-0.nightly-2021-07-06-133521 and later payloads

Comment 10 hongyan li 2021-07-08 07:57:28 UTC
# oc label node $work_node_name prometheus=example-app
# oc apply -f -<<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: prometheus-example-app
  name: prometheus-example-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: prometheus-example-app
  template:
    metadata:
      labels:
        app: prometheus-example-app
    spec:
      nodeSelector:
        prometheus: example-app
      affinity:
        podAntiAffinity: 
          requiredDuringSchedulingIgnoredDuringExecution: 
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - prometheus-example-app
            topologyKey: kubernetes.io/hostname 
      containers:
      - image: quay.io/openshifttest/prometheus-example-app@sha256:6db44b54b8a6de24f947de9dfb16454a270a4e3a7cd69a14d61948512891ef96
        imagePullPolicy: IfNotPresent
        name: prometheus-example-app
EOF

Get expected alert by adding label to node and creating podAntiAffinity
token=`oc -n openshift-monitoring sa get-token prometheus-k8s
oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq

---------------------
      {
        "labels": {
          "alertname": "KubeDeploymentReplicasMismatch",
          "container": "kube-rbac-proxy-main",
          "deployment": "prometheus-example-app",
          "endpoint": "https-main",
          "job": "kube-state-metrics",
          "namespace": "default",
          "service": "kube-state-metrics",
          "severity": "warning"
        },
        "annotations": {
          "description": "Deployment default/prometheus-example-app has not matched the expected number of replicas for longer than 15 minutes. This indicates that cluster infrastructure is unable to start or restart the necessary components. This most often occurs when one or more nodes are down or partioned from the cluster, or a fault occurs on the node that prevents the workload from starting. In rare cases this may indicate a new version of a cluster component cannot start due to a bug or configuration error. Assess the pods for this deployment to verify they are running on healthy nodes and then contact support.",
          "summary": "Deployment has not matched the expected number of replicas"
        },
        "state": "pending",
        "activeAt": "2021-07-08T07:54:22.563881375Z",
        "value": "2e+00"
      },
---------------------

Comment 11 hongyan li 2021-07-08 08:03:02 UTC
continue of comments 10

oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml|grep -A30 KubeDeploymentReplicasMismatch 
      - alert: KubeDeploymentReplicasMismatch
        annotations:
          description: Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has
            not matched the expected number of replicas for longer than 15 minutes. This
            indicates that cluster infrastructure is unable to start or restart the necessary
            components. This most often occurs when one or more nodes are down or partioned
            from the cluster, or a fault occurs on the node that prevents the workload
            from starting. In rare cases this may indicate a new version of a cluster
            component cannot start due to a bug or configuration error. Assess the pods
            for this deployment to verify they are running on healthy nodes and then contact
            support.
          summary: Deployment has not matched the expected number of replicas
        expr: |
          (((
            kube_deployment_spec_replicas{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"}
              >
            kube_deployment_status_replicas_available{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"}
          ) and (
            changes(kube_deployment_status_replicas_updated{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"}[5m])
              ==
            0
          )) * on() group_left cluster:control_plane:all_nodes_ready) > 0
        for: 15m
        labels:
          severity: warning

Test with payload 4.9.0-0.nightly-2021-07-07-021823

Comment 12 Junqi Zhao 2021-07-12 08:16:37 UTC
checked from CI result, KubeDeploymentReplicasMismatch will be fired when the expression matched
https://search.ci.openshift.org/?search=alert+KubeDeploymentReplicasMismatch+fired&maxAge=24h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
example, KubeDeploymentReplicasMismatch is fired when the current replicas number does not matched the expected number of replicas :
checked #1414233172719177728
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovn-4.9/1414233172719177728
: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] expand_less
Run #0: Failed expand_less 	14s
fail [github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Jul 11 16:20:03.773: Unexpected alerts fired or pending after the test run:

alert KubeDeploymentReplicasMismatch fired for 120 seconds with labels: {container="kube-rbac-proxy-main", deployment="thanos-querier", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", service="kube-state-metrics", severity="warning"}

checked from logs
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovn-4.9/1414233172719177728/artifacts/e2e-aws/pods/openshift-monitoring_thanos-querier-7b64df4c89-trcfg_oauth-proxy.log
oauth-proxy container error is
2021/07/11 16:06:15 reverseproxy.go:480: http: proxy error: dial tcp [::1]:9090: connect: connection refused

so the alert triggered

Comment 19 errata-xmlrpc 2021-10-18 17:35:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.