Bug 1814723 - How to handle alert KubeDaemonSetMisScheduled
Summary: How to handle alert KubeDaemonSetMisScheduled
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.5.0
Assignee: Lili Cosic
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-18 15:04 UTC by Hongkai Liu
Modified: 2020-07-13 17:23 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:22:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
prome.query (432.08 KB, image/png)
2020-03-18 15:04 UTC, Hongkai Liu
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 739 0 None closed Bug 1814723: Bump dependencies 2021-01-06 10:02:44 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:23:02 UTC

Description Hongkai Liu 2020-03-18 15:04:02 UTC
Created attachment 1671140 [details]
prome.query

Description of problem:
https://coreos.slack.com/archives/CV1UZU53R/p1584521800135300
3 instances of this alert were fired this morning.
AlertManagerAPP  4:56 AM
[FIRING:3] KubeDaemonSetMisScheduled kube-state-metrics (https-main 10.131.62.127:8443 kube-state-metrics-6bcc97c9d6-mmrhm openshift-monitoring/k8s kube-state-metrics warning)



Version-Release number of selected component (if applicable):
oc --context build01 get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-03-04-222846   True        False         12d     Cluster version is 4.3.0-0.nightly-2020-03-04-222846

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
Checked with prometheus UI (see the screenshot)

oc --context build01 get ds -n openshift-image-registry
NAME      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
node-ca   11        11        11      11           11          kubernetes.io/os=linux   48d

oc --context build01 get ds -n openshift-machine-config-operator
NAME                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
machine-config-daemon   11        11        11      11           11          kubernetes.io/os=linux            48d


oc --context build01 get ds -n openshift-dns
NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
dns-default   11        11        11      11           11          kubernetes.io/os=linux   48d

Are those DSs supposed to run a pod on every node?
How should cluster admin act on this alert?


Expected results:
A document telling how to deal with this alert would be great.

Additional info:
------------
alert: KubeDaemonSetMisScheduled
expr: kube_daemonset_status_number_misscheduled{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)"}
  > 0
for: 10m
labels:
  severity: warning
annotations:
  message: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset
    }} are running where they are not supposed to run.'

Comment 1 Lili Cosic 2020-03-18 16:58:07 UTC
Were the daemonset actually running where they were supposed to be running or not? Are you asking just for docs, or saying there is a bug in the alert?

Comment 2 Hongkai Liu 2020-03-18 18:26:59 UTC
If i understand the alert's message correctly, it is fired because "the pods of DS are running where they are not supposed to run".

However, all 3 DSs are with node selector kubernetes.io/os=linux
I think the label is there for every node of the cluster.
Then why it is fired in the first place?


And yes, documentation about how to handle this alert would be nice too.

Comment 3 Lili Cosic 2020-03-19 08:27:38 UTC
Was there a new node started on the cluster? Relevant issue: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/347

Comment 4 Hongkai Liu 2020-03-19 13:22:46 UTC
See the prometheus query about the alerts and cluster_autoscaler_nodes_count

https://imgur.com/a/SePLZQG

we have autoscaler set up on the cluster.
It seems we had a node ready/notready back and forth. 
Not sure it is caused by new nodes or the same node changing the status.

Comment 5 Hongkai Liu 2020-03-19 13:32:15 UTC
About comment https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/347#issuecomment-585075577
we have already `10m` for clause in the alert def. But the alert query with 1 over 10m in the case.
Is it about sensitivity of the alert?

This is confusing part to me:
The 3 DSs should run on all nodes with label kubernetes.io/os=linux.
Why is the node status related to firing such an alert?
I can imagine it makes sense only if the label is removed and the DS still runs there.

Comment 6 Lili Cosic 2020-03-23 07:31:22 UTC
> The 3 DSs should run on all nodes with label kubernetes.io/os=linux.
> Why is the node status related to firing such an alert?

It seems like there are problems with the node or the autoscaler. The metric and alert look at the misscheduled DS on that node.
> The number of nodes running a daemon pod but are not supposed to.

Note this is a warning alert so seems to be doing its job, warning there is a problem, the alert is not critical so you/customer should not be paged on it. But as a warning this alert makes perfect sense.

Comment 7 Hongkai Liu 2020-03-23 13:08:33 UTC
> Note this is a warning alert so seems to be doing its job, warning there is a problem, the alert is not critical so you/customer should not be paged on it. But as a warning this alert makes perfect sense.


I would like to know what a customer is expected to do when seeing such a warning?

Comment 8 Lili Cosic 2020-03-23 14:11:42 UTC
> I would like to know what a customer is expected to do when seeing such a warning?

Open ticket to investigate what is wrong with the nodes. Is there no other alerts firing on the cluster only that one?

Comment 9 Hongkai Liu 2020-03-23 18:24:22 UTC
> Open ticket to investigate what is wrong with the nodes.
Ack

> Is there no other alerts firing on the cluster only that one?

No other alerts fired in 2hours range before/after it.

Comment 11 Lili Cosic 2020-03-26 07:20:36 UTC
I guess this can be closed, think this answers all your questions?

Comment 12 Lili Cosic 2020-04-01 14:40:31 UTC
Reopening after discussion on slack.

Comment 16 Hongkai Liu 2020-04-22 18:41:01 UTC
Would this be backported to 4.3?

Comment 19 errata-xmlrpc 2020-07-13 17:22:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.