Bug 1814723

Summary:

How to handle alert KubeDaemonSetMisScheduled

Product:

OpenShift Container Platform

Reporter:

Hongkai Liu <hongkliu>

Component:

Monitoring

Assignee:

Lili Cosic <lcosic>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

low

Docs Contact:

Priority:

low

Version:

4.3.0

CC:

alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania

Target Milestone:

---

Keywords:

Reopened

Target Release:

4.5.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-07-13 17:22:38 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
prome.query	none

Description Hongkai Liu 2020-03-18 15:04:02 UTC

Created attachment 1671140 [details]
prome.query

Description of problem:
https://coreos.slack.com/archives/CV1UZU53R/p1584521800135300
3 instances of this alert were fired this morning.
AlertManagerAPP  4:56 AM
[FIRING:3] KubeDaemonSetMisScheduled kube-state-metrics (https-main 10.131.62.127:8443 kube-state-metrics-6bcc97c9d6-mmrhm openshift-monitoring/k8s kube-state-metrics warning)



Version-Release number of selected component (if applicable):
oc --context build01 get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-03-04-222846   True        False         12d     Cluster version is 4.3.0-0.nightly-2020-03-04-222846

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
Checked with prometheus UI (see the screenshot)

oc --context build01 get ds -n openshift-image-registry
NAME      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
node-ca   11        11        11      11           11          kubernetes.io/os=linux   48d

oc --context build01 get ds -n openshift-machine-config-operator
NAME                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
machine-config-daemon   11        11        11      11           11          kubernetes.io/os=linux            48d


oc --context build01 get ds -n openshift-dns
NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
dns-default   11        11        11      11           11          kubernetes.io/os=linux   48d

Are those DSs supposed to run a pod on every node?
How should cluster admin act on this alert?


Expected results:
A document telling how to deal with this alert would be great.

Additional info:
------------
alert: KubeDaemonSetMisScheduled
expr: kube_daemonset_status_number_misscheduled{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)"}
  > 0
for: 10m
labels:
  severity: warning
annotations:
  message: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset
    }} are running where they are not supposed to run.'

Comment 1 Lili Cosic 2020-03-18 16:58:07 UTC

Were the daemonset actually running where they were supposed to be running or not? Are you asking just for docs, or saying there is a bug in the alert?

Comment 2 Hongkai Liu 2020-03-18 18:26:59 UTC

If i understand the alert's message correctly, it is fired because "the pods of DS are running where they are not supposed to run".

However, all 3 DSs are with node selector kubernetes.io/os=linux
I think the label is there for every node of the cluster.
Then why it is fired in the first place?


And yes, documentation about how to handle this alert would be nice too.

Comment 3 Lili Cosic 2020-03-19 08:27:38 UTC

Was there a new node started on the cluster? Relevant issue: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/347

Comment 4 Hongkai Liu 2020-03-19 13:22:46 UTC

See the prometheus query about the alerts and cluster_autoscaler_nodes_count

https://imgur.com/a/SePLZQG

we have autoscaler set up on the cluster.
It seems we had a node ready/notready back and forth. 
Not sure it is caused by new nodes or the same node changing the status.

Comment 5 Hongkai Liu 2020-03-19 13:32:15 UTC

About comment https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/347#issuecomment-585075577
we have already `10m` for clause in the alert def. But the alert query with 1 over 10m in the case.
Is it about sensitivity of the alert?

This is confusing part to me:
The 3 DSs should run on all nodes with label kubernetes.io/os=linux.
Why is the node status related to firing such an alert?
I can imagine it makes sense only if the label is removed and the DS still runs there.

Comment 6 Lili Cosic 2020-03-23 07:31:22 UTC

> The 3 DSs should run on all nodes with label kubernetes.io/os=linux.
> Why is the node status related to firing such an alert?

It seems like there are problems with the node or the autoscaler. The metric and alert look at the misscheduled DS on that node.
> The number of nodes running a daemon pod but are not supposed to.

Note this is a warning alert so seems to be doing its job, warning there is a problem, the alert is not critical so you/customer should not be paged on it. But as a warning this alert makes perfect sense.

Comment 7 Hongkai Liu 2020-03-23 13:08:33 UTC

> Note this is a warning alert so seems to be doing its job, warning there is a problem, the alert is not critical so you/customer should not be paged on it. But as a warning this alert makes perfect sense.


I would like to know what a customer is expected to do when seeing such a warning?

Comment 8 Lili Cosic 2020-03-23 14:11:42 UTC

> I would like to know what a customer is expected to do when seeing such a warning?

Open ticket to investigate what is wrong with the nodes. Is there no other alerts firing on the cluster only that one?

Comment 9 Hongkai Liu 2020-03-23 18:24:22 UTC

> Open ticket to investigate what is wrong with the nodes.
Ack

> Is there no other alerts firing on the cluster only that one?

No other alerts fired in 2hours range before/after it.

Comment 11 Lili Cosic 2020-03-26 07:20:36 UTC

I guess this can be closed, think this answers all your questions?

Comment 12 Lili Cosic 2020-04-01 14:40:31 UTC

Reopening after discussion on slack.

Comment 16 Hongkai Liu 2020-04-22 18:41:01 UTC

Would this be backported to 4.3?

Comment 19 errata-xmlrpc 2020-07-13 17:22:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409