Created attachment 1671140 [details] prome.query Description of problem: https://coreos.slack.com/archives/CV1UZU53R/p1584521800135300 3 instances of this alert were fired this morning. AlertManagerAPP 4:56 AM [FIRING:3] KubeDaemonSetMisScheduled kube-state-metrics (https-main 10.131.62.127:8443 kube-state-metrics-6bcc97c9d6-mmrhm openshift-monitoring/k8s kube-state-metrics warning) Version-Release number of selected component (if applicable): oc --context build01 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2020-03-04-222846 True False 12d Cluster version is 4.3.0-0.nightly-2020-03-04-222846 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Checked with prometheus UI (see the screenshot) oc --context build01 get ds -n openshift-image-registry NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE node-ca 11 11 11 11 11 kubernetes.io/os=linux 48d oc --context build01 get ds -n openshift-machine-config-operator NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE machine-config-daemon 11 11 11 11 11 kubernetes.io/os=linux 48d oc --context build01 get ds -n openshift-dns NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE dns-default 11 11 11 11 11 kubernetes.io/os=linux 48d Are those DSs supposed to run a pod on every node? How should cluster admin act on this alert? Expected results: A document telling how to deal with this alert would be great. Additional info: ------------ alert: KubeDaemonSetMisScheduled expr: kube_daemonset_status_number_misscheduled{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)"} > 0 for: 10m labels: severity: warning annotations: message: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run.'
Were the daemonset actually running where they were supposed to be running or not? Are you asking just for docs, or saying there is a bug in the alert?
If i understand the alert's message correctly, it is fired because "the pods of DS are running where they are not supposed to run". However, all 3 DSs are with node selector kubernetes.io/os=linux I think the label is there for every node of the cluster. Then why it is fired in the first place? And yes, documentation about how to handle this alert would be nice too.
Was there a new node started on the cluster? Relevant issue: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/347
See the prometheus query about the alerts and cluster_autoscaler_nodes_count https://imgur.com/a/SePLZQG we have autoscaler set up on the cluster. It seems we had a node ready/notready back and forth. Not sure it is caused by new nodes or the same node changing the status.
About comment https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/347#issuecomment-585075577 we have already `10m` for clause in the alert def. But the alert query with 1 over 10m in the case. Is it about sensitivity of the alert? This is confusing part to me: The 3 DSs should run on all nodes with label kubernetes.io/os=linux. Why is the node status related to firing such an alert? I can imagine it makes sense only if the label is removed and the DS still runs there.
> The 3 DSs should run on all nodes with label kubernetes.io/os=linux. > Why is the node status related to firing such an alert? It seems like there are problems with the node or the autoscaler. The metric and alert look at the misscheduled DS on that node. > The number of nodes running a daemon pod but are not supposed to. Note this is a warning alert so seems to be doing its job, warning there is a problem, the alert is not critical so you/customer should not be paged on it. But as a warning this alert makes perfect sense.
> Note this is a warning alert so seems to be doing its job, warning there is a problem, the alert is not critical so you/customer should not be paged on it. But as a warning this alert makes perfect sense. I would like to know what a customer is expected to do when seeing such a warning?
> I would like to know what a customer is expected to do when seeing such a warning? Open ticket to investigate what is wrong with the nodes. Is there no other alerts firing on the cluster only that one?
> Open ticket to investigate what is wrong with the nodes. Ack > Is there no other alerts firing on the cluster only that one? No other alerts fired in 2hours range before/after it.
I guess this can be closed, think this answers all your questions?
Reopening after discussion on slack.
Would this be backported to 4.3?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409