Bug 1814723
Summary: | How to handle alert KubeDaemonSetMisScheduled | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Hongkai Liu <hongkliu> | ||||
Component: | Monitoring | Assignee: | Lili Cosic <lcosic> | ||||
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||
Severity: | low | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 4.3.0 | CC: | alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania | ||||
Target Milestone: | --- | Keywords: | Reopened | ||||
Target Release: | 4.5.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-07-13 17:22:38 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Hongkai Liu
2020-03-18 15:04:02 UTC
Were the daemonset actually running where they were supposed to be running or not? Are you asking just for docs, or saying there is a bug in the alert? If i understand the alert's message correctly, it is fired because "the pods of DS are running where they are not supposed to run". However, all 3 DSs are with node selector kubernetes.io/os=linux I think the label is there for every node of the cluster. Then why it is fired in the first place? And yes, documentation about how to handle this alert would be nice too. Was there a new node started on the cluster? Relevant issue: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/347 See the prometheus query about the alerts and cluster_autoscaler_nodes_count https://imgur.com/a/SePLZQG we have autoscaler set up on the cluster. It seems we had a node ready/notready back and forth. Not sure it is caused by new nodes or the same node changing the status. About comment https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/347#issuecomment-585075577 we have already `10m` for clause in the alert def. But the alert query with 1 over 10m in the case. Is it about sensitivity of the alert? This is confusing part to me: The 3 DSs should run on all nodes with label kubernetes.io/os=linux. Why is the node status related to firing such an alert? I can imagine it makes sense only if the label is removed and the DS still runs there. > The 3 DSs should run on all nodes with label kubernetes.io/os=linux. > Why is the node status related to firing such an alert? It seems like there are problems with the node or the autoscaler. The metric and alert look at the misscheduled DS on that node. > The number of nodes running a daemon pod but are not supposed to. Note this is a warning alert so seems to be doing its job, warning there is a problem, the alert is not critical so you/customer should not be paged on it. But as a warning this alert makes perfect sense. > Note this is a warning alert so seems to be doing its job, warning there is a problem, the alert is not critical so you/customer should not be paged on it. But as a warning this alert makes perfect sense.
I would like to know what a customer is expected to do when seeing such a warning?
> I would like to know what a customer is expected to do when seeing such a warning?
Open ticket to investigate what is wrong with the nodes. Is there no other alerts firing on the cluster only that one?
> Open ticket to investigate what is wrong with the nodes. Ack > Is there no other alerts firing on the cluster only that one? No other alerts fired in 2hours range before/after it. I guess this can be closed, think this answers all your questions? Reopening after discussion on slack. Would this be backported to 4.3? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |