Bug 1817774 - Alert if any node has: kubernetes.io/hostname: localhost
Summary: Alert if any node has: kubernetes.io/hostname: localhost
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.5.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-26 23:45 UTC by W. Trevor King
Modified: 2020-05-18 14:58 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-18 14:58:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description W. Trevor King 2020-03-26 23:45:48 UTC
This can break common anti-affinity patterns as described in bug 1817769.  We should alert on this condition so the cluster admin can easily discover the problem and fix it (and also so that we hear about this issue in Telemetry/Insights), without having to do a bunch of debugging and wondering about scheduler bugs.

Comment 1 Ryan Phillips 2020-03-31 20:07:12 UTC
I'm not a huge fan of creating alarms for bugs. Typically these metrics are wasteful.

Comment 2 W. Trevor King 2020-04-01 04:16:08 UTC
Alerting on this costs CPU.  Having devs/admins hunt for this costs salary.  People are more expensive than computers.  I'm open to alternatives to alerts for raising the visibility of this troubling condition, but lots of smart people looked at the must-gather for this cluster before Miciah noticed the localhost issue.  I'd like to have the machines chip in in a way that cuts that time down for admins on the next cluster that hits this.  Do you have alternative ideas?

Comment 3 Ryan Phillips 2020-05-18 14:58:47 UTC
Created a JIRA to track the feature request: https://issues.redhat.com/browse/OCPNODE-344


Note You need to log in before you can comment on or make changes to this bug.