1939728 – Alerts Should Exist for Excessive Steady-State Resource Consumption on Master Nodes

Bug 1939728 - Alerts Should Exist for Excessive Steady-State Resource Consumption on Master Nodes

Summary: Alerts Should Exist for Excessive Steady-State Resource Consumption on Master...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Harshal Patil
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-16 22:04 UTC by Steve Kuznetsov
Modified:	2021-03-29 14:35 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-29 14:35:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Steve Kuznetsov 2021-03-16 22:04:07 UTC

The control plane needs to have alerts on system usage, full stop. There has been no shortage of outages, fire drills and system degradations that are caused by the master VM instances running low on resources or system components (kubelet, cri-o) being starved of resources. Regardless of the cause of these issues, alerts for high steady-state usage must exist, or administrators are never going to know that something needs to be done. 

See the post-mortem here for more  details:
https://docs.google.com/document/d/1VfwmECbpCnDTOb0JVE37wcEQm4KnGwbatgIynTa6Wvg/edit#

Comment 1 Standa Laznicka 2021-03-19 11:59:40 UTC

I think that monitoring of a state of a node and alerts based on that should be handled by the node team.

Comment 4 Martin Sivák 2021-03-22 15:02:47 UTC

I think we need to rename the Memory manager component as it deals with hugepages. This goes to the kubelet subcomponent I believe.

Note You need to log in before you can comment on or make changes to this bug.