1970403 – [RFE] PodDisruptionBudgetAtLimit alert should be actionable

Bug 1970403 - [RFE] PodDisruptionBudgetAtLimit alert should be actionable

Summary: [RFE] PodDisruptionBudgetAtLimit alert should be actionable

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jan Chaloupka
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:	LifecycleFrozen
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-10 12:25 UTC by ravig
Modified:	2023-01-16 09:45 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-16 09:45:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description ravig 2021-06-10 12:25:47 UTC

Description of problem:

PodDisruptionBudgetAtLimit alert depending on the context may be actionable sometimes and may not be actionable sometimes. Let's take the scenario of maxUnavailable: 0, which means we need to have healthy pods all the time. Say, if all the replicas are running and healthy, we would still fire alert which perhaps is not actionable but maxUnavailable is 1, the alert will become actionable. So, opening this bug to track situations like this and to decide the future of this alert.
- Should have this alert in first place
- If we decide to have it, how to cope with scenarios mentioned above. Should alerts be configured based on PDB tracking or identify if PDB is wrongly configured. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ben Parees 2021-06-10 13:00:08 UTC

Some of the scenarios we discussed:

1) etcd is missing a replica due to a node being rebooted during an upgrade - not actionable
2) etcd is missing a replica due to a node being not ready for another reason - actionable  (but should nodenotready alerts be sufficient?  and/or should etcd have its own alerts?)
3) some user workload is misconfigured such that it will always be at/below the limit - semi actionable (user should fix it, or admin should tell the user to fix it)
4) some user workload is configured (deliberately) to always be at the limit (e.g. builds) - not actionable

And the kinds of actions that might be taken are:
1) fix unready nodes
2) fix workload configuration
3) don't upgrade your cluster when you have workloads that will be put at risk due to being at/below their limits already

Comment 2 Michal Fojtik 2021-07-10 13:45:39 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 3 Ben Parees 2021-07-10 17:29:51 UTC

This bug is still important, we should not be alerting on things that do not require administrative action, particularly during upgrades.

Comment 4 Michal Fojtik 2021-08-11 22:53:21 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 5 Michal Fojtik 2021-09-10 23:27:56 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 7 Michal Fojtik 2021-10-11 01:58:36 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 11 Jan Chaloupka 2023-01-16 09:45:37 UTC

Ported to https://issues.redhat.com/browse/WRKLDS-647

Note You need to log in before you can comment on or make changes to this bug.