Description of problem: PodDisruptionBudgetAtLimit alert depending on the context may be actionable sometimes and may not be actionable sometimes. Let's take the scenario of maxUnavailable: 0, which means we need to have healthy pods all the time. Say, if all the replicas are running and healthy, we would still fire alert which perhaps is not actionable but maxUnavailable is 1, the alert will become actionable. So, opening this bug to track situations like this and to decide the future of this alert. - Should have this alert in first place - If we decide to have it, how to cope with scenarios mentioned above. Should alerts be configured based on PDB tracking or identify if PDB is wrongly configured. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Some of the scenarios we discussed: 1) etcd is missing a replica due to a node being rebooted during an upgrade - not actionable 2) etcd is missing a replica due to a node being not ready for another reason - actionable (but should nodenotready alerts be sufficient? and/or should etcd have its own alerts?) 3) some user workload is misconfigured such that it will always be at/below the limit - semi actionable (user should fix it, or admin should tell the user to fix it) 4) some user workload is configured (deliberately) to always be at the limit (e.g. builds) - not actionable And the kinds of actions that might be taken are: 1) fix unready nodes 2) fix workload configuration 3) don't upgrade your cluster when you have workloads that will be put at risk due to being at/below their limits already
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
This bug is still important, we should not be alerting on things that do not require administrative action, particularly during upgrades.
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
Ported to https://issues.redhat.com/browse/WRKLDS-647