Bug 1762888
| Summary: | OpenShift fails upgrade when a pod has a PodDisruptionBudget minAvailable set to 1 and disruptionsAllowed set to 0 | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ryan Howe <rhowe> | |
| Component: | kube-scheduler | Assignee: | Maciej Szulik <maszulik> | |
| Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | high | |||
| Version: | 4.1.z | CC: | aos-bugs, ccoleman, eparis, erich, kconner, mfojtik, scuppett, xtian, xxia | |
| Target Milestone: | --- | |||
| Target Release: | 4.3.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1770303 (view as bug list) | Environment: | ||
| Last Closed: | 2020-01-23 11:07:51 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1770303 | |||
|
Description
Ryan Howe
2019-10-17 18:23:58 UTC
Is https://bugzilla.redhat.com/show_bug.cgi?id=1747472 a side-effect of this issue? PDBs are working exactly as expected. There is a related though NOT a dup bug https://bugzilla.redhat.com/show_bug.cgi?id=1752111 For this bug I expect the workload team to generate an Info level alert any time there is a PDB in the system which has no ability to be disrupted for $some period of time (say 5m?). We need to pro-actively alert customers that they have something configured which is likely to cause them problems. This bug is tracking that Info level proactive alert. 1752111 is tracking a reactive MCO alert which will WARN a customer when such a PDB situation has broken the MCO's ability to do it's job. If you would like to coordinate your efforts on these 2 alerts that is fine, however the workloads team owns an info level alert if the situation every happens. The MCO team owns a warn level alert if it affects the MCO. If you are unclear what is required here, please don't hesitate to ask me or Clayton. Verified this against a test system by creating a PDB at limit and verifying alert fired, then switching the PDB to require more pods than were possible and verifying it failed. However, I noticed that the namespace of the failed PDB is not listed which complicates finding the offending PDB. I think the reported alert needs to have that namespace label set in some form. Opened https://github.com/openshift/cluster-kube-controller-manager-operator/pull/309 with the namespace Ge Liu, seems upgrade failure is expected; you need to check the alert should be displayed in Prometheus page's "Alerts" tab. (In reply to Xingxing Xia from comment #10) > Ge Liu, seems upgrade failure is expected; you need to check the alert > should be displayed in Prometheus page's "Alerts" tab. That is correct, the reason for that is due to this problem updates will fail, that's why we setup the alert in the first place. Confirmed with payload: 4.3.0-0.nightly-2019-12-05-073829 upgrade to payload: 4.3.0-0.nightly-2019-12-05-213858:
After upgrade:
[root@dhcp-140-138 ~]# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.3.0-0.nightly-2019-12-05-213858 True False 68m Error while reconciling 4.3.0-0.nightly-2019-12-05-213858: the cluster operator ingress is degraded
[root@dhcp-140-138 ~]# oc get node
NAME STATUS ROLES AGE VERSION
ip-10-0-135-45.us-east-2.compute.internal Ready master 139m v1.16.2
ip-10-0-135-7.us-east-2.compute.internal Ready,SchedulingDisabled worker 129m v1.16.2
ip-10-0-147-20.us-east-2.compute.internal Ready worker 129m v1.16.2
ip-10-0-159-247.us-east-2.compute.internal Ready master 139m v1.16.2
ip-10-0-160-104.us-east-2.compute.internal Ready master 139m v1.16.2
Check alert in Prometheus:
alert: PodDisruptionBudgetAtLimit
expr: kube_poddisruptionbudget_status_expected_pods
== on(namespace, poddisruptionbudget, service) kube_poddisruptionbudget_status_desired_healthy
for: 15m
labels:
severity: warning
annotations:
message: The pod disruption budget is preventing further disruption to pods because
it is at the minimum allowed level.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 |