1762888 – OpenShift fails upgrade when a pod has a PodDisruptionBudget minAvailable set to 1 and disruptionsAllowed set to 0

Bug 1762888 - OpenShift fails upgrade when a pod has a PodDisruptionBudget minAvailable set to 1 and disruptionsAllowed set to 0

Summary: OpenShift fails upgrade when a pod has a PodDisruptionBudget minAvailable set...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-scheduler
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Maciej Szulik
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1770303
TreeView+	depends on / blocked

Reported:	2019-10-17 18:23 UTC by Ryan Howe
Modified:	2020-04-03 10:01 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1770303 (view as bug list)
Environment:
Last Closed:	2020-01-23 11:07:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-controller-manager-operator pull 301	0	'None'	closed	Bug 1762888: alert when number of expected pods is equal or lower than desired healthy pods in a PDB	2021-02-15 11:58:18 UTC
Red Hat Bugzilla	1752111	1	None	None	None	2021-01-20 06:05:38 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:08:19 UTC

Description Ryan Howe 2019-10-17 18:23:58 UTC

Description of problem:

OpenShift upgrade when a pod as a PodDisruptionBudget maxUnavailable set to 1 and disruptionsAllowed set to 0

MCO fails to upgrade os version but nothing shows as degraded.

Version-Release number of selected component (if applicable):
4.x

How reproducible:
100%

Steps to Reproduce:
1. Create a pod
# oc new-app --template httpd-example

2. Create poddisruptionbudget
# oc create poddisruptionbudget --min-available=1 test --selector="name=httpd-example"

3. Go to upgrade.
# oc adm upgrade --to-latest

Actual results:
- Upgrade succeeds but loops between some degraded state of random operators mostly because of scheduling failures.
- Nothing tells you why the upgrade failed until you look at the logs of the machine-config-daemon

# oc logs machine-config-daemon-v44gn
update.go:89] error when evicting pod "httpd-example-4-8fkpr" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

Then looking at `oc get node` we see all worker nodes on different version than masters.

Expected results:

For upgrade to succeed or fail just on the one node and tell us why it failed. Explaining that its due to PodDisruptionBudget.

Additional info:

# oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-130-250.us-west-2.compute.internal Ready master 115d v1.13.4+a80aad556
ip-10-0-134-211.us-west-2.compute.internal Ready worker 115d v1.13.4+12ee15d4a
ip-10-0-150-54.us-west-2.compute.internal Ready worker 115d v1.13.4+12ee15d4a
ip-10-0-155-30.us-west-2.compute.internal Ready master 115d v1.13.4+a80aad556
ip-10-0-171-64.us-west-2.compute.internal Ready,SchedulingDisabled worker 51d v1.13.4+12ee15d4a
ip-10-0-173-199.us-west-2.compute.internal Ready master 115d v1.13.4+a80aad556

Comment 1 Eric Rich 2019-10-17 19:22:36 UTC

Is https://bugzilla.redhat.com/show_bug.cgi?id=1747472 a side-effect of this issue?

Comment 4 Eric Paris 2019-10-17 21:29:44 UTC

PDBs are working exactly as expected. There is a related though NOT a dup bug https://bugzilla.redhat.com/show_bug.cgi?id=1752111

For this bug I expect the workload team to generate an Info level alert any time there is a PDB in the system which has no ability to be disrupted for $some period of time (say 5m?).

We need to pro-actively alert customers that they have something configured which is likely to cause them problems. This bug is tracking that Info level proactive alert.

1752111 is tracking a reactive MCO alert which will WARN a customer when such a PDB situation has broken the MCO's ability to do it's job.

If you would like to coordinate your efforts on these 2 alerts that is fine, however the workloads team owns an info level alert if the situation every happens. The MCO team owns a warn level alert if it affects the MCO.

If you are unclear what is required here, please don't hesitate to ask me or Clayton.

Comment 6 Clayton Coleman 2019-11-10 20:47:09 UTC

Verified this against a test system by creating a PDB at limit and verifying alert fired, then switching the PDB to require more pods than were possible and verifying it failed.

However, I noticed that the namespace of the failed PDB is not listed which complicates finding the offending PDB. I think the reported alert needs to have that namespace label set in some form.

Comment 7 Maciej Szulik 2019-11-12 15:54:26 UTC

Opened https://github.com/openshift/cluster-kube-controller-manager-operator/pull/309 with the namespace

Comment 10 Xingxing Xia 2019-11-27 10:01:09 UTC

Ge Liu, seems upgrade failure is expected; you need to check the alert should be displayed in Prometheus page's "Alerts" tab.

Comment 11 Maciej Szulik 2019-12-02 17:37:25 UTC

(In reply to Xingxing Xia from comment #10)
> Ge Liu, seems upgrade failure is expected; you need to check the alert
> should be displayed in Prometheus page's "Alerts" tab.

That is correct, the reason for that is due to this problem updates will fail, that's why we setup the alert in the first place.

Comment 13 zhou ying 2019-12-06 05:18:20 UTC

Confirmed with payload: 4.3.0-0.nightly-2019-12-05-073829 upgrade to payload: 4.3.0-0.nightly-2019-12-05-213858:

After upgrade:
[root@dhcp-140-138 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-12-05-213858   True        False         68m     Error while reconciling 4.3.0-0.nightly-2019-12-05-213858: the cluster operator ingress is degraded

[root@dhcp-140-138 ~]# oc get node
NAME                                         STATUS                     ROLES    AGE    VERSION
ip-10-0-135-45.us-east-2.compute.internal    Ready                      master   139m   v1.16.2
ip-10-0-135-7.us-east-2.compute.internal     Ready,SchedulingDisabled   worker   129m   v1.16.2
ip-10-0-147-20.us-east-2.compute.internal    Ready                      worker   129m   v1.16.2
ip-10-0-159-247.us-east-2.compute.internal   Ready                      master   139m   v1.16.2
ip-10-0-160-104.us-east-2.compute.internal   Ready                      master   139m   v1.16.2

Check alert in Prometheus:
alert: PodDisruptionBudgetAtLimit
expr: kube_poddisruptionbudget_status_expected_pods
  == on(namespace, poddisruptionbudget, service) kube_poddisruptionbudget_status_desired_healthy
for: 15m
labels:
  severity: warning
annotations:
  message: The pod disruption budget is preventing further disruption to pods because
    it is at the minimum allowed level.

Comment 16 errata-xmlrpc 2020-01-23 11:07:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.