Bug 1762888 - OpenShift fails upgrade when a pod has a PodDisruptionBudget minAvailable set to 1 and disruptionsAllowed set to 0
Summary: OpenShift fails upgrade when a pod has a PodDisruptionBudget minAvailable set...
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-scheduler
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.3.0
Assignee: Maciej Szulik
QA Contact: zhou ying
URL:
Whiteboard:
Depends On:
Blocks: 1770303
TreeView+ depends on / blocked
 
Reported: 2019-10-17 18:23 UTC by Ryan Howe
Modified: 2019-12-06 06:25 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1770303 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-kube-controller-manager-operator pull 301 'None' 'closed' 'Bug 1762888: alert when number of expected pods is equal or lower than desired healthy pods in a PDB' 2019-12-02 17:25:08 UTC
Red Hat Bugzilla 1752111 None None None 2019-12-02 17:25:05 UTC

Description Ryan Howe 2019-10-17 18:23:58 UTC
Description of problem:

OpenShift upgrade when a pod as a PodDisruptionBudget  maxUnavailable set to 1 and disruptionsAllowed set to 0

MCO fails to upgrade os version but nothing shows as degraded. 

Version-Release number of selected component (if applicable):
4.x

How reproducible:
100%

Steps to Reproduce:
1. Create a pod
 # oc new-app --template httpd-example 

2. Create poddisruptionbudget
 # oc create poddisruptionbudget --min-available=1 test --selector="name=httpd-example"

3. Go to upgrade. 
# oc adm upgrade --to-latest

Actual results:
- Upgrade succeeds but loops between some degraded state of random operators mostly because of scheduling failures. 
- Nothing tells you why the upgrade failed until you look at the logs of the machine-config-daemon

# oc logs machine-config-daemon-v44gn 
  update.go:89] error when evicting pod "httpd-example-4-8fkpr" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

Then looking at `oc get node` we see all worker nodes on different version than masters. 

Expected results:

For upgrade to succeed or fail just on the one node and tell us why it failed. Explaining that its due to PodDisruptionBudget. 

Additional info:

# oc get nodes
NAME                                         STATUS                     ROLES    AGE    VERSION
ip-10-0-130-250.us-west-2.compute.internal   Ready                      master   115d   v1.13.4+a80aad556
ip-10-0-134-211.us-west-2.compute.internal   Ready                      worker   115d   v1.13.4+12ee15d4a
ip-10-0-150-54.us-west-2.compute.internal    Ready                      worker   115d   v1.13.4+12ee15d4a
ip-10-0-155-30.us-west-2.compute.internal    Ready                      master   115d   v1.13.4+a80aad556
ip-10-0-171-64.us-west-2.compute.internal    Ready,SchedulingDisabled   worker   51d    v1.13.4+12ee15d4a
ip-10-0-173-199.us-west-2.compute.internal   Ready                      master   115d   v1.13.4+a80aad556

Comment 1 Eric Rich 2019-10-17 19:22:36 UTC
Is https://bugzilla.redhat.com/show_bug.cgi?id=1747472 a side-effect of this issue?

Comment 4 Eric Paris 2019-10-17 21:29:44 UTC
PDBs are working exactly as expected. There is a related though NOT a dup bug https://bugzilla.redhat.com/show_bug.cgi?id=1752111

For this bug I expect the workload team to generate an Info level alert any time there is a PDB in the system which has no ability to be disrupted for $some period of time (say 5m?).

We need to pro-actively alert customers that they have something configured which is likely to cause them problems. This bug is tracking that Info level proactive alert.

1752111 is tracking a reactive MCO alert which will WARN a customer when such a PDB situation has broken the MCO's ability to do it's job.

If you would like to coordinate your efforts on these 2 alerts that is fine, however the workloads team owns an info level alert if the situation every happens. The MCO team owns a warn level alert if it affects the MCO.

If you are unclear what is required here, please don't hesitate to ask me or Clayton.

Comment 6 Clayton Coleman 2019-11-10 20:47:09 UTC
Verified this against a test system by creating a PDB at limit and verifying alert fired, then switching the PDB to require more pods than were possible and verifying it failed.

However, I noticed that the namespace of the failed PDB is not listed which complicates finding the offending PDB. I think the reported alert needs to have that namespace label set in some form.

Comment 7 Maciej Szulik 2019-11-12 15:54:26 UTC
Opened https://github.com/openshift/cluster-kube-controller-manager-operator/pull/309 with the namespace

Comment 10 Xingxing Xia 2019-11-27 10:01:09 UTC
Ge Liu, seems upgrade failure is expected; you need to check the alert should be displayed in Prometheus page's "Alerts" tab.

Comment 11 Maciej Szulik 2019-12-02 17:37:25 UTC
(In reply to Xingxing Xia from comment #10)
> Ge Liu, seems upgrade failure is expected; you need to check the alert
> should be displayed in Prometheus page's "Alerts" tab.

That is correct, the reason for that is due to this problem updates will fail, that's why we setup the alert in the first place.

Comment 13 zhou ying 2019-12-06 05:18:20 UTC
Confirmed with payload: 4.3.0-0.nightly-2019-12-05-073829 upgrade to payload: 4.3.0-0.nightly-2019-12-05-213858:

After upgrade:
[root@dhcp-140-138 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-12-05-213858   True        False         68m     Error while reconciling 4.3.0-0.nightly-2019-12-05-213858: the cluster operator ingress is degraded

[root@dhcp-140-138 ~]# oc get node
NAME                                         STATUS                     ROLES    AGE    VERSION
ip-10-0-135-45.us-east-2.compute.internal    Ready                      master   139m   v1.16.2
ip-10-0-135-7.us-east-2.compute.internal     Ready,SchedulingDisabled   worker   129m   v1.16.2
ip-10-0-147-20.us-east-2.compute.internal    Ready                      worker   129m   v1.16.2
ip-10-0-159-247.us-east-2.compute.internal   Ready                      master   139m   v1.16.2
ip-10-0-160-104.us-east-2.compute.internal   Ready                      master   139m   v1.16.2

Check alert in Prometheus:
alert: PodDisruptionBudgetAtLimit
expr: kube_poddisruptionbudget_status_expected_pods
  == on(namespace, poddisruptionbudget, service) kube_poddisruptionbudget_status_desired_healthy
for: 15m
labels:
  severity: warning
annotations:
  message: The pod disruption budget is preventing further disruption to pods because
    it is at the minimum allowed level.


Note You need to log in before you can comment on or make changes to this bug.