Bug 1762888

Summary: OpenShift fails upgrade when a pod has a PodDisruptionBudget minAvailable set to 1 and disruptionsAllowed set to 0
Product: OpenShift Container Platform Reporter: Ryan Howe <rhowe>
Component: kube-schedulerAssignee: Maciej Szulik <maszulik>
Status: CLOSED ERRATA QA Contact: zhou ying <yinzhou>
Severity: medium Docs Contact:
Priority: high    
Version: 4.1.zCC: aos-bugs, ccoleman, eparis, erich, kconner, mfojtik, scuppett, xtian, xxia
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1770303 (view as bug list) Environment:
Last Closed: 2020-01-23 11:07:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1770303    

Description Ryan Howe 2019-10-17 18:23:58 UTC
Description of problem:

OpenShift upgrade when a pod as a PodDisruptionBudget  maxUnavailable set to 1 and disruptionsAllowed set to 0

MCO fails to upgrade os version but nothing shows as degraded. 

Version-Release number of selected component (if applicable):
4.x

How reproducible:
100%

Steps to Reproduce:
1. Create a pod
 # oc new-app --template httpd-example 

2. Create poddisruptionbudget
 # oc create poddisruptionbudget --min-available=1 test --selector="name=httpd-example"

3. Go to upgrade. 
# oc adm upgrade --to-latest

Actual results:
- Upgrade succeeds but loops between some degraded state of random operators mostly because of scheduling failures. 
- Nothing tells you why the upgrade failed until you look at the logs of the machine-config-daemon

# oc logs machine-config-daemon-v44gn 
  update.go:89] error when evicting pod "httpd-example-4-8fkpr" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

Then looking at `oc get node` we see all worker nodes on different version than masters. 

Expected results:

For upgrade to succeed or fail just on the one node and tell us why it failed. Explaining that its due to PodDisruptionBudget. 

Additional info:

# oc get nodes
NAME                                         STATUS                     ROLES    AGE    VERSION
ip-10-0-130-250.us-west-2.compute.internal   Ready                      master   115d   v1.13.4+a80aad556
ip-10-0-134-211.us-west-2.compute.internal   Ready                      worker   115d   v1.13.4+12ee15d4a
ip-10-0-150-54.us-west-2.compute.internal    Ready                      worker   115d   v1.13.4+12ee15d4a
ip-10-0-155-30.us-west-2.compute.internal    Ready                      master   115d   v1.13.4+a80aad556
ip-10-0-171-64.us-west-2.compute.internal    Ready,SchedulingDisabled   worker   51d    v1.13.4+12ee15d4a
ip-10-0-173-199.us-west-2.compute.internal   Ready                      master   115d   v1.13.4+a80aad556

Comment 1 Eric Rich 2019-10-17 19:22:36 UTC
Is https://bugzilla.redhat.com/show_bug.cgi?id=1747472 a side-effect of this issue?

Comment 4 Eric Paris 2019-10-17 21:29:44 UTC
PDBs are working exactly as expected. There is a related though NOT a dup bug https://bugzilla.redhat.com/show_bug.cgi?id=1752111

For this bug I expect the workload team to generate an Info level alert any time there is a PDB in the system which has no ability to be disrupted for $some period of time (say 5m?).

We need to pro-actively alert customers that they have something configured which is likely to cause them problems. This bug is tracking that Info level proactive alert.

1752111 is tracking a reactive MCO alert which will WARN a customer when such a PDB situation has broken the MCO's ability to do it's job.

If you would like to coordinate your efforts on these 2 alerts that is fine, however the workloads team owns an info level alert if the situation every happens. The MCO team owns a warn level alert if it affects the MCO.

If you are unclear what is required here, please don't hesitate to ask me or Clayton.

Comment 6 Clayton Coleman 2019-11-10 20:47:09 UTC
Verified this against a test system by creating a PDB at limit and verifying alert fired, then switching the PDB to require more pods than were possible and verifying it failed.

However, I noticed that the namespace of the failed PDB is not listed which complicates finding the offending PDB. I think the reported alert needs to have that namespace label set in some form.

Comment 7 Maciej Szulik 2019-11-12 15:54:26 UTC
Opened https://github.com/openshift/cluster-kube-controller-manager-operator/pull/309 with the namespace

Comment 10 Xingxing Xia 2019-11-27 10:01:09 UTC
Ge Liu, seems upgrade failure is expected; you need to check the alert should be displayed in Prometheus page's "Alerts" tab.

Comment 11 Maciej Szulik 2019-12-02 17:37:25 UTC
(In reply to Xingxing Xia from comment #10)
> Ge Liu, seems upgrade failure is expected; you need to check the alert
> should be displayed in Prometheus page's "Alerts" tab.

That is correct, the reason for that is due to this problem updates will fail, that's why we setup the alert in the first place.

Comment 13 zhou ying 2019-12-06 05:18:20 UTC
Confirmed with payload: 4.3.0-0.nightly-2019-12-05-073829 upgrade to payload: 4.3.0-0.nightly-2019-12-05-213858:

After upgrade:
[root@dhcp-140-138 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-12-05-213858   True        False         68m     Error while reconciling 4.3.0-0.nightly-2019-12-05-213858: the cluster operator ingress is degraded

[root@dhcp-140-138 ~]# oc get node
NAME                                         STATUS                     ROLES    AGE    VERSION
ip-10-0-135-45.us-east-2.compute.internal    Ready                      master   139m   v1.16.2
ip-10-0-135-7.us-east-2.compute.internal     Ready,SchedulingDisabled   worker   129m   v1.16.2
ip-10-0-147-20.us-east-2.compute.internal    Ready                      worker   129m   v1.16.2
ip-10-0-159-247.us-east-2.compute.internal   Ready                      master   139m   v1.16.2
ip-10-0-160-104.us-east-2.compute.internal   Ready                      master   139m   v1.16.2

Check alert in Prometheus:
alert: PodDisruptionBudgetAtLimit
expr: kube_poddisruptionbudget_status_expected_pods
  == on(namespace, poddisruptionbudget, service) kube_poddisruptionbudget_status_desired_healthy
for: 15m
labels:
  severity: warning
annotations:
  message: The pod disruption budget is preventing further disruption to pods because
    it is at the minimum allowed level.

Comment 16 errata-xmlrpc 2020-01-23 11:07:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062