Bug 1787424 - Upgrade can not start due to version pod fail to request expected ephemeral-storage on master node
Summary: Upgrade can not start due to version pod fail to request expected ephemeral-s...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.4.0
Assignee: W. Trevor King
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks: 1787334
TreeView+ depends on / blocked
 
Reported: 2020-01-02 19:18 UTC by W. Trevor King
Modified: 2020-05-04 11:22 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1787422
Environment:
Last Closed: 2020-05-04 11:22:00 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:22:46 UTC

Description W. Trevor King 2020-01-02 19:18:15 UTC
This issue does not apply to 4.3 updates, because we only support 4.y -> 4.(y+1), so 4.4 CVOs will never run on nodes with 4.2- kubelets.

+++ This bug was initially created as a clone of Bug #1787422 +++

Although 4.3 kubelet capacity reporting works, we still need to drop the 4.3 request, to support flows like:

1. 4.2 cluster running with 4.2 CVO and 4.2 kubelets (so no capacity reporting).
2. Admin requests an update to 4.3.1.
3. 4.2 CVO launches a version pod without requests, because of the 4.2 reversion (#288). This works fine.
4. Update gets far enough to run a 4.3 CVO.
5. Update hangs on some 4.3.1 bug, while it's still running 4.2 kubelets.
6. Admin requests an update to 4.3.2.
7. 4.3 CVO launches a version pod with an ephemeral-storage request, which hangs because the 4.2 kubelets are still running and not reporting ephemeral-storage capacity.

Comment 1 W. Trevor King 2020-01-02 19:20:11 UTC
Current 4.3 nightlies can update to 4.4 nightlies without hitting this, so jumping straight to MODIFIED.

Comment 2 W. Trevor King 2020-01-02 19:33:16 UTC
No accepted 4.4 nightlies since the 20th [1], so I've launched a 4.3.0-0.nightly-2020-01-02-141332 -> 4.4.0-0.ci-2020-01-02-161748 job [2] to confirm the "does not apply to 4.3 -> 4.4" assertion.

[1]: https://openshift-release.svc.ci.openshift.org/#4.4.0-0.nightly
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.3/3

Comment 4 W. Trevor King 2020-01-02 22:21:40 UTC
The 4.3.0-0.nightly-2020-01-02-141332 -> 4.4.0-0.ci-2020-01-02-161748 update job passed [1].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.3/3

Comment 5 W. Trevor King 2020-01-02 23:45:13 UTC
Attempt to keep this breakage class from sneaking through CI again: https://github.com/openshift/release/pull/6542

Comment 6 liujia 2020-01-10 08:26:18 UTC
Run upgrade from 4.4.0-0.nightly-2020-01-08-233510 to 4.4.0-0.nightly-2020-01-09-013524 succeed.

Checked the version pod requests ephemeral-storage resource and the scheduled node had the capacity.
# ./oc get pod version--8hprx-2nbbg -ojson|jq .spec.containers[].resources
{
  "requests": {
    "cpu": "10m",
    "ephemeral-storage": "2Mi",
    "memory": "50Mi"
  }
}
# ./oc get node control-plane-0 -ojson| jq .status.capacity
{
  "cpu": "4",
  "ephemeral-storage": "30905324Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "8163844Ki",
  "pods": "250"
}

Since both 4.3 and 4.4 have ephemeral-storage capacity, so it will hit the issue like 4.2-4.3 in description. 

> The 4.3.0-0.nightly-2020-01-02-141332 -> 4.4.0-0.ci-2020-01-02-161748 update job passed [1].
And 4.3 to 4.4 upgrade works well from above ci job.

Verify the fix.

Comment 8 errata-xmlrpc 2020-05-04 11:22:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.