Bug 1787424

Summary: Upgrade can not start due to version pod fail to request expected ephemeral-storage on master node
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Cluster Version OperatorAssignee: W. Trevor King <wking>
Status: CLOSED ERRATA QA Contact: liujia <jiajliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.4CC: aos-bugs, bparees, jiajliu, jokerman, lmohanty, wking, wsun, zpeng
Target Milestone: ---Keywords: Regression, TestBlocker
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1787422 Environment:
Last Closed: 2020-05-04 11:22:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1787334    

Description W. Trevor King 2020-01-02 19:18:15 UTC
This issue does not apply to 4.3 updates, because we only support 4.y -> 4.(y+1), so 4.4 CVOs will never run on nodes with 4.2- kubelets.

+++ This bug was initially created as a clone of Bug #1787422 +++

Although 4.3 kubelet capacity reporting works, we still need to drop the 4.3 request, to support flows like:

1. 4.2 cluster running with 4.2 CVO and 4.2 kubelets (so no capacity reporting).
2. Admin requests an update to 4.3.1.
3. 4.2 CVO launches a version pod without requests, because of the 4.2 reversion (#288). This works fine.
4. Update gets far enough to run a 4.3 CVO.
5. Update hangs on some 4.3.1 bug, while it's still running 4.2 kubelets.
6. Admin requests an update to 4.3.2.
7. 4.3 CVO launches a version pod with an ephemeral-storage request, which hangs because the 4.2 kubelets are still running and not reporting ephemeral-storage capacity.

Comment 1 W. Trevor King 2020-01-02 19:20:11 UTC
Current 4.3 nightlies can update to 4.4 nightlies without hitting this, so jumping straight to MODIFIED.

Comment 2 W. Trevor King 2020-01-02 19:33:16 UTC
No accepted 4.4 nightlies since the 20th [1], so I've launched a 4.3.0-0.nightly-2020-01-02-141332 -> 4.4.0-0.ci-2020-01-02-161748 job [2] to confirm the "does not apply to 4.3 -> 4.4" assertion.

[1]: https://openshift-release.svc.ci.openshift.org/#4.4.0-0.nightly
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.3/3

Comment 4 W. Trevor King 2020-01-02 22:21:40 UTC
The 4.3.0-0.nightly-2020-01-02-141332 -> 4.4.0-0.ci-2020-01-02-161748 update job passed [1].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.3/3

Comment 5 W. Trevor King 2020-01-02 23:45:13 UTC
Attempt to keep this breakage class from sneaking through CI again: https://github.com/openshift/release/pull/6542

Comment 6 liujia 2020-01-10 08:26:18 UTC
Run upgrade from 4.4.0-0.nightly-2020-01-08-233510 to 4.4.0-0.nightly-2020-01-09-013524 succeed.

Checked the version pod requests ephemeral-storage resource and the scheduled node had the capacity.
# ./oc get pod version--8hprx-2nbbg -ojson|jq .spec.containers[].resources
{
  "requests": {
    "cpu": "10m",
    "ephemeral-storage": "2Mi",
    "memory": "50Mi"
  }
}
# ./oc get node control-plane-0 -ojson| jq .status.capacity
{
  "cpu": "4",
  "ephemeral-storage": "30905324Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "8163844Ki",
  "pods": "250"
}

Since both 4.3 and 4.4 have ephemeral-storage capacity, so it will hit the issue like 4.2-4.3 in description. 

> The 4.3.0-0.nightly-2020-01-02-141332 -> 4.4.0-0.ci-2020-01-02-161748 update job passed [1].
And 4.3 to 4.4 upgrade works well from above ci job.

Verify the fix.

Comment 8 errata-xmlrpc 2020-05-04 11:22:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581