+++ This bug was initially created as a clone of Bug #1786315 +++ Description of problem: Run upgrade from v4.2 to v4.3 failed. # oc adm upgrade info: An upgrade is in progress. Working towards registry.svc.ci.openshift.org/ocp/release@sha256:6ece1c63d87fb90a66b28c038920651464230f45712b389040445437d5aab82c: downloading update warning: Cannot display available updates: Reason: RemoteFailed Message: Unable to retrieve available updates: currently installed version 4.2.0-0.nightly-2019-12-22-150714 not found in the "stable-4.2" channel ==================================================================== Checked the version pod can not run due to OutOfephemeral-storage. # oc project openshift-cluster-version # oc get pod NAME READY STATUS RESTARTS AGE pod/cluster-version-operator-7447dc7fd-2thsb 1/1 Running 2 20m pod/version--vqztg-g68lx 0/1 OutOfephemeral-storage 0 9s pod/version--vqztg-ml7pg 0/1 OutOfephemeral-storage 0 9s # oc describe pod/version--vqztg-ml7pg Name: version--vqztg-ml7pg Namespace: openshift-cluster-version Priority: 0 Node: control-plane-0/ Start Time: Tue, 24 Dec 2019 10:19:53 +0000 Labels: controller-uid=ecc2f2e3-2636-11ea-b92f-0050568b75cf job-name=version--vqztg Annotations: <none> Status: Failed Reason: OutOfephemeral-storage Message: Pod Node didn't have enough resource: ephemeral-storage, requested: 2097152, used: 0, capacity: 0 IP: IPs: <none> Controlled By: Job/version--vqztg Containers: payload: Image: registry.svc.ci.openshift.org/ocp/release@sha256:6ece1c63d87fb90a66b28c038920651464230f45712b389040445437d5aab82c Port: <none> Host Port: <none> Command: /bin/sh Args: -c mkdir -p /etc/cvo/updatepayloads/JMHZxYZNYuhAlnQmwW9a8g && mv /manifests /etc/cvo/updatepayloads/JMHZxYZNYuhAlnQmwW9a8g/manifests && mkdir -p /etc/cvo/updatepayloads/JMHZxYZNYuhAlnQmwW9a8g && mv /release-manifests /etc/cvo/updatepayloads/JMHZxYZNYuhAlnQmwW9a8g/release-manifests Requests: cpu: 10m ephemeral-storage: 2Mi memory: 50Mi Environment: <none> Mounts: /etc/cvo/updatepayloads from payloads (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-246zg (ro) Volumes: payloads: Type: HostPath (bare host directory volume) Path: /etc/cvo/updatepayloads HostPathType: default-token-246zg: Type: Secret (a volume populated by a Secret) SecretName: default-token-246zg Optional: false QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning OutOfephemeral-storage <invalid> kubelet, control-plane-0 Node didn't have enough resource: ephemeral-storage, requested: 2097152, used: 0, capacity: 0 Checked master node does not have ephemeral-storage capacity. # oc get node control-plane-0 -o json|jq .status.capacity { "cpu": "4", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "8163844Ki", "pods": "250" } Version-Release number of the following components: 4.2.0-0.nightly-2019-12-22-150714 How reproducible: always Steps to Reproduce: 1. Run upgrade from 4.2.0-0.nightly-2019-12-22-150714 to 4.3.0-0.nightly-2019-12-24-053745 2. 3. Actual results: Upgrade hang on creating version pod. Expected results: upgrade succeed. Additional info: should be related with https://github.com/openshift/cluster-version-operator/pull/286 --- Additional comment from liujia on 2019-12-24 11:07:09 UTC --- This will block 4.2 upgrade, so add testblocker. --- Additional comment from liujia on 2019-12-30 08:49:03 UTC --- To be more clear, any upgrade from an 4.2 build with pr286 merged will hit the issue. For both v4.2 to v4.3 and v4.2 to v4.2 latest. --- Additional comment from Wei Sun on 2020-01-02 07:38:52 UTC --- pr 286 was merged to 4.2.0-0.nightly-2019-12-20-184812, per the comment#2, the issue will not be happened when upgrading from 4.2.12 to the latest 4.2.z. Per the test result: 4.2.12-> 4.2.0-0.nightly-2019-12-23-132554 , it works for upgrade from 4.2.12 to 4.2.0-0.nightly-2019-12-23-132554 #oc get co:NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.2.0-0.nightly-2019-12-23-132554 True False False 54m cloud-credential 4.2.0-0.nightly-2019-12-23-132554 True False False 74m cluster-autoscaler 4.2.0-0.nightly-2019-12-23-132554 True False False 67m console 4.2.0-0.nightly-2019-12-23-132554 True False False 82s dns 4.2.0-0.nightly-2019-12-23-132554 True False False 73m image-registry 4.2.0-0.nightly-2019-12-23-132554 True False False 17m ingress 4.2.0-0.nightly-2019-12-23-132554 True False False 61m insights 4.2.0-0.nightly-2019-12-23-132554 True False False 74m kube-apiserver 4.2.0-0.nightly-2019-12-23-132554 True False False 72m kube-controller-manager 4.2.0-0.nightly-2019-12-23-132554 True False False 72m kube-scheduler 4.2.0-0.nightly-2019-12-23-132554 True False False 71m machine-api 4.2.0-0.nightly-2019-12-23-132554 True False False 74m machine-config 4.2.0-0.nightly-2019-12-23-132554 True False False 54s marketplace 4.2.0-0.nightly-2019-12-23-132554 True False False 2m8s monitoring 4.2.0-0.nightly-2019-12-23-132554 True False False 3m58s network 4.2.0-0.nightly-2019-12-23-132554 True False False 73m node-tuning 4.2.0-0.nightly-2019-12-23-132554 True False False 4m46s openshift-apiserver 4.2.0-0.nightly-2019-12-23-132554 True False False 81s openshift-controller-manager 4.2.0-0.nightly-2019-12-23-132554 True False False 71m openshift-samples 4.2.0-0.nightly-2019-12-23-132554 True False False 23m operator-lifecycle-manager 4.2.0-0.nightly-2019-12-23-132554 True False False 73m operator-lifecycle-manager-catalog 4.2.0-0.nightly-2019-12-23-132554 True False False 73m operator-lifecycle-manager-packageserver 4.2.0-0.nightly-2019-12-23-132554 True False False 73s service-ca 4.2.0-0.nightly-2019-12-23-132554 True False False 74m service-catalog-apiserver 4.2.0-0.nightly-2019-12-23-132554 True False False 70m service-catalog-controller-manager 4.2.0-0.nightly-2019-12-23-132554 True False False 70m storage 4.2.0-0.nightly-2019-12-23-132554 True False False 24m
Per the https://bugzilla.redhat.com/show_bug.cgi?id=1786315#c2 , clone this bug for 4.3.0
*** Bug 1787422 has been marked as a duplicate of this bug. ***
The original issue from bz1786315 will not happen during an upgrade from v4.3. But the fix/workaround in bz1786315 may cause the inconsistency between v4.2 and v4.3, which will cause further issue when do continues upgrade following 4.2-4.3-4.3 latest path. So this bug is for enhance and consistancy in v4.3. QE will do regression test against the pr and check no ephemeral-storage request from v4.3 cvo.
you could also validate it against the interrupted-update flow from [1], you'd just need to trigger the second 4.3->4.3 update (step 6) before the first 4.2->4.3 update (step 2) got far enough to bump control-plane kubelets. But I'm fine with more basic regression testing too ;). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1787422#c0
Run upgrade from 4.3.0-0.nightly-2020-01-02-214950 to 4.3.0-0.nightly-2020-01-03-005054 succeed. Checked the version pod did not request ephemeral-storage resource even the scheduled node had the capacity. # oc get pod/version--hwwg5-wgpbq -o json -n openshift-cluster-version |jq .spec.containers[].resources { "requests": { "cpu": "10m", "memory": "50Mi" } } # oc get node control-plane-1 -o json|jq .status.capacity { "cpu": "4", "ephemeral-storage": "30905324Ki", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "8163844Ki", "pods": "250" }
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062