Description of problem: Run upgrade from v4.2 to v4.3 failed. # oc adm upgrade info: An upgrade is in progress. Working towards registry.svc.ci.openshift.org/ocp/release@sha256:6ece1c63d87fb90a66b28c038920651464230f45712b389040445437d5aab82c: downloading update warning: Cannot display available updates: Reason: RemoteFailed Message: Unable to retrieve available updates: currently installed version 4.2.0-0.nightly-2019-12-22-150714 not found in the "stable-4.2" channel ==================================================================== Checked the version pod can not run due to OutOfephemeral-storage. # oc project openshift-cluster-version # oc get pod NAME READY STATUS RESTARTS AGE pod/cluster-version-operator-7447dc7fd-2thsb 1/1 Running 2 20m pod/version--vqztg-g68lx 0/1 OutOfephemeral-storage 0 9s pod/version--vqztg-ml7pg 0/1 OutOfephemeral-storage 0 9s # oc describe pod/version--vqztg-ml7pg Name: version--vqztg-ml7pg Namespace: openshift-cluster-version Priority: 0 Node: control-plane-0/ Start Time: Tue, 24 Dec 2019 10:19:53 +0000 Labels: controller-uid=ecc2f2e3-2636-11ea-b92f-0050568b75cf job-name=version--vqztg Annotations: <none> Status: Failed Reason: OutOfephemeral-storage Message: Pod Node didn't have enough resource: ephemeral-storage, requested: 2097152, used: 0, capacity: 0 IP: IPs: <none> Controlled By: Job/version--vqztg Containers: payload: Image: registry.svc.ci.openshift.org/ocp/release@sha256:6ece1c63d87fb90a66b28c038920651464230f45712b389040445437d5aab82c Port: <none> Host Port: <none> Command: /bin/sh Args: -c mkdir -p /etc/cvo/updatepayloads/JMHZxYZNYuhAlnQmwW9a8g && mv /manifests /etc/cvo/updatepayloads/JMHZxYZNYuhAlnQmwW9a8g/manifests && mkdir -p /etc/cvo/updatepayloads/JMHZxYZNYuhAlnQmwW9a8g && mv /release-manifests /etc/cvo/updatepayloads/JMHZxYZNYuhAlnQmwW9a8g/release-manifests Requests: cpu: 10m ephemeral-storage: 2Mi memory: 50Mi Environment: <none> Mounts: /etc/cvo/updatepayloads from payloads (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-246zg (ro) Volumes: payloads: Type: HostPath (bare host directory volume) Path: /etc/cvo/updatepayloads HostPathType: default-token-246zg: Type: Secret (a volume populated by a Secret) SecretName: default-token-246zg Optional: false QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning OutOfephemeral-storage <invalid> kubelet, control-plane-0 Node didn't have enough resource: ephemeral-storage, requested: 2097152, used: 0, capacity: 0 Checked master node does not have ephemeral-storage capacity. # oc get node control-plane-0 -o json|jq .status.capacity { "cpu": "4", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "8163844Ki", "pods": "250" } Version-Release number of the following components: 4.2.0-0.nightly-2019-12-22-150714 How reproducible: always Steps to Reproduce: 1. Run upgrade from 4.2.0-0.nightly-2019-12-22-150714 to 4.3.0-0.nightly-2019-12-24-053745 2. 3. Actual results: Upgrade hang on creating version pod. Expected results: upgrade succeed. Additional info: should be related with https://github.com/openshift/cluster-version-operator/pull/286
This will block 4.2 upgrade, so add testblocker.
To be more clear, any upgrade from an 4.2 build with pr286 merged will hit the issue. For both v4.2 to v4.3 and v4.2 to v4.2 latest.
pr 286 was merged to 4.2.0-0.nightly-2019-12-20-184812, per the comment#2, the issue will not be happened when upgrading from 4.2.12 to the latest 4.2.z. Per the test result: 4.2.12-> 4.2.0-0.nightly-2019-12-23-132554 , it works for upgrade from 4.2.12 to 4.2.0-0.nightly-2019-12-23-132554 #oc get co:NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.2.0-0.nightly-2019-12-23-132554 True False False 54m cloud-credential 4.2.0-0.nightly-2019-12-23-132554 True False False 74m cluster-autoscaler 4.2.0-0.nightly-2019-12-23-132554 True False False 67m console 4.2.0-0.nightly-2019-12-23-132554 True False False 82s dns 4.2.0-0.nightly-2019-12-23-132554 True False False 73m image-registry 4.2.0-0.nightly-2019-12-23-132554 True False False 17m ingress 4.2.0-0.nightly-2019-12-23-132554 True False False 61m insights 4.2.0-0.nightly-2019-12-23-132554 True False False 74m kube-apiserver 4.2.0-0.nightly-2019-12-23-132554 True False False 72m kube-controller-manager 4.2.0-0.nightly-2019-12-23-132554 True False False 72m kube-scheduler 4.2.0-0.nightly-2019-12-23-132554 True False False 71m machine-api 4.2.0-0.nightly-2019-12-23-132554 True False False 74m machine-config 4.2.0-0.nightly-2019-12-23-132554 True False False 54s marketplace 4.2.0-0.nightly-2019-12-23-132554 True False False 2m8s monitoring 4.2.0-0.nightly-2019-12-23-132554 True False False 3m58s network 4.2.0-0.nightly-2019-12-23-132554 True False False 73m node-tuning 4.2.0-0.nightly-2019-12-23-132554 True False False 4m46s openshift-apiserver 4.2.0-0.nightly-2019-12-23-132554 True False False 81s openshift-controller-manager 4.2.0-0.nightly-2019-12-23-132554 True False False 71m openshift-samples 4.2.0-0.nightly-2019-12-23-132554 True False False 23m operator-lifecycle-manager 4.2.0-0.nightly-2019-12-23-132554 True False False 73m operator-lifecycle-manager-catalog 4.2.0-0.nightly-2019-12-23-132554 True False False 73m operator-lifecycle-manager-packageserver 4.2.0-0.nightly-2019-12-23-132554 True False False 73s service-ca 4.2.0-0.nightly-2019-12-23-132554 True False False 74m service-catalog-apiserver 4.2.0-0.nightly-2019-12-23-132554 True False False 70m service-catalog-controller-manager 4.2.0-0.nightly-2019-12-23-132554 True False False 70m storage 4.2.0-0.nightly-2019-12-23-132554 True False False 24m
*** Bug 1786374 has been marked as a duplicate of this bug. ***
I've filed [1] with a narrow ephemeral-storage revert for 4.2.z. We're still trying to figure out if we need to do anything about master/4.3. [1]: https://github.com/openshift/cluster-version-operator/pull/288
[1] is a 4.2.0-0.nightly-2019-12-22-150714 -> 4.2.0-0.nightly-2019-12-23-132554 update failing with this mode. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/55
I spun the lack of capacity reporting out into bug 1787427.
Version: 4.2.0-0.nightly-2020-01-03-055246 Verify upgrade from 4.2 to 4.3 since no newer 4.2 build available as a target version. Run upgrade from 4.2.0-0.nightly-2020-01-03-055246 to 4.3.0-0.nightly-2020-01-03-005054 and checked that the version pod did not request ephemeral-storage resource and the scheduled node had not the capacity. # oc get pod/version--k28tz-cfw45 -o json -n openshift-cluster-version |jq .spec.containers[].resources{ "requests": { "cpu": "10m", "memory": "50Mi" } } # oc get node control-plane-0 -o json|jq .status.capacity{ "cpu": "4", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "8163860Ki", "pods": "250" }
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0066