Description of problem: updatepayload job was created continuously each 120s due to the disiredupdate image could not download locally. CVO should have a clean-up mechanism to reserve limited job history for debug/track instead of all of finished jobs(failed/successful). In this user scenario, due to image can not download, the job numbers will increase endlessly, which will add pressure to server and need extra clean-up workload for users. [root@preserve-installer 091]# oc get po NAME READY STATUS RESTARTS AGE cluster-version-operator-849bc5d7b4-sg29m 1/1 Running 2 20m version-4.0.0-0.alpha-2019-01-07-215051-vf8jn-h4cjx 0/1 ImagePullBackOff 0 25s [root@preserve-installer 091]# oc get job NAME DESIRED SUCCESSFUL AGE version-4.0.0-0.alpha-2019-01-07-215051-5cqqp 1 0 5m version-4.0.0-0.alpha-2019-01-07-215051-6kbxk 1 0 9m version-4.0.0-0.alpha-2019-01-07-215051-7tvp9 1 0 7m version-4.0.0-0.alpha-2019-01-07-215051-m8q6k 1 0 3m version-4.0.0-0.alpha-2019-01-07-215051-ttxvn 1 0 11m version-4.0.0-0.alpha-2019-01-07-215051-vf8jn 1 0 23s [root@preserve-installer 091]# oc get job | wc -l 73 [root@preserve-installer 091]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-9 True False 18m Error while reconciling 4.0.0-0.alpha-2019-01-07-215051: could not download the update [root@preserve-installer 091]# oc get clusterversion -o json| jq ".items[0].spec.desiredUpdate" { "payload": "registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2019-01-07-215051", "version": "4.0.0-0.alpha-2019-01-07-215051" } [root@preserve-installer 091]# oc get clusterversion -o json| jq ".items[0].status.availableUpdates" [ { "payload": "registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2019-01-07-130851", "version": "4.0.0-0.alpha-2019-01-07-130851" }, { "payload": "registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2019-01-07-215051", "version": "4.0.0-0.alpha-2019-01-07-215051" } ] [root@preserve-installer 091]# oc describe po version-4.0.0-0.alpha-2019-01-07-215051-ttxvn-t76lz Name: version-4.0.0-0.alpha-2019-01-07-215051-ttxvn-t76lz Namespace: openshift-cluster-version Priority: 0 PriorityClassName: <none> Node: ip-10-0-6-158.ec2.internal/10.0.6.158 Start Time: Thu, 10 Jan 2019 08:06:59 +0000 Labels: controller-uid=b467d885-14ae-11e9-ab27-0aec6e4eca62 job-name=version-4.0.0-0.alpha-2019-01-07-215051-ttxvn Annotations: <none> Status: Pending IP: 10.128.0.42 Controlled By: Job/version-4.0.0-0.alpha-2019-01-07-215051-ttxvn Containers: payload: Container ID: Image: registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2019-01-07-215051 Image ID: Port: <none> Host Port: <none> Command: /bin/sh Args: -c mkdir -p /etc/cvo/updatepayloads/4.0.0-0.alpha-2019-01-07-215051 && mv /manifests /etc/cvo/updatepayloads/4.0.0-0.alpha-2019-01-07-215051/manifests && mkdir -p /etc/cvo/updatepayloads/4.0.0-0.alpha-2019-01-07-215051 && mv /release-manifests /etc/cvo/updatepayloads/4.0.0-0.alpha-2019-01-07-215051/release-manifests State: Waiting Reason: ImagePullBackOff Ready: False Restart Count: 0 Environment: <none> Mounts: /etc/cvo/updatepayloads from payloads (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-nbxwj (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: payloads: Type: HostPath (bare host directory volume) Path: /etc/cvo/updatepayloads HostPathType: default-token-nbxwj: Type: Secret (a volume populated by a Secret) SecretName: default-token-nbxwj Optional: false QoS Class: BestEffort Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulling 27s (x3 over 1m) kubelet, ip-10-0-6-158.ec2.internal pulling image "registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2019-01-07-215051" Warning Failed 26s (x3 over 1m) kubelet, ip-10-0-6-158.ec2.internal Failed to pull image "registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2019-01-07-215051": rpc error: code = Unknown desc = Error reading manifest 4.0.0-0.alpha-2019-01-07-215051 in registry.svc.ci.openshift.org/openshift/origin-release: manifest unknown: manifest unknown Warning Failed 26s (x3 over 1m) kubelet, ip-10-0-6-158.ec2.internal Error: ErrImagePull Normal BackOff 0s (x4 over 1m) kubelet, ip-10-0-6-158.ec2.internal Back-off pulling image "registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2019-01-07-215051" Warning Failed 0s (x4 over 1m) kubelet, ip-10-0-6-158.ec2.internal Error: ImagePullBackOff Version-Release number of the following components: [root@preserve-installer 091]# oc rsh cluster-version-operator-849bc5d7b4-sg29m sh-4.2# cluster-version-operator version ClusterVersionOperator 1.10.3 [root@preserve-installer 091]# oc get clusterversion -o json| jq ".items[0].status.current" { "payload": "quay.io/openshift-release-dev/ocp-release@sha256:e237499d3b118e25890550daad8b17274af93baf855914a9c6f8f07ebc095dea", "version": "4.0.0-9" } How reproducible: always Steps to Reproduce: 1. Install a new cluster with nextgen installer 2. Set upstream in clusterversion 3. Set "--enable-auto-update=true" in cvo deployment to enable auto update 4. Actual results: The updatepayload will be created continuously without automatic clean-up. Expected results: CVO should have clean-up mechanism to limit finished(failed or successful) updatepayload jobs. Additional info: Please attach logs from ansible-playbook with the -vvv flag
> [root@preserve-installer 091]# oc get clusterversion -o json| jq ".items[0].spec.desiredUpdate" > { "payload": "registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2019-01-07-215051", "version": "4.0.0-0.alpha-2019-01-07-215051" } > ... > [root@preserve-installer 091]# oc get clusterversion -o json| jq ".items[0].status.current" > { "payload": "quay.io/openshift-release-dev/ocp-release@sha256:e237499d3b118e25890550daad8b17274af93baf855914a9c6f8f07ebc095dea", "version": "4.0.0-9" } This may be because ci.openshift.org images are garbage-collected (after 48 hours?). I'm not against the CVO having a cap on retained jobs, but I don't expect this to be a problem for folks using update payloads that have been released to quay.io.
I've noticed the jobs laying around as well. It should be easy enough to reap these once they've completed, but there shouldn't be much of a downside to leaving them around.
https://github.com/openshift/cluster-version-operator/pull/186
https://github.com/openshift/cluster-version-operator/pull/186 merged
Still hit it on latest 4.1 nightly build. Since the target release was set to "4.2.0" and no available 4.2 nightly build for the bug, change status to "Modify".
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922