1665011 – CVO should have clean-up mechanism to limit finished updatepayload jobs

Bug 1665011 - CVO should have clean-up mechanism to limit finished updatepayload jobs

Summary: CVO should have clean-up mechanism to limit finished updatepayload jobs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Jeremiah Stuever
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1721617 1762187
TreeView+	depends on / blocked

Reported:	2019-01-10 09:23 UTC by liujia
Modified:	2019-10-16 07:12 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1721617 1762187 (view as bug list)
Environment:
Last Closed:	2019-10-16 06:27:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:27:56 UTC

Description liujia 2019-01-10 09:23:35 UTC

Description of problem:
updatepayload job was created continuously each 120s due to the disiredupdate image could not download locally. CVO should have a clean-up mechanism to reserve limited job history for debug/track instead of all of finished jobs(failed/successful).  

In this user scenario, due to image can not download, the job numbers will increase endlessly, which will add pressure to server and need extra clean-up workload for users. 

[root@preserve-installer 091]# oc get po
NAME                                                  READY     STATUS             RESTARTS   AGE
cluster-version-operator-849bc5d7b4-sg29m             1/1       Running            2          20m
version-4.0.0-0.alpha-2019-01-07-215051-vf8jn-h4cjx   0/1       ImagePullBackOff   0          25s

[root@preserve-installer 091]# oc get job
NAME                                            DESIRED   SUCCESSFUL   AGE
version-4.0.0-0.alpha-2019-01-07-215051-5cqqp   1         0            5m
version-4.0.0-0.alpha-2019-01-07-215051-6kbxk   1         0            9m
version-4.0.0-0.alpha-2019-01-07-215051-7tvp9   1         0            7m
version-4.0.0-0.alpha-2019-01-07-215051-m8q6k   1         0            3m
version-4.0.0-0.alpha-2019-01-07-215051-ttxvn   1         0            11m
version-4.0.0-0.alpha-2019-01-07-215051-vf8jn   1         0            23s

[root@preserve-installer 091]# oc get job | wc -l
73


[root@preserve-installer 091]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-9   True        False         18m       Error while reconciling 4.0.0-0.alpha-2019-01-07-215051: could not download the update

[root@preserve-installer 091]# oc get clusterversion -o json| jq ".items[0].spec.desiredUpdate"
{
  "payload": "registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2019-01-07-215051",
  "version": "4.0.0-0.alpha-2019-01-07-215051"
}
[root@preserve-installer 091]# oc get clusterversion -o json| jq ".items[0].status.availableUpdates"
[
  {
    "payload": "registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2019-01-07-130851",
    "version": "4.0.0-0.alpha-2019-01-07-130851"
  },
  {
    "payload": "registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2019-01-07-215051",
    "version": "4.0.0-0.alpha-2019-01-07-215051"
  }
]

[root@preserve-installer 091]# oc describe po version-4.0.0-0.alpha-2019-01-07-215051-ttxvn-t76lz
Name:               version-4.0.0-0.alpha-2019-01-07-215051-ttxvn-t76lz
Namespace:          openshift-cluster-version
Priority:           0
PriorityClassName:  <none>
Node:               ip-10-0-6-158.ec2.internal/10.0.6.158
Start Time:         Thu, 10 Jan 2019 08:06:59 +0000
Labels:             controller-uid=b467d885-14ae-11e9-ab27-0aec6e4eca62
                    job-name=version-4.0.0-0.alpha-2019-01-07-215051-ttxvn
Annotations:        <none>
Status:             Pending
IP:                 10.128.0.42
Controlled By:      Job/version-4.0.0-0.alpha-2019-01-07-215051-ttxvn
Containers:
  payload:
    Container ID:  
    Image:         registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2019-01-07-215051
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      mkdir -p /etc/cvo/updatepayloads/4.0.0-0.alpha-2019-01-07-215051 && mv /manifests /etc/cvo/updatepayloads/4.0.0-0.alpha-2019-01-07-215051/manifests && mkdir -p /etc/cvo/updatepayloads/4.0.0-0.alpha-2019-01-07-215051 && mv /release-manifests /etc/cvo/updatepayloads/4.0.0-0.alpha-2019-01-07-215051/release-manifests
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/cvo/updatepayloads from payloads (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-nbxwj (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  payloads:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cvo/updatepayloads
    HostPathType:  
  default-token-nbxwj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-nbxwj
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  node-role.kubernetes.io/master=
Tolerations:     node-role.kubernetes.io/master
Events:
  Type     Reason   Age               From                                 Message
  ----     ------   ----              ----                                 -------
  Normal   Pulling  27s (x3 over 1m)  kubelet, ip-10-0-6-158.ec2.internal  pulling image "registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2019-01-07-215051"
  Warning  Failed   26s (x3 over 1m)  kubelet, ip-10-0-6-158.ec2.internal  Failed to pull image "registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2019-01-07-215051": rpc error: code = Unknown desc = Error reading manifest 4.0.0-0.alpha-2019-01-07-215051 in registry.svc.ci.openshift.org/openshift/origin-release: manifest unknown: manifest unknown
  Warning  Failed   26s (x3 over 1m)  kubelet, ip-10-0-6-158.ec2.internal  Error: ErrImagePull
  Normal   BackOff  0s (x4 over 1m)   kubelet, ip-10-0-6-158.ec2.internal  Back-off pulling image "registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2019-01-07-215051"
  Warning  Failed   0s (x4 over 1m)   kubelet, ip-10-0-6-158.ec2.internal  Error: ImagePullBackOff

Version-Release number of the following components:
[root@preserve-installer 091]# oc rsh cluster-version-operator-849bc5d7b4-sg29m
sh-4.2# cluster-version-operator version
ClusterVersionOperator 1.10.3


[root@preserve-installer 091]# oc get clusterversion -o json| jq ".items[0].status.current"
{
  "payload": "quay.io/openshift-release-dev/ocp-release@sha256:e237499d3b118e25890550daad8b17274af93baf855914a9c6f8f07ebc095dea",
  "version": "4.0.0-9"
}

How reproducible:
always

Steps to Reproduce:
1. Install a new cluster with nextgen installer
2. Set upstream in clusterversion
3. Set "--enable-auto-update=true" in cvo deployment to enable auto update
4. 

Actual results:
The updatepayload will be created continuously without automatic clean-up.

Expected results:
CVO should have clean-up mechanism to limit finished(failed or successful) updatepayload jobs.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 W. Trevor King 2019-01-14 14:26:47 UTC

> [root@preserve-installer 091]# oc get clusterversion -o json| jq ".items[0].spec.desiredUpdate"
> { "payload": "registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2019-01-07-215051", "version": "4.0.0-0.alpha-2019-01-07-215051" }
> ...
> [root@preserve-installer 091]# oc get clusterversion -o json| jq ".items[0].status.current"
> { "payload": "quay.io/openshift-release-dev/ocp-release@sha256:e237499d3b118e25890550daad8b17274af93baf855914a9c6f8f07ebc095dea", "version": "4.0.0-9" }

This may be because ci.openshift.org images are garbage-collected (after 48 hours?).  I'm not against the CVO having a cap on retained jobs, but I don't expect this to be a problem for folks using update payloads that have been released to quay.io.

Comment 2 Alex Crawford 2019-02-13 23:02:35 UTC

I've noticed the jobs laying around as well. It should be easy enough to reap these once they've completed, but there shouldn't be much of a downside to leaving them around.

Comment 5 Jeremiah Stuever 2019-05-15 20:51:19 UTC

https://github.com/openshift/cluster-version-operator/pull/186

Comment 6 Abhinav Dahiya 2019-05-24 16:35:19 UTC

https://github.com/openshift/cluster-version-operator/pull/186 merged

Comment 7 liujia 2019-05-27 08:52:47 UTC

Still hit it on latest 4.1 nightly build. Since the target release was set to "4.2.0" and no available 4.2 nightly build for the bug, change status to "Modify".

Comment 11 errata-xmlrpc 2019-10-16 06:27:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.