Description of problem: image-registry goes offline after between 2 and 4 hours after cluster creation. This is repeatable but intermittent. Version-Release number of selected component (if applicable): image-registry 4.5.0-0.nightly-s390x-2020-06-29-163732 True False True 15h NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-s390x-2020-06-29-163732 True False 15h Error while reconciling 4.5.0-0.nightly-s390x-2020-06-29-163732: the cluster operator image-registry is degraded How reproducible: Reproducible but intermittent. Steps to Reproduce: 1.Biold z/VM cluster - monitor the co's 2. 3. Actual results: image-registry 4.5.0-0.nightly-s390x-2020-06-29-163732 True False True 15h Message Degraded: The registry is removed ImagePrunerDegraded: Job has reached the specified backoff limit Expected results: image-registry 4.5.0-0.nightly-s390x-2020-06-29-163732 True False False 10h Additional info:
Please attach YAMLs for pod/image-pruner-* from the openshift-image-registry namespace. If these pods were able to start, also attach their logs.
Created attachment 1699347 [details] image-registry yaml file
Created attachment 1699348 [details] log file
Created attachment 1699350 [details] newe log file.... I may have attached the wrong one.
The attached log are for the operator. The problem is with the image pruner. Please attach YAMLs for cronjobs, jobs, and pods in openshift-image-registry namespace.
Created attachment 1699672 [details] image-pruner.yaml
I have attached the image-pruner cronjob yaml. I am not sure how to get the others you have requested. If you could send me instructions on how to get what you need I will load those as well. Thanks, Christian
One other odd thing we saw is that at 8PM for 6 minutes the status goes to not degraded......I attached the logging file from last night. Any idea why that is? We also had another cluster image-registry go degraded at 8PM.
Created attachment 1699673 [details] Log from last night
Created attachment 1699676 [details] jobs1
Created attachment 1699677 [details] jobs2
Created attachment 1699678 [details] jobs3
Created attachment 1699679 [details] pods1
Created attachment 1699680 [details] pods2
Created attachment 1699681 [details] pods3
Created attachment 1699682 [details] pods4
Created attachment 1699683 [details] pods5
Is it all pods? It's better to get all pods into a single file: oc -n openshift-image-registry get pods -o yaml > pods.yaml
Yes, that is all the pods.
Created attachment 1699694 [details] all_pods.yaml
Can you run oc adm prune images --keep-tag-revisions=3 --keep-younger-than=60m --prune-registry=false ? Do you have `The following objects have invalid references:` in output?
oc adm prune images --keep-tag-revisions=3 --keep-younger-than=60m --prune-registry=false Dry run enabled - no modifications will be made. Add --confirm to remove images Only API objects will be removed. No modifications to the image registry will be made. Deleted 0 objects. Run as kubeadmin
Has anyone looked into the comment I put in about the 8PM timeframe where the operator - if degraded goes to not degraded and if not degraded - goes to degraded, this happens for 6 minutes? I attached a time log earlier showing the status of the operator at that timeframe. It happens every night.
There are some problems with the pruner, but it's hard to tell what exactly went wrong. There should be short-living failing image-pruner pods. The engineering team will try to reproduce it and collect logs on their site, but I'll take some time. The registry is Removed on your cluster. If you don't use imagestreams a lot, you can suspend the pruner and delete all pruner jobs from the openshift-image-registry namespace: oc patch imagepruner.imageregistry/cluster --patch '{"spec":{"suspend":true}}' --type=merge oc -n openshift-image-registry delete jobs --all Does it help to mitigate the problem?
This did make the operator available again. I Ould like to know what the underlying issue is... Will the issue be fixed or will the patch be incorporated into the install/upgrade path.
Also turned up this failure mode in 4.5 CI (on a PR presubmit) [1], so that will have gathered assets with a must-gather and all the other usual bits. [1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/25263/pull-ci-openshift-origin-release-4.5-e2e-gcp-upgrade/1281371515811532800
From the must-gather: $ yaml2json <namespaces/openshift-image-registry/core/events.yaml | jq -r '[.items[] | .timePrefix = if .firstTimestamp == null or .firstTimestamp == "null" then .eventTime else .firstTimestamp + " - " + .lastTimestamp + " (" + (.count | tostring) + ")" end] | sort_by(.timePrefix)[] | .timePrefix + " " + .metadata.namespace + " " + .message' | grep pruner 2020-07-10T00:00:05Z - 2020-07-10T00:00:05Z (1) openshift-image-registry Successfully assigned openshift-image-registry/image-pruner-1594339200-jkbqd to ci-op-ygbhp5zt-49a5f-vs995-worker-c-pg7wk 2020-07-10T00:00:05Z - 2020-07-10T00:00:05Z (1) openshift-image-registry Created pod: image-pruner-1594339200-jkbqd 2020-07-10T00:00:05Z - 2020-07-10T00:00:05Z (1) openshift-image-registry Created job image-pruner-1594339200 2020-07-10T00:00:07Z - 2020-07-10T00:01:43Z (5) openshift-image-registry Created container image-pruner 2020-07-10T00:00:07Z - 2020-07-10T00:01:43Z (5) openshift-image-registry Started container image-pruner 2020-07-10T00:05:55Z - 2020-07-10T00:05:55Z (1) openshift-image-registry Deleted pod: image-pruner-1594339200-jkbqd 2020-07-10T00:06:02Z - 2020-07-10T00:06:02Z (1) openshift-image-registry Saw completed job: image-pruner-1594339200, status: Failed $ tail -n13 namespaces/openshift-image-registry/batch/jobs.yaml status: conditions: - lastProbeTime: "2020-07-10T00:05:55Z" lastTransitionTime: "2020-07-10T00:05:55Z" message: Job has reached the specified backoff limit reason: BackoffLimitExceeded status: "True" type: Failed startTime: "2020-07-10T00:00:05Z" kind: JobList metadata: resourceVersion: "60530" selfLink: /apis/batch/v1/namespaces/openshift-image-registry/jobs $ yaml2json <namespaces/openshift-image-registry/core/pods.yaml | jq -r '.items[].metadata.name' cluster-image-registry-operator-6dd4488f9d-dzkr4 image-registry-d5bbb6764-26blg image-registry-d5bbb6764-2kxxb node-ca-5kwbw node-ca-6wdcf node-ca-l7c7r node-ca-wqkmt node-ca-ww4j6 node-ca-xkscj $ ls namespaces/openshift-image-registry/pods/ cluster-image-registry-operator-6dd4488f9d-dzkr4/ node-ca-5kwbw/ node-ca-wqkmt/ image-registry-d5bbb6764-26blg/ node-ca-6wdcf/ node-ca-ww4j6/ image-registry-d5bbb6764-2kxxb/ node-ca-l7c7r/ node-ca-xkscj/ Seems like there's not much of a record left after the failing pods got reaped.
Bug 1851414 is about the pruner vs. the OOM killer. Not sure if that applies here, because if it did, I'd expect the must-gather's kubelet logs to mention it. But maybe OOM kills don't show up in kubelet logs after all?
Hi, I see that there is a mitigation method available in comment 24 and 25. Is this bug still an "Urgent" or can it be de-escalated to High or Medium?
Hi Christian, please see my inquiry above. Can we de-escalate this bug to High or Medium since a mitigation has been provided?
Yes, I think we can move this to High as it is still an issue. When do you expect this to be fixed? What version?
I don't have knowledge in the bug so I will leave this question to Oleg to answer. I will de-escalate this bug to High.
First we need to fix the cronjob so that failed pods stay there for debugging. There are several reasons why the pruner may fail and we need logs to understand what's going on in this case.
Gathering some data: 1. We could not gather job pods log 1.1 Needs https://bugzilla.redhat.com/show_bug.cgi?id=1857687 to be addressed first 2. Jobs are failing with image registry removed 2.1 There is a known issue with the pruner when the image registry is removed 2.2 Issue addressed by https://bugzilla.redhat.com/show_bug.cgi?id=1867792 2.3 Maybe this is issue is going to be addressed once the above one moves to CLOSED 3. Operator status seems intermittent (sometimes it becomes healthy) 3.1 This is a known issue 3.2 Addressed by https://bugzilla.redhat.com/show_bug.cgi?id=1857684 Moving this ahead to the next sprint.
Doug,could you please help to verify whether the pruner works on s390x? Thanks!
I was working on another issue. I will start testing this.
(In reply to Douglas Slavens from comment #48) > I was working on another issue. I will start testing this. Thanks, when you finish test, please mark this bug as Verified : )
The pruner appears to work on s390x: [dslavens@rock-kvmlp-3 ~]$ oc login Authentication required for https://api.dslavens-ocp.ocp128.rockhopper:6443 (openshift) Username: kubeadmin Password: Login successful. You have access to 57 projects, the list has been suppressed. You can list all projects with 'oc projects' Using project "default". [dslavens@rock-kvmlp-3 ~]$ oc adm prune images --keep-tag-revisions=3 --keep-younger-than=60m --prune-registry=false Dry run enabled - no modifications will be made. Add --confirm to remove images Only API objects will be removed. No modifications to the image registry will be made. Deleting istags openshift/cli: latest Deleted 1 objects. [dslavens@rock-kvmlp-3 ~]$ arch s390x [dslavens@rock-kvmlp-3 ~]$ oc version Client Version: 4.5.6 Server Version: 4.5.0-0.nightly-s390x-2020-09-24-223849 Kubernetes Version: v1.18.3+47c0e71 [dslavens@rock-kvmlp-3 ~]$ oc status In project default on server https://api.dslavens-ocp.ocp128.rockhopper:6443 svc/openshift - kubernetes.default.svc.cluster.local svc/kubernetes - 172.30.0.1:443 -> 6443 View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'.
facing similar issue in 4.5.7 [kni@provision ~]$ oc -n openshift-image-registry delete jobs --all job.batch "image-pruner-1602547200" deleted Worked
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
Add steps in OSE/workitem?id=OCP-33708 to check if prune will retry when it goes to fail