Bug 1852501 - image-registry goes offline after between 2 and 4 hours after cluster creation. This is repeatable but intermittent.
Summary: image-registry goes offline after between 2 and 4 hours after cluster creatio...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.5
Hardware: s390x
OS: Linux
unspecified
high
Target Milestone: ---
: 4.6.0
Assignee: Ricardo Maraschini
QA Contact: Wenjing Zheng
URL:
Whiteboard: multi-arch
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-30 14:39 UTC by Christian LaPolt
Modified: 2021-10-09 01:35 UTC (History)
31 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Z15 z/VM
Last Closed: 2020-10-27 16:10:31 UTC
Target Upstream Version:


Attachments (Terms of Use)
image-registry yaml file (1.74 KB, text/plain)
2020-06-30 16:10 UTC, Christian LaPolt
no flags Details
log file (134.66 KB, application/octet-stream)
2020-06-30 16:11 UTC, Christian LaPolt
no flags Details
newe log file.... I may have attached the wrong one. (128.00 KB, application/octet-stream)
2020-06-30 16:22 UTC, Christian LaPolt
no flags Details
image-pruner.yaml (4.44 KB, text/plain)
2020-07-02 17:34 UTC, Christian LaPolt
no flags Details
Log from last night (11.87 KB, text/plain)
2020-07-02 17:45 UTC, Christian LaPolt
no flags Details
jobs1 (4.50 KB, text/plain)
2020-07-02 18:05 UTC, Christian LaPolt
no flags Details
jobs2 (4.50 KB, text/plain)
2020-07-02 18:06 UTC, Christian LaPolt
no flags Details
jobs3 (4.50 KB, text/plain)
2020-07-02 18:06 UTC, Christian LaPolt
no flags Details
pods1 (12.23 KB, text/plain)
2020-07-02 18:07 UTC, Christian LaPolt
no flags Details
pods2 (7.57 KB, text/plain)
2020-07-02 18:08 UTC, Christian LaPolt
no flags Details
pods3 (7.57 KB, text/plain)
2020-07-02 18:08 UTC, Christian LaPolt
no flags Details
pods4 (7.57 KB, text/plain)
2020-07-02 18:08 UTC, Christian LaPolt
no flags Details
pods5 (7.57 KB, text/plain)
2020-07-02 18:09 UTC, Christian LaPolt
no flags Details
all_pods.yaml (53.69 KB, text/plain)
2020-07-02 19:40 UTC, Christian LaPolt
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5341521 0 None None None 2020-08-24 11:56:29 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:11:08 UTC

Description Christian LaPolt 2020-06-30 14:39:57 UTC
Description of problem:
image-registry goes offline after between 2 and 4 hours after cluster creation. This is repeatable but intermittent. 

Version-Release number of selected component (if applicable):
image-registry                             4.5.0-0.nightly-s390x-2020-06-29-163732   True        False         True       15h

NAME      VERSION                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-s390x-2020-06-29-163732   True        False         15h     Error while reconciling 4.5.0-0.nightly-s390x-2020-06-29-163732: the cluster operator image-registry is degraded

How reproducible:
Reproducible but intermittent.

Steps to Reproduce:
1.Biold z/VM cluster - monitor the co's
2.
3.

Actual results:

image-registry                             4.5.0-0.nightly-s390x-2020-06-29-163732   True        False         True       15h

Message
Degraded: The registry is removed ImagePrunerDegraded: Job has reached the specified backoff limit

Expected results:

image-registry                             4.5.0-0.nightly-s390x-2020-06-29-163732   True        False         False      10h
Additional info:

Comment 1 Oleg Bulatov 2020-06-30 15:37:17 UTC
Please attach YAMLs for pod/image-pruner-* from the openshift-image-registry namespace. If these pods were able to start, also attach their logs.

Comment 2 Christian LaPolt 2020-06-30 16:10:40 UTC
Created attachment 1699347 [details]
image-registry yaml file

Comment 3 Christian LaPolt 2020-06-30 16:11:14 UTC
Created attachment 1699348 [details]
log file

Comment 4 Christian LaPolt 2020-06-30 16:22:13 UTC
Created attachment 1699350 [details]
newe log file.... I may have attached the wrong one.

Comment 5 Oleg Bulatov 2020-07-01 14:34:25 UTC
The attached log are for the operator. The problem is with the image pruner.

Please attach YAMLs for cronjobs, jobs, and pods in openshift-image-registry namespace.

Comment 6 Christian LaPolt 2020-07-02 17:34:23 UTC
Created attachment 1699672 [details]
image-pruner.yaml

Comment 7 Christian LaPolt 2020-07-02 17:36:59 UTC
I have attached the image-pruner cronjob yaml.  I am not sure how to get the others you have requested.  If you could send me instructions on how to get what you need I will load those as well.
Thanks,
Christian

Comment 8 Christian LaPolt 2020-07-02 17:45:10 UTC
One other odd thing we saw is that at 8PM for 6 minutes the status goes to not degraded......I attached the logging file from last night.  Any idea why that is? We also had another cluster image-registry go degraded at 8PM.

Comment 9 Christian LaPolt 2020-07-02 17:45:43 UTC
Created attachment 1699673 [details]
Log from last night

Comment 10 Christian LaPolt 2020-07-02 18:05:56 UTC
Created attachment 1699676 [details]
jobs1

Comment 11 Christian LaPolt 2020-07-02 18:06:22 UTC
Created attachment 1699677 [details]
jobs2

Comment 12 Christian LaPolt 2020-07-02 18:06:47 UTC
Created attachment 1699678 [details]
jobs3

Comment 13 Christian LaPolt 2020-07-02 18:07:46 UTC
Created attachment 1699679 [details]
pods1

Comment 14 Christian LaPolt 2020-07-02 18:08:06 UTC
Created attachment 1699680 [details]
pods2

Comment 15 Christian LaPolt 2020-07-02 18:08:29 UTC
Created attachment 1699681 [details]
pods3

Comment 16 Christian LaPolt 2020-07-02 18:08:51 UTC
Created attachment 1699682 [details]
pods4

Comment 17 Christian LaPolt 2020-07-02 18:09:10 UTC
Created attachment 1699683 [details]
pods5

Comment 18 Oleg Bulatov 2020-07-02 19:01:16 UTC
Is it all pods? It's better to get all pods into a single file:

oc -n openshift-image-registry get pods -o yaml > pods.yaml

Comment 19 Christian LaPolt 2020-07-02 19:39:35 UTC
Yes, that is all the pods.

Comment 20 Christian LaPolt 2020-07-02 19:40:55 UTC
Created attachment 1699694 [details]
all_pods.yaml

Comment 21 Oleg Bulatov 2020-07-03 13:43:23 UTC
Can you run

oc adm prune images --keep-tag-revisions=3 --keep-younger-than=60m --prune-registry=false

?

Do you have `The following objects have invalid references:` in output?

Comment 22 Christian LaPolt 2020-07-07 10:17:51 UTC
 oc adm prune images --keep-tag-revisions=3 --keep-younger-than=60m --prune-registry=false
Dry run enabled - no modifications will be made. Add --confirm to remove images
Only API objects will be removed.  No modifications to the image registry will be made.
Deleted 0 objects.

Run as kubeadmin

Comment 23 Christian LaPolt 2020-07-08 13:56:23 UTC
Has anyone looked into the comment I put in about the 8PM timeframe where the operator - if degraded goes to not degraded and if not degraded - goes to degraded, this happens for 6 minutes?  I attached a time log earlier showing the status of the operator at that timeframe.  It happens every night.

Comment 24 Oleg Bulatov 2020-07-08 16:01:44 UTC
There are some problems with the pruner, but it's hard to tell what exactly went wrong. There should be short-living failing image-pruner pods. The engineering team will try to reproduce it and collect logs on their site, but I'll take some time.

The registry is Removed on your cluster. If you don't use imagestreams a lot, you can suspend the pruner and delete all pruner jobs from the openshift-image-registry namespace:

oc patch imagepruner.imageregistry/cluster --patch '{"spec":{"suspend":true}}' --type=merge
oc -n openshift-image-registry delete jobs --all

Does it help to mitigate the problem?

Comment 25 Christian LaPolt 2020-07-09 18:54:14 UTC
This did make the operator available again.  I Ould like to know what the underlying issue is...  Will the issue be fixed or will the patch be incorporated into the install/upgrade path.

Comment 26 W. Trevor King 2020-07-10 04:08:25 UTC
Also turned up this failure mode in 4.5 CI (on a PR presubmit) [1], so that will have gathered assets with a must-gather and all the other usual bits.

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/25263/pull-ci-openshift-origin-release-4.5-e2e-gcp-upgrade/1281371515811532800

Comment 27 W. Trevor King 2020-07-10 04:25:49 UTC
From the must-gather:

$ yaml2json <namespaces/openshift-image-registry/core/events.yaml | jq -r '[.items[] | .timePrefix = if .firstTimestamp == null or .firstTimestamp == "null" then .eventTime else .firstTimestamp + " - " + .lastTimestamp + " (" + (.count | tostring) + ")" end] | sort_by(.timePrefix)[] | .timePrefix + " " + .metadata.namespace + " " + .message' | grep pruner
2020-07-10T00:00:05Z - 2020-07-10T00:00:05Z (1) openshift-image-registry Successfully assigned openshift-image-registry/image-pruner-1594339200-jkbqd to ci-op-ygbhp5zt-49a5f-vs995-worker-c-pg7wk
2020-07-10T00:00:05Z - 2020-07-10T00:00:05Z (1) openshift-image-registry Created pod: image-pruner-1594339200-jkbqd
2020-07-10T00:00:05Z - 2020-07-10T00:00:05Z (1) openshift-image-registry Created job image-pruner-1594339200
2020-07-10T00:00:07Z - 2020-07-10T00:01:43Z (5) openshift-image-registry Created container image-pruner
2020-07-10T00:00:07Z - 2020-07-10T00:01:43Z (5) openshift-image-registry Started container image-pruner
2020-07-10T00:05:55Z - 2020-07-10T00:05:55Z (1) openshift-image-registry Deleted pod: image-pruner-1594339200-jkbqd
2020-07-10T00:06:02Z - 2020-07-10T00:06:02Z (1) openshift-image-registry Saw completed job: image-pruner-1594339200, status: Failed
$ tail -n13 namespaces/openshift-image-registry/batch/jobs.yaml 
  status:
    conditions:
    - lastProbeTime: "2020-07-10T00:05:55Z"
      lastTransitionTime: "2020-07-10T00:05:55Z"
      message: Job has reached the specified backoff limit
      reason: BackoffLimitExceeded
      status: "True"
      type: Failed
    startTime: "2020-07-10T00:00:05Z"
kind: JobList
metadata:
  resourceVersion: "60530"
  selfLink: /apis/batch/v1/namespaces/openshift-image-registry/jobs
$ yaml2json <namespaces/openshift-image-registry/core/pods.yaml | jq -r '.items[].metadata.name'
cluster-image-registry-operator-6dd4488f9d-dzkr4
image-registry-d5bbb6764-26blg
image-registry-d5bbb6764-2kxxb
node-ca-5kwbw
node-ca-6wdcf
node-ca-l7c7r
node-ca-wqkmt
node-ca-ww4j6
node-ca-xkscj
$ ls namespaces/openshift-image-registry/pods/
cluster-image-registry-operator-6dd4488f9d-dzkr4/ node-ca-5kwbw/                                    node-ca-wqkmt/
image-registry-d5bbb6764-26blg/                   node-ca-6wdcf/                                    node-ca-ww4j6/
image-registry-d5bbb6764-2kxxb/                   node-ca-l7c7r/                                    node-ca-xkscj/

Seems like there's not much of a record left after the failing pods got reaped.

Comment 28 W. Trevor King 2020-07-10 16:07:29 UTC
Bug 1851414 is about the pruner vs. the OOM killer.  Not sure if that applies here, because if it did, I'd expect the must-gather's kubelet logs to mention it.  But maybe OOM kills don't show up in kubelet logs after all?

Comment 29 Dan Li 2020-07-13 18:11:27 UTC
Hi, I see that there is a mitigation method available in comment 24 and 25. Is this bug still an "Urgent" or can it be de-escalated to High or Medium?

Comment 30 Dan Li 2020-07-20 19:55:24 UTC
Hi Christian, please see my inquiry above. Can we de-escalate this bug to High or Medium since a mitigation has been provided?

Comment 31 Christian LaPolt 2020-07-23 01:09:43 UTC
Yes, I think we can move this to High as it is still an issue. When do you expect this to be fixed? What version?

Comment 32 Dan Li 2020-07-23 12:00:35 UTC
I don't have knowledge in the bug so I will leave this question to Oleg to answer. I will de-escalate this bug to High.

Comment 33 Oleg Bulatov 2020-07-31 15:39:46 UTC
First we need to fix the cronjob so that failed pods stay there for debugging. There are several reasons why the pruner may fail and we need logs to understand what's going on in this case.

Comment 38 Ricardo Maraschini 2020-08-21 13:09:42 UTC
Gathering some data:

1. We could not gather job pods log
   1.1 Needs https://bugzilla.redhat.com/show_bug.cgi?id=1857687 to be addressed first
2. Jobs are failing with image registry removed
   2.1 There is a known issue with the pruner when the image registry is removed
   2.2 Issue addressed by https://bugzilla.redhat.com/show_bug.cgi?id=1867792
   2.3 Maybe this is issue is going to be addressed once the above one moves to CLOSED
3. Operator status seems intermittent (sometimes it becomes healthy)
   3.1 This is a known issue
   3.2 Addressed by https://bugzilla.redhat.com/show_bug.cgi?id=1857684


Moving this ahead to the next sprint.

Comment 47 Wenjing Zheng 2020-09-11 07:45:58 UTC
Doug,could you please help to verify whether the pruner works on s390x? Thanks!

Comment 48 Douglas Slavens 2020-09-16 22:00:21 UTC
I was working on another issue. I will start testing this.

Comment 49 Wenjing Zheng 2020-09-22 07:07:23 UTC
(In reply to Douglas Slavens from comment #48)
> I was working on another issue. I will start testing this.

Thanks, when you finish test, please mark this bug as Verified : )

Comment 50 Douglas Slavens 2020-09-25 17:44:57 UTC
The pruner appears to work on s390x:

[dslavens@rock-kvmlp-3 ~]$ oc login
Authentication required for https://api.dslavens-ocp.ocp128.rockhopper:6443 (openshift)
Username: kubeadmin
Password: 
Login successful.

You have access to 57 projects, the list has been suppressed. You can list all projects with 'oc projects'

Using project "default".

[dslavens@rock-kvmlp-3 ~]$ oc adm prune images --keep-tag-revisions=3 --keep-younger-than=60m --prune-registry=false
Dry run enabled - no modifications will be made. Add --confirm to remove images
Only API objects will be removed.  No modifications to the image registry will be made.
Deleting istags openshift/cli: latest
Deleted 1 objects.

[dslavens@rock-kvmlp-3 ~]$ arch
s390x

[dslavens@rock-kvmlp-3 ~]$ oc version
Client Version: 4.5.6
Server Version: 4.5.0-0.nightly-s390x-2020-09-24-223849
Kubernetes Version: v1.18.3+47c0e71

[dslavens@rock-kvmlp-3 ~]$ oc status
In project default on server https://api.dslavens-ocp.ocp128.rockhopper:6443

svc/openshift - kubernetes.default.svc.cluster.local
svc/kubernetes - 172.30.0.1:443 -> 6443

View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'.

Comment 56 Anil Dhingra 2020-10-13 03:48:35 UTC
facing similar issue in 4.5.7

[kni@provision ~]$ oc -n openshift-image-registry delete jobs --all
job.batch "image-pruner-1602547200" deleted

Worked

Comment 58 errata-xmlrpc 2020-10-27 16:10:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.