Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1518817

Summary:	oadm prune images fails on invalid image reference
Product:	OpenShift Container Platform	Reporter:	Justin Pierce <jupierce>
Component:	Image Registry	Assignee:	Alexey Gladkov <agladkov>
Status:	CLOSED ERRATA	QA Contact:	Dongbo Yan <dyan>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	3.7.0	CC:	agladkov, aos-bugs, bduffy, bmorriso, bparees, cdaley, dakini, ihorvath, jokerman, jupierce, marc.jadoul, maszulik, mmccomas, nraghava, tnozicka
Target Milestone:	---	Keywords:	OpsBlocker
Target Release:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1577356 1577357 (view as bug list)		Environment:
Last Closed:	2018-07-30 19:09:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1577356, 1577357

Description Justin Pierce 2017-11-29 15:37:06 UTC

Description of problem:
['oadm', 'prune', 'deployments', '--keep-complete', '2', '--keep-younger-than', '1h', '--keep-failed', '1', '--config', '/tmp/admin.kubeconfig', '--confirm']

Version-Release number of selected component (if applicable):
v3.7.9

How reproducible:
100%

Steps to Reproduce:
1. Attempt to prune deployments with invalid image references. On a large cluster, manually weeding these invalid deployments out does not scale for operations. 

Actual results:
prune cannot complete due to errors like:

NAME         SECRETS   AGE
autopruner   2         99d
Failed to build graph!

The following objects have invalid references:
ReplicationController[someproject/inov-2]: invalid docker image reference "https://github.com/luistarela/tecno.git": invalid reference format
DeploymentConfig[someproject/mongodb]: invalid docker image reference " ": invalid reference format

Expected results:
Prune should perform best effort work.

Comment 1 Tomáš Nožička 2017-12-06 10:39:06 UTC

I've tried to reproduce it but it didn't fail.

Are you sure you have run `oadm prune deployments` and not `oadm prune images`? I checked the sources and the error you provide ("Failed to build graph!") is only in pruning images subcommand. And I can't see why pruning Deployments would need to resolve images either.

Maybe I am wrong and this is some kind of bad wiring bug, but I though checking first would be better.

Otherwise please provide yaml outputs of affected objects (DC,RC) and I'll give it another shot.

Comment 2 ihorvath 2017-12-06 15:30:17 UTC

You are right, I messed up when following the code that runs these commands. The python script runs oadm prune deployments, oadm prune builds and finally oadm prune images. My bad on the confusion, I grabbed the wrong command associated with the output.

I ran it manually this time, and this is what we got:
[root@starter-ca-central-1-master-692e9 etc]# oadm prune images --keep-younger-than 24h --keep-tag-revisions 5  --token $TOKEN --confirm
Failed to build graph!

The following objects have invalid references:

  ReplicationController[tecnologias/tecnologias-inovadoras-2]: invalid docker image reference "https://github.com/luistarela/tecno.git": invalid reference format
  DeploymentConfig[1-q/cakephp-mysql-persistent]: invalid docker image reference " ": invalid reference format
  DeploymentConfig[1-q/mysql]: invalid docker image reference " ": invalid reference format
  DeploymentConfig[123-my-project/nodejs-mongo-persistent]: invalid docker image reference " ": invalid reference format
.
.
.
many more of these
.
.
Either fix the references or delete the objects to make the pruner proceed.
error: failed to build graph - no changes made

Comment 3 Ben Parees 2017-12-06 15:43:01 UTC

As I recall we debated this behavior and decided this was better than warning and proceeding but I don't recall the justification.

Comment 4 Ben Parees 2017-12-06 15:44:12 UTC

I think it was a concern that if the validation rules changed, we could ignore a bunch of references and prune everything.

Comment 5 Ben Parees 2017-12-06 15:45:36 UTC

Also why didn't validation prevent the creation of these invalid references?

Comment 6 Tomáš Nožička 2017-12-06 15:57:51 UTC

" " is valid image because that is what is used before image trigger controller injects the image resolved from triggers and validation forbids empty string. 

we use upstream validation for dc.spec.template and it checks only for empty image

upstream controls validation for RCs, so user can always edit it

Comment 7 Ben Parees 2017-12-06 16:25:44 UTC

Given the concern that if docker image validation changed in such a way that we suddenly treated all references as "invalid" we'd end up pruning everything, the only path forward I see is to introduce a "--ignore-invalid-refs" or "--proceed-on-error" flag which users would have to set if they wanted to prune in the face of these types of errors.

Comment 8 Ben Parees 2018-01-10 03:41:59 UTC

Marking this as an RFE because per my comment 7, that's effectively what it has to be (the current behavior is intentional and the only thing that's safe to do by default).

Comment 9 Maciej Szulik 2018-03-19 12:04:33 UTC

*** Bug 1555003 has been marked as a duplicate of this bug. ***

Comment 10 Brian Duffy 2018-04-14 11:56:08 UTC

Hi All, I ran into this issue when setting up a pruner cronjob;

$ oc logs openshift-pruners-1522800000-zz2m4 -c prune-images
Failed to build graph!

The following objects have invalid references:

  ReplicationController[dk0503-b/nginx-2]: invalid docker image reference "redacted.com/rhscl/nginx-18-rhel7@sha256:latest": invalid reference format

Either fix the references or delete the objects to make the pruner proceed.
Client version (v3.9.14) doesn't match master (v3.7.23), which may allow for different image references. Try to re-run this binary with the same version.
error: failed to build graph - no changes made

This seems related to this bug.

Comment 11 Ben Parees 2018-04-16 00:58:41 UTC

Michal, see comment 7, i think this is probably small enough that we just fix it as a bug.  (Related to the comment i just put in the bug Clayton opened about pruning).

Let me know if you think it's not so trivial to fix/add the flag/behavior.

Comment 12 Michal Minar 2018-04-16 07:21:27 UTC

> As I recall we debated this behavior and decided this was better than warning and proceeding but I don't recall the justification.

Full discussion: https://github.com/openshift/origin/pull/16821#discussion_r144269214

TL;DR:
1. upstream docker can suddenly relax the pull-spec format
2. master will allow it
3. older oadm client is called to prune the images
4. the pruner ignores the relaxed spec and deletes the images

> " " is valid image because that is what is used before image trigger controller injects the image resolved from triggers and validation forbids empty string. 

Oh, that's a terrible hack. " " is in no way a valid image to me.

> Let me know if you think it's not so trivial to fix/add the flag/behavior.

It looks trivial to me as well. 

My preference would be to make " " a special case and always proceed.
For --ignore-invalid-refs, apart from proceeding, I'd build a set of them from all the referrers and keep all the images having their dockerImageReference present in the set.

Comment 13 Tomáš Nožička 2018-04-16 08:44:22 UTC

>> " " is valid image because that is what is used before image trigger controller injects the image resolved from triggers and validation forbids empty string. 

> Oh, that's a terrible hack. " " is in no way a valid image to me.

I'd rephrase: " " is a valid, explicitly unspecified image. As we need to work with upstream and we don't want to allow "" as an image as that's usually a mistake (like bad yaml indentation), upstream validation allows " ". Given we have image streams and image injection, image may be unspecified during parts of its lifetime.

Comment 14 Michal Minar 2018-04-16 08:58:49 UTC

I see the need, but looking at the raw spec seeing " " instead of image name, and not know this, I'd be tempted to remove it.

Comment 15 Ben Parees 2018-04-16 11:54:04 UTC

> My preference would be to make " " a special case and always proceed.

We've already resolved this specific case I believe (I think Alexey put in that fix), so yes we already treat that as "valid" and proceed.


I'm concerned w/ the more general case where someone has bad references and needs to proceed w/ pruning.  Telling the online ops team to go fix up 0928390428490 bad references so they can prune their cluster isn't going to go over well. We need a path for them.


> For --ignore-invalid-refs, apart from proceeding, I'd build a set of them from 
> all the referrers and keep all the images having their dockerImageReference 
> present in the set.

Sorry I don't follow.

Comment 16 Ben Parees 2018-04-27 21:33:28 UTC

So it is not lost in the history, my suggestion to resolve this bug is:

Given the concern that if docker image validation changed in such a way that we suddenly treated all references as "invalid" we'd end up pruning everything, the only path forward I see is to introduce a "--ignore-invalid-refs" or "--proceed-on-error" flag which users would have to set if they wanted to prune in the face of these types of errors.

Comment 17 Alexey Gladkov 2018-05-03 10:16:11 UTC

> the only path forward I see is to introduce a "--ignore-invalid-refs" or "--proceed-on-error" flag which users would have to set if they wanted to prune in the face of these types of errors.

This approach will shift the problem from our shoulder to the user's shoulders. But by doing this we do not give him enough information to make such a decision.

$ oadm prune images --keep-younger-than 24h --confirm --ignore-invalid-refs

What I will lose after this command ? If we can not give an answer, then how can I use it. If I can get a list of objects with bad refs, then how can I protect those images that are important to me?

I think we need to answer these questions first.

Comment 18 Ben Parees 2018-05-03 11:23:56 UTC

> This approach will shift the problem from our shoulder to the user's shoulders. But by doing this we do not give him enough information to make such a decision.

We do give them that information:  They run prune, they get errors reporting the invalid references, they decide if they want to fix those references, or ignore them and allow pruning to proceed, knowing that some images will be pruned which, potentially, were supposed to be referenced (but were not because the reference was invalid).

Frankly I think this is an unlikely scenario anyway.  How many legitimate configurations do you expect are going to have invalid references to real images that need to be protected?  That workload is already broken/not running since the reference is invalid.

> What I will lose after this command ? 

That's what running the command without --prune will tell you.

> I think we need to answer these questions first.

I think i've answered them, but if you still have concerns, please make an alternative proposal for how this bug should be resolved.  But "require users to fix all their broken refs" is not a viable solution because cluster operators who need to prune their cluster can't be responsible for cleaning up thousands of bad references created by the cluster users.

Comment 19 Alexey Gladkov 2018-05-03 13:59:39 UTC

> They run prune, they get errors reporting the invalid references, they decide if they want to fix those references

> But "require users to fix all their broken refs" is not a viable solution

Do not you see a contradiction here?

> please make an alternative proposal for how this bug should be resolved

Yes, I have an alternative proposal: Instead of deleting objects that contain bad refs, we can create a fake node in the graph and add these objects to it. In this case we will preserve all objects with bad refs since we do not have enough infomation to delete them. Yes, I realize that these objects will stuck in the cluster, but in my opinion this is better than deleting objects in a situation where we are not sure that they are not needed.

Comment 20 Alexey Gladkov 2018-05-03 14:05:54 UTC

Why are we so tolerant of errors and so not tolerant of user data? ))

We so love to ignore errors and delete/update user data without the confidence that it can be done.

Comment 21 Alexey Gladkov 2018-05-03 14:21:22 UTC

(In reply to Alexey Gladkov from comment #19)
> Yes, I have an alternative proposal: Instead of deleting objects that
> contain bad refs, we can create a fake node in the graph and add these
> objects to it. In this case we will preserve all objects with bad refs since
> we do not have enough infomation to delete them. Yes, I realize that these
> objects will stuck in the cluster, but in my opinion this is better than
> deleting objects in a situation where we are not sure that they are not
> needed.

However, I thought that this will not always work. For example, this will not work with imagestreams.

Comment 22 Ben Parees 2018-05-03 14:38:50 UTC

> Do not you see a contradiction here?

you deleted my "or they decide to proceed w/ pruning" so no, there's no contradiction.

With my proposal, they can make an informed decision whether to fix the refs, or allow the deletion to proceed.  but they are not *required* to fix the refs, it is up to them.  If they are ok w/ the deletion, they do not have to fix the refs.  


> We so love to ignore errors and delete/update user data without the confidence that it can be done.

If the deploymentconfig references something that can't possibly be an image we know about and therefore can't possibly be a valid workload, what is the risk in proceeding w/ deletion?  This is a decision an administrator can make.

And if they see that same value 100 times, isn't it easier for them to have a "proceed w/ deletion" option than to go update 100 deploymentconfigs to make them "valid" just so they can prune?  (And depending on policies, the administrator may not be allowed to touch user's deploymentconfigs)

> Yes, I have an alternative proposal: Instead of deleting objects that contain bad refs, we can create a fake node in the graph and add these objects to it. In this case we will preserve all objects with bad refs since we do not have enough infomation to delete them. 

So to be clear, if there's a Dc/pod/etc w/ a bad ref, it's not going to be deleted by pruning images.  The point is that we should provide a way to proceed w/ pruning images in spite of the bad ref, not that we should delete the dc/pod (which isn't a thing pruning does).

With regards to imagestreamtag pruning it is more complicated I agree (I wasn't really thinking about those resources).  I'm ok w/ your fake node proposal but you later said it won't work for imagestreams, why is that?

Comment 23 Alexey Gladkov 2018-05-03 16:02:07 UTC

https://github.com/openshift/origin/pull/19611

Comment 24 Alexey Gladkov 2018-05-07 09:11:32 UTC

I added the `--ignore-invalid-refs`, but I'm very against that solution. I do not see the correct ways to use this option. If it is used in a cluster operator or in scripts without user control, then this is the direct path to data loss. I believe that all references should be readable for the garbage collector. The construction of dependency graphs does not work in another way.

With this comment, I disclaim responsibility for the consequences and impose them on person who will approve this PR.

Comment 26 Ben Parees 2018-05-11 19:10:10 UTC

Bug clones have been created to backport the change to 3.7 and 3.9:
https://bugzilla.redhat.com/show_bug.cgi?id=1577356
https://bugzilla.redhat.com/show_bug.cgi?id=1577357

Comment 28 Dongbo Yan 2018-05-16 07:41:54 UTC

Verified
openshift v3.10.0-0.41.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

Reproduce steps:
1. Edit dc with invalid image reference in 'image' field like below:
        image: docker-registry.default.svc:5000/dyan/cakephp-mysql-example@sha256:latest

2. Prune images
 $ oc adm prune images --keep-tag-revisions=0 --keep-younger-than=0 --token=jf9HRi58RqN0GqBerRyn9UHXHxAvQImhlr6tJ5RhDYE --confirm --loglevel=4

Failed to build graph!

The following objects have invalid references:

  Pod[dyan/cakephp-mysql-example-6-hook-pre]: invalid docker image reference "docker-registry.default.svc:5000/dyan/cakephp-mysql-example@sha256:latest": invalid reference format
  ReplicationController[dyan/cakephp-mysql-example-6]: invalid docker image reference "docker-registry.default.svc:5000/dyan/cakephp-mysql-example@sha256:latest": invalid reference format
  DeploymentConfig[dyan/cakephp-mysql-example]: invalid docker image reference "docker-registry.default.svc:5000/dyan/cakephp-mysql-example@sha256:latest": invalid reference format

Either fix the references or delete the objects to make the pruner proceed.
F0516 03:40:11.697121    1269 helpers.go:119] error: failed to build graph - no changes made

3. Prune images with option '--ignore-invalid-refs'
could prune images without errors

Comment 29 Marc Jadoul 2018-05-16 15:19:56 UTC

Hello guys,

In OCP v3.7 we are hit by this. Users use a template but the build fail. The dc has then image " ". 
The oc adm prune image (in jenkins v3.7 image) consider this as invalid reference and prune fail.... for all projects ?

If we delete the dc --> the deployment won't happen when the build finally work.

So as admin we have zero solution ?

Can we just have a quick fix for this please.

Regards,
Marc

Comment 30 Ben Parees 2018-05-16 16:20:01 UTC

Marc, the backport request is reflected by the presence of this bug:  https://bugzilla.redhat.com/show_bug.cgi?id=1577356

please follow that bug for updates on the progress of backporting it.

Comment 32 errata-xmlrpc 2018-07-30 19:09:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816