Description of problem: ['oadm', 'prune', 'deployments', '--keep-complete', '2', '--keep-younger-than', '1h', '--keep-failed', '1', '--config', '/tmp/admin.kubeconfig', '--confirm'] Version-Release number of selected component (if applicable): v3.7.9 How reproducible: 100% Steps to Reproduce: 1. Attempt to prune deployments with invalid image references. On a large cluster, manually weeding these invalid deployments out does not scale for operations. Actual results: prune cannot complete due to errors like: NAME SECRETS AGE autopruner 2 99d Failed to build graph! The following objects have invalid references: ReplicationController[someproject/inov-2]: invalid docker image reference "https://github.com/luistarela/tecno.git": invalid reference format DeploymentConfig[someproject/mongodb]: invalid docker image reference " ": invalid reference format Expected results: Prune should perform best effort work.
I've tried to reproduce it but it didn't fail. Are you sure you have run `oadm prune deployments` and not `oadm prune images`? I checked the sources and the error you provide ("Failed to build graph!") is only in pruning images subcommand. And I can't see why pruning Deployments would need to resolve images either. Maybe I am wrong and this is some kind of bad wiring bug, but I though checking first would be better. Otherwise please provide yaml outputs of affected objects (DC,RC) and I'll give it another shot.
You are right, I messed up when following the code that runs these commands. The python script runs oadm prune deployments, oadm prune builds and finally oadm prune images. My bad on the confusion, I grabbed the wrong command associated with the output. I ran it manually this time, and this is what we got: [root@starter-ca-central-1-master-692e9 etc]# oadm prune images --keep-younger-than 24h --keep-tag-revisions 5 --token $TOKEN --confirm Failed to build graph! The following objects have invalid references: ReplicationController[tecnologias/tecnologias-inovadoras-2]: invalid docker image reference "https://github.com/luistarela/tecno.git": invalid reference format DeploymentConfig[1-q/cakephp-mysql-persistent]: invalid docker image reference " ": invalid reference format DeploymentConfig[1-q/mysql]: invalid docker image reference " ": invalid reference format DeploymentConfig[123-my-project/nodejs-mongo-persistent]: invalid docker image reference " ": invalid reference format . . . many more of these . . Either fix the references or delete the objects to make the pruner proceed. error: failed to build graph - no changes made
As I recall we debated this behavior and decided this was better than warning and proceeding but I don't recall the justification.
I think it was a concern that if the validation rules changed, we could ignore a bunch of references and prune everything.
Also why didn't validation prevent the creation of these invalid references?
" " is valid image because that is what is used before image trigger controller injects the image resolved from triggers and validation forbids empty string. we use upstream validation for dc.spec.template and it checks only for empty image upstream controls validation for RCs, so user can always edit it
Given the concern that if docker image validation changed in such a way that we suddenly treated all references as "invalid" we'd end up pruning everything, the only path forward I see is to introduce a "--ignore-invalid-refs" or "--proceed-on-error" flag which users would have to set if they wanted to prune in the face of these types of errors.
Marking this as an RFE because per my comment 7, that's effectively what it has to be (the current behavior is intentional and the only thing that's safe to do by default).
*** Bug 1555003 has been marked as a duplicate of this bug. ***
Hi All, I ran into this issue when setting up a pruner cronjob; $ oc logs openshift-pruners-1522800000-zz2m4 -c prune-images Failed to build graph! The following objects have invalid references: ReplicationController[dk0503-b/nginx-2]: invalid docker image reference "redacted.com/rhscl/nginx-18-rhel7@sha256:latest": invalid reference format Either fix the references or delete the objects to make the pruner proceed. Client version (v3.9.14) doesn't match master (v3.7.23), which may allow for different image references. Try to re-run this binary with the same version. error: failed to build graph - no changes made This seems related to this bug.
Michal, see comment 7, i think this is probably small enough that we just fix it as a bug. (Related to the comment i just put in the bug Clayton opened about pruning). Let me know if you think it's not so trivial to fix/add the flag/behavior.
> As I recall we debated this behavior and decided this was better than warning and proceeding but I don't recall the justification. Full discussion: https://github.com/openshift/origin/pull/16821#discussion_r144269214 TL;DR: 1. upstream docker can suddenly relax the pull-spec format 2. master will allow it 3. older oadm client is called to prune the images 4. the pruner ignores the relaxed spec and deletes the images > " " is valid image because that is what is used before image trigger controller injects the image resolved from triggers and validation forbids empty string. Oh, that's a terrible hack. " " is in no way a valid image to me. > Let me know if you think it's not so trivial to fix/add the flag/behavior. It looks trivial to me as well. My preference would be to make " " a special case and always proceed. For --ignore-invalid-refs, apart from proceeding, I'd build a set of them from all the referrers and keep all the images having their dockerImageReference present in the set.
>> " " is valid image because that is what is used before image trigger controller injects the image resolved from triggers and validation forbids empty string. > Oh, that's a terrible hack. " " is in no way a valid image to me. I'd rephrase: " " is a valid, explicitly unspecified image. As we need to work with upstream and we don't want to allow "" as an image as that's usually a mistake (like bad yaml indentation), upstream validation allows " ". Given we have image streams and image injection, image may be unspecified during parts of its lifetime.
I see the need, but looking at the raw spec seeing " " instead of image name, and not know this, I'd be tempted to remove it.
> My preference would be to make " " a special case and always proceed. We've already resolved this specific case I believe (I think Alexey put in that fix), so yes we already treat that as "valid" and proceed. I'm concerned w/ the more general case where someone has bad references and needs to proceed w/ pruning. Telling the online ops team to go fix up 0928390428490 bad references so they can prune their cluster isn't going to go over well. We need a path for them. > For --ignore-invalid-refs, apart from proceeding, I'd build a set of them from > all the referrers and keep all the images having their dockerImageReference > present in the set. Sorry I don't follow.
So it is not lost in the history, my suggestion to resolve this bug is: Given the concern that if docker image validation changed in such a way that we suddenly treated all references as "invalid" we'd end up pruning everything, the only path forward I see is to introduce a "--ignore-invalid-refs" or "--proceed-on-error" flag which users would have to set if they wanted to prune in the face of these types of errors.
> the only path forward I see is to introduce a "--ignore-invalid-refs" or "--proceed-on-error" flag which users would have to set if they wanted to prune in the face of these types of errors. This approach will shift the problem from our shoulder to the user's shoulders. But by doing this we do not give him enough information to make such a decision. $ oadm prune images --keep-younger-than 24h --confirm --ignore-invalid-refs What I will lose after this command ? If we can not give an answer, then how can I use it. If I can get a list of objects with bad refs, then how can I protect those images that are important to me? I think we need to answer these questions first.
> This approach will shift the problem from our shoulder to the user's shoulders. But by doing this we do not give him enough information to make such a decision. We do give them that information: They run prune, they get errors reporting the invalid references, they decide if they want to fix those references, or ignore them and allow pruning to proceed, knowing that some images will be pruned which, potentially, were supposed to be referenced (but were not because the reference was invalid). Frankly I think this is an unlikely scenario anyway. How many legitimate configurations do you expect are going to have invalid references to real images that need to be protected? That workload is already broken/not running since the reference is invalid. > What I will lose after this command ? That's what running the command without --prune will tell you. > I think we need to answer these questions first. I think i've answered them, but if you still have concerns, please make an alternative proposal for how this bug should be resolved. But "require users to fix all their broken refs" is not a viable solution because cluster operators who need to prune their cluster can't be responsible for cleaning up thousands of bad references created by the cluster users.
> They run prune, they get errors reporting the invalid references, they decide if they want to fix those references > But "require users to fix all their broken refs" is not a viable solution Do not you see a contradiction here? > please make an alternative proposal for how this bug should be resolved Yes, I have an alternative proposal: Instead of deleting objects that contain bad refs, we can create a fake node in the graph and add these objects to it. In this case we will preserve all objects with bad refs since we do not have enough infomation to delete them. Yes, I realize that these objects will stuck in the cluster, but in my opinion this is better than deleting objects in a situation where we are not sure that they are not needed.
Why are we so tolerant of errors and so not tolerant of user data? )) We so love to ignore errors and delete/update user data without the confidence that it can be done.
(In reply to Alexey Gladkov from comment #19) > Yes, I have an alternative proposal: Instead of deleting objects that > contain bad refs, we can create a fake node in the graph and add these > objects to it. In this case we will preserve all objects with bad refs since > we do not have enough infomation to delete them. Yes, I realize that these > objects will stuck in the cluster, but in my opinion this is better than > deleting objects in a situation where we are not sure that they are not > needed. However, I thought that this will not always work. For example, this will not work with imagestreams.
> Do not you see a contradiction here? you deleted my "or they decide to proceed w/ pruning" so no, there's no contradiction. With my proposal, they can make an informed decision whether to fix the refs, or allow the deletion to proceed. but they are not *required* to fix the refs, it is up to them. If they are ok w/ the deletion, they do not have to fix the refs. > We so love to ignore errors and delete/update user data without the confidence that it can be done. If the deploymentconfig references something that can't possibly be an image we know about and therefore can't possibly be a valid workload, what is the risk in proceeding w/ deletion? This is a decision an administrator can make. And if they see that same value 100 times, isn't it easier for them to have a "proceed w/ deletion" option than to go update 100 deploymentconfigs to make them "valid" just so they can prune? (And depending on policies, the administrator may not be allowed to touch user's deploymentconfigs) > Yes, I have an alternative proposal: Instead of deleting objects that contain bad refs, we can create a fake node in the graph and add these objects to it. In this case we will preserve all objects with bad refs since we do not have enough infomation to delete them. So to be clear, if there's a Dc/pod/etc w/ a bad ref, it's not going to be deleted by pruning images. The point is that we should provide a way to proceed w/ pruning images in spite of the bad ref, not that we should delete the dc/pod (which isn't a thing pruning does). With regards to imagestreamtag pruning it is more complicated I agree (I wasn't really thinking about those resources). I'm ok w/ your fake node proposal but you later said it won't work for imagestreams, why is that?
https://github.com/openshift/origin/pull/19611
I added the `--ignore-invalid-refs`, but I'm very against that solution. I do not see the correct ways to use this option. If it is used in a cluster operator or in scripts without user control, then this is the direct path to data loss. I believe that all references should be readable for the garbage collector. The construction of dependency graphs does not work in another way. With this comment, I disclaim responsibility for the consequences and impose them on person who will approve this PR.
Bug clones have been created to backport the change to 3.7 and 3.9: https://bugzilla.redhat.com/show_bug.cgi?id=1577356 https://bugzilla.redhat.com/show_bug.cgi?id=1577357
Verified openshift v3.10.0-0.41.0 kubernetes v1.10.0+b81c8f8 etcd 3.2.16 Reproduce steps: 1. Edit dc with invalid image reference in 'image' field like below: image: docker-registry.default.svc:5000/dyan/cakephp-mysql-example@sha256:latest 2. Prune images $ oc adm prune images --keep-tag-revisions=0 --keep-younger-than=0 --token=jf9HRi58RqN0GqBerRyn9UHXHxAvQImhlr6tJ5RhDYE --confirm --loglevel=4 Failed to build graph! The following objects have invalid references: Pod[dyan/cakephp-mysql-example-6-hook-pre]: invalid docker image reference "docker-registry.default.svc:5000/dyan/cakephp-mysql-example@sha256:latest": invalid reference format ReplicationController[dyan/cakephp-mysql-example-6]: invalid docker image reference "docker-registry.default.svc:5000/dyan/cakephp-mysql-example@sha256:latest": invalid reference format DeploymentConfig[dyan/cakephp-mysql-example]: invalid docker image reference "docker-registry.default.svc:5000/dyan/cakephp-mysql-example@sha256:latest": invalid reference format Either fix the references or delete the objects to make the pruner proceed. F0516 03:40:11.697121 1269 helpers.go:119] error: failed to build graph - no changes made 3. Prune images with option '--ignore-invalid-refs' could prune images without errors
Hello guys, In OCP v3.7 we are hit by this. Users use a template but the build fail. The dc has then image " ". The oc adm prune image (in jenkins v3.7 image) consider this as invalid reference and prune fail.... for all projects ? If we delete the dc --> the deployment won't happen when the build finally work. So as admin we have zero solution ? Can we just have a quick fix for this please. Regards, Marc
Marc, the backport request is reflected by the presence of this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1577356 please follow that bug for updates on the progress of backporting it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816