Bug 1469654 - image pruning doesn't work from outside the cluster
Summary: image pruning doesn't work from outside the cluster
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 3.5.0
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: ---
: 3.7.0
Assignee: Michal Minar
QA Contact: Dongbo Yan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-11 15:11 UTC by Anton Sherkhonov
Modified: 2017-11-30 09:17 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The `oadm prune images` used to print confusing errors (e.g. operation timeout). And several of its options were not documented. Also the help was misleading (e.g. --registry-url flag is necessary only when passed with --confirm). Consequence: User did not know what to do when timeout occured. User wasn't able to find out the solution even in the documentation. Fix: Errors are now printed with hint. Documentation has been updated, help has been amended. Result: N00b should be able to prune images now. Even outside of OpenShift cluster.
Clone Of:
Environment:
Last Closed: 2017-11-28 22:00:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Anton Sherkhonov 2017-07-11 15:11:53 UTC
Description of problem:
When running the command:
#oc adm prune 
The oc tries to get healthz from the openshift registry via the internal openshift's registry ip address and times out, because registry's internal ip address is obviously unavailable outside the cluster.

Version-Release number of selected component (if applicable):
$ oc version
oc v3.6.0-alpha.1+16132e2-45
kubernetes v1.5.2+43a9be4
features: Basic-Auth

Server https://osemaster.sbu.lab.eng.bos.redhat.com:8443
openshift v3.5.5.15
kubernetes v1.5.2+43a9be4

How reproducible:
Always

Steps to Reproduce:
1. Cluster has a functional internal registry. User has cluster-admin privileges.
2. the client host we'll be running from is outside of the cluster (doesn't have access to Openshift's SDN).
3. run the following command specifying the <timeframe> to make sure you have some images to prune
#oc adm prune images --keep-younger-than=<timeframe>
4. run
#oc adm prune images --keep-younger-than=<timeframe> --confirm --loglevel=8


Actual results:
The level 8 error output:
I0711 10:43:58.188971    2817 prune.go:832] Using registry: 172.22.77.97:5000
I0711 10:43:58.188985    2817 prune.go:193] Trying https for 172.22.77.97:5000
I0711 10:43:58.189026    2817 round_trippers.go:296] GET https://172.22.77.97:5000/healthz
I0711 10:43:58.189034    2817 round_trippers.go:303] Request Headers:
I0711 10:43:58.189041    2817 round_trippers.go:306]     User-Agent: oc/v1.5.2+43a9be4 (darwin/amd64) kubernetes/43a9be4
I0711 10:43:58.189048    2817 round_trippers.go:306]     Authorization: Basic dW51c2VkOnk3VzY1cldGRTdyVXhwdnNsV2twOUtlTXZWSW5aalJsY3F2UzVoU2ZhelU=
I0711 10:45:13.826431    2817 round_trippers.go:321] Response Status:  in 75637 milliseconds
I0711 10:45:13.826454    2817 round_trippers.go:324] Response Headers:
I0711 10:45:13.826477    2817 prune.go:198] Error with https for 172.22.77.97:5000: Get https://172.22.77.97:5000/healthz: dial tcp 172.22.77.97:5000: getsockopt: operation timed out
I0711 10:45:13.826496    2817 prune.go:193] Trying http for 172.22.77.97:5000
I0711 10:45:13.826522    2817 round_trippers.go:296] GET http://172.22.77.97:5000/healthz
I0711 10:45:13.826531    2817 round_trippers.go:303] Request Headers:
I0711 10:45:13.826538    2817 round_trippers.go:306]     User-Agent: oc/v1.5.2+43a9be4 (darwin/amd64) kubernetes/43a9be4
I0711 10:45:13.826546    2817 round_trippers.go:306]     Authorization: Basic dW51c2VkOnk3VzY1cldGRTdyVXhwdnNsV2twOUtlTXZWSW5aalJsY3F2UzVoU2ZhelU=
I0711 10:46:30.104671    2817 round_trippers.go:321] Response Status:  in 76278 milliseconds
I0711 10:46:30.104700    2817 round_trippers.go:324] Response Headers:
I0711 10:46:30.104718    2817 prune.go:198] Error with http for 172.22.77.97:5000: Get http://172.22.77.97:5000/healthz: dial tcp 172.22.77.97:5000: getsockopt: operation timed out
F0711 10:46:30.106273    2817 helpers.go:116] error: error communicating with registry: Get http://172.22.77.97:5000/healthz: dial tcp 172.22.77.97:5000: getsockopt: operation timed out

(172.22.77.97 is the ip of the internal registry).

The images are still not pruned.


Expected results:
The images are pruned

Additional info:
When running `oc adm prune images` on one of the nodes under same user - the command succeeds, images are pruned.

Comment 1 Michal Minar 2017-07-12 08:23:44 UTC
Anton, for this purpose we offer `--registry-url` option for the `oadm prune images` command. Could you please try it and report back?

Comment 2 Anton Sherkhonov 2017-07-12 12:22:55 UTC
`--registry-url` works, I still think this is a bug though:
It is very inconsistent with other usages of `oc` command where you don't need anything except for master api endpoint.
If you don't have `--registry-url` the commmand without `--confirm` works just fine, that suggests that it should work with `--confirm` as well.
The error you get is "operation timed out" which doesn't make it easy to understand that you need `--registry-url`.
There is nothing in the documentation to support that.

Comment 3 Peter Portante 2017-07-12 16:06:00 UTC
So if "oc adm prune images" works from in the cluster, doesn't that mean it is accessing the registry via the internal service name?  If so, why not always use the external service name, if it exists, and only use the internal service name when it doe s not exist?

Then you would not need --registry url for internal vs external access, right?

Comment 4 Michal Minar 2017-07-17 15:30:10 UTC
The `--registry-url` flag is covered in PR [1]. But I see that the section about `--registry-url` will need to be back-ported to earlier versions. I'll take care of it.

[1] https://github.com/openshift/openshift-docs/pull/4471

> The error you get is "operation timed out" which doesn't make it easy to understand that you need `--registry-url`. There is nothing in the documentation to support that.

This really isn't a good user experience. It will be fixed by this bz.

> If you don't have `--registry-url` the commmand without `--confirm` works just fine, that suggests that it should work with `--confirm` as well.

I'm not really sure about this point. The `--registry-url` isn't really needed for the dry-run. Would it be enough to just document this better in command's help?

> If so, why not always use the external service name, if it exists, and only use the internal service name when it does not exist?

Unfortunately, it's pretty hard to determine the working external url of the registry. We don't have a way to safely determine it. Recently, we started to allow for external registry name to propagate into image streams [2]. However, making use of it is still optional, which still makes the internal IP the safest option from inside of cluster.

[2] https://github.com/openshift/origin/pull/14882

For the usage outside of cluster, I don't see a better option to `--registry-url`. Or making the URL discover-able from the master which has been discussed several times already.

Comment 8 Ben Parees 2017-10-03 00:55:52 UTC
>> The error you get is "operation timed out" which doesn't make it easy to understand that you need `--registry-url`. There is nothing in the documentation to support that.

> This really isn't a good user experience. It will be fixed by this bz.

@michal, did this ever get fixed?


> The `--registry-url` flag is covered in PR [1]. But I see that the section about `--registry-url` will need to be back-ported to earlier versions. I'll take care of it.

did this documentation get backported?

> I'm not really sure about this point. The `--registry-url` isn't really needed for the dry-run. Would it be enough to just document this better in command's help?

yes, that seems reasonable, let's do that.


I am lowering the severity of this bug as all I see are:

1) better error/timeout logic
2) some better docs
3) some better help text

Certainly not a blocker.

Comment 9 Michal Fojtik 2017-10-03 08:07:14 UTC
I don't think got fixed, the error is really not good user experience, we should probably improve it to "operation timed out while contacting registry XYZ".

Comment 10 Michal Minar 2017-10-03 11:46:26 UTC
oadm prune help and error-reporting fixing PR: https://github.com/openshift/origin/pull/16655

Comment 11 Michal Minar 2017-10-05 12:32:48 UTC
The PR has been fixed. The only missing pieces are documentation back-port PRs [1] and [2] for the --registry-url flag.

[1] https://github.com/openshift/openshift-docs/pull/5535 (OCP 3.5)
[2] https://github.com/openshift/openshift-docs/pull/5536 (OCP 3.4)

Comment 13 Michal Minar 2017-10-10 15:18:07 UTC
The documentation PRs have been merged as well.

Comment 14 Dongbo Yan 2017-10-12 05:58:45 UTC
Verified
# oc version
oc v3.7.0-0.147.0
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://:8443
openshift v3.7.0-0.143.2
kubernetes v1.7.0+80709908fd

When prune images outside cluster without '--registry-url' flag will prompt error like below:
# oadm prune images --keep-younger-than=0  --confirm
 error: failed to ping registry docker-registry.default.svc:5000: [Get https://docker-registry.default.svc:5000/: dial tcp: lookup docker-registry.default.svc on 10.72.17.5:53: no such host, Get http://docker-registry.default.svc:5000/: dial tcp: lookup docker-registry.default.svc on 10.72.17.5:53: no such host]
* Please provide a reachable route to the integrated registry using --registry-url.

Docs and help text look good, so move to verified

Comment 17 errata-xmlrpc 2017-11-28 22:00:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.