1782852 – ImageContentSourcePolicy does not report causes of failures to use mirrors (+several authentication-related failures reported here)

Bug 1782852 - ImageContentSourcePolicy does not report causes of failures to use mirrors (+several authentication-related failures reported here)

Summary: ImageContentSourcePolicy does not report causes of failures to use mirrors (+...

Keywords:
Status:	CLOSED DUPLICATE of bug 1773806
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-12-12 13:54 UTC by Jason Montleon
Modified:	2020-01-08 21:24 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-12-17 14:16:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jason Montleon 2019-12-12 13:54:51 UTC

Description of problem:
When the original source host cannot be resolved the mirror is never tried

Version-Release number of selected component (if applicable):
OpenShift 4.2.9

How reproducible:
Always

Steps to Reproduce:
1. Mirror an image from an unresolvable host
2. Create a ImageContentSourcePolicy so that it should be available
3. Start a pod pointing at the origin source for the image

Actual results:
Pod gets stuck in ImagePullBackOff

Expected results:
Pod starts

Additional info:

Comment 3 Jason Montleon 2019-12-12 16:50:39 UTC

Why would it work when creating a deploymentconfig explicitly using the mirrored image in the same namespace on the same cluster as in https://bugzilla.redhat.com/show_bug.cgi?id=1782852#c1 and yet not witht he deployment pointing at the original source?

Why would the internal registry not be contactable?

Comment 4 Miloslav Trmač 2019-12-12 17:10:07 UTC

No idea. There may well be a bug in the mirroring code, all I’m saying is that this error message is not enough to point at a root cause.

Comment 6 Jason Montleon 2019-12-12 19:33:52 UTC

how do I adjust the CRI-O debug level? Is there documentation somewhere?

Comment 7 Peter Hunt 2019-12-12 19:55:24 UTC

you can create a Container Runtime CRD: https://github.com/openshift/machine-config-operator/blob/master/docs/ContainerRuntimeConfigDesign.md

Comment 8 Jason Montleon 2019-12-12 20:12:40 UTC

I think I figured it out:

oc debug node/ip-10-0-133-241.ec2.internal
Starting pod/ip-10-0-133-241ec2internal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.133.241
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# cat /etc/crio/crio.conf  | grep log_level
    log_level = "debug"


Am I still supposed to look at the pod logs?

Comment 16 Stephen Cuppett 2019-12-17 14:16:37 UTC

Marking this a duplicate. It's not an "unresolved", it's getting 401 unauthorized.

*** This bug has been marked as a duplicate of bug 1773806 ***

Comment 17 Kevin Chung 2019-12-31 14:24:45 UTC

I have a customer who is experiencing this original issue reported in their OpenShift 4.2.9 disconnected cluster.  By design, their cluster cannot reach out to or resolve registry.redhat.io, and thus they use a mirror repository which is resolvable.

I asked the customer to perform:
podman pull --log-level=debug registry.redhat.io/fuse7/fuse-eap-openshift:1.0

The output shows that it cannot ping registry.redhat.io, and thus doesn't try the mirror.  Curiously enough, the "reference rewritten from registry.redhat.io" does not rewrite to the mirror itself.  I don't see this resulting in a 401 unauthorized as this original bug seems to have ended up, but this seems to be similar as the original issue reported.  I'll provide the customer-sensitive ImageContentSourcePolicy object and /etc/containers/registries.conf in additional private notes.

Results:
time="2019-12-30T17:37:18Z" level=info msg="running as rootless" 
time="2019-12-30T17:37:18Z" level=debug msg="Initializing boltdb state at /var/home/core/.local/share/containers/storage/libpod/bolt_state.db" 
time="2019-12-30T17:37:18Z" level=debug msg="Using graph driver overlay" 
time="2019-12-30T17:37:18Z" level=debug msg="Using graph root /var/home/core/.local/share/containers/storage" 
time="2019-12-30T17:37:18Z" level=debug msg="Using run root /run/user/1000" 
time="2019-12-30T17:37:18Z" level=debug msg="Using static dir /var/home/core/.local/share/containers/storage/libpod" 
time="2019-12-30T17:37:18Z" level=debug msg="Using tmp dir /run/user/1000/libpod/tmp" 
time="2019-12-30T17:37:18Z" level=debug msg="Using volume path /var/home/core/.local/share/containers/storage/volumes" 
time="2019-12-30T17:37:18Z" level=debug msg="Set libpod namespace to """ 
time="2019-12-30T17:37:18Z" level=debug msg="[graphdriver] trying provided driver "overlay"" 
time="2019-12-30T17:37:18Z" level=debug msg="overlay: mount_program=/bin/fuse-overlayfs" 
time="2019-12-30T17:37:18Z" level=debug msg="backingFs=xfs, projectQuotaSupported=false, useNativeDiff=false, usingMetacopy=false" 
time="2019-12-30T17:37:18Z" level=debug msg="Initializing event backend journald" 
time="2019-12-30T17:37:18Z" level=debug msg="parsed reference into "[overlay@/var/home/core/.local/share/containers/storage+/run/user/1000:overlay.mount_program=/bin/fuse-overlayfs]registry.redhat.io/fuse7/fuse-eap-openshift:1.0"" 
Trying to pull registry.redhat.io/fuse7/fuse-eap-openshift:1.0...time="2019-12-30T17:37:18Z" level=debug msg="reference rewritten from 'registry.redhat.io/fuse7/fuse-eap-openshift:1.0' to 'registry.redhat.io/fuse7/fuse-eap-openshift:1.0'" 
time="2019-12-30T17:37:18Z" level=debug msg="Trying to pull "registry.redhat.io/fuse7/fuse-eap-openshift:1.0"" 
time="2019-12-30T17:37:18Z" level=debug msg="Using registries.d directory /etc/containers/registries.d for sigstore configuration" 
time="2019-12-30T17:37:18Z" level=debug msg=" Using "default-docker" configuration" 
time="2019-12-30T17:37:18Z" level=debug msg=" No signature storage configuration found for registry.redhat.io/fuse7/fuse-eap-openshift:1.0" 
time="2019-12-30T17:37:18Z" level=debug msg="Looking for TLS certificates and private keys in /etc/docker/certs.d/registry.redhat.io" 
time="2019-12-30T17:37:18Z" level=debug msg="GET https://registry.redhat.io/v2/" 
time="2019-12-30T17:37:18Z" level=debug msg="Ping https://registry.redhat.io/v2/ err Get https://registry.redhat.io/v2/: dial tcp: lookup registry.redhat.io on 10.205.40.192:53: no such host (&url.Error{Op:"Get", URL:"https://registry.redhat.io/v2/", Err:(*net.OpError)(0xc0008c4550)})" 
time="2019-12-30T17:37:18Z" level=debug msg="GET https://registry.redhat.io/v1/_ping" 
time="2019-12-30T17:37:18Z" level=debug msg="Ping https://registry.redhat.io/v1/_ping err Get https://registry.redhat.io/v1/_ping: dial tcp: lookup registry.redhat.io on 10.205.40.192:53: no such host (&url.Error{Op:"Get", URL:"https://registry.redhat.io/v1/_ping", Err:(*net.OpError)(0xc0002f8140)})" 
time="2019-12-30T17:37:18Z" level=error msg="Error pulling image ref //registry.redhat.io/fuse7/fuse-eap-openshift:1.0: Error initializing source docker://registry.redhat.io/fuse7/fuse-eap-openshift:1.0: pinging docker registry returned: Get https://registry.redhat.io/v2/: dial tcp: lookup registry.redhat.io on 10.205.40.192:53: no such host" 
Failed
time="2019-12-30T17:37:18Z" level=error msg="error pulling image "registry.redhat.io/fuse7/fuse-eap-openshift:1.0": unable to pull registry.redhat.io/fuse7/fuse-eap-openshift:1.0: unable to pull image: Error initializing source docker://registry.redhat.io/fuse7/fuse-eap-openshift:1.0: pinging docker registry returned: Get https://registry.redhat.io/v2/: dial tcp: lookup registry.redhat.io on 10.205.40.192:53: no such host"

Comment 23 Dan Clark 2020-01-06 04:17:27 UTC

This doesn't really seem like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1773806

1773806 turns out to be about authenticating to different registries, AFAIK

The issue in this bug seems to be that in a disconnected environment the ImageContentSourcePolicy isn't overriding registry.redhat.io because the cluster cannot talk to registry.redhat.io
I'm having this same issue. I've tried 'oc edit image.config.openshift.io/cluster' to try to block registry.redhat.io

spec:
  registrySources:
    blockedRegistries:
    - registry.redhat.io
    insecureRegistries:
    - registry.redhat.io


My next workaround is going to be to running a small server and hijack registry.redhat.io, telling the cluster it is an insecure registry so I don't get TLS errors.

Anyone have a better workaround than the terrible idea I've suggested, haha.

Comment 24 Jason Montleon 2020-01-06 13:18:56 UTC

It's not possible to see whether connecting to the mirror is attempted and get any hint of why it failed without enabling debugging for cri-o and looking at the cri-o logs. 

At https://www.redhat.com/en/blog/red-hat-openshift-container-platform-4-now-defaults-cri-o-underlying-container-engine under the "Demonstrating the Power of Replaceable Parts" header there's a pretty straight forward example, though in my case I needed to apply it to the workers instead of the masters. Could be one, the other, or both depending on what you need.

Once the nodes have been updated to the new rendered config you can look at 'oc debug node/$node' where the pod is running to possibly get a better idea of what's going on. I've seen what look like auth errors against a mirror and cert errors against another mirror this way that weren't easy to diagnose otherwise.

Comment 25 Miloslav Trmač 2020-01-06 14:21:32 UTC

Dan Clark or, anyone finding this bug because of
> Failed to pull image "…": rpc error: code = Unknown desc = pinging docker registry returned: Get [not mirror]: dial tcp: lookup [not mirror] on 10.0.0.2:53: no such host

Please read https://bugzilla.redhat.com/show_bug.cgi?id=1782852#c2 .  Seeing that message is NOT an indication that you are in the same situation; please file a SEPARATE bug, with CRI-O debug logs of the pull attempt, if possible.

I’m going to strongly prioritize improving the error messages, but in the meantime, just file bugs and I’ll help diagnose/triage as necessary.

Comment 26 Jason Montleon 2020-01-06 14:35:31 UTC

Comment 2 is marked private. The actual response doesn't appear to have anything that can't be shared:

"Unfortunately, there is a known error reporting bug: Even if mirrors are configured, and are actually tried, if using both the primary registry and all mirrors fails, only the error using the primary registry is reported (and the causes of failures using the mirrors are not returned to the user at all).

The code does not really distinguish between DNS failures and any other kinds of failures at this point, and disconnected operation is passing some tests at least, so it’s very likely that, the failure is actually caused by inability to contact the _mirror_, not anything specific to DNS.

Running CRI-O with debug-level logging enabled should include the underlying HTTP requests. That would, at least, allow determining whether mirrors are contacted at all, or whether the code completely fails on DNS lookup, as claimed. (I’m afraid if the mirror is contacted but does not work, even the debug log may not currently contain the actual cause of the failure to use the mirror — it would contain the HTTP request but neither the response nor, necessarily, the description of what went wrong.)"

Comment 27 Dan Clark 2020-01-06 14:41:43 UTC

I'm using the exact same mirror to host the images as was used to build the cluster. The cluster can clearly talk to the mirror and has the auths to pull images from that mirror or the cluster wouldn't build. I've even gone so far as to add the mirrors to the original install-config.yaml

imageContentSources:
- mirrors:
  - openshift4-registry.redhatgovsa.io:5000/ocp4/openshift4
  source: quay.io/openshift-release-dev/ocp-release
- mirrors:
  - openshift4-registry.redhatgovsa.io:5000/ocp4/openshift4
  source: quay.io/openshift-release-dev/ocp-v4.0-art-dev
- mirrors:
  - openshift4-registry.redhatgovsa.io:5000/ocp4/server-operator-rhel8
  source: registry.redhat.io/codeready-workspaces/server-operator-rhel8
- mirrors:
  - openshift4-registry.redhatgovsa.io:5000/ocp4/noobaa-core
  source: docker.io/noobaa/noobaa-core
- mirrors:
  - openshift4-registry.redhatgovsa.io:5000/ocp4/noobaa-operator
  source: docker.io/noobaa/noobaa-operator
- mirrors:
  - openshift4-registry.redhatgovsa.io:5000/ocp4/support-tools
  source: registry.redhat.io/rhel7/support-tools


I've made sure that both a version number and latest tag are set in the repo. For example, I can go to any node, worker or master, in the cluster and run:

- sudo podman login openshift4-registry.redhatgovsa.io:5000  ( login succeeds)
- sudo podman pull openshift4-registry.redhatgovsa.io:5000/ocp4/server-operator-rhel8  (pulls the latest tag)
- sudo podman pull openshift4-registry.redhatgovsa.io:5000/ocp4/server-operator-rhel8:2.0 (pulls the actual tag version of what the operator should pull)


Clearly the mirror is online, accepts connections, and images can be pulled. When I tried to do something as simple as 'oc debug node' which should just pull the support-tools image I get events that say it cannot be pulled from registry.redhat.io. If I run 'oc debug --image openshift4-registry.redhatgovsa.io:5000/ocp4/support-tools node' it works perfectly.

If I can pull those images from any node in the cluster and they are part of the same mirror that the cluster was built from then it's not an issue talking to the mirror. Something is up with not being able to talk to registry.redhat.io it seems.

Comment 28 Miloslav Trmač 2020-01-06 14:53:38 UTC

Dan Clark, SEPARATE BUG please. I don’t want to have 4 independent conversations in a single page.

Comment 29 Dan Clark 2020-01-06 17:21:09 UTC

Miloslav: I'll certainly file another bug if that is what needs to be done. I'm a bit confused why it is a separate issue when the original description in this bug is:

Description of problem:
When the original source host cannot be resolved the mirror is never tried

which appears to be the exact error I'm running into. The original comments that I see in this thread don't appear to ever address the issue. Your original comment is, there's not enough detail to figure out the root cause. Followed by suggestions on how to debug the issue. Followed by closing the bug as a duplicate of another bug that is not a duplicate of this issue.

I'm literally going to file a bug with the exact original message as this bug?

Comment 30 Miloslav Trmač 2020-01-06 17:35:25 UTC

(In reply to Dan Clark from comment #29)
> Miloslav: I'll certainly file another bug if that is what needs to be done.
> I'm a bit confused why it is a separate issue when the original description
> in this bug is:
> 
> Description of problem:
> When the original source host cannot be resolved the mirror is never tried
> 
> which appears to be the exact error I'm running into.

You’re extremely unlikely to be running into that problem. That problem basically does not exist at all, at least it has never been encountered so far. (Read the error reporting comment again.) In all cases that have ever been reported in this bug and the duplicate, the mirror _is_ tried when the original cannot be resolved and the problem observed by users is caused by something else, something entirely different (never so far related to DNS in any way, for example).



> I'm literally going to file a bug with the exact original message as this
> bug?

Yes, that would definitely be helpful.  You can also include YOUR steps to reproduce, expected/actual results, because I’d to ask for that immediately afterwards (comments #23 and #27 discuss attempted workarounds and attempts to diagnose, but not the problem itself). CRI-O debug logs from the pull would be a possible next step.

Note You need to log in before you can comment on or make changes to this bug.