Description of problem: When the original source host cannot be resolved the mirror is never tried Version-Release number of selected component (if applicable): OpenShift 4.2.9 How reproducible: Always Steps to Reproduce: 1. Mirror an image from an unresolvable host 2. Create a ImageContentSourcePolicy so that it should be available 3. Start a pod pointing at the origin source for the image Actual results: Pod gets stuck in ImagePullBackOff Expected results: Pod starts Additional info:
Why would it work when creating a deploymentconfig explicitly using the mirrored image in the same namespace on the same cluster as in https://bugzilla.redhat.com/show_bug.cgi?id=1782852#c1 and yet not witht he deployment pointing at the original source? Why would the internal registry not be contactable?
No idea. There may well be a bug in the mirroring code, all I’m saying is that this error message is not enough to point at a root cause.
how do I adjust the CRI-O debug level? Is there documentation somewhere?
you can create a Container Runtime CRD: https://github.com/openshift/machine-config-operator/blob/master/docs/ContainerRuntimeConfigDesign.md
I think I figured it out: oc debug node/ip-10-0-133-241.ec2.internal Starting pod/ip-10-0-133-241ec2internal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.133.241 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# cat /etc/crio/crio.conf | grep log_level log_level = "debug" Am I still supposed to look at the pod logs?
Marking this a duplicate. It's not an "unresolved", it's getting 401 unauthorized. *** This bug has been marked as a duplicate of bug 1773806 ***
I have a customer who is experiencing this original issue reported in their OpenShift 4.2.9 disconnected cluster. By design, their cluster cannot reach out to or resolve registry.redhat.io, and thus they use a mirror repository which is resolvable. I asked the customer to perform: podman pull --log-level=debug registry.redhat.io/fuse7/fuse-eap-openshift:1.0 The output shows that it cannot ping registry.redhat.io, and thus doesn't try the mirror. Curiously enough, the "reference rewritten from registry.redhat.io" does not rewrite to the mirror itself. I don't see this resulting in a 401 unauthorized as this original bug seems to have ended up, but this seems to be similar as the original issue reported. I'll provide the customer-sensitive ImageContentSourcePolicy object and /etc/containers/registries.conf in additional private notes. Results: time="2019-12-30T17:37:18Z" level=info msg="running as rootless" time="2019-12-30T17:37:18Z" level=debug msg="Initializing boltdb state at /var/home/core/.local/share/containers/storage/libpod/bolt_state.db" time="2019-12-30T17:37:18Z" level=debug msg="Using graph driver overlay" time="2019-12-30T17:37:18Z" level=debug msg="Using graph root /var/home/core/.local/share/containers/storage" time="2019-12-30T17:37:18Z" level=debug msg="Using run root /run/user/1000" time="2019-12-30T17:37:18Z" level=debug msg="Using static dir /var/home/core/.local/share/containers/storage/libpod" time="2019-12-30T17:37:18Z" level=debug msg="Using tmp dir /run/user/1000/libpod/tmp" time="2019-12-30T17:37:18Z" level=debug msg="Using volume path /var/home/core/.local/share/containers/storage/volumes" time="2019-12-30T17:37:18Z" level=debug msg="Set libpod namespace to """ time="2019-12-30T17:37:18Z" level=debug msg="[graphdriver] trying provided driver "overlay"" time="2019-12-30T17:37:18Z" level=debug msg="overlay: mount_program=/bin/fuse-overlayfs" time="2019-12-30T17:37:18Z" level=debug msg="backingFs=xfs, projectQuotaSupported=false, useNativeDiff=false, usingMetacopy=false" time="2019-12-30T17:37:18Z" level=debug msg="Initializing event backend journald" time="2019-12-30T17:37:18Z" level=debug msg="parsed reference into "[overlay@/var/home/core/.local/share/containers/storage+/run/user/1000:overlay.mount_program=/bin/fuse-overlayfs]registry.redhat.io/fuse7/fuse-eap-openshift:1.0"" Trying to pull registry.redhat.io/fuse7/fuse-eap-openshift:1.0...time="2019-12-30T17:37:18Z" level=debug msg="reference rewritten from 'registry.redhat.io/fuse7/fuse-eap-openshift:1.0' to 'registry.redhat.io/fuse7/fuse-eap-openshift:1.0'" time="2019-12-30T17:37:18Z" level=debug msg="Trying to pull "registry.redhat.io/fuse7/fuse-eap-openshift:1.0"" time="2019-12-30T17:37:18Z" level=debug msg="Using registries.d directory /etc/containers/registries.d for sigstore configuration" time="2019-12-30T17:37:18Z" level=debug msg=" Using "default-docker" configuration" time="2019-12-30T17:37:18Z" level=debug msg=" No signature storage configuration found for registry.redhat.io/fuse7/fuse-eap-openshift:1.0" time="2019-12-30T17:37:18Z" level=debug msg="Looking for TLS certificates and private keys in /etc/docker/certs.d/registry.redhat.io" time="2019-12-30T17:37:18Z" level=debug msg="GET https://registry.redhat.io/v2/" time="2019-12-30T17:37:18Z" level=debug msg="Ping https://registry.redhat.io/v2/ err Get https://registry.redhat.io/v2/: dial tcp: lookup registry.redhat.io on 10.205.40.192:53: no such host (&url.Error{Op:"Get", URL:"https://registry.redhat.io/v2/", Err:(*net.OpError)(0xc0008c4550)})" time="2019-12-30T17:37:18Z" level=debug msg="GET https://registry.redhat.io/v1/_ping" time="2019-12-30T17:37:18Z" level=debug msg="Ping https://registry.redhat.io/v1/_ping err Get https://registry.redhat.io/v1/_ping: dial tcp: lookup registry.redhat.io on 10.205.40.192:53: no such host (&url.Error{Op:"Get", URL:"https://registry.redhat.io/v1/_ping", Err:(*net.OpError)(0xc0002f8140)})" time="2019-12-30T17:37:18Z" level=error msg="Error pulling image ref //registry.redhat.io/fuse7/fuse-eap-openshift:1.0: Error initializing source docker://registry.redhat.io/fuse7/fuse-eap-openshift:1.0: pinging docker registry returned: Get https://registry.redhat.io/v2/: dial tcp: lookup registry.redhat.io on 10.205.40.192:53: no such host" Failed time="2019-12-30T17:37:18Z" level=error msg="error pulling image "registry.redhat.io/fuse7/fuse-eap-openshift:1.0": unable to pull registry.redhat.io/fuse7/fuse-eap-openshift:1.0: unable to pull image: Error initializing source docker://registry.redhat.io/fuse7/fuse-eap-openshift:1.0: pinging docker registry returned: Get https://registry.redhat.io/v2/: dial tcp: lookup registry.redhat.io on 10.205.40.192:53: no such host"
This doesn't really seem like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1773806 1773806 turns out to be about authenticating to different registries, AFAIK The issue in this bug seems to be that in a disconnected environment the ImageContentSourcePolicy isn't overriding registry.redhat.io because the cluster cannot talk to registry.redhat.io I'm having this same issue. I've tried 'oc edit image.config.openshift.io/cluster' to try to block registry.redhat.io spec: registrySources: blockedRegistries: - registry.redhat.io insecureRegistries: - registry.redhat.io My next workaround is going to be to running a small server and hijack registry.redhat.io, telling the cluster it is an insecure registry so I don't get TLS errors. Anyone have a better workaround than the terrible idea I've suggested, haha.
It's not possible to see whether connecting to the mirror is attempted and get any hint of why it failed without enabling debugging for cri-o and looking at the cri-o logs. At https://www.redhat.com/en/blog/red-hat-openshift-container-platform-4-now-defaults-cri-o-underlying-container-engine under the "Demonstrating the Power of Replaceable Parts" header there's a pretty straight forward example, though in my case I needed to apply it to the workers instead of the masters. Could be one, the other, or both depending on what you need. Once the nodes have been updated to the new rendered config you can look at 'oc debug node/$node' where the pod is running to possibly get a better idea of what's going on. I've seen what look like auth errors against a mirror and cert errors against another mirror this way that weren't easy to diagnose otherwise.
Dan Clark or, anyone finding this bug because of > Failed to pull image "…": rpc error: code = Unknown desc = pinging docker registry returned: Get [not mirror]: dial tcp: lookup [not mirror] on 10.0.0.2:53: no such host Please read https://bugzilla.redhat.com/show_bug.cgi?id=1782852#c2 . Seeing that message is NOT an indication that you are in the same situation; please file a SEPARATE bug, with CRI-O debug logs of the pull attempt, if possible. I’m going to strongly prioritize improving the error messages, but in the meantime, just file bugs and I’ll help diagnose/triage as necessary.
Comment 2 is marked private. The actual response doesn't appear to have anything that can't be shared: "Unfortunately, there is a known error reporting bug: Even if mirrors are configured, and are actually tried, if using both the primary registry and all mirrors fails, only the error using the primary registry is reported (and the causes of failures using the mirrors are not returned to the user at all). The code does not really distinguish between DNS failures and any other kinds of failures at this point, and disconnected operation is passing some tests at least, so it’s very likely that, the failure is actually caused by inability to contact the _mirror_, not anything specific to DNS. Running CRI-O with debug-level logging enabled should include the underlying HTTP requests. That would, at least, allow determining whether mirrors are contacted at all, or whether the code completely fails on DNS lookup, as claimed. (I’m afraid if the mirror is contacted but does not work, even the debug log may not currently contain the actual cause of the failure to use the mirror — it would contain the HTTP request but neither the response nor, necessarily, the description of what went wrong.)"
I'm using the exact same mirror to host the images as was used to build the cluster. The cluster can clearly talk to the mirror and has the auths to pull images from that mirror or the cluster wouldn't build. I've even gone so far as to add the mirrors to the original install-config.yaml imageContentSources: - mirrors: - openshift4-registry.redhatgovsa.io:5000/ocp4/openshift4 source: quay.io/openshift-release-dev/ocp-release - mirrors: - openshift4-registry.redhatgovsa.io:5000/ocp4/openshift4 source: quay.io/openshift-release-dev/ocp-v4.0-art-dev - mirrors: - openshift4-registry.redhatgovsa.io:5000/ocp4/server-operator-rhel8 source: registry.redhat.io/codeready-workspaces/server-operator-rhel8 - mirrors: - openshift4-registry.redhatgovsa.io:5000/ocp4/noobaa-core source: docker.io/noobaa/noobaa-core - mirrors: - openshift4-registry.redhatgovsa.io:5000/ocp4/noobaa-operator source: docker.io/noobaa/noobaa-operator - mirrors: - openshift4-registry.redhatgovsa.io:5000/ocp4/support-tools source: registry.redhat.io/rhel7/support-tools I've made sure that both a version number and latest tag are set in the repo. For example, I can go to any node, worker or master, in the cluster and run: - sudo podman login openshift4-registry.redhatgovsa.io:5000 ( login succeeds) - sudo podman pull openshift4-registry.redhatgovsa.io:5000/ocp4/server-operator-rhel8 (pulls the latest tag) - sudo podman pull openshift4-registry.redhatgovsa.io:5000/ocp4/server-operator-rhel8:2.0 (pulls the actual tag version of what the operator should pull) Clearly the mirror is online, accepts connections, and images can be pulled. When I tried to do something as simple as 'oc debug node' which should just pull the support-tools image I get events that say it cannot be pulled from registry.redhat.io. If I run 'oc debug --image openshift4-registry.redhatgovsa.io:5000/ocp4/support-tools node' it works perfectly. If I can pull those images from any node in the cluster and they are part of the same mirror that the cluster was built from then it's not an issue talking to the mirror. Something is up with not being able to talk to registry.redhat.io it seems.
Dan Clark, SEPARATE BUG please. I don’t want to have 4 independent conversations in a single page.
Miloslav: I'll certainly file another bug if that is what needs to be done. I'm a bit confused why it is a separate issue when the original description in this bug is: Description of problem: When the original source host cannot be resolved the mirror is never tried which appears to be the exact error I'm running into. The original comments that I see in this thread don't appear to ever address the issue. Your original comment is, there's not enough detail to figure out the root cause. Followed by suggestions on how to debug the issue. Followed by closing the bug as a duplicate of another bug that is not a duplicate of this issue. I'm literally going to file a bug with the exact original message as this bug?
(In reply to Dan Clark from comment #29) > Miloslav: I'll certainly file another bug if that is what needs to be done. > I'm a bit confused why it is a separate issue when the original description > in this bug is: > > Description of problem: > When the original source host cannot be resolved the mirror is never tried > > which appears to be the exact error I'm running into. You’re extremely unlikely to be running into that problem. That problem basically does not exist at all, at least it has never been encountered so far. (Read the error reporting comment again.) In all cases that have ever been reported in this bug and the duplicate, the mirror _is_ tried when the original cannot be resolved and the problem observed by users is caused by something else, something entirely different (never so far related to DNS in any way, for example). > I'm literally going to file a bug with the exact original message as this > bug? Yes, that would definitely be helpful. You can also include YOUR steps to reproduce, expected/actual results, because I’d to ask for that immediately afterwards (comments #23 and #27 discuss attempted workarounds and attempts to diagnose, but not the problem itself). CRI-O debug logs from the pull would be a possible next step.