Description of problem: When trying to build a new app the build fails in pulling image from registry step with certificate signed by unknown authority error. Version-Release number of selected component (if applicable): oc v3.11.0-0.11.0 kubernetes v1.11.0+d4cacc0 openshift v3.11.0-0.11.0 kubernetes v1.11.0+d4cacc0 How reproducible: Always Steps to Reproduce: 1. Create a new project oc new-project test 2. Create a new app oc new-app --template=cakephp-mysql-example 3. The app pod fails to create with ImagePullBackOff 4. Get the error from the failed event oc get events Actual results: The creation of the app pod fails with the following error: Failed to pull image "xx.xx.xx.xx:5000/@sha256:9a8a4ed1a621 82b61c33bbe767240754c7486fafebdf3068ffd18a0def31338f": rpc error: code = Unknown desc = Get https://xxx.xx.xxx.xxx:5000/v2/: x509: certificate signed by unknown authority Expected results: The app pod should be created without any errors Additional info:
this means you're using an imagestream w/ pullthrough enabled. If you're using one of our imagestreams, it should be pointing to a registry that has a trusted certificate. If you're using your own imagestream that points to an untrusted registry, or you've modified our imagestreams to point to an untrusted registry, then you'll need to add the appropriate CA to your openshift registry pod so it can trust the upstream registry: https://docs.okd.io/latest/install_config/registry/extended_registry_configuration.html#middleware-repository-pullthrough "You must ensure that your registry has appropriate certificates to trust any external registries you do a pullthrough against. The certificates need to be placed in the /etc/pki/tls/certs directory on the pod. You can mount the certificates using a configuration map or secret. Note that the entire /etc/pki/tls/certs directory must be replaced. You must include the new certificates and replace the system certificates in your secret or configuration map that you mount."
The pull error is occurring while trying to pull from the internal OpenShift registry. Is that an untrusted registry? This is something that just started happening in 3.11.0-0.11.0. It did not occur in 3.11.0-0.9.0. Pull error after running a successful s2i build on nodejs-mongodb-example: 10m 11m 2 nodejs-mongodb-example-1-6s525.154937464bb75a42 Pod spec.containers{nodejs-mongodb-example} Normal Pulling kubelet, ip-172-31-39-197.us-west-2.compute.internal pulling image "172.27.101.217:5000/mff/nodejs-mongodb-example@sha256:0109f72e3fbd54f817c806d956121e09efeb6462b967390bd39983293fce8ad2" 10m 11m 2 nodejs-mongodb-example-1-6s525.154937464cfa925b Pod spec.containers{nodejs-mongodb-example} Warning Failed kubelet, ip-172-31-39-197.us-west-2.compute.internal Error: ErrImagePull 10m 11m 2 nodejs-mongodb-example-1-6s525.154937464cfa2c2a Pod spec.containers{nodejs-mongodb-example} Warning Failed kubelet, ip-172-31-39-197.us-west-2.compute.internal Failed to pull image "172.27.101.217:5000/mff/nodejs-mongodb-example@sha256:0109f72e3fbd54f817c806d956121e09efeb6462b967390bd39983293fce8ad2": rpc error: code = Unknown desc = Get https://172.27.101.217:5000/v2/: x509: certificate signed by unknown authority 10m 1
> The pull error is occurring while trying to pull from the internal OpenShift registry. Is that an untrusted registry? in this context by untrusted i mean "registry which serves content using a certificate which is not trusted by the default system CAs". I would have expected it to start happening in 3.10 because that's when we started using pullthrough for the default imagestreams we ship. I'll come by later and see exactly what you guys are doing.
I changed the referencePolicy to Source for all imagestreams in the openshift ns (using registry.access.redhat.com as the registry) and the same issue occurs. ImagePullBackoff pulling the newly built image from the local registry. Comparing this cluster to a working one now.
"Fixed" this by adding openshift_docker_hosted_registry_insecure=true to the inventory which adds an entry to /etc/sysconfig/docker for an insecure registry CIDR in the service network range. Still investigating why we have to add this now.
Sorry, I think i misunderstood the original issue you were having. it sounds like the certificate the registry is using (which I think is signed by the cluster CA) isn't trusted by your host which means your host doesn't have the cluster CA for some reason.
Re-opening while investigated at https://github.com/openshift/origin/issues/20604
there seem to be two issues on the cluster (Mike's cluster): 1) REGISTRY_OPENSHIFT_SERVER_ADDR was not set on the registry DC (it needs to be set to docker-registry.default.svc:5000 i've fixed that, and Scott has a PR to revert the change that caused it to not be set: https://github.com/openshift/openshift-ansible/pull/9533 after triggering a new build the new pod successfully deployd. 2) not clear to me why the certificate is not accepted, because the certificate does appear to contain the service ip as we'd expect.
ok Scott figured out why the cert isn't accepted, the docker/certs.d only contains an entry for the svc hostname. that said, there is also a missing trust anchor on the node which should have the cluster CA and might(?) cause docker to trust the ip. Once i put that in place, the ip is trusted as expected. So the resolution to this bug is: 1) put the logic back in place to set the registry hostname env var on the registry DC 2) determine why the cluster CA wasn't populated on the node. Unfortunately this still doesn't explain why in GCP we see certificates w/ no ip address, but(for better or worse) if the registry hostname var were set, that would not break anything. Handing over to Scott to shepherd the env var fix through as that is the primary issue.
*** Bug 1615337 has been marked as a duplicate of this bug. ***
https://github.com/openshift/openshift-ansible/pull/9533 reverted the problematic commit that introduced this problem
Actually this issue is introduced in openshift-ansible-3.11.0-0.13.0.git.0.16dc599None.noarch, but not openshift-ansible-3.11.0-0.11.0.git.0.3c66516None.noarch, so removing "3.11.0-0.11.0" from summary.
The fix PR is already merged to openshift-ansible-3.11.0-0.14.0, after a retest, it works well now. # oc get dc docker-registry -o yaml <--snip--> - name: OPENSHIFT_DEFAULT_REGISTRY value: docker-registry.default.svc:5000 <--snip--> - name: REGISTRY_OPENSHIFT_SERVER_ADDR value: docker-registry.default.svc:5000 <--snip--> # oc describe po nodejs-mongodb-example-6-48dkm -n install-test <--snip--> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 1h default-scheduler Successfully assigned install-test/nodejs-mongodb-example-6-48dkm to ip-172-18-15-97.ec2.internal Normal Pulled 1h kubelet, ip-172-18-15-97.ec2.internal Container image "docker-registry.default.svc:5000/install-test/nodejs-mongodb-example@sha256:4b0722d367e47a43647acd84c91a81b03fa001bc2468f87bca1327c0aeb6a5be" already present on machine Normal Created 1h kubelet, ip-172-18-15-97.ec2.internal Created container Normal Started 1h kubelet, ip-172-18-15-97.ec2.internal Started container
In openshift-ansible-3.11.0-0.15.0
Verified this bug with openshift-ansible-3.11.0-0.15.0.git.0.842d3d1None.noarch, and PASS. The same result as comment 13.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652