Description of problem: docker_image_availability check fails in disconnected environment when internet / registry.access.redhat.com access is blocked by firewalls. Version-Release number of the following components: openshift-ansible-3.6.173.0.5-3.git.0.522a92a.el7.noarch ansible-2.3.1.0-3.el7.noarch ansible 2.3.1.0 config file = /root/ocp-setup/conf/ansible.cfg configured module search path = Default w/o overrides python version = 2.7.5 (default, May 3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)] Actual results: 8. Host: master.example.com Play: Verify Requirements Task: openshift_health_check Message: One or more checks failed Details: check "docker_image_availability": One or more required Docker images are not available: registry.access.redhat.com/openshift3/registry-console, registry.access.redhat.com/rhel7/etcd Configured registries: registry.access.redhat.com, registry.example.com:5000 These images are both available in the local repository and prefixes are appropriately configured in the hosts file to use them meaning that the actual installation succeeds even if this check fails.
The check is supposed to look for the images in the local docker index. Not sure why that wouldn't be happening. Is there any chance you could attach the log from running ansible-playbook with the -vvv option? It may help to see what's going on.
Created attachment 1311972 [details] ansible.hosts Inventory file used.
Created attachment 1311978 [details] ansible.log ansible.log from the failed installation.
As a workaround, you can easily disable the image check: openshift_disable_check=docker_image_availability
(In reply to Gan Huang from comment #4) > As a workaround, you can easily disable the image check: > > openshift_disable_check=docker_image_availability Correct, this is commented out in the attached hosts file, uncommenting it will allow the installation to proceed. Thanks.
The problem is in roles/openshift_health_checker/openshift_checks/docker_image_availability.py Here skopeo is called without a HTTPS_PROXY env variable set, and thus it fails...
(In reply to Krist van Besien from comment #6) > The problem is in > roles/openshift_health_checker/openshift_checks/docker_image_availability.py > > Here skopeo is called without a HTTPS_PROXY env variable set, and thus it > fails... Thanks for pointing that out. However, in some cases access to the internet is not allowed even through proxies so the registry/images specified in the inventory file should be used, if set.
Basically the test should do this, in order: - If registries are set in the inventory, check we can reach them and if they have the right images. - If no registries are set (hence we use registry.access.redhat.com) check that we can reach that. I think we can assume here that registry.access.redhat.com has the correct images. So we do not need to check that. If a https proxy is set, assume we can use that.
Hello, There is a workaround found regarding this same issue: Reran installer, when this came up I went and added the override in the hosts file before saying "y": --> Uncommented on bother files below thee following line: openshift_disable_check=docker_image_availability Wrote atomic-openshift-installer config: /root/.config/openshift/installer.cfg.yml Wrote Ansible inventory: /root/.config/openshift/hosts Ready to run installation process. If changes are needed please edit the config file above and re-run. Are you ready to continue? [y/N]:
The check is designed to look for the necessary images first in the docker index on each host, and only if they are not available, then look for them in the configured docker registries. In the attached ansible log it looks like most of the images were found, and just a couple were not there (registry-console, and on the masters, etcd). The check would proceed correctly if those were found in the hosts' docker indexes, so the check result should be viewed as correct. If directions for a disconnected environment don't include pulling those images, they should. What's broken about the check is that in a disconnected environment, we don't want it to fall back to looking at the remote docker registries at all, or if it does (the install assumes a registry if not supplied; I'm not sure it's possible to configure the install to specify no registry so I'm not sure how to detect a disconnected environment), we could impose a reasonably short timeout so it doesn't take forever to fail. I'll see what's to be done about that. A separate problem is making sure proxies (if specified) are used in querying registries. That doesn't help the disconnected case however.
https://github.com/openshift/openshift-ansible/pull/5228 sets a 10 second timeout on registry lookups, which should improve the experience a little. Our documentation for disconnected installs (https://docs.openshift.com/container-platform/latest/install_config/install/disconnected_install.html) doesn't seem to say anything about containerized components. Since etcd was considered missing I assume at least the masters are containerized, but I'm surprised other components were not missing as well (ose, node, openvswitch) in this bug report. Regardless, the docs need updating for that. The registry-console is mentioned, but as an optional component. I'm not aware of an option for not installing it -- I'm pretty sure the installer includes it whenever there is a registry deployed, and it looks like it could land on any node, so nodes need to have this image. Unless someone knows better, that seems like a docs update too. I couldn't find any way to keep our installer from configuring registry.access.redhat.com into docker; it's hardcoded to be added on OCP installs. So we don't really have a way to indicate a disconnected install; there will always be registries configured, specifically registry.access. What I could improve in the check is that it could be sure it checks the registries in the order specified (currently it's what order they come out of a hash), so that at least you can put the images in a local registry, configure it first, and have those found without having to consult registry.access when it's unreachable. Finally I could have the check stop querying any registry that has timed out 3 times so we only waste 30 seconds on a host finding out it is unreachable.
(In reply to Luke Meyer from comment #12) > https://github.com/openshift/openshift-ansible/pull/5228 sets a 10 second > timeout on registry lookups, which should improve the experience a little. Thanks, 10 s should indeed be enough when everything is working ok. > Our documentation for disconnected installs > (https://docs.openshift.com/container-platform/latest/install_config/install/ > disconnected_install.html) doesn't seem to say anything about containerized > components. Since etcd was considered missing I assume at least the masters > are containerized, but I'm surprised other components were not missing as > well (ose, node, openvswitch) in this bug report. Regardless, the docs need > updating for that. Yes, the document seems to be a bit behind. > The registry-console is mentioned, but as an optional component. I'm not > aware of an option for not installing it -- I'm pretty sure the installer > includes it whenever there is a registry deployed, and it looks like it > could land on any node, so nodes need to have this image. Unless someone > knows better, that seems like a docs update too. RFE to make registry-console optional was rejected just a couple of days ago: https://bugzilla.redhat.com/show_bug.cgi?id=1425022 > I couldn't find any way to keep our installer from configuring > registry.access.redhat.com into docker; RFE to make this possible was rejected just a couple of days ago: https://bugzilla.redhat.com/show_bug.cgi?id=1461465 > installs. So we don't really have a way to indicate a disconnected install; > there will always be registries configured, specifically registry.access. > What I could improve in the check is that it could be sure it checks the > registries in the order specified (currently it's what order they come out > of a hash), so that at least you can put the images in a local registry, > configure it first, and have those found without having to consult > registry.access when it's unreachable. Finally I could have the check stop > querying any registry that has timed out 3 times so we only waste 30 seconds > on a host finding out it is unreachable. Perhaps you could also check whether you can actually access these registries; in many disconnected environments firewalls or such prevent any internet access to any connection attempt to registry.access.redhat.com will be rejected (or, in some cases, will timeout eventually). But already using the registries in the order they are configured should be a step forward. Thanks.
Hi Marko, We discussed this bug during scrum today and decided it makes sense to fix this as well as Bug #1461465. Luke may have time to get to it this sprint after devcut.
Related: https://github.com/openshift/openshift-ansible/issues/5330
For registry console and etcd image location, user has the following option to specify his/her own local registry url. openshift_cockpit_deployer_prefix osm_etcd_image Only when user did not specify those options, installer will use registry.access as default. But after go though docker_image_availability.py in openshift-ansible-roles-3.6.173.0.31-1.git.0.c9aeacc.el7.noarch, unfortunately docker_image_availability check functionality does NOT take those options into consideration, just use registry.access as hard code registry location for etcd and registry console images. From my understanding, if docker_image_availability function could do better - take into the two options into consideration instead of hard code, we only need update docs, saying when user specify a local registry, user have to set openshift_cockpit_deployer_prefix and osm_etcd_image to point the image url there. Then everything should go well. We even do not need the two RFE bug for this bug. BTW, according to https://docs.openshift.com/container-platform/latest/install_config/install/disconnected_install.html#disconnected-syncing-images, registry console image also have preferred tag to check, e.g: v3.6, but not latest. While docker_image_availability function is using latest tag to check the image's availability. This also should be corrected. <--snip--> # The registry-console is for some reason not prefixed with ose- like the other components. # Nor is it versioned the same, so just look for latest. # Also a completely different name is used for Origin. required.add(image_info["registry_console_image"]) <--snip-->
I'd like to keep the scope of this bug limited to improving the reliability / user experience of the check. Handling proxies (comment 6) and alternate image specifications (comment 16, see also https://github.com/openshift/openshift-ansible/issues/5330) are important fixes that deserve their own bugs. Also the docs need some updates. https://github.com/openshift/openshift-ansible/pull/5365 created to implement the following: 1. Check correctly for image in docker index (using all registry names). 2. Inspect registries in the order configured; we can't currently configure no registries for a disconnected install, but they could at least configure a local one and fill it with images so that public registries are never pulled. 3. Probe for connectivity to registries and don't inspect ones that we can't reach. 4. Retry skopeo inspect to work around external network blips.
Commit pushed to master at https://github.com/openshift/openshift-ansible https://github.com/openshift/openshift-ansible/commit/a3e1c2a819434acc2ce07467c322e12beeee8591 docker_image_availability: probe registry connectivity Probe whether the host has connectivity to the registry before trying to inspect it for images, and remember the result. Also if later inspection fails due to timeout, mark registry as unreachable. Note in failure output if any registries were unreachable. Registry order should match what is configured into docker now as well. Fixes bug 1480195 https://bugzilla.redhat.com/show_bug.cgi?id=1480195
Tested with openshift-ansible-3.7.0-0.126.4.git.0.3fc2b9b.el7.noarch.rpm Now the checks failed in a short time in a disconnected environment. According to comment 17, this is expected result. Moving to verified.
Apologized that I didn't realized that the bug is targeted to 3.6. Retested and verified with openshift-ansible-3.6.173.0.37-1.git.0.2774e0b.el7.noarch.rpm
Hello, I have been watching this thread as I was hitting the same issue with the docker_image_availability tests. I just did a pull of openshift-ansible to the latest code and tried to test to see if the issue was fixed, but I am still hitting the error (while performing a disconnected install) This is my local registry settings - openshift_docker_additional_registries=161.192.1.1:5000 openshift_docker_insecure_registries=161.192.1.1:5000 openshift_docker_blocked_registries=docker.io,registry.access.redhat.com osm_etcd_image=161.192.1.1:5000/rhel7/etcd And the output when I run the test - "docker_image_availability": { "changed": true, "failed": true, "failures": [ [ "OpenShiftCheckException", "One or more required Docker images are not available:\n registry.access.redhat.com/rhel7/etcd\nConfigured registries: 161.192.1.1:5000, docker.io\nChecked by: timeout 10 skopeo inspect --tls-verify=false docker://{registry}/{image}" ] ], "msg": "One or more required Docker images are not available:\n registry.access.redhat.com/rhel7/etcd\nConfigured registries: 161.192.1.1:5000, docker.io\nChecked by: timeout 10 skopeo inspect --tls-verify=false docker://{registry}/{image}" }, Anything that I need to do to fix the above error?
In addition to the above (comment 22), the following change works for me, but unsure if this would break a regular install. index 98372d9..62c5efd 100644 --- a/roles/openshift_health_checker/openshift_checks/docker_image_availability.py +++ b/roles/openshift_health_checker/openshift_checks/docker_image_availability.py @@ -131,7 +131,7 @@ class DockerImageAvailability(DockerHostMixin, OpenShiftCheck): for component in components: required.add("{}/{}:{}".format(image_info["namespace"], component, image_tag)) if 'etcd' in host_groups: # special case, note it is the same for origin/enterprise - required.add("registry.access.redhat.com/rhel7/etcd") # and no image tag + required.add("rhel7/etcd") # and no image tag return required
(In reply to Pradeep D from comment #22) > Anything that I need to do to fix the above error? It is reporting that the "registry.access.redhat.com/rhel7/etcd" image is not available. So if you want to fix it, make that image available. "rhel7/etcd" is not the same image tag, even though it is presumably the right image. Just docker tag it and the check will pass. The reason the check looks specifically for the fully qualified name is because AFAICS that is what Ansible deploys for containerized etcd, and the install would fail without it being available. If I am wrong about that, then the check should be updated to look for whatever image tag is actually going to be deployed.
(In reply to Luke Meyer from comment #24) > (In reply to Pradeep D from comment #22) > > > Anything that I need to do to fix the above error? > > It is reporting that the "registry.access.redhat.com/rhel7/etcd" image is > not available. So if you want to fix it, make that image available. > "rhel7/etcd" is not the same image tag, even though it is presumably the > right image. Just docker tag it and the check will pass. > > The reason the check looks specifically for the fully qualified name is > because AFAICS that is what Ansible deploys for containerized etcd, and the > install would fail without it being available. If I am wrong about that, > then the check should be updated to look for whatever image tag is actually > going to be deployed. Unfortunately I had already tried that and does not help either. Here are a list of images (and the various docker tags that I have tried :)) in my internal registry [ec2-user@ip-161-192-1-1 openshift-ansible]$ docker images | grep -i rhel7 localhost:5000/registry.access.redhat.com/rhel7/etcd latest 93e8e4932fa2 3 weeks ago 238.6 MB localhost:5000/rhel7/etcd latest 93e8e4932fa2 3 weeks ago 238.6 MB registry.access.redhat.com/rhel7/etcd latest 93e8e4932fa2 3 weeks ago 238.6 MB 161.192.1.1:5000/registry.access.redhat.com/rhel7/etcd latest 93e8e4932fa2 3 weeks ago 238.6 MB 161.192.1.1:5000/rhel7/etcd latest 93e8e4932fa2 3 weeks ago 238.6 MB and I get the same error as above - Play: Verify Requirements Task: openshift_health_check Message: One or more checks failed Details: check "docker_image_availability": One or more required Docker images are not available: registry.access.redhat.com/rhel7/etcd Configured registries: 161.192.1.1:5000, docker.io Checked by: timeout 10 skopeo inspect --tls-verify=false docker://{registry}/{image}
(In reply to Pradeep D from comment #25) > > Unfortunately I had already tried that and does not help either. Here are a > list of images (and the various docker tags that I have tried :)) in my > internal registry > > [ec2-user@ip-161-192-1-1 openshift-ansible]$ docker images | grep -i rhel7 > localhost:5000/registry.access.redhat.com/rhel7/etcd latest > 93e8e4932fa2 3 weeks ago 238.6 MB > localhost:5000/rhel7/etcd latest > 93e8e4932fa2 3 weeks ago 238.6 MB > registry.access.redhat.com/rhel7/etcd latest > 93e8e4932fa2 3 weeks ago 238.6 MB > 161.192.1.1:5000/registry.access.redhat.com/rhel7/etcd latest > 93e8e4932fa2 3 weeks ago 238.6 MB > 161.192.1.1:5000/rhel7/etcd latest > 93e8e4932fa2 3 weeks ago 238.6 MB This is not a support channel but let me point out that latest tag is not the one that is used when installing OpenShift, see e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1480443 for pointers.
Hi Marko, I understand that this is not a support channel, I am just raising a point about something that is not working (supposed to be fixed as per the comments above). I am more than happy to raise a different request if needed. I do specify the image tag for all openshift* images. I am unable to specify the tag for the etcd image. As already mentioned above, when I use the script in a regular aws environment, the installation goes through fine, but throws errors in a disconnected install, and the issue seems to be in the "docker_image_availability" check where it tries to check if the image pre exists before it continues with the install. In fact, disabling this check will give me a successful install which goes onto show that just the code to check for image availability is not obtaining the etcd image information correctly.
(In reply to Pradeep D from comment #27) > > I do specify the image tag for all openshift* images. I am unable to specify > the tag for the etcd image. Just a quick guess (I'm on the road and not able to inspect in more details during the near future), but you have: osm_etcd_image=161.192.1.1:5000/rhel7/etcd Whereas I have it working with something like: openshift_docker_additional_registries=registry.example.com:5000 #openshift_docker_insecure_registries=registry.example.com:5000 oreg_url=registry.example.com:5000/openshift3/ose-${component}:${version} osm_etcd_image=rhel7/etcd openshift_metrics_image_prefix=registry.example.com:5000/openshift3/ openshift_logging_image_prefix=registry.example.com:5000/openshift3/ There are some possibly helpful scripts and playbooks in my github repo to facilitate disconnected installations, see https://github.com/myllynen/openshift-automation-tools. Thanks.
(In reply to Marko Myllynen from comment #28) > Whereas I have it working with something like: > > openshift_docker_additional_registries=registry.example.com:5000 > #openshift_docker_insecure_registries=registry.example.com:5000 > oreg_url=registry.example.com:5000/openshift3/ose-${component}:${version} > osm_etcd_image=rhel7/etcd > openshift_metrics_image_prefix=registry.example.com:5000/openshift3/ > openshift_logging_image_prefix=registry.example.com:5000/openshift3/ > > There are some possibly helpful scripts and playbooks in my github repo to > facilitate disconnected installations, see > https://github.com/myllynen/openshift-automation-tools. Thanks Marko. Changing the osm_etcd_image to be just rhel7/etcd did the trick. The documentation here - https://docs.openshift.com/container-platform/3.6/install_config/install/rpm_vs_containerized.html specifies that it needs to be a fully qualified name of the registry threw me off.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2900
I don't see this documented but looking at the git code below skopeo does take into account these three advanced settings openshift_http_proxy This variable specifies the HTTP_PROXY environment variable for masters and the Docker daemon. openshift_https_proxy This variable specifices the HTTPS_PROXY environment variable for masters and the Docker daemon. openshift_no_proxy This variable is used to set the NO_PROXY environment variable for masters and the Docker daemon. This value should be set to a comma separated list of host names or wildcard host names that should not use the defined proxy. This list will be augmented with the list of all defined OpenShift Container Platform host names by default. https://github.com/openshift/openshift-ansible/blob/release-3.7/roles/openshift_health_checker/openshift_checks/docker_image_availability.py#L87