Bug 1480195

Summary: docker_image_availability check fails in disconnected environment
Product: OpenShift Container Platform Reporter: Marko Myllynen <myllynen>
Component: InstallerAssignee: Luke Meyer <lmeyer>
Status: CLOSED ERRATA QA Contact: Gan Huang <ghuang>
Severity: medium Docs Contact:
Priority: high    
Version: 3.6.0CC: adietish, aos-bugs, bleanhar, dsulliva, hgomes, jialiu, jokerman, jswensso, kvanbesi, lmeyer, mmccomas, myllynen, pradeep.k.dhananjaya, sreber
Target Milestone: ---Keywords: NeedsTestCase
Target Release: 3.6.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The docker_image_availability ansible health check did not apply any intelligence for when registries are not reachable, and also did not search for images in the local index with all fully-qualified names. Consequence: This check takes a very long time to run in disconnected installs if not all of the required images are imported and tagged a certain way, causing it to consult the default registry which is not reachable and taking a long time to timeout for each image. Fix: Update docker_image_availability check to: 1. Check correctly for image in docker index (using all registry names). 2. Inspect registries in the order configured, to enable finding required images in a local registry before consulting a public one. 3. Probe for connectivity to registries and don't continue to inspect ones that we can't reach. 4. Retry failed registry inspections to add robustness in case of transient network problems. Result: This check should be a lot more robust and performance in disconnected scenarios should be much improved.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-17 11:45:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1461465, 1541226    
Bug Blocks:    
Attachments:
Description Flags
ansible.hosts
none
ansible.log none

Description Marko Myllynen 2017-08-10 11:28:45 UTC
Description of problem:
docker_image_availability check fails in disconnected environment when internet / registry.access.redhat.com access is blocked by firewalls.

Version-Release number of the following components:
openshift-ansible-3.6.173.0.5-3.git.0.522a92a.el7.noarch
ansible-2.3.1.0-3.el7.noarch
ansible 2.3.1.0
  config file = /root/ocp-setup/conf/ansible.cfg
  configured module search path = Default w/o overrides
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]

Actual results:
  8. Host:     master.example.com
     Play:     Verify Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "docker_image_availability":
               One or more required Docker images are not available:
                   registry.access.redhat.com/openshift3/registry-console,
                   registry.access.redhat.com/rhel7/etcd
               Configured registries: registry.access.redhat.com, registry.example.com:5000

These images are both available in the local repository and prefixes are appropriately configured in the hosts file to use them meaning that the actual installation succeeds even if this check fails.

Comment 1 Luke Meyer 2017-08-10 13:23:40 UTC
The check is supposed to look for the images in the local docker index. Not sure why that wouldn't be happening. Is there any chance you could attach the log from running ansible-playbook with the -vvv option? It may help to see what's going on.

Comment 2 Marko Myllynen 2017-08-11 06:06:10 UTC
Created attachment 1311972 [details]
ansible.hosts

Inventory file used.

Comment 3 Marko Myllynen 2017-08-11 06:07:19 UTC
Created attachment 1311978 [details]
ansible.log

ansible.log from the failed installation.

Comment 4 Gan Huang 2017-08-11 06:55:41 UTC
As a workaround, you can easily disable the image check:

openshift_disable_check=docker_image_availability

Comment 5 Marko Myllynen 2017-08-11 06:57:42 UTC
(In reply to Gan Huang from comment #4)
> As a workaround, you can easily disable the image check:
> 
> openshift_disable_check=docker_image_availability

Correct, this is commented out in the attached hosts file, uncommenting it will allow the installation to proceed. Thanks.

Comment 6 Krist van Besien 2017-08-15 19:09:50 UTC
The problem is in 
roles/openshift_health_checker/openshift_checks/docker_image_availability.py

Here skopeo is called without a HTTPS_PROXY env variable set, and thus it fails...

Comment 7 Marko Myllynen 2017-08-15 19:12:54 UTC
(In reply to Krist van Besien from comment #6)
> The problem is in 
> roles/openshift_health_checker/openshift_checks/docker_image_availability.py
> 
> Here skopeo is called without a HTTPS_PROXY env variable set, and thus it
> fails...

Thanks for pointing that out. However, in some cases access to the internet is not allowed even through proxies so the registry/images specified in the inventory file should be used, if set.

Comment 8 Krist van Besien 2017-08-16 08:44:16 UTC
Basically the test should do this, in order:
- If registries are set in the inventory, check we can reach them and if they have the right images.
- If no registries are set (hence we use registry.access.redhat.com) check that we can reach that. I think we can assume here that registry.access.redhat.com has the correct images. So we do not need to check that. If a https proxy is set, assume we can use that.

Comment 9 hgomes 2017-08-21 20:48:27 UTC
Hello,

There is a workaround found regarding this same issue:

Reran installer, when this came up I went and added the override in the hosts file before saying "y":

--> Uncommented on bother files below thee following line:
openshift_disable_check=docker_image_availability

Wrote atomic-openshift-installer config: /root/.config/openshift/installer.cfg.yml
Wrote Ansible inventory: /root/.config/openshift/hosts

Ready to run installation process.

If changes are needed please edit the config file above and re-run.

Are you ready to continue? [y/N]:

Comment 11 Luke Meyer 2017-08-25 19:29:59 UTC
The check is designed to look for the necessary images first in the docker index on each host, and only if they are not available, then look for them in the configured docker registries.

In the attached ansible log it looks like most of the images were found, and just a couple were not there (registry-console, and on the masters, etcd). The check would proceed correctly if those were found in the hosts' docker indexes, so the check result should be viewed as correct. If directions for a disconnected environment don't include pulling those images, they should.

What's broken about the check is that in a disconnected environment, we don't want it to fall back to looking at the remote docker registries at all, or if it does (the install assumes a registry if not supplied; I'm not sure it's possible to configure the install to specify no registry so I'm not sure how to detect a disconnected environment), we could impose a reasonably short timeout so it doesn't take forever to fail. I'll see what's to be done about that.

A separate problem is making sure proxies (if specified) are used in querying registries. That doesn't help the disconnected case however.

Comment 12 Luke Meyer 2017-08-25 21:19:23 UTC
https://github.com/openshift/openshift-ansible/pull/5228 sets a 10 second timeout on registry lookups, which should improve the experience a little.

Our documentation for disconnected installs (https://docs.openshift.com/container-platform/latest/install_config/install/disconnected_install.html) doesn't seem to say anything about containerized components. Since etcd was considered missing I assume at least the masters are containerized, but I'm surprised other components were not missing as well (ose, node, openvswitch) in this bug report. Regardless, the docs need updating for that.

The registry-console is mentioned, but as an optional component. I'm not aware of an option for not installing it -- I'm pretty sure the installer includes it whenever there is a registry deployed, and it looks like it could land on any node, so nodes need to have this image. Unless someone knows better, that seems like a docs update too.

I couldn't find any way to keep our installer from configuring registry.access.redhat.com into docker; it's hardcoded to be added on OCP installs. So we don't really have a way to indicate a disconnected install; there will always be registries configured, specifically registry.access. What I could improve in the check is that it could be sure it checks the registries in the order specified (currently it's what order they come out of a hash), so that at least you can put the images in a local registry, configure it first, and have those found without having to consult registry.access when it's unreachable. Finally I could have the check stop querying any registry that has timed out 3 times so we only waste 30 seconds on a host finding out it is unreachable.

Comment 13 Marko Myllynen 2017-08-28 10:17:04 UTC
(In reply to Luke Meyer from comment #12)
> https://github.com/openshift/openshift-ansible/pull/5228 sets a 10 second
> timeout on registry lookups, which should improve the experience a little.

Thanks, 10 s should indeed be enough when everything is working ok.

> Our documentation for disconnected installs
> (https://docs.openshift.com/container-platform/latest/install_config/install/
> disconnected_install.html) doesn't seem to say anything about containerized
> components. Since etcd was considered missing I assume at least the masters
> are containerized, but I'm surprised other components were not missing as
> well (ose, node, openvswitch) in this bug report. Regardless, the docs need
> updating for that.

Yes, the document seems to be a bit behind.

> The registry-console is mentioned, but as an optional component. I'm not
> aware of an option for not installing it -- I'm pretty sure the installer
> includes it whenever there is a registry deployed, and it looks like it
> could land on any node, so nodes need to have this image. Unless someone
> knows better, that seems like a docs update too.

RFE to make registry-console optional was rejected just a couple of days ago:

https://bugzilla.redhat.com/show_bug.cgi?id=1425022

> I couldn't find any way to keep our installer from configuring
> registry.access.redhat.com into docker;

RFE to make this possible was rejected just a couple of days ago:

https://bugzilla.redhat.com/show_bug.cgi?id=1461465

> installs. So we don't really have a way to indicate a disconnected install;
> there will always be registries configured, specifically registry.access.
> What I could improve in the check is that it could be sure it checks the
> registries in the order specified (currently it's what order they come out
> of a hash), so that at least you can put the images in a local registry,
> configure it first, and have those found without having to consult
> registry.access when it's unreachable. Finally I could have the check stop
> querying any registry that has timed out 3 times so we only waste 30 seconds
> on a host finding out it is unreachable.

Perhaps you could also check whether you can actually access these registries; in many disconnected environments firewalls or such prevent any internet access to any connection attempt to registry.access.redhat.com will be rejected (or, in some cases, will timeout eventually). But already using the registries in the order they are configured should be a step forward.

Thanks.

Comment 14 Brenton Leanhardt 2017-08-28 14:06:46 UTC
Hi Marko,

We discussed this bug during scrum today and decided it makes sense to fix this as well as Bug #1461465.  Luke may have time to get to it this sprint after devcut.

Comment 15 Luke Meyer 2017-09-07 18:55:13 UTC
Related: https://github.com/openshift/openshift-ansible/issues/5330

Comment 16 Johnny Liu 2017-09-11 06:35:42 UTC
For registry console and etcd image location, user has the following option to specify his/her own local registry url.

openshift_cockpit_deployer_prefix
osm_etcd_image

Only when user did not specify those options, installer will use registry.access as default. But after go though docker_image_availability.py in openshift-ansible-roles-3.6.173.0.31-1.git.0.c9aeacc.el7.noarch, unfortunately docker_image_availability check functionality does NOT take those options into consideration, just use registry.access as hard code registry location for etcd and registry console images.

From my understanding, if docker_image_availability function could do better - take into the two options into consideration instead of hard code, we only need update docs, saying when user specify a local registry, user have to set openshift_cockpit_deployer_prefix and osm_etcd_image to point the image url there. Then everything should go well. 

We even do not need the two RFE bug for this bug.

BTW, according to https://docs.openshift.com/container-platform/latest/install_config/install/disconnected_install.html#disconnected-syncing-images, registry console image also have preferred tag to check, e.g: v3.6, but not latest. While docker_image_availability function is using latest tag to check the image's availability. This also should be corrected.
<--snip-->
            # The registry-console is for some reason not prefixed with ose- like the other components.
            # Nor is it versioned the same, so just look for latest.
            # Also a completely different name is used for Origin.
            required.add(image_info["registry_console_image"])
<--snip-->

Comment 17 Luke Meyer 2017-09-12 15:37:32 UTC
I'd like to keep the scope of this bug limited to improving the reliability / user experience of the check. Handling proxies (comment 6) and alternate image specifications (comment 16, see also https://github.com/openshift/openshift-ansible/issues/5330) are important fixes that deserve their own bugs. Also the docs need some updates.

https://github.com/openshift/openshift-ansible/pull/5365 created to implement the following:

1. Check correctly for image in docker index (using all registry names).
2. Inspect registries in the order configured; we can't currently configure no registries for a disconnected install, but they could at least configure a local one and fill it with images so that public registries are never pulled.
3. Probe for connectivity to registries and don't inspect ones that we can't reach.
4. Retry skopeo inspect to work around external network blips.

Comment 18 openshift-github-bot 2017-09-13 03:15:41 UTC
Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/a3e1c2a819434acc2ce07467c322e12beeee8591
docker_image_availability: probe registry connectivity

Probe whether the host has connectivity to the registry before trying to
inspect it for images, and remember the result. Also if later inspection
fails due to timeout, mark registry as unreachable. Note in failure
output if any registries were unreachable.

Registry order should match what is configured into docker now as well.

Fixes bug 1480195
https://bugzilla.redhat.com/show_bug.cgi?id=1480195

Comment 20 Gan Huang 2017-09-18 10:00:32 UTC
Tested with openshift-ansible-3.7.0-0.126.4.git.0.3fc2b9b.el7.noarch.rpm

Now the checks failed in a short time in a disconnected environment.

According to comment 17, this is expected result.

Moving to verified.

Comment 21 Gan Huang 2017-09-19 03:31:15 UTC
Apologized that I didn't realized that the bug is targeted to 3.6.

Retested and verified with openshift-ansible-3.6.173.0.37-1.git.0.2774e0b.el7.noarch.rpm

Comment 22 Pradeep D 2017-09-21 16:31:56 UTC
Hello,

I have been watching this thread as I was hitting the same issue with the docker_image_availability tests. I just did a pull of openshift-ansible to the latest code and tried to test to see if the issue was fixed, but I am still hitting the error (while performing a disconnected install)

This is my local registry settings - 
openshift_docker_additional_registries=161.192.1.1:5000
openshift_docker_insecure_registries=161.192.1.1:5000
openshift_docker_blocked_registries=docker.io,registry.access.redhat.com
osm_etcd_image=161.192.1.1:5000/rhel7/etcd

And the output when I run the test - 

        "docker_image_availability": {
            "changed": true,
            "failed": true,
            "failures": [
                [
                    "OpenShiftCheckException",
                    "One or more required Docker images are not available:\n    registry.access.redhat.com/rhel7/etcd\nConfigured registries: 161.192.1.1:5000, docker.io\nChecked by: timeout 10 skopeo inspect --tls-verify=false docker://{registry}/{image}"
                ]
            ],
            "msg": "One or more required Docker images are not available:\n    registry.access.redhat.com/rhel7/etcd\nConfigured registries: 161.192.1.1:5000, docker.io\nChecked by: timeout 10 skopeo inspect --tls-verify=false docker://{registry}/{image}"
        },

Anything that I need to do to fix the above error?

Comment 23 Pradeep D 2017-09-21 18:47:55 UTC
In addition to the above (comment 22), the following change works for me, but unsure if this would break a regular install. 

index 98372d9..62c5efd 100644
--- a/roles/openshift_health_checker/openshift_checks/docker_image_availability.py
+++ b/roles/openshift_health_checker/openshift_checks/docker_image_availability.py
@@ -131,7 +131,7 @@ class DockerImageAvailability(DockerHostMixin, OpenShiftCheck):
             for component in components:
                 required.add("{}/{}:{}".format(image_info["namespace"], component, image_tag))
             if 'etcd' in host_groups:  # special case, note it is the same for origin/enterprise
-                required.add("registry.access.redhat.com/rhel7/etcd")  # and no image tag
+                required.add("rhel7/etcd")  # and no image tag

         return required

Comment 24 Luke Meyer 2017-09-21 19:03:14 UTC
(In reply to Pradeep D from comment #22)

> Anything that I need to do to fix the above error?

It is reporting that the "registry.access.redhat.com/rhel7/etcd" image is not available. So if you want to fix it, make that image available. "rhel7/etcd" is not the same image tag, even though it is presumably the right image. Just docker tag it and the check will pass.

The reason the check looks specifically for the fully qualified name is because AFAICS that is what Ansible deploys for containerized etcd, and the install would fail without it being available. If I am wrong about that, then the check should be updated to look for whatever image tag is actually going to be deployed.

Comment 25 Pradeep D 2017-09-21 22:18:29 UTC
(In reply to Luke Meyer from comment #24)
> (In reply to Pradeep D from comment #22)
> 
> > Anything that I need to do to fix the above error?
> 
> It is reporting that the "registry.access.redhat.com/rhel7/etcd" image is
> not available. So if you want to fix it, make that image available.
> "rhel7/etcd" is not the same image tag, even though it is presumably the
> right image. Just docker tag it and the check will pass.
> 
> The reason the check looks specifically for the fully qualified name is
> because AFAICS that is what Ansible deploys for containerized etcd, and the
> install would fail without it being available. If I am wrong about that,
> then the check should be updated to look for whatever image tag is actually
> going to be deployed.

Unfortunately I had already tried that and does not help either. Here are a list of images (and the various docker tags that I have tried :)) in my internal registry

[ec2-user@ip-161-192-1-1 openshift-ansible]$ docker images | grep -i rhel7
localhost:5000/registry.access.redhat.com/rhel7/etcd         latest              93e8e4932fa2        3 weeks ago         238.6 MB
localhost:5000/rhel7/etcd                                    latest              93e8e4932fa2        3 weeks ago         238.6 MB
registry.access.redhat.com/rhel7/etcd                        latest              93e8e4932fa2        3 weeks ago         238.6 MB
161.192.1.1:5000/registry.access.redhat.com/rhel7/etcd   latest              93e8e4932fa2        3 weeks ago         238.6 MB
161.192.1.1:5000/rhel7/etcd                              latest              93e8e4932fa2        3 weeks ago         238.6 MB

and I get the same error as above - 

     Play:     Verify Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "docker_image_availability":
               One or more required Docker images are not available:
                   registry.access.redhat.com/rhel7/etcd
               Configured registries: 161.192.1.1:5000, docker.io
               Checked by: timeout 10 skopeo inspect --tls-verify=false docker://{registry}/{image}

Comment 26 Marko Myllynen 2017-09-22 06:13:06 UTC
(In reply to Pradeep D from comment #25)
> 
> Unfortunately I had already tried that and does not help either. Here are a
> list of images (and the various docker tags that I have tried :)) in my
> internal registry
> 
> [ec2-user@ip-161-192-1-1 openshift-ansible]$ docker images | grep -i rhel7
> localhost:5000/registry.access.redhat.com/rhel7/etcd         latest         
> 93e8e4932fa2        3 weeks ago         238.6 MB
> localhost:5000/rhel7/etcd                                    latest         
> 93e8e4932fa2        3 weeks ago         238.6 MB
> registry.access.redhat.com/rhel7/etcd                        latest         
> 93e8e4932fa2        3 weeks ago         238.6 MB
> 161.192.1.1:5000/registry.access.redhat.com/rhel7/etcd   latest             
> 93e8e4932fa2        3 weeks ago         238.6 MB
> 161.192.1.1:5000/rhel7/etcd                              latest             
> 93e8e4932fa2        3 weeks ago         238.6 MB

This is not a support channel but let me point out that latest tag is not the one that is used when installing OpenShift, see e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1480443 for pointers.

Comment 27 Pradeep D 2017-09-25 14:40:30 UTC
Hi Marko, I understand that this is not a support channel, I am just raising a point about something that is not working (supposed to be fixed as per the comments above). I am more than happy to raise a different request if needed. 

I do specify the image tag for all openshift* images. I am unable to specify the tag for the etcd image. 

As already mentioned above, when I use the script in a regular aws environment, the installation goes through fine, but throws errors in a disconnected install, and the issue seems to be in the "docker_image_availability" check where it tries to check if the image pre exists before it continues with the install. In fact, disabling this check will give me a successful install which goes onto show that just the code to check for image availability is not obtaining the etcd image information correctly.

Comment 28 Marko Myllynen 2017-09-25 16:39:28 UTC
(In reply to Pradeep D from comment #27)
> 
> I do specify the image tag for all openshift* images. I am unable to specify
> the tag for the etcd image. 

Just a quick guess (I'm on the road and not able to inspect in more details during the near future), but you have:

osm_etcd_image=161.192.1.1:5000/rhel7/etcd

Whereas I have it working with something like:

openshift_docker_additional_registries=registry.example.com:5000
#openshift_docker_insecure_registries=registry.example.com:5000
oreg_url=registry.example.com:5000/openshift3/ose-${component}:${version}
osm_etcd_image=rhel7/etcd
openshift_metrics_image_prefix=registry.example.com:5000/openshift3/
openshift_logging_image_prefix=registry.example.com:5000/openshift3/

There are some possibly helpful scripts and playbooks in my github repo to facilitate disconnected installations, see https://github.com/myllynen/openshift-automation-tools.

Thanks.

Comment 29 Pradeep D 2017-10-05 20:06:36 UTC
(In reply to Marko Myllynen from comment #28)
> Whereas I have it working with something like:
> 
> openshift_docker_additional_registries=registry.example.com:5000
> #openshift_docker_insecure_registries=registry.example.com:5000
> oreg_url=registry.example.com:5000/openshift3/ose-${component}:${version}
> osm_etcd_image=rhel7/etcd
> openshift_metrics_image_prefix=registry.example.com:5000/openshift3/
> openshift_logging_image_prefix=registry.example.com:5000/openshift3/
> 
> There are some possibly helpful scripts and playbooks in my github repo to
> facilitate disconnected installations, see
> https://github.com/myllynen/openshift-automation-tools.

Thanks Marko. Changing the osm_etcd_image to be just rhel7/etcd did the trick. The documentation here - https://docs.openshift.com/container-platform/3.6/install_config/install/rpm_vs_containerized.html specifies that it needs to be a fully qualified name of the registry threw me off.

Comment 31 errata-xmlrpc 2017-10-17 11:45:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2900

Comment 32 Dave Sullivan 2018-05-23 20:38:56 UTC
I don't see this documented but looking at the git code below skopeo does take into account these three advanced settings

openshift_http_proxy
This variable specifies the HTTP_PROXY environment variable for masters and the Docker daemon.
openshift_https_proxy
This variable specifices the HTTPS_PROXY environment variable for masters and the Docker daemon.
openshift_no_proxy
This variable is used to set the NO_PROXY environment variable for masters and the Docker daemon. This value should be set to a comma separated list of host names or wildcard host names that should not use the defined proxy. This list will be augmented with the list of all defined OpenShift Container Platform host names by default.


https://github.com/openshift/openshift-ansible/blob/release-3.7/roles/openshift_health_checker/openshift_checks/docker_image_availability.py#L87